Skip to content

Strange behaviour of do_tfidf #841

@hhaensel

Description

@hhaensel

I am trying to reproduce the example from the manual

> res <- data.frame("text" = c("this is what it is", "which is better")) %>%
+   do_tokenize(text) %>%
+   do_tfidf(document_id, token)

which is expected to result in:

document_id token count_per_doc count_of_docs tfidf
1 is 2 2 0.0000000
1 it 1 1 0.5773503
1 this 1 1 0.5773503
1 what 1 1 0.5773503
2 better 1 1 0.7071068
2 is 1 2 0.0000000
2 which 1 1 0.7071068

However, I obtain

document_id token count_per_doc count_of_docs tfidf
1 is 2 2 0.0000000
1 it 1 1 0.0000000
1 this 1 1 0.7071068
1 what 1 1 0.7071068
2 better 1 1 0.7071068
2 is 1 2 0.0000000
2 which 1 1 0.7071068

Another strange result is the following:

> data.frame("text" = c("good it was", "is nice she", "good is she")) %>%
+   do_tokenize(text) %>%
+   do_tfidf(document_id,token)
document_id token count_per_doc count_of_docs tfidf
1 good 1 2 0.327
1 it 1 1 0.327
1 was 1 1 0.887
2 is 1 2 0.327
2 nice 1 1 0.327
2 she 1 2 0.887
3 good 1 2 0.327
3 is 1 2 0.327
3 she 1 2 0.887

where I would expect to find identical values for "it" and "was"...

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions