I am trying to reproduce the example from the manual
> res <- data.frame("text" = c("this is what it is", "which is better")) %>%
+ do_tokenize(text) %>%
+ do_tfidf(document_id, token)
which is expected to result in:
| document_id |
token |
count_per_doc |
count_of_docs |
tfidf |
| 1 |
is |
2 |
2 |
0.0000000 |
| 1 |
it |
1 |
1 |
0.5773503 |
| 1 |
this |
1 |
1 |
0.5773503 |
| 1 |
what |
1 |
1 |
0.5773503 |
| 2 |
better |
1 |
1 |
0.7071068 |
| 2 |
is |
1 |
2 |
0.0000000 |
| 2 |
which |
1 |
1 |
0.7071068 |
However, I obtain
| document_id |
token |
count_per_doc |
count_of_docs |
tfidf |
| 1 |
is |
2 |
2 |
0.0000000 |
| 1 |
it |
1 |
1 |
0.0000000 |
| 1 |
this |
1 |
1 |
0.7071068 |
| 1 |
what |
1 |
1 |
0.7071068 |
| 2 |
better |
1 |
1 |
0.7071068 |
| 2 |
is |
1 |
2 |
0.0000000 |
| 2 |
which |
1 |
1 |
0.7071068 |
Another strange result is the following:
> data.frame("text" = c("good it was", "is nice she", "good is she")) %>%
+ do_tokenize(text) %>%
+ do_tfidf(document_id,token)
| document_id |
token |
count_per_doc |
count_of_docs |
tfidf |
| 1 |
good |
1 |
2 |
0.327 |
| 1 |
it |
1 |
1 |
0.327 |
| 1 |
was |
1 |
1 |
0.887 |
| 2 |
is |
1 |
2 |
0.327 |
| 2 |
nice |
1 |
1 |
0.327 |
| 2 |
she |
1 |
2 |
0.887 |
| 3 |
good |
1 |
2 |
0.327 |
| 3 |
is |
1 |
2 |
0.327 |
| 3 |
she |
1 |
2 |
0.887 |
where I would expect to find identical values for "it" and "was"...
I am trying to reproduce the example from the manual
which is expected to result in:
However, I obtain
Another strange result is the following:
where I would expect to find identical values for "it" and "was"...