Can not reproduce the results on LongBench in the paper.

Thank you for your great work on KV cache merging, D2O's dynamic token merging method inspires me a lot.

Currently I'm running your source code.
As stated in the paper,  I choose N:M=3:1, alpha=0.3 and beta=0.7 under 20% (rho=0.2) KV cache compression ratio.
I'm sure that the Python dependancies and environment are met with your source code, and I'm using nvidia RTX PRO6000 GPU, which can run LongBench properly. 

However, I found the results are very different from the paper.
The results from source code may not show D2O's strength.
If there are something I misunderstood or may omit, could you help to solve this issue?
Thank you very much.

Results from paper show, 
                     | H2O  |  D2O
NarrativeQA | 13.27 | 14.43
Qasper         | 11.05 | 12.66
MF-en          | 17.72 | 19.93
HotpotQA    | 10.38 | 11.92
2WikiMQA   | 11.23 | 12.79
Musique      | 6.38   | 9.88
GovReport   | 21.29 | 24.36
QMSum       | 21.33 | 23.42
MultiNews   | 3.38   | 3.95
TREC            | 66.63 | 69.72
TriviaQA       | 89.19 | 90.99
SAMSum     | 41.12 | 42.36
Pcount         | 5.52   | 6.61
Pre               | 11.11 | 14.67
Lcc               | 71.86 | 72.43
RB-P            | 58.29 | 60

Results from source code show, 
                     | H2O  |  D2O
NarrativeQA | 12.43 | 12.64
Qasper         | 12.55 | 11.92
MF-en          | 19.95 | 19.87
HotpotQA    | 10.92 | 10.72
2WikiMQA   | 12.2   | 11.95
Musique      | 6.65   | 6.75
GovReport   | 22.97 | 21.13
QMSum       | 23.44 | 23.13
MultiNews   | 3.56   | 1.93
TREC            | 69      | 69.67
TriviaQA       | 90.57 | 90.63
SAMSum      | 41.96 | 42.05
Pcount         | 5.18   | 5.9
Pre               | 11.58 | 13.86
Lcc               | 69.26 | 71.57
RB-P            | 55.67 | 58.43



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not reproduce the results on LongBench in the paper. #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can not reproduce the results on LongBench in the paper. #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions