Skip to content

Improve performance for proofread_canonicals()#258

Merged
AA-Turner merged 2 commits intomainfrom
proofread-perf
Apr 11, 2025
Merged

Improve performance for proofread_canonicals()#258
AA-Turner merged 2 commits intomainfrom
proofread-perf

Conversation

@AA-Turner
Copy link
Member

Currently, proofread_canonicals() takes c. 4-5 minutes, with the vast majority of time spent reading the files from disk. This PR improves performance to c. 100-120 seconds by using multiple threads to check the files. We also switch to byte methods over re for another slight improvement, avoiding Unicode encoding/decoding.

A

@AA-Turner AA-Turner requested a review from hugovk April 11, 2025 04:27
@AA-Turner AA-Turner merged commit e80b729 into main Apr 11, 2025
6 checks passed
@AA-Turner AA-Turner deleted the proofread-perf branch April 11, 2025 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants