Skip to content

Document Search Solution with OCR #23

@SeboCode

Description

@SeboCode

Paperless seems like the ideal candidate:

  • Append-Only
  • Can be used as a secondary store to the main git based document store
  • Has an inbox directory it observes, where new documents can be moved to (use PAPERLESS_CONSUMER_DELETE_DUPLICATES for deleting duplicates)
  • Is flat and only compares documents by hash (moves of documents in the main store have no effect on the index)
  • Deletes of files are not reflected correctly with the secondary store approach, but are rare anyway and in case it gets out of hand, the paperless index can be rebuilt anytime

As a result, a script can be implemented that clones the main store, moves the documents into the inbox, let paperless delete anything that is duplicate and index documents that are new. Paperless effectively becomes a web based document search index for personal documents and can be destroyed and rebuilt anytime, just like a read model in CQRS.

Points to concider:

  • Maybe use rsync to only copy over documents of a type that paperless can understand.

Metadata

Metadata

Assignees

Labels

new-serviceProvision a new servicepriority-highHigh priority, should be worked on before any other issues.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions