-
Notifications
You must be signed in to change notification settings - Fork 0
Document Search Solution with OCR #23
Copy link
Copy link
Open
Labels
new-serviceProvision a new serviceProvision a new servicepriority-highHigh priority, should be worked on before any other issues.High priority, should be worked on before any other issues.
Description
Paperless seems like the ideal candidate:
- Append-Only
- Can be used as a secondary store to the main git based document store
- Has an inbox directory it observes, where new documents can be moved to (use PAPERLESS_CONSUMER_DELETE_DUPLICATES for deleting duplicates)
- Is flat and only compares documents by hash (moves of documents in the main store have no effect on the index)
- Deletes of files are not reflected correctly with the secondary store approach, but are rare anyway and in case it gets out of hand, the paperless index can be rebuilt anytime
As a result, a script can be implemented that clones the main store, moves the documents into the inbox, let paperless delete anything that is duplicate and index documents that are new. Paperless effectively becomes a web based document search index for personal documents and can be destroyed and rebuilt anytime, just like a read model in CQRS.
Points to concider:
- Maybe use rsync to only copy over documents of a type that paperless can understand.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
new-serviceProvision a new serviceProvision a new servicepriority-highHigh priority, should be worked on before any other issues.High priority, should be worked on before any other issues.