This repository implements a retrieval pipeline for Chinese financial QA across three knowledge domains:
- insurance
- finance
- faq
The system predicts one source document id for each question, following the competition output format.
Given a question and a candidate source id list, retrieve the most relevant document id from the provided references.
Input:
- question text
- category
- candidate source ids
Output:
- one predicted retrieve id per question
main.py: main entry point, retrieval selection, prediction export, local evaluationretrieve.py: data loading, PDF parsing/OCR, preprocessing, segmentation, BM25, embedding, reranker pipelinesrequirements.txt: Python dependenciesdataset/questions_example.json: sample question input formatdataset/ground_truths_example.json: sample ground truth formatreference/faq/pid_map_content.json: FAQ mappingreference/insuranceandreference/finance: source PDF documentsresult/: output predictions
From dataset/questions_example.json
Each item in questions contains:
qid: integer question idsource: list of candidate document idsquery: question textcategory: one of insurance, finance, faq
Example:
{ "qid": 1, "source": [442, 115, 440, 196], "query": "匯款銀行及中間行所收取之相關費用由誰負擔?", "category": "insurance" }
From dataset/ground_truths_example.json
Each item in ground_truths contains:
qidretrieve(correct doc id)category
From result/pred_retrieve_embedding_reranker.json
Required top-level key:
answers
Each prediction item:
qidretrieve
Example:
{ "answers": [ {"qid": 1, "retrieve": 392}, {"qid": 2, "retrieve": 428} ] }
- Various implementations defined in
retrieve.pyretrieve_BM25()retrieve_BM25_segment()retrieve_BM25_Embedding()retrieve_BM25_Reranker()retrieve_BM25_Embedding_Reranker()retrieve_onlyEmbedding()retrieve_onlyReranker()retrieve_Embedding_Reranker()
- Default selection is controlled in main.py
- Load question set.
- Load category-specific corpus dictionaries.
- Preprocess text and split into chunks.
- Retrieve candidates using BM25 and/or embedding similarity.
- Re-rank with
CrossEncoderwhen enabled. - Output answers in competition JSON format.
- Compare with ground truth for local precision (if available).
- Recommended: Python 3.10
- Install command:
pip install -r requirements.txt - This project uses
pytesseractinretrieve.py, so Tesseract OCR engine must also be installed on your OS and available in PATH.
main.py currently runs new_main with hardcoded paths:
question_path = questions_example.json
source_path = ./reference/
truth_path = ground_truths_example.json
output_path = result/pred_retrieve_embedding_reranker.json
Run:
python main.py
main.py also includes original_main with argparse for:
- --question_path
- --source_path
- --output_path
If you switch to original_main in the entry point, run:
python main.py --question_path <path> --source_path <path> --output_path <path>
In main.py, new_main() loads:
corpus_dict_insurance_fitz_ocr.jsoncorpus_dict_finance_fitz_ocr.jsoncorpus_dict_faq.json
Those files are currently located under result/processed_dict/, so make sure source_path points to that folder when using new_main, or move/copy these files accordingly.
In main.py, new_main() computes:
precision = correct_count / total_count
and prints:
- Correct count / Total count
- Precision with 7 decimal places
This requires a local truth file, e.g. ground_truths_example.json.
experiment.md includes historical comments showing performance progression from BM25 baseline to reranking-based improvements.
- Prepare official question JSON with the same schema as questions_example.json.
- Select one retrieval function in main.py.
- Run inference and generate prediction JSON.
- Validate JSON format before submission.
- Keep experiment outputs under dataset/preliminary for tracking.
This project is built for a financial QA retrieval competition and demonstrates practical hybrid retrieval design:
- lexical retrieval (BM25)
- semantic retrieval (embeddings)
- neural reranking (CrossEncoder)