AI CUP 2024 - Applications of RAG and LLM in Financial Question Answering

This repository implements a retrieval pipeline for Chinese financial QA across three knowledge domains:

insurance
finance
faq

The system predicts one source document id for each question, following the competition output format.

Project Goal

Given a question and a candidate source id list, retrieve the most relevant document id from the provided references.

Input:

question text
category
candidate source ids

Output:

one predicted retrieve id per question

Repository Structure

main.py: main entry point, retrieval selection, prediction export, local evaluation
retrieve.py: data loading, PDF parsing/OCR, preprocessing, segmentation, BM25, embedding, reranker pipelines
requirements.txt: Python dependencies
dataset/questions_example.json: sample question input format
dataset/ground_truths_example.json: sample ground truth format
reference/faq/pid_map_content.json: FAQ mapping
reference/insurance and reference/finance: source PDF documents
result/: output predictions

Data Format

1) Question File Format

From dataset/questions_example.json

Each item in questions contains:

qid: integer question id
source: list of candidate document ids
query: question text
category: one of insurance, finance, faq

Example: { "qid": 1, "source": [442, 115, 440, 196], "query": "匯款銀行及中間行所收取之相關費用由誰負擔?", "category": "insurance" }

2) Ground Truth Format

From dataset/ground_truths_example.json

Each item in ground_truths contains:

qid
retrieve (correct doc id)
category

3) Submission / Prediction Format

From result/pred_retrieve_embedding_reranker.json

Required top-level key:

answers

Each prediction item:

qid
retrieve

Example: { "answers": [ {"qid": 1, "retrieve": 392}, {"qid": 2, "retrieve": 428} ] }

Retrieval Methods Implemented

Various implementations defined in retrieve.py
- retrieve_BM25()
- retrieve_BM25_segment()
- retrieve_BM25_Embedding()
- retrieve_BM25_Reranker()
- retrieve_BM25_Embedding_Reranker()
- retrieve_onlyEmbedding()
- retrieve_onlyReranker()
- retrieve_Embedding_Reranker()
Default selection is controlled in main.py

Pipeline Summary

Load question set.
Load category-specific corpus dictionaries.
Preprocess text and split into chunks.
Retrieve candidates using BM25 and/or embedding similarity.
Re-rank with CrossEncoder when enabled.
Output answers in competition JSON format.
Compare with ground truth for local precision (if available).

Environment Setup

Recommended: Python 3.10
Install command: pip install -r requirements.txt
This project uses pytesseract in retrieve.py, so Tesseract OCR engine must also be installed on your OS and available in PATH.

Running the Project

1) Current default execution

main.py currently runs new_main with hardcoded paths:

question_path = questions_example.json
source_path = ./reference/
truth_path = ground_truths_example.json
output_path = result/pred_retrieve_embedding_reranker.json

Run: python main.py

2) CLI-style execution (baseline style)

main.py also includes original_main with argparse for:

--question_path
--source_path
--output_path

If you switch to original_main in the entry point, run: python main.py --question_path <path> --source_path <path> --output_path <path>

Important Path Note

In main.py, new_main() loads:

corpus_dict_insurance_fitz_ocr.json
corpus_dict_finance_fitz_ocr.json
corpus_dict_faq.json

Those files are currently located under result/processed_dict/, so make sure source_path points to that folder when using new_main, or move/copy these files accordingly.

Evaluation

In main.py, new_main() computes:

precision = correct_count / total_count

and prints:

Correct count / Total count
Precision with 7 decimal places

This requires a local truth file, e.g. ground_truths_example.json.

Baseline and Experiment Notes

experiment.md includes historical comments showing performance progression from BM25 baseline to reranking-based improvements.

Suggested Competition Workflow

Prepare official question JSON with the same schema as questions_example.json.
Select one retrieval function in main.py.
Run inference and generate prediction JSON.
Validate JSON format before submission.
Keep experiment outputs under dataset/preliminary for tracking.

Acknowledgment

This project is built for a financial QA retrieval competition and demonstrates practical hybrid retrieval design:

lexical retrieval (BM25)
semantic retrieval (embeddings)
neural reranking (CrossEncoder)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
experiment.md		experiment.md
main.py		main.py
requirements.txt		requirements.txt
retrieve.py		retrieve.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI CUP 2024 - Applications of RAG and LLM in Financial Question Answering

Project Goal

Repository Structure

Data Format

1) Question File Format

2) Ground Truth Format

3) Submission / Prediction Format

Retrieval Methods Implemented

Pipeline Summary

Environment Setup

Running the Project

1) Current default execution

2) CLI-style execution (baseline style)

Important Path Note

Evaluation

Baseline and Experiment Notes

Suggested Competition Workflow

Acknowledgment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI CUP 2024 - Applications of RAG and LLM in Financial Question Answering

Project Goal

Repository Structure

Data Format

1) Question File Format

2) Ground Truth Format

3) Submission / Prediction Format

Retrieval Methods Implemented

Pipeline Summary

Environment Setup

Running the Project

1) Current default execution

2) CLI-style execution (baseline style)

Important Path Note

Evaluation

Baseline and Experiment Notes

Suggested Competition Workflow

Acknowledgment

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages