Skip to content

JackLi0711/2024Fall-FinTechIntro-FinalProject

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI CUP 2024 - Applications of RAG and LLM in Financial Question Answering

This repository implements a retrieval pipeline for Chinese financial QA across three knowledge domains:

  • insurance
  • finance
  • faq

The system predicts one source document id for each question, following the competition output format.

Project Goal

Given a question and a candidate source id list, retrieve the most relevant document id from the provided references.

Input:

  • question text
  • category
  • candidate source ids

Output:

  • one predicted retrieve id per question

Repository Structure

  • main.py: main entry point, retrieval selection, prediction export, local evaluation
  • retrieve.py: data loading, PDF parsing/OCR, preprocessing, segmentation, BM25, embedding, reranker pipelines
  • requirements.txt: Python dependencies
  • dataset/questions_example.json: sample question input format
  • dataset/ground_truths_example.json: sample ground truth format
  • reference/faq/pid_map_content.json: FAQ mapping
  • reference/insurance and reference/finance: source PDF documents
  • result/: output predictions

Data Format

1) Question File Format

From dataset/questions_example.json

Each item in questions contains:

  • qid: integer question id
  • source: list of candidate document ids
  • query: question text
  • category: one of insurance, finance, faq

Example: { "qid": 1, "source": [442, 115, 440, 196], "query": "匯款銀行及中間行所收取之相關費用由誰負擔?", "category": "insurance" }

2) Ground Truth Format

From dataset/ground_truths_example.json

Each item in ground_truths contains:

  • qid
  • retrieve (correct doc id)
  • category

3) Submission / Prediction Format

From result/pred_retrieve_embedding_reranker.json

Required top-level key:

  • answers

Each prediction item:

  • qid
  • retrieve

Example: { "answers": [ {"qid": 1, "retrieve": 392}, {"qid": 2, "retrieve": 428} ] }

Retrieval Methods Implemented

  • Various implementations defined in retrieve.py
    • retrieve_BM25()
    • retrieve_BM25_segment()
    • retrieve_BM25_Embedding()
    • retrieve_BM25_Reranker()
    • retrieve_BM25_Embedding_Reranker()
    • retrieve_onlyEmbedding()
    • retrieve_onlyReranker()
    • retrieve_Embedding_Reranker()
  • Default selection is controlled in main.py

Pipeline Summary

  1. Load question set.
  2. Load category-specific corpus dictionaries.
  3. Preprocess text and split into chunks.
  4. Retrieve candidates using BM25 and/or embedding similarity.
  5. Re-rank with CrossEncoder when enabled.
  6. Output answers in competition JSON format.
  7. Compare with ground truth for local precision (if available).

Environment Setup

  • Recommended: Python 3.10
  • Install command: pip install -r requirements.txt
  • This project uses pytesseract in retrieve.py, so Tesseract OCR engine must also be installed on your OS and available in PATH.

Running the Project

1) Current default execution

main.py currently runs new_main with hardcoded paths:

question_path = questions_example.json
source_path = ./reference/
truth_path = ground_truths_example.json
output_path = result/pred_retrieve_embedding_reranker.json

Run: python main.py

2) CLI-style execution (baseline style)

main.py also includes original_main with argparse for:

  • --question_path
  • --source_path
  • --output_path

If you switch to original_main in the entry point, run: python main.py --question_path <path> --source_path <path> --output_path <path>

Important Path Note

In main.py, new_main() loads:

  • corpus_dict_insurance_fitz_ocr.json
  • corpus_dict_finance_fitz_ocr.json
  • corpus_dict_faq.json

Those files are currently located under result/processed_dict/, so make sure source_path points to that folder when using new_main, or move/copy these files accordingly.

Evaluation

In main.py, new_main() computes:

precision = correct_count / total_count

and prints:

  • Correct count / Total count
  • Precision with 7 decimal places

This requires a local truth file, e.g. ground_truths_example.json.

Baseline and Experiment Notes

experiment.md includes historical comments showing performance progression from BM25 baseline to reranking-based improvements.

Suggested Competition Workflow

  1. Prepare official question JSON with the same schema as questions_example.json.
  2. Select one retrieval function in main.py.
  3. Run inference and generate prediction JSON.
  4. Validate JSON format before submission.
  5. Keep experiment outputs under dataset/preliminary for tracking.

Acknowledgment

This project is built for a financial QA retrieval competition and demonstrates practical hybrid retrieval design:

  • lexical retrieval (BM25)
  • semantic retrieval (embeddings)
  • neural reranking (CrossEncoder)

About

2024 AI CUP Competition on E.SUN AI Open Challenge-Applications of RAG and LLM in Financial Question Answering

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages