This directory contains the code and dataset release for DiscussLLM: Teaching Large Language Models When to Speak
src/discussllm: dataset loaders and collators used for generator and classifier fine-tuningscripts: minimal training and turn-by-turn inference entry pointsdata_generation: prompts and local generation utilitiesconfigs/deepspeed: DeepSpeed configs used by the training scriptshf_dataset: Hugging Face dataset package with group discussions and split files
mamba env create -f discussllm.yml
mamba activate discussllm
pip install -e .Clone the dataset in the root of the repo.
The raw discussion transcripts are in:
hf_dataset/data/generated_discussion_data
Splits:
hf_dataset/data/train_data.txt
hf_dataset/data/test_data.txt
deepspeed scripts/train_generator.py \
--deepspeed configs/deepspeed/zero1.json \
--model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
--data_root hf_dataset/data/generated_discussion_data \
--output_dir outputs/generator \
--num_train_epochs 5 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 8 \
--gradient_checkpointing \
--learning_rate 0.0002 \
--bf16 \
--tf32 \
--use_loradeepspeed scripts/train_classifier.py \
--deepspeed configs/deepspeed/zero1.json \
--model_name_or_path roberta-base \
--data_root hf_dataset/data/generated_discussion_data \
--output_dir outputs/classifier \
--num_train_epochs 5 \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 32 \
--learning_rate 0.00001The same commands are available as:
bash scripts/train_generator_deepspeed.sh
bash scripts/train_classifier_deepspeed.shZero-shot evaluation:
bash scripts/eval_zeroshot.sh- End to End:
python scripts/infer_generator.py \
--base_model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
--model_name_or_path outputs/generator- Classifier:
python scripts/infer_classifier.py --model_name_or_path outputs/classifier