This Repository implements training EfficientDet on multiple HPUs by using horovod, supporting synchronized batch norm. External source for this code is: https://github.com/itsliupeng/automl [1] Mingxing Tan, Ruoming Pang, Quoc V. Le. EfficientDet: Scalable and Efficient Object Detection. CVPR 2020. Arxiv link: https://arxiv.org/abs/1911.09070
Please follow the instructions given in the following link for setting up the
environment including the $PYTHON environment variable: Gaudi Setup and
Installation Guide. Please
answer the questions in the guide according to your preferences. This guide will
walk you through the process of setting up your system to run the model on
Gaudi.
In the docker container, clone this repository and switch to the branch that
matches your SynapseAI version. (Run the
hl-smi
utility to determine the SynapseAI version.)
git clone -b [SynapseAI version] https://github.com/HabanaAI/Model-ReferencesGo to the EfficientDet directory:
cd Model-References/TensorFlow/computer_vision/efficientdetcd dataset
wget http://images.cocodataset.org/zips/train2017.zip
unzip train2017.zip && rm train2017.zip
wget http://images.cocodataset.org/zips/val2017.zip
unzip val2017.zip && rm val2017.zip
wget http://images.cocodataset.org/zips/test2017.zip
unzip test2017.zip && rm test2017.zip
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
unzip annotations_trainval2017.zip && rm annotations_trainval2017.zip
mkdir tfrecord
# training data
PYTHONPATH=".:$PYTHONPATH" python3 ./dataset/create_coco_tfrecord.py \
--image_dir=./dataset/train2017 \
--caption_annotations_file=./dataset/annotations/captions_train2017.json \
--output_file_prefix=tfrecord/train \
--num_shards=32
# validation data
PYTHONPATH=".:$PYTHONPATH" python3 ./dataset/create_coco_tfrecord.py \
--image_dir=./dataset/val2017 \
--caption_annotations_file=./dataset/annotations/captions_val2017.json \
--output_file_prefix=tfrecord/val \
--num_shards=32
cd backbone
wget https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b0.tar.gz
tar -xvzf efficientnet-b0.tar.gz
Add the Model-References path to PYTHONPATH
export PYTHONPATH=/path/to/Model-References:$PYTHONPATH
demo_efficientdet.py is a distributed launcher for main.py. It accepts the same arguments as main.py, it runs 'main.py [ARGS] --use_horovod' via mpirun with generated HCL config."
run on single HPU
python demo_efficientdet.py [options]
run on multiple HPU
python demo_efficientdet.py --use_horovod [options]
Available options, none is mandatory:
demo_efficientdet.py
--use_horovod | Use horovod for distributed run, pass amount of Horovod workers
--hls_type {HLS1,HLS1-H} | HLS Type
--kubernetes_run KUBERNETES_RUN | Kubernetes run
In addition demo_efficientdet.py gets all main.py flags mentioned below.
main.py
--backbone_ckpt: Location of the ResNet50 checkpoint to use for model initialization.
--ckpt: Start training from this EfficientDet checkpoint.
--cp_every_n_steps: Number of iterations after which checkpoint is saved.
--deterministic: Deterministic input data.
--eval_after_training: Run one eval after the training finishes.
--eval_batch_size: evaluation batch size
--eval_master: GRPC URL of the eval master. Set to an appropriate value when running on CPU/GPU
--eval_samples: The number of samples for evaluation.
--eval_timeout: Maximum seconds between checkpoints before evaluation terminates.
--gcp_project: Project name for the Cloud TPU-enabled project. If not specified, we will attempt to automatically detect the GCE project from metadata.
--hparams: Comma separated k=v pairs of hyperparameters.
--input_partition_dims: A list that describes the partition dims for all the tensors.;
--iterations_per_loop: Number of iterations per TPU training loop
--log_every_n_steps: Number of iterations after which training parameters are logged.
--min_eval_interval: Minimum seconds between evaluations.
--mode: Mode to run: train or eval (default: train)
--model_dir: Location of model_dir
--model_name: Model name: retinanet or efficientdet
--no_hpu: Do not load Habana modules = train on CPU/GPU
--num_cores: Number of TPU cores for training
--num_cores_per_replica: Number of TPU cores perreplica when using spatial partition.
--num_epochs: Number of epochs for training
--num_examples_per_epoch: Number of examples in one epoch
--sbs_test: Config topology run for sbs testing.
--testdev_dir: COCO testdev dir. If true, ignorer val_json_file.
--tpu: The Cloud TPU to use for training. This should be either the name used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 url.
--tpu_zone: GCE zone where the Cloud TPU is located in. If not specified, we will attempt to automatically detect the GCE project from metadata.
--train_batch_size: training batch size
--training_file_pattern: Glob for training data files (e.g., COCO train - minival set)
--use_amp: Use AMP
--use_fake_data: Use fake input.
--use_horovod: Use Horovod for distributed training
--use_spatial_partition: Use spatial partition.
--use_tpu: Use TPUs rather than CPUs/GPUs
--use_xla: Use XLA
--val_json_file: COCO validation JSON containing golden bounding boxes.
--validation_file_pattern: Glob for evaluation tfrecords (e.g., COCO val2017 set)
run on single HPU, with batch size of 8, full training of 300 epochs, on COCO dataset
python demo_efficientdet.py --mode=train --train_batch_size 8 --num_epochs 300 --dataset_dir "/software/data/tf/coco2017/tf_records" --backbone_ckpt "/software/data/tf/data/efficientdet/backbones/efficientnet-b0"run on 8 HPUs, with batch size of 8, full training of 300 epochs, on COCO dataset
python3 demo_efficientdet.py --backbone_ckpt=/root/tensorflow_datasets/efficientdet/backbones/efficientnet-b0/ --training_file_pattern=/root/tensorflow_datasets/coco2017/tf_records/train-* --model_dir /tmp/efficientdet --use_horovod 8 --keep_checkpoint_max=300- Currently TPC fuser disabled.
- Currently maximum batch size is 32. Batch size 64 will be available soon.
- Only d0 variant is avalable. Variants d1 to d7 will be available in the future.
- Only single HPU training is available. Multiple HPU will be available soon.