For more information about training deep learning models on Gaudi, visit developer.habana.ai.
This repository provides a script to train the Single Shot Detection (SSD) (Liu et al., 2016) with backbone ResNet-34 trained with COCO2017 dataset on Habana Gaudi (HPU). It is based on the MLPerf training 0.6 implementation by Google. The model provides output as bounding boxes. Please visit this page for performance information.
The SSD ResNet34 model is based on https://github.com/mlperf/training_results_v0.6/ (Google/benchmarks/ssd/implementations/tpu-v3-32-ssd)
The functional changes are:
- Migrated from TF 1.15 to TF 2.2
- Enabled resource variables
- TPU-specific topk_mask implementation is replaced with implementation from the reference
- Removed TPU and GCP related code
- Used argparse instead of tf.flags
- Removed mlperf 0.6 logger
- Added flags: num_steps, log_step_count_steps, save_checkpoints_steps
- Used dataset.interleave instead of tf.data.experimental.parallel_interleave in dataloader.py
- Name scopes 'concat_cls_outputs' and 'concat_box_outputs' added to concat_outputs
- Fixed issues with COCO2017 dataset - provided script for generating correct dataset, 117266 training examples
- Removed weight_decay loss from total loss calculation (fwd only)
- Updated input normalization formula to use multiplication by reciprocal instead of division
- Added logging hook that calculates IPS and total training time
- Disabled Eager mode
- Enabled resource variables
- Added Horovod for distibuted training
- Added support for HPU (load_habana_modules and Habana Horovod)
- Added demo_ssd allowing to run multinode training with OpenMPI
- Turned off dataset caching when RAM size is not sufficient
- Added support for TF profiling
- Added inference mode
- Added support for distributed batch normalization
The performance changes are:
- Boxes and classes are transposed in dataloder not in model to improve performance
- Introduced custom softmax_cross_entropy_mme loss function that better utilizes HPU hardware (by implementing reduce_sum through conv2d which is computed on MME in parallel with other TPC operations and transposing tensors for reduce_max)
- Learning rate base = 3e-3
- Weight decay = 5e-4
- Epochs for learning rate Warm-up = 5
- Batch size = 128
- Epochs for training = 50
- Data type = bf16
- Loss calculation = False
- Mode = train
- Epochs at which learning rate decays = 0
- Number of samples for evaluation = 5000
- Number of example in one epoch = 117266
- Number of training steps = 0
- Frequency of printing loss = 1
- Frequency of saving checkpoints = 5
- Maximum number of checkpoints stored = 20
Please follow the instructions given in the following link for setting up the
environment including the $PYTHON environment variable: Gaudi Setup and
Installation Guide. Please
answer the questions in the guide according to your preferences. This guide will
walk you through the process of setting up your system to run the model on
Gaudi.
Set the MPI_ROOT environment variable to the directory where OpenMPI is installed.
For example, in Habana containers, use
export MPI_ROOT=/usr/local/openmpi/In the docker container, clone this repository and switch to the branch that
matches your SynapseAI version. (Run the
hl-smi
utility to determine the SynapseAI version.)
git clone -b [SynapseAI version] https://github.com/HabanaAI/Model-References /root/Model-ReferencesAdd Model-References to PYTHONPATH
export PYTHONPATH=/root/Model-References:$PYTHONPATHThe topology script is already configured for COCO2017 (117266 training images, 5000 validation images).
Only images with any bounding box annotations are used for training.
The dataset directory should be mounted to /data/coco2017/ssd_tf_records.
The topology uses tf-records which can be prepared in the following way:
cd /root/Model-References/TensorFlow/computer_vision/SSD_ResNet34
export TMP_DIR=$(mktemp -d)
export SSD_PATH=$(pwd)
pushd $TMP_DIR
git clone https://github.com/tensorflow/tpu
cd tpu
git checkout 0ffe1274745806c411ed3dda7e84f692e00df8af
git apply ${SSD_PATH}/coco-tf-records.patch
cd tools/datasets
bash download_and_preprocess_coco.sh /data/coco2017/ssd_tf_records
popd
rm -rf $TMP_DIRThe topology uses pretrained ResNet34 weights.
They should be mounted to /data/ssd_r34-mlperf.
mkdir /data/ssd_r34-mlperf
cd /data/ssd_r34-mlperf
wget https://storage.googleapis.com/intel-optimized-tensorflow/models/ssd-backbone/checkpoint
wget https://storage.googleapis.com/intel-optimized-tensorflow/models/ssd-backbone/model.ckpt-28152.data-00000-of-00001
wget https://storage.googleapis.com/intel-optimized-tensorflow/models/ssd-backbone/model.ckpt-28152.index
wget https://storage.googleapis.com/intel-optimized-tensorflow/models/ssd-backbone/model.ckpt-28152.metaGo to SSD ResNet34 directory:
cd /root/Model-References/TensorFlow/computer_vision/SSD_ResNet34During single card trainings, the model requires over 90GB of host memory for
its dataset due to caching. To handle this on the Gaudi device, it is necessary
to allocate huge pages in the system configuration, otherwise Error mapping memory may occur. This configuration can be done as follows:
- Back up
/etc/sysctl.confas follows:
cp /etc/sysctl.conf /etc/sysctl.conf.bak- Ensure the following line is in
/etc/sysctl.conf, either by adding it or by editing an setting:
vm.nr_hugepages = 153600- Run
sysctl -p
If for some reason, the setting cannot be applied, you will have to remove
cache() from SSDInputReader in dataloader.py.
- After completing training of this model, restore
/etc/sysctl.confto its original state:
mv /etc/sysctl.conf.bak /etc/sysctl.conf$PYTHON demo_ssd.py -e <epoch> -b <batch_size> -d <precision> --model_dir <path/to/model_dir>For example:
-
The following command will train the topology on single Gaudi using batch size 128, 50 epochs, precision bf16, and remaining default hyperparameters and save summary data every 10 steps.
$PYTHON demo_ssd.py -e 50 -b 128 -d bf16 --model_dir /tmp/ssd_1_hpu --save_summary_steps 10
Each epoch will take ceil(117266 / 128) = 917 steps so the whole training will take 917 * 50 = 45850 steps. Checkpoints will be saved to
/tmp/ssd_1_hpu. -
The following command will train the topology on single Gaudi, batch size 128, 50 epochs, precision fp32 and remaining default hyperparameters.
$PYTHON demo_ssd.py -e 50 -b 128 -d fp32 --model_dir /tmp/ssd_1_hpu
Each epoch will take ceil(117266 / 128) = 917 steps so the whole training will take 917 * 50 = 45850 steps. Checkpoints will be saved to
/tmp/ssd_1_hpu.
$PYTHON demo_ssd.py -e <epoch> -b <batch_size> -d <precision> --model_dir <path/to/model_dir> --hvd_workers 8For example:
- In order to train the topology on 8 Gaudi cards, batch size 128, 50 epochs, precision bf16 and other default hyperparameters.
Each epoch will take ceil(117266 / (128 * 8)) = 115 steps so the whole training will take 115 * 50 = 5750 steps. Checkpoints will be saved in
$PYTHON demo_ssd.py -e 50 -b 128 -d bf16 --model_dir /tmp/ssd_8_hpus --hvd_workers 8/tmp/ssd_8_hpus.
For example, the mpirun command to run 8 Gaudi cards training with batch size 128, 50 epochs, precision bf16, and other default hyperparameters, saving summary data every 10 steps and checkpoints every 50 epochs is as follows.
mpirun map-by PE attribute value may vary on your setup and should be calculated as:
socket:PE = floor((number of physical cores) / (number of gaudi devices per each node))
mpirun --allow-run-as-root --bind-to core --map-by socket:PE=7 --np 8 \
$PYTHON /root/Model-References/TensorFlow/computer_vision/SSD_ResNet34/ssd.py --use_horovod --epochs 50 --batch_size 128 --dtype bf16 --model_dir /tmp/ssd_8_hpus --save_summary_steps 10 --save_checkpoints_epochs 50Note that it is required to provide --use_horovod argument.
In order to calculate mAP for the saved checkpoints in /tmp/ssd_1_hpu:
$PYTHON ssd.py --mode eval --model_dir /tmp/ssd_1_hpuThe following section provides details of running training.
In the root/Model-References/TensorFlow/computer_vision/SSD_ResNet34 directory, the most important files are:
demo_ssd.py: Serves as a wrapper script for the training filessd.py. Also preloads libjemalloc and allows to run distributed training as it contains the--hvd_workersargument.ssd.py: The training script of the SSD model. Contains all the other arguments.
Modify the training behavior through the various flags present in the ssd.py file. Some of the important parameters in the
ssd.py script are as follows:
-d bf16/fp32or--dtype bf16/fp32Data type: fp32 or bf16 (default: bf16)-b Nor--batch_size NBatch size (default: 128)-e Nor--epochs NNumber of epochs for training (default: 50)--mode train/evalMode:trainoreval(default: train)--use_horovodUse Horovod for distributed training (default: False)--inference IMAGEPath to image for inference (if set then mode is ignored) (default: None)--no_hpuDo not load Habana modules = train on CPU/GPU (default: False)--distributed_bnUse distributed batch norm (default: False)--vis_dataloaderVisualize dataloader (default: False)--profilingTurn on profiling (generate json every 20 steps) (default: False)--use_cocoeval_ccUse cocoeval cc (default: False)-for--use_fake_dataUse fake data to reduce the input preprocessing overhead (for unit tests) (default: False)--lr_warmup_epoch NNumber of epochs for learning rate warmup (default: 5.0)--base_lr BASE_LRbase learning rate (default: 0.003)--weight_decay WDL2 wight decay (default: 0.0005)--k Kk is an integer defining at which epochs the learning rate decays: [40, 50] * (1 + k/10) (default: 0)--model_dir <dir>Location of model_dir (default: /tmp/ssd)--resnet_checkpoint <PATH>Location of the ResNet ckpt to use for model init. (default: /data/ssd_r34-mlperf)--data_dir <PATH>Path to dataset (default: /data/coco2017/ssd_tf_records)--eval_samples NNumber of samples for evaluation. (default: 5000)--num_examples_per_epoch NNumber of examples in one epoch (default: 117266)-s STEPSor--steps STEPSNumber of training steps (epochs and num_examples_per_epoch are ignored when set) (default: 0)-v STEPSor--log_step_count_steps STEPSHow often print global_step/sec and loss (default: 1)-c EPOCHSor--save_checkpoints_epochs EPOCHSHow often save checkpoints (default: 5.0)--keep_ckpt_max NMaximum number of checkpoints to keep (default: 20)--save_summary_steps SAVE_SUMMARY_STEPSHow often save summary (default: 1)--staticEnables use of static dataloader (default: False)--recipe_cache RECIPE_CACHEPath to recipe cache directory. Set to empty to disable recipe cache. Externally set "TF_RECIPE_CACHE_PATH" will override this settings. (default: /tmp/ssd_recipe_cache/)