Skip to content

Commit bc8f323

Browse files
authored
Merge pull request #324 from NYU-RTS/mdweisner-dataset-paths-1
Update 04_datasets.md
2 parents aadd2a1 + b92bbe6 commit bc8f323

1 file changed

Lines changed: 24 additions & 39 deletions

File tree

docs/hpc/04_datasets/01_intro.md

Lines changed: 24 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,7 @@
22

33
## General
44
The HPC team makes available a number of public sets that are commonly used in analysis jobs. The data sets are available Read-Only under
5-
- `/scratch/work/public/ml-datasets/`
6-
- `/vast/work/public/ml-datasets/`
7-
8-
We recommend to use version stored at `/vast` (when available) to have better read performance
5+
- `/projects/work/public/ml-datasets/`
96

107
:::note
118
For some of the datasets users must provide a signed usage agreement before accessing
@@ -17,17 +14,17 @@ For example, in order to use coco dataset, one can run the following commands
1714
```sh
1815
$ singularity exec \
1916
--overlay /<path>/pytorch1.8.0-cuda11.1.ext3:ro \
20-
--overlay /vast/work/public/ml-datasets/coco/coco-2014.sqf:ro \
21-
--overlay /vast/work/public/ml-datasets/coco/coco-2015.sqf:ro \
22-
--overlay /vast/work/public/ml-datasets/coco/coco-2017.sqf:ro \
23-
/scratch/work/public/singularity/cuda11.1-cudnn8-devel-ubuntu18.04.sif /bin/bash
17+
--overlay /projects/work/public/ml-datasets/coco/coco-2014.sqf:ro \
18+
--overlay /projects/work/public/ml-datasets/coco/coco-2015.sqf:ro \
19+
--overlay /projects/work/public/ml-datasets/coco/coco-2017.sqf:ro \
20+
/projects/work/public/singularity/cuda11.1-cudnn8-devel-ubuntu18.04.sif /bin/bash
2421

2522
$ singularity exec \
2623
--overlay /<path>/pytorch1.8.0-cuda11.1.ext3:ro \
27-
--overlay /vast/work/public/ml-datasets/coco/coco-2014.sqf:ro \
28-
--overlay /vast/work/public/ml-datasets/coco/coco-2015.sqf:ro \
29-
--overlay /vast/work/public/ml-datasets/coco/coco-2017.sqf:ro \
30-
/scratch/work/public/singularity/cuda11.1-cudnn8-devel-ubuntu18.04.sif find /coco | wc -l
24+
--overlay /projects/work/public/ml-datasets/coco/coco-2014.sqf:ro \
25+
--overlay /projects/work/public/ml-datasets/coco/coco-2015.sqf:ro \
26+
--overlay /projects/work/public/ml-datasets/coco/coco-2017.sqf:ro \
27+
/projects/work/public/singularity/cuda11.1-cudnn8-devel-ubuntu18.04.sif find /coco | wc -l
3128

3229
532896
3330
```
@@ -39,15 +36,9 @@ $ singularity exec \
3936
Common Objects in Context (COCO) is a large-scale object detection, segmentation, and captioning dataset.
4037

4138
*Dataset is available under*
42-
`/scratch`
43-
- `/scratch/work/public/ml-datasets/coco/coco-2014.sqf`
44-
- `/scratch/work/public/ml-datasets/coco/coco-2015.sqf`
45-
- `/scratch/work/public/ml-datasets/coco/coco-2017.sqf`
46-
47-
`/vast`
48-
- `/vast/work/public/ml-datasets/coco/coco-2014.sqf`
49-
- `/vast/work/public/ml-datasets/coco/coco-2015.sqf`
50-
- `/vast/work/public/ml-datasets/coco/coco-2017.sqf`
39+
- `/projects/work/public/ml-datasets/coco/coco-2014.sqf`
40+
- `/projects/work/public/ml-datasets/coco/coco-2015.sqf`
41+
- `/projects/work/public/ml-datasets/coco/coco-2017.sqf`
5142

5243
### ImageNet and ILSVRC
5344
About data set: [ImageNet (image-net.org)](https://image-net.org/)
@@ -70,8 +61,7 @@ ILSVRC uses a subset of ImageNet images for training the algorithms and some of
7061
- Size of data is about 150 GB (for train and validation)
7162

7263
*Dataset is available under*
73-
- `/scratch/work/public/ml-datasets/imagenet`
74-
- `/vast/work/public/ml-datasets/imagenet`
64+
- `/projects/work/public/ml-datasets/imagenet`
7565

7666
##### Get access to Data
7767

@@ -84,34 +74,31 @@ Please open the ImageNet site, find the terms of use ([http://image-net.org/down
8474

8575
*Dataset is available under*
8676

87-
- `/scratch/work/public/MillionSongDataset`
88-
- `/vast/work/public/ml-datasets/millionsongdataset/`
77+
- `/projects/work/public/ml-datasets/millionsongdataset/`
8978

9079
### ProQuest Congressional Record
9180
About data set: [ProQuest Congressional Record](https://guides.nyu.edu/govdocs/congressional#s-lg-box-14137380)
9281

9382
The ProQuest Congressional Record text-as-data collection consists of machine-readable files capturing the full text and a small number of metadata fields for a full run of the Congressional Record between 1789 and 2005. Metadata fields include the date of publication, subjects (for issues for which such information exists in the ProQuest system), and URLs linking the full text to the canonical online record for that issue on the ProQuest Congressional platform. A total of 31,952 issues are available.
9483

9584
*Dataset is available under*:
96-
- `/scratch/work/public/proquest/`
85+
- `/projects/work/public/proquest/`
9786

9887
### C4
9988
*About data set*: [c4 | TensorFlow Datasets](https://www.tensorflow.org/datasets/catalog/c4)
10089

10190
A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: [https://commoncrawl.org](https://commoncrawl.org)
10291

10392
*Dataset is available under*
104-
- `/scratch/work/public/ml-datasets/c4`
105-
- `/vast/work/public/ml-datasets/c4`
93+
- `/projects/work/public/ml-datasets/c4`
10694

10795
### GQA
10896
*About data set*: [GQA: Visual Reasoning in the Real World (stanford.edu)](https://cs.stanford.edu/people/dorarad/gqa/index.html)
10997

11098
Question Answering on Image Scene Graphs
11199

112100
*Dataset is available under*
113-
- `/scratch/work/public/ml-datasets/gqa`
114-
- `/vast/work/public/ml-datasets/gqa`
101+
- `/projects/work/public/ml-datasets/gqa`
115102

116103
### MJSynth
117104
*About data set*: [Visual Geometry Group - University of Oxford](https://www.robots.ox.ac.uk/~vgg/data/text/)
@@ -121,7 +108,7 @@ This is synthetically generated dataset which found to be sufficient for trainin
121108
This dataset consists of 9 million images covering 90k English words, and includes the training, validation and test splits used in the author's work (archived dataset is about 10 GB)
122109

123110
*Dataset is available under*
124-
- `/vast/work/public/ml-datasets/mjsynth`
111+
- `/projects/work/public/ml-datasets/mjsynth`
125112

126113
### open-images-dataset
127114
*About data set*: [Open Images Dataset – opensource.google](https://storage.googleapis.com/openimages/web/index.html)
@@ -131,26 +118,24 @@ A dataset of ~9 million varied images with rich annotations
131118
The images are very diverse and often contain complex scenes with several objects (8.4 per image on average). It contains image-level labels annotations, object bounding boxes, object segmentations, visual relationships, localized narratives, and more
132119

133120
*Dataset is available under*
134-
- `/scratch/work/public/ml-datasets/open-images-dataset`
135-
- `/vast/work/public/ml-datasets/open-images-dataset`
121+
- `/projects/work/public/ml-datasets/open-images-dataset`
136122

137123
### Pile
138124
*About data set*: [The Pile (eleuther.ai)](https://pile.eleuther.ai/)
139125

140126
The Pile is a 825 GiB diverse, open source language modeling data set that consists of 22 smaller, high-quality datasets combined together.
141127

142128
*Dataset is available under*
143-
- `/scratch/work/public/ml-datasets/pile`
144-
- `/vast/work/public/ml-datasets/pile`
129+
- `/projects/work/public/ml-datasets/pile`
145130

146131
### Waymo open dataset
147132
*About data set*: [Open Dataset – Waymo](https://waymo.com/open/)
148133

149134
The field of machine learning is changing rapidly. Waymo is in a unique position to contribute to the research community with some of the largest and most diverse autonomous driving datasets ever released.
150135

151136
*Dataset is available under*
152-
- `/vast/work/public/ml-datasets/waymo_open_dataset_scene_flow`
153-
- `/vast/work/public/ml-datasets/waymo_open_dataset_v_1_2_0_individual_files`
154-
- `/vast/work/public/ml-datasets/waymo_open_dataset_v_1_3_2_individual_files`
155-
- `/vast/work/public/ml-datasets/waymo_open_dataset_v_1_4_1_individual_files`
137+
- `/projects/work/public/ml-datasets/waymo_open_dataset_scene_flow`
138+
- `/projects/work/public/ml-datasets/waymo_open_dataset_v_1_2_0_individual_files`
139+
- `/projects/work/public/ml-datasets/waymo_open_dataset_v_1_3_2_individual_files`
140+
- `/projects/work/public/ml-datasets/waymo_open_dataset_v_1_4_1_individual_files`
156141

0 commit comments

Comments
 (0)