You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
About data set: [ProQuest Congressional Record](https://guides.nyu.edu/govdocs/congressional#s-lg-box-14137380)
92
81
93
82
The ProQuest Congressional Record text-as-data collection consists of machine-readable files capturing the full text and a small number of metadata fields for a full run of the Congressional Record between 1789 and 2005. Metadata fields include the date of publication, subjects (for issues for which such information exists in the ProQuest system), and URLs linking the full text to the canonical online record for that issue on the ProQuest Congressional platform. A total of 31,952 issues are available.
94
83
95
84
*Dataset is available under*:
96
-
-`/scratch/work/public/proquest/`
85
+
-`/projects/work/public/proquest/`
97
86
98
87
### C4
99
88
*About data set*: [c4 | TensorFlow Datasets](https://www.tensorflow.org/datasets/catalog/c4)
100
89
101
90
A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: [https://commoncrawl.org](https://commoncrawl.org)
102
91
103
92
*Dataset is available under*
104
-
-`/scratch/work/public/ml-datasets/c4`
105
-
-`/vast/work/public/ml-datasets/c4`
93
+
-`/projects/work/public/ml-datasets/c4`
106
94
107
95
### GQA
108
96
*About data set*: [GQA: Visual Reasoning in the Real World (stanford.edu)](https://cs.stanford.edu/people/dorarad/gqa/index.html)
109
97
110
98
Question Answering on Image Scene Graphs
111
99
112
100
*Dataset is available under*
113
-
-`/scratch/work/public/ml-datasets/gqa`
114
-
-`/vast/work/public/ml-datasets/gqa`
101
+
-`/projects/work/public/ml-datasets/gqa`
115
102
116
103
### MJSynth
117
104
*About data set*: [Visual Geometry Group - University of Oxford](https://www.robots.ox.ac.uk/~vgg/data/text/)
@@ -121,7 +108,7 @@ This is synthetically generated dataset which found to be sufficient for trainin
121
108
This dataset consists of 9 million images covering 90k English words, and includes the training, validation and test splits used in the author's work (archived dataset is about 10 GB)
122
109
123
110
*Dataset is available under*
124
-
-`/vast/work/public/ml-datasets/mjsynth`
111
+
-`/projects/work/public/ml-datasets/mjsynth`
125
112
126
113
### open-images-dataset
127
114
*About data set*: [Open Images Dataset – opensource.google](https://storage.googleapis.com/openimages/web/index.html)
@@ -131,26 +118,24 @@ A dataset of ~9 million varied images with rich annotations
131
118
The images are very diverse and often contain complex scenes with several objects (8.4 per image on average). It contains image-level labels annotations, object bounding boxes, object segmentations, visual relationships, localized narratives, and more
*About data set*: [The Pile (eleuther.ai)](https://pile.eleuther.ai/)
139
125
140
126
The Pile is a 825 GiB diverse, open source language modeling data set that consists of 22 smaller, high-quality datasets combined together.
141
127
142
128
*Dataset is available under*
143
-
-`/scratch/work/public/ml-datasets/pile`
144
-
-`/vast/work/public/ml-datasets/pile`
129
+
-`/projects/work/public/ml-datasets/pile`
145
130
146
131
### Waymo open dataset
147
132
*About data set*: [Open Dataset – Waymo](https://waymo.com/open/)
148
133
149
134
The field of machine learning is changing rapidly. Waymo is in a unique position to contribute to the research community with some of the largest and most diverse autonomous driving datasets ever released.
0 commit comments