A large-scale, open-domain ASL-English parallel corpus with 11,000+ YouTube videos and 73,000+ segments (Uthus et al., 2023).
Default pose job: dataset.download → dataset.manifest → processing.video2pose → post_processing.normalize → output.webdataset
python -m signdata run configs/jobs/youtube_asl/mediapipe.yamlDefault video job: dataset.download → dataset.manifest → processing.video2crop → output.webdataset
python -m signdata run configs/jobs/youtube_asl/video.yamlRequires dataset.source.video_ids_file pointing to the video ID list
(included at assets/youtube-asl_youtube_asl_video_ids.txt). The dataset
download stage fetches videos via yt-dlp and transcripts via
youtube-transcript-api. If transcript requests start failing with
RequestBlocked or IpBlocked, configure
dataset.source.transcript_proxy_http / dataset.source.transcript_proxy_https
or retry from a non-blocked residential IP.
80+ hours of instructional "how-to" videos with continuous ASL, recorded in a controlled environment with professional signers (Duarte et al., CVPR 2021).
Default pose job: dataset.download (validation only) → dataset.manifest → processing.video2pose → post_processing.normalize → output.webdataset
python -m signdata run configs/jobs/how2sign/mediapipe.yamlDefault video job: dataset.download (validation only) → dataset.manifest → processing.video2crop → output.webdataset
python -m signdata run configs/jobs/how2sign/video.yamlSetup:
- Download the dataset from how2sign.github.io
- Place videos in the
videospath (default:dataset/how2sign/videos/) - Place the alignment CSV (e.g.
how2sign_realigned_val.csv) atpaths.manifestordataset.source.manifest_csv
The How2Sign dataset adapter uses dataset.download as a validation step for
local files; it does not fetch remote data.
All datasets must use the package layout
src/signdata/datasets/<dataset_name>/ with adapter.py, source.py, and
manifest.py as the default entry files.
See CONTRIBUTING.md for the required structure, responsibilities, and code template.
- Pipeline Stages -- what each stage does and its I/O
- Configuration Reference -- full config schema and CLI overrides
- Research-Aligned Preprocessing -- paper-aligned methodology notes