feat: add person localization and video cropping to video pipeline#1
feat: add person localization and video cropping to video pipeline#1brian10420 wants to merge 2 commits intomainfrom
Conversation
There was a problem hiding this comment.
# PR Review — Confirmed + Additional Issues Found
---
## [Issue 1] Redundant and Duplicated Config Parameters
### 1a — frame_skip is declared twice with the same default and same semantic meaning.
ProcessingConfig (already existing) already has:
```python
# schema.py
class ProcessingConfig(BaseModel):
frame_skip: int = 2 # ← already existsThe PR adds a second frame_skip in PersonLocalizeConfig:
class PersonLocalizeConfig(BaseModel):
frame_skip: int = 2 # ← duplicateAnd both appear in _base/video.yaml:
processing:
frame_skip: 2 # ← existing
person_localize:
frame_skip: 2 # ← new duplicateThese two fields can silently diverge. A user who sets processing.frame_skip: 4 to speed up pose extraction will reasonably expect localization to match — but it won't, because person_localize has its own independent value. The person_localize step should read from cfg.processing.frame_skip instead of maintaining its own copy.
1b — Commented-out override blocks in dataset configs repeat the exact defaults verbatim.
In configs/how2sign/video.yaml and configs/youtube_asl/video.yaml:
# Uncomment to override:
# person_localize:
# model: "yolov8n.pt" ← same as _base default
# confidence_threshold: 0.5 ← same as _base default
# sample_frames: 5 ← same as _base default
# device: "cuda:0" ← same as _base default
# min_bbox_area: 0.05 ← same as _base defaultOverrides should only show the fields that are actually meaningful to change per-dataset, with example non-default values. As written, a user who uncomments the block changes nothing — it gives a false impression that something has been configured.
[Issue 2] Hardcoded YOLO Backend — No Dispatch Logic for Future Model Support
In person_localize.py, the model is always loaded through ultralytics.YOLO, regardless of what string is in config.person_localize.model:
# person_localize.py
model = YOLO(loc_cfg.model) # always ultralytics — no dispatch
model.to(loc_cfg.device)The model field in PersonLocalizeConfig is typed as a bare str, which suggests it could point to any model file, but the implementation has no branching to support anything other than ultralytics-format weights. Compare this to how the existing extractor pattern works — ExtractorConfig has a name field and the runner dispatches based on it to mediapipe or mmpose backends via the registry.
The fix should follow the same pattern: add a backend: Literal["ultralytics"] = "ultralytics" field (or similar), and wrap the model loading in a dispatch block so that adding a new detector backend in the future (e.g., MMDetection, Torchvision) only requires adding a new branch, not rewriting the processor:
# Suggested direction
if loc_cfg.backend == "ultralytics":
from ultralytics import YOLO
model = YOLO(loc_cfg.model)
elif loc_cfg.backend == "mmdet":
... # future support
else:
raise ValueError(f"Unknown detector backend: {loc_cfg.backend}")[Issue 3] smoke_test_localize.py Added Without Integration into scripts/run.py
scripts/run.py is the project's documented single entry point:
# scripts/run.py — unchanged by this PR
from sign_prep.__main__ import main
if __name__ == "__main__":
main()The PR adds scripts/smoke_test_localize.py as a completely standalone script with its own argument parser, its own Config construction, and its own processor invocations — none of which flow through run.py or the CLI. This creates two problems:
- Discoverability — users reading
scripts/run.pyhave no way to know the smoke test exists. - Maintenance drift — the smoke test builds a Config object manually with hardcoded fields that can go stale as the schema evolves, whereas
run.pygoes throughload_config()which handles validation and defaults centrally.
At minimum, the smoke test should be referenced in a comment in run.py or the README. Better yet, consider exposing it as a subcommand:
python scripts/run.py --smoke-test --video path/to/video.mp4[Issue 4 — New] ultralytics Not Added to requirements.txt
The PR introduces a hard dependency on ultralytics (YOLOv8) but does not add it to requirements.txt:
# requirements.txt — ultralytics is absent
torch==2.0.1
opencv-python==4.11.0.86
mediapipe==0.10.21
...
# ultralytics — MISSINGThe code tries to handle this gracefully with a module-level try/except:
try:
from ultralytics import YOLO
except ImportError:
YOLO = NoneBut this only delays the error to runtime. A fresh pip install -r requirements.txt will produce a broken environment where person_localize silently sets YOLO = None and then raises ImportError only when the step actually runs. The package should be added to requirements.txt — optionally with a comment marking it as required only for the person_localize step.
[Issue 5 — New] Resume Safety is Overstated — Manifest Written Only Once at the End
The PR description claims resume safety as a feature:
Resume-safe: skips already-processed rows on re-run
However, looking at the implementation:
# person_localize.py — manifest is saved ONCE at the very end
for idx, row in tqdm(pending.iterrows(), ...):
... # process all rows
results_map[idx] = {...}
# Written only after all rows are done
data.to_csv(manifest_path, sep="\t", index=False)If the process crashes (OOM, power loss, keyboard interrupt) after processing 1,700 of 1,741 segments, results_map is lost entirely and the next run starts from zero. True resume safety requires writing to the manifest periodically (e.g., every N rows or every video), not only on successful completion.
[Issue 6 — New] _ShardWriter Double _next_shard() Possible Bug
In the new _ShardWriter, both rollover conditions are checked independently in the same write() call:
def write(self, sample: dict):
if self._count >= self.max_count:
self._next_shard() # opens shard N+1
if self.max_size and self._size >= self.max_size:
self._next_shard() # opens shard N+2 — skips N+1 entirely!
...If both conditions are true simultaneously (a large final sample that exactly fills both count and size), _next_shard() is called twice. The first call closes shard N and opens shard N+1 (empty). The second call immediately closes the empty shard N+1 and opens shard N+2. The result is a zero-byte shard file in the output directory which may confuse WebDataset readers. Use elif or merge the two checks into a single condition.
[Issue 7 — New] device: "cuda:0" Hardcoded in Base Config — No CPU Fallback
# configs/_base/video.yaml
person_localize:
device: "cuda:0" # hardcoded — fails on CPU-only machines and macOS MPSEvery other config in this project that takes a device argument either defaults to "cpu" or uses a softer default. A user running the default config on a CPU-only machine (CI, cloud VM, macOS without CUDA) will get a CUDA device error immediately on step start. The base config should default to "cpu" for safety, with a comment suggesting "cuda:0" or "mps" for GPU environments.
[Issue 8 — New] sample_frames Has Ambiguous Dual Meaning
class PersonLocalizeConfig(BaseModel):
sample_frames: int = 5 # uniform: exact count; skip_frame: max frame countThe same field name controls two fundamentally different behaviours depending on sample_strategy:
- In uniform mode: exactly 5 frames are sampled, always.
- In skip_frame mode: up to 5 frames are sampled, but the actual count depends on segment duration and frame_skip.
This makes sample_frames difficult to reason about. Tuning one strategy silently changes the cap for the other. Consider splitting into uniform_frames: int = 5 and max_frames: int = 5, or at minimum add a @field_validator that emits a warning when the field is set without specifying sample_strategy.
Summary Table
| # | Source | Severity | Issue |
|---|---|---|---|
| 1a | User | Medium | frame_skip duplicated in ProcessingConfig and PersonLocalizeConfig |
| 1b | User | Low | Commented-out override blocks repeat base defaults verbatim |
| 2 | User | Medium | No backend dispatch — YOLO always used with no future extensibility |
| 3 | User | Low | smoke_test_localize.py not discoverable from scripts/run.py |
| 4 | New | High | ultralytics missing from requirements.txt |
| 5 | New | High | Manifest written once at end — crash loses all progress, resume safety overstated |
| 6 | New | Medium | _ShardWriter double _next_shard() when count and size limits fire together |
| 7 | New | Medium | device: "cuda:0" default breaks CPU-only environments |
| 8 | New | Low | sample_frames has ambiguous dual meaning across strategies |
Summary
Adds two new pipeline processors to reduce background noise in video-mode datasets by detecting and cropping to the signer before downstream processing.
New Processors
person_localizeBBOX_X1,BBOX_Y1,BBOX_X2,BBOX_Y2,PERSON_DETECTEDcolumns to the manifestskip_frame(default): mirrors the extract processor's frame sampling logicuniform: evenly samples N frames across the segmentcrop_videopaths.cropped_clips(separate frompaths.clips)Pipeline Integration
webdatasetautomatically usescropped_clipswhen available.Bug Fixes
clip_videoandcrop_videonow useshutil.which("ffmpeg")to resolve the binary path in worker processes, fixing[WinError 2]on Windowswds.ShardWriterwith atarfile-based_ShardWriterto bypassgopen()misinterpreting Windows drive letters (D:/) as URL schemesresolve_paths()now correctly treats/abs/pathas absolute on WindowsValidation
Tested on How2Sign val set (132 videos, 1741 segments):
Tests
164 tests passing. New test classes:
TestSampleStrategy— skip_frame / uniform sampling logicTestPersonLocalizeManifest— manifest column writing and resume safetyTestCropGeometry— bbox padding, clamp, even dimensionsTestSchemaDefaults— new config fields and validation