Skip to content

feat: add person localization and video cropping to video pipeline#1

Open
brian10420 wants to merge 2 commits intomainfrom
feature/person-localize-crop-video
Open

feat: add person localization and video cropping to video pipeline#1
brian10420 wants to merge 2 commits intomainfrom
feature/person-localize-crop-video

Conversation

@brian10420
Copy link
Collaborator

Summary

Adds two new pipeline processors to reduce background noise in video-mode datasets by detecting and cropping to the signer before downstream processing.

New Processors

person_localize

  • Uses YOLOv8-nano to detect the signer's bounding box across sampled frames
  • Writes BBOX_X1, BBOX_Y1, BBOX_X2, BBOX_Y2, PERSON_DETECTED columns to the manifest
  • Two configurable sampling strategies:
    • skip_frame (default): mirrors the extract processor's frame sampling logic
    • uniform: evenly samples N frames across the segment
  • Gracefully falls back to full-frame bbox when no person is detected
  • Resume-safe: skips already-processed rows on re-run

crop_video

  • Reads bbox columns from manifest and crops each clip using ffmpeg
  • Configurable padding (default 25%) clamped to frame boundaries
  • Ensures even pixel dimensions for libx264 compatibility
  • Stream-copy (no re-encode) for fallback segments
  • Outputs to paths.cropped_clips (separate from paths.clips)

Pipeline Integration

# configs/how2sign/video.yaml
pipeline:
  steps: [person_localize, clip_video, crop_video, webdataset]

webdataset automatically uses cropped_clips when available.

Bug Fixes

  • Windows ffmpeg path: clip_video and crop_video now use shutil.which("ffmpeg") to resolve the binary path in worker processes, fixing [WinError 2] on Windows
  • Windows webdataset path: replaced wds.ShardWriter with a tarfile-based _ShardWriter to bypass gopen() misinterpreting Windows drive letters (D:/) as URL schemes
  • Windows POSIX paths: resolve_paths() now correctly treats /abs/path as absolute on Windows

Validation

Tested on How2Sign val set (132 videos, 1741 segments):

person_localize : 1739/1741 detected, 2 fallback, 0 errors
clip_video      : 1741/1741 success
crop_video      : 1739 cropped + 2 stream-copied, 0 errors
webdataset      : 1741 samples → shard-000000.tar

Tests

164 tests passing. New test classes:

  • TestSampleStrategy — skip_frame / uniform sampling logic
  • TestPersonLocalizeManifest — manifest column writing and resume safety
  • TestCropGeometry — bbox padding, clamp, even dimensions
  • TestSchemaDefaults — new config fields and validation

@balaboom123 balaboom123 self-requested a review February 26, 2026 09:02
Copy link
Owner

@balaboom123 balaboom123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# PR Review — Confirmed + Additional Issues Found

---

## [Issue 1] Redundant and Duplicated Config Parameters

### 1a — frame_skip is declared twice with the same default and same semantic meaning.

ProcessingConfig (already existing) already has:

```python
# schema.py
class ProcessingConfig(BaseModel):
    frame_skip: int = 2   # ← already exists

The PR adds a second frame_skip in PersonLocalizeConfig:

class PersonLocalizeConfig(BaseModel):
    frame_skip: int = 2   # ← duplicate

And both appear in _base/video.yaml:

processing:
  frame_skip: 2          # ← existing

person_localize:
  frame_skip: 2          # ← new duplicate

These two fields can silently diverge. A user who sets processing.frame_skip: 4 to speed up pose extraction will reasonably expect localization to match — but it won't, because person_localize has its own independent value. The person_localize step should read from cfg.processing.frame_skip instead of maintaining its own copy.


1b — Commented-out override blocks in dataset configs repeat the exact defaults verbatim.

In configs/how2sign/video.yaml and configs/youtube_asl/video.yaml:

# Uncomment to override:
# person_localize:
#   model: "yolov8n.pt"          ← same as _base default
#   confidence_threshold: 0.5    ← same as _base default
#   sample_frames: 5             ← same as _base default
#   device: "cuda:0"             ← same as _base default
#   min_bbox_area: 0.05          ← same as _base default

Overrides should only show the fields that are actually meaningful to change per-dataset, with example non-default values. As written, a user who uncomments the block changes nothing — it gives a false impression that something has been configured.


[Issue 2] Hardcoded YOLO Backend — No Dispatch Logic for Future Model Support

In person_localize.py, the model is always loaded through ultralytics.YOLO, regardless of what string is in config.person_localize.model:

# person_localize.py
model = YOLO(loc_cfg.model)   # always ultralytics — no dispatch
model.to(loc_cfg.device)

The model field in PersonLocalizeConfig is typed as a bare str, which suggests it could point to any model file, but the implementation has no branching to support anything other than ultralytics-format weights. Compare this to how the existing extractor pattern works — ExtractorConfig has a name field and the runner dispatches based on it to mediapipe or mmpose backends via the registry.

The fix should follow the same pattern: add a backend: Literal["ultralytics"] = "ultralytics" field (or similar), and wrap the model loading in a dispatch block so that adding a new detector backend in the future (e.g., MMDetection, Torchvision) only requires adding a new branch, not rewriting the processor:

# Suggested direction
if loc_cfg.backend == "ultralytics":
    from ultralytics import YOLO
    model = YOLO(loc_cfg.model)
elif loc_cfg.backend == "mmdet":
    ...  # future support
else:
    raise ValueError(f"Unknown detector backend: {loc_cfg.backend}")

[Issue 3] smoke_test_localize.py Added Without Integration into scripts/run.py

scripts/run.py is the project's documented single entry point:

# scripts/run.py — unchanged by this PR
from sign_prep.__main__ import main

if __name__ == "__main__":
    main()

The PR adds scripts/smoke_test_localize.py as a completely standalone script with its own argument parser, its own Config construction, and its own processor invocations — none of which flow through run.py or the CLI. This creates two problems:

  1. Discoverability — users reading scripts/run.py have no way to know the smoke test exists.
  2. Maintenance drift — the smoke test builds a Config object manually with hardcoded fields that can go stale as the schema evolves, whereas run.py goes through load_config() which handles validation and defaults centrally.

At minimum, the smoke test should be referenced in a comment in run.py or the README. Better yet, consider exposing it as a subcommand:

python scripts/run.py --smoke-test --video path/to/video.mp4

[Issue 4 — New] ultralytics Not Added to requirements.txt

The PR introduces a hard dependency on ultralytics (YOLOv8) but does not add it to requirements.txt:

# requirements.txt — ultralytics is absent
torch==2.0.1
opencv-python==4.11.0.86
mediapipe==0.10.21
...
# ultralytics — MISSING

The code tries to handle this gracefully with a module-level try/except:

try:
    from ultralytics import YOLO
except ImportError:
    YOLO = None

But this only delays the error to runtime. A fresh pip install -r requirements.txt will produce a broken environment where person_localize silently sets YOLO = None and then raises ImportError only when the step actually runs. The package should be added to requirements.txt — optionally with a comment marking it as required only for the person_localize step.


[Issue 5 — New] Resume Safety is Overstated — Manifest Written Only Once at the End

The PR description claims resume safety as a feature:

Resume-safe: skips already-processed rows on re-run

However, looking at the implementation:

# person_localize.py — manifest is saved ONCE at the very end
for idx, row in tqdm(pending.iterrows(), ...):
    ...  # process all rows
    results_map[idx] = {...}

# Written only after all rows are done
data.to_csv(manifest_path, sep="\t", index=False)

If the process crashes (OOM, power loss, keyboard interrupt) after processing 1,700 of 1,741 segments, results_map is lost entirely and the next run starts from zero. True resume safety requires writing to the manifest periodically (e.g., every N rows or every video), not only on successful completion.


[Issue 6 — New] _ShardWriter Double _next_shard() Possible Bug

In the new _ShardWriter, both rollover conditions are checked independently in the same write() call:

def write(self, sample: dict):
    if self._count >= self.max_count:
        self._next_shard()          # opens shard N+1
    if self.max_size and self._size >= self.max_size:
        self._next_shard()          # opens shard N+2 — skips N+1 entirely!
    ...

If both conditions are true simultaneously (a large final sample that exactly fills both count and size), _next_shard() is called twice. The first call closes shard N and opens shard N+1 (empty). The second call immediately closes the empty shard N+1 and opens shard N+2. The result is a zero-byte shard file in the output directory which may confuse WebDataset readers. Use elif or merge the two checks into a single condition.


[Issue 7 — New] device: "cuda:0" Hardcoded in Base Config — No CPU Fallback

# configs/_base/video.yaml
person_localize:
  device: "cuda:0"   # hardcoded — fails on CPU-only machines and macOS MPS

Every other config in this project that takes a device argument either defaults to "cpu" or uses a softer default. A user running the default config on a CPU-only machine (CI, cloud VM, macOS without CUDA) will get a CUDA device error immediately on step start. The base config should default to "cpu" for safety, with a comment suggesting "cuda:0" or "mps" for GPU environments.


[Issue 8 — New] sample_frames Has Ambiguous Dual Meaning

class PersonLocalizeConfig(BaseModel):
    sample_frames: int = 5   # uniform: exact count; skip_frame: max frame count

The same field name controls two fundamentally different behaviours depending on sample_strategy:

  • In uniform mode: exactly 5 frames are sampled, always.
  • In skip_frame mode: up to 5 frames are sampled, but the actual count depends on segment duration and frame_skip.

This makes sample_frames difficult to reason about. Tuning one strategy silently changes the cap for the other. Consider splitting into uniform_frames: int = 5 and max_frames: int = 5, or at minimum add a @field_validator that emits a warning when the field is set without specifying sample_strategy.


Summary Table

# Source Severity Issue
1a User Medium frame_skip duplicated in ProcessingConfig and PersonLocalizeConfig
1b User Low Commented-out override blocks repeat base defaults verbatim
2 User Medium No backend dispatch — YOLO always used with no future extensibility
3 User Low smoke_test_localize.py not discoverable from scripts/run.py
4 New High ultralytics missing from requirements.txt
5 New High Manifest written once at end — crash loses all progress, resume safety overstated
6 New Medium _ShardWriter double _next_shard() when count and size limits fire together
7 New Medium device: "cuda:0" default breaks CPU-only environments
8 New Low sample_frames has ambiguous dual meaning across strategies

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants