Skip to content
View kcsayem's full-sized avatar
  • Ulsan National Institute of Science & Technology
  • Ulsan, South Korea

Highlights

  • Pro

Block or report kcsayem

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
kcsayem/README.md

MD Khalequzzaman Chowdhury Sayem

Researcher at UNIST Vision & Learning Lab
3D Vision · Vision-Language Models · Geometry-Grounded Multimodal Reasoning

Homepage Google Scholar LinkedIn Email

CVPR 2026 AAAI 2025 3D Vision Vision-Language Models Hand-Object Interaction

About Me

I am a researcher at the Vision & Learning Lab, UNIST, South Korea, working under the supervision of Prof. Seungryul Baek and Prof. Binod Bhattarai.

My work focuses on multimodal learning, vision-language models, and geometry-grounded reasoning in visually complex environments, especially for articulated hands and hand-object interaction.

I am interested in building multimodal systems that reason more reliably about 3D structure, spatial relationships, and fine-grained interactions, with longer-term goals in grounded world models and embodied multimodal intelligence.

Research Snapshot

Current Directions

  • Reliable multimodal reasoning with explicit geometric supervision
  • Fine-grained understanding of hands and hand-object interactions
  • Scalable benchmarks for spatial reasoning in VLMs
  • Interpretable and grounded multimodal foundation models

Research Areas

  • 3D Vision
  • Vision-Language Models
  • Multimodal Learning
  • Hand Pose and Hand-Object Interaction
  • Geometry-Grounded Reasoning
  • Embodied AI and World Models

Featured Publications

HandVQA

CVPR 2026

1.6M+ VQA Pairs

Spatial Reasoning

HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models

  • Large-scale benchmark grounded in 3D hand geometry
  • Covers joint angles, distances, and relative spatial relations
  • Shows explicit 3D supervision improves reliability and cross-task generalization

Project Page

QORT-Former

AAAI 2025

53.5 FPS

Real-Time 3D Pose

QORT-Former: Query-Optimized Real-Time Transformer for Understanding Two Hands Manipulating Objects

  • Real-time Transformer for two-hand and object 3D pose estimation
  • Balances efficiency and accuracy for practical deployment
  • Outperforms prior methods on H2O and FPHA while running in real time

Project PagePaperCode

Selected Repositories

Repository Description
HandVQA Fine-grained spatial reasoning about hands in vision-language models
QORT-Former Real-time Transformer for understanding two hands manipulating objects
4d-editing 4D Instruct-GS2GS for extending semantic editing to dynamic 3D scenes
Parallel-bandit Parallelized contextual bandit algorithms for news recommendation

Connect

I am always open to research discussions, collaborations, and ideas around 3D vision, multimodal learning, and vision-language reasoning.

Pinned Loading

  1. handvqa handvqa Public

    [CVPR 2026] HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models

  2. QORT-Former QORT-Former Public

    [AAAI 2025] Official implementation of QORT-Former: Query-optimized Real-time Transformer for Understanding Two Hands Manipulating Objects

    JavaScript 9

  3. Parallel-bandit Parallel-bandit Public

    Analyzes the performance trade-offs of parallelized contextual bandit algorithms (LinUCB, Thompson Sampling) in news recommendation, using the Yahoo! R6A dataset.

    Jupyter Notebook 1

  4. eldor-fozilov/4d-editing eldor-fozilov/4d-editing Public

    4D Instruct-GS2GS: Extending Semantic Editing to Dynamic 3D Scenes

    Jupyter Notebook