Topic 2 — Computer Vision and Object Understanding
This topic covers how humanoid robots extract objects, people, and semantics from visual data. It starts with classical computer vision, transitions to deep learning–based detectors and segmenters, and culminates in a hands-on perception pipeline that runs in real time on RGB-D and LiDAR streams.
2.1 Classical Computer Vision
Before deep learning, perception systems relied on handcrafted features:
- Edge detectors (Sobel, Canny) to find object boundaries.
- Corner and feature detectors (Harris, FAST).
- Feature descriptors (SIFT, SURF, ORB) to encode local patches.
- Feature matching across images for:
- Motion estimation.
- Stereo matching.
- Visual odometry.
Classical methods still matter because:
- They are fast and lightweight, useful on constrained hardware.
- They are more interpretable and tunable.
- Many SLAM systems (e.g., ORB-SLAM) still rely on keypoints and descriptors.
Typical pipeline:
- Detect keypoints in each image.
- Compute descriptors for each keypoint.
- Match descriptors between consecutive frames.
- Use matches to estimate camera motion and scene structure.
You will see these ideas again when we discuss SLAM in Topic 3.
2.2 Deep Vision Models
Deep learning has transformed computer vision by learning features directly from data.
Convolutional Neural Networks (CNNs)
CNNs apply learned filters across images to detect patterns:
- Early layers detect edges and textures.
- Deeper layers detect parts and objects.
- Used for:
- Image classification.
- Object detection.
- Semantic segmentation.
Vision Transformers (ViTs)
Vision Transformers treat images as sequences of patches:
- Use self-attention to relate all patches to each other.
- Capture long-range context effectively.
- Often combined with CNNs or used as backbones in modern detectors and segmenters.
Object Detection
Detectors take an image and produce bounding boxes + class labels:
- Single-stage models (e.g., YOLO-family) excel at real-time performance.
- Two-stage models (e.g., region-based detectors) can offer higher accuracy at higher compute cost.
For humanoid robots, detection is used to:
- Localize manipulable objects (mugs, tools, doors).
- Detect humans for interaction and safety.
Semantic and Instance Segmentation
- Semantic segmentation: assign a class label to every pixel (e.g., floor, wall, table).
- Instance segmentation: distinguish individual object instances (e.g., mug A vs mug B).
Segmentation is useful for:
- Understanding traversable vs non-traversable areas.
- Manipulation planning (precise object extents).
- Scene understanding and affordance detection.
Depth Estimation and Monocular 3D
Depth estimation models infer distance from a single RGB image:
- Useful when no hardware depth sensor is available.
- Can augment sparse depth sensors.
Monocular 3D reconstruction goes further by estimating:
- 3D structure of scenes.
- Coarse point clouds or meshes from single or multiple images.
Human Pose Estimation
Human pose estimators detect key human joints (e.g., shoulders, elbows, knees):
- Used for:
- Gesture recognition (waving, pointing).
- Safety (detecting awkward or dangerous proximity).
- Human-robot interaction (mirroring or responding to user posture).
For humanoid robots, pose estimation enables:
- Interpreting commands like "wave back at the human" or "follow that person."
2.3 Building a Real-Time Vision Pipeline
A practical perception pipeline for your humanoid will:
- Capture frames from RGB-D camera (and optionally LiDAR).
- Run object detection on the RGB stream.
- Run segmentation to get pixel-level masks.
- Associate detections with depth or point cloud data:
- Estimate 3D positions of objects.
- Compute approximate size and orientation.
- Publish results as ROS 2 topics for other nodes to consume.
Example ROS 2 Topics
/camera/rgb/image_raw— Raw RGB images./camera/depth/image_raw— Raw depth images./perception/detections— List of detected objects (class, 2D box, score)./perception/segments— Masks or polygon outlines./perception/objects_3d— Estimated 3D positions and extents.
Downstream nodes (planners, controllers, VLM interfaces) should not need to know how detections are produced—only how to consume them.
2.4 Lab A: Build a Real-Time Perception System
This lab turns the concepts above into a working pipeline.
Objectives
- Integrate camera, depth, and (optionally) LiDAR streams.
- Run object detection and segmentation in real time.
- Publish a clean world state into ROS 2 topics.
Tasks
- Sensor Integration
- Subscribe to RGB-D streams from:
- RealSense (hardware), or
- Simulated sensors in Gazebo/Isaac Sim (from Chapter 3).
- (Optional) Subscribe to LiDAR point clouds.
- Subscribe to RGB-D streams from:
- Object Detection
- Run a pretrained detector on RGB images.
- Filter detections by confidence and class.
- Segmentation
- Run semantic or instance segmentation.
- Associate masks with bounding boxes.
- 3D Projection
- For each detection, use depth or LiDAR to:
- Estimate 3D position (x, y, z) in camera frame.
- Transform into robot/world frame using TF.
- For each detection, use depth or LiDAR to:
- ROS 2 World State Publisher
- Aggregate detections into a message (e.g., custom
DetectedObjectArray). - Publish at a fixed rate (e.g., 5–10 Hz).
- Aggregate detections into a message (e.g., custom
Deliverables
- ROS 2 node(s) implementing the perception pipeline.
- Sample recordings (rosbags) with:
- Input sensors.
- Perception outputs.
- Short report:
- What objects are reliably detected?
- What failure modes did you observe (lighting, occlusion, distance)?
This lab forms the front end of your autonomous humanoid: everything downstream will depend on these perception outputs.
2.5 From Perception to Structured Representations
To support later modules, it helps to convert raw detections into more structured forms:
- Object list: each with class, 3D pose, size, and confidence.
- Semantic map overlays: occupancy grid with semantic labels.
- Scene graphs: nodes are objects, edges represent relationships (e.g., "on top of", "near").
These structures will be:
- Consumed directly by planning (Topic 5).
- Queried by Vision-Language Models (Topic 4) to answer questions like:
- "Where is the nearest chair?"
- "Is there a clear path to the door?"
Design your perception outputs now with these future consumers in mind.