Topic 4 — Multimodal AI for Reasoning and Action

Perception systems convert sensor data into structured world representations, but your humanoid also needs to reason about those representations and connect them to natural language. This topic introduces multimodal AI, Vision-Language Models (VLMs), and perception-language grounding for task-driven behavior.

4.1 Perception-Language Grounding

Humans communicate tasks using language:

"Pick up the red mug near the laptop."
"Find the person by the window and wave."
"Is the path to the door clear?"

For a robot to follow such instructions, it must:

Parse the language:
- Understand entities (mug, laptop, person, door).
- Understand relationships (near, by, in front of).
Link entities to perception:
- Match "red mug" to an actual detection in the scene.
- Determine which person is "by the window."
Translate into goals and constraints:
- Produce target poses and high-level actions (reach, grasp, navigate).

This linking process is called perception-language grounding.

4.2 Vision-Language Models (VLMs)

Vision-Language Models extend language models with visual inputs:

Accept images (or video frames) plus text prompts.
Produce:
- Descriptions (captions).
- Answers to questions (VQA).
- Structured outputs (lists of objects, relationships, or affordances).

Examples of capabilities:

Scene captioning:
- "A person sitting on a chair in front of a desk with a laptop and a mug."
Visual question answering:
- Q: "Where is the red mug?"
  A: "On the table to the right of the laptop."
Affordance detection:
- Highlighting regions that are "graspable", "walkable", or "sit-able."

For humanoids, VLMs can:

Provide semantic context that low-level detectors do not.
Help interpret ambiguous or underspecified instructions.
Act as a bridge between human operators and the robot’s perception stack.

4.3 Pipeline: Camera → VLM → Planner → Actuator

A typical multimodal pipeline looks like:

Camera / Perception Frontend
- Capture RGB image (and optionally depth).
- Optionally overlay detection/segmentation results.
VLM Interface
- Send image plus a prompt, such as:
  - "List all objects on the table."
  - "Which object is closest to the door?"
  - "Where is the nearest free space to place a box?"
- Parse the VLM response into a structured format:
  - Objects with attributes (name, color, approximate location).
  - Relations (left of, on top of, near).
Planner
- Convert structured perception into:
  - Goal poses.
  - Waypoints.
  - Manipulation tasks.
Controller
- Execute trajectories generated by the planner.
- Use low-level feedback loops (from Chapter 2 and future chapters) to track commands.

The key design decision is the interface between VLM outputs and your robot’s planning/control stack. Often this takes the form of:

A scene graph topic in ROS 2.
A high-level task message (e.g., "pick object with id X at pose Y").

4.4 Lab C: Vision-Language Task Execution

This lab explores end-to-end integration of perception and language.

Objectives

Build a system where natural language instructions trigger perception-driven actions.
Enable the robot to answer questions about its environment.

Tasks

Scene Capture
- Capture periodic RGB images from your humanoid’s camera.
- Optionally include detection overlays to stabilize VLM responses.
VLM Query Node
- Implement a ROS 2 node that:
  - Subscribes to images.
  - Sends images plus prompts to a VLM API or local model.
  - Receives text responses and parses them into structured data.
Instruction Handling
- Support a small set of instructions, such as:
  - "Locate the human and wave."
  - "Point to the nearest chair."
  - "Describe what is on the table."
- Map each instruction to:
  - Specific VLM prompts.
  - A simple action script (e.g., move arm to pointing pose).
ROS 2 Integration
- Publish:
  - Parsed scene information on a topic (e.g., /perception/scene_graph).
  - High-level tasks on a planner command topic.

Deliverables

ROS 2 node(s) implementing VLM queries and instruction handling.
Demonstration sequences (simulation or hardware) of:
- Question answering (e.g., "What objects do you see?").
- Simple instruction execution (e.g., waving at a detected human).
Short report describing:
- Prompts used.
- Common failure modes (ambiguous language, occlusions, detection errors).

4.5 Design Considerations and Limitations

When integrating VLMs into a humanoid system:

Latency:
- VLM calls can be slow compared to control loops.
- Use them for high-level reasoning, not time-critical reflexes.
Reliability:
- VLM outputs may be approximate or occasionally wrong.
- Cross-check with traditional detectors when safety is at stake.
Privacy and safety:
- Be mindful of what visual data is sent to external APIs.
- For sensitive environments, consider on-premise or local models.

The goal is to use VLMs as powerful, but not omniscient, advisors that augment your perception stack with semantic understanding and flexible language interfaces.

AI Assistant

4.1 Perception-Language Grounding​

4.2 Vision-Language Models (VLMs)​

4.3 Pipeline: Camera → VLM → Planner → Actuator​

4.4 Lab C: Vision-Language Task Execution​

Objectives​

Tasks​

Deliverables​

4.5 Design Considerations and Limitations​