A multimodal model learns to handle several kinds of inputs — text, image, audio, sometimes actions — in a **shared representation space**, so that a caption and the photo it describes end up "in the same place." This is the foundation of assistants that "see," and ultimately it is also what lets a robot turn "put away the blue cup" into an arm motion. This article follows that thread: how modalities are fused (CLIP, ViT), how vision is wired into an LLM (projector, cross-attention, native), how they are trained and evaluated, and then how **Vision-Language-Action models (VLA)** extend all of this to embodied action.

## Fusing modalities: the core idea

The basic problem is **heterogeneity**: an image is a grid of pixels, text a sequence of discrete symbols. Fusing them means projecting each modality into one vector space where **geometric proximity reflects semantic proximity**, regardless of the original format. Once that bridge exists, an image encoder and a text encoder can "talk to each other": you query in one modality and retrieve in the other.

Two families coexist:

- **Dual-encoder** models (CLIP-style) keep two separate towers and only align the final vectors. They are perfect for **retrieval** (text ⇄ image) and **zero-shot classification**, but they *generate* nothing — they measure proximity.
- **Deep-fusion** models (generative VLMs) inject visual tokens *into* an LLM, which reasons jointly over text and image to **produce** language: describe, answer, reason.
- **VLAs** extend this second family all the way to producing **actions** rather than (or in addition to) text.

We speak of **early fusion** when modalities are mixed at the network's input, and **late fusion** when only separately computed representations are aligned. CLIP is the archetype of late fusion; generative VLMs lean toward early fusion, because the LLM must see image and text together to reason. Understanding this continuum helps choose the architecture per task: retrieve, generate, or act.

Beyond image and text, the same principle extends to other modalities:

- **Audio** (speech, environmental sounds), itself aligned to text by contrast.
- **Video**, treated as a sequence of images with a temporal dimension.
- **Robot sensors** (depth, proprioception, forces) in VLAs.

The pattern is always the same: one encoder per modality, alignment toward a common space, then joint reasoning. It is this uniformity that makes the approach so fertile.

## CLIP: contrastive image-text alignment

CLIP (Contrastive Language-Image Pre-training, OpenAI, 2021) is the founding model of image-text fusion. Its architecture is a **dual encoder**: an image encoder (ViT or ResNet) and a text encoder (Transformer) each project their input into a **shared embedding space** where images and texts that are close in meaning end up side by side.

![Diagram of CLIP: two encoders project image and text into a shared space, and a similarity matrix maximizes the diagonal (correct pairs).](/articles/modeles-multimodaux-et-vision-language-action/clip-contrastive.svg)
*Figure: CLIP's contrastive pre-training. Original diagram.*

Training is **contrastive**. Over a batch of N (image, caption) pairs, you compute an N×N matrix of cosine similarities: the **diagonal** holds the correct pairs, the rest are negatives. The objective (a symmetric InfoNCE loss, image→text and text→image) **pulls** each image toward ITS caption and **pushes** all the others apart. CLIP was trained on ~400 million image-text pairs scraped from the web, with gigantic batches (the similarity matrix reaches 32,768 × 32,768).

A remarkable consequence: a **zero-shot classifier**. To classify an image among categories, you encode each category as a sentence ("a photo of a dog"), encode the image, and pick the highest-similarity caption — with no retraining at all. This ability made CLIP the visual building block of countless later systems (image generation, retrieval, and… VLMs).

A few often-misunderstood points about CLIP:

- **Normalization matters.** Embeddings are projected onto the unit sphere before computing similarity; dot product and cosine then coincide, and a learned **temperature** parameter controls how "sharp" the distribution is.
- **The contrast is symmetric.** The loss sums two terms: find the right caption for each image *and* the right image for each caption. That is what makes the space reversible.
- **CLIP is not generative.** It does not describe an image: it **measures** how well an image and a text go together. Generation comes from VLMs, which nonetheless often reuse CLIP's vision encoder.

To make it concrete, the heart of the contrastive loss fits in a few lines:

```python
# Pseudo-code: CLIP's symmetric contrastive loss (a batch of N pairs)
img_emb = l2_normalize(image_encoder(images))   # (N, d) on the unit sphere
txt_emb = l2_normalize(text_encoder(texts))     # (N, d)
logits  = (img_emb @ txt_emb.T) * exp(temperature)  # N×N similarities
labels  = arange(N)                             # the correct pair is on the diagonal
loss_i  = cross_entropy(logits, labels)         # image → text
loss_t  = cross_entropy(logits.T, labels)       # text → image
loss    = (loss_i + loss_t) / 2                 # symmetric objective
```

## Why a shared space "works"

Why is it reasonable to hope that an image and its description end up in the same place? Because meaning is **relational**. In a learned space, what matters is not the absolute value of an axis but the **relative position**: "cat" and "dog" are close to each other and far from "airplane," whether you start from a photo or a word. Contrastive training carves exactly this geometry: it **pulls** matching pairs together and **pushes** non-matching ones apart, until the **structure** of the space is consistent across modalities.

Two properties follow, valuable for what comes next:

- **Compositionality.** Related concepts share common directions; you can combine "cup" and "blue" in a partly additive way, which helps generalize to combinations never seen.
- **Transferability.** An encoder that has learned this space over hundreds of millions of web pairs carries reusable visual "common sense" — that is the capital a VLM, then a VLA, will recycle instead of relearning everything.

This is the deep reason why we **reuse** pre-trained encoders (CLIP, SigLIP, DINOv2) rather than train new ones: alignment is expensive, so capitalize on it.

## Vision encoders: from patch to token

On the image side, the dominant tool is the **Vision Transformer (ViT)**. The key idea is simple and powerful: cut the image into **patches** (e.g. 14×14 or 16×16 pixels), flatten each patch, project it linearly into a vector — and treat that sequence of vectors **exactly like a sequence of tokens** for a Transformer. A patch becomes a "visual word."

A useful order of magnitude: a 224×224 image in 14×14 patches yields 16×16 = **256 visual tokens**; moving to 336×336 raises that to 576. This simple arithmetic explains why resolution is so costly downstream — every extra token weighs on the LLM's whole attention.

The ViT then outputs a grid of vectors (one per patch) carrying local **spatial** information. That is precisely what a VLM needs: LLaVA, for instance, takes the *grid tokens* from a penultimate layer of CLIP ViT-L/14 to preserve fine detail rather than settling for a single global vector. More patches = more visual tokens = more detail, but also more compute in the downstream LLM — a central trade-off of modern VLMs.

Not all vision encoders are equal, and the choice strongly shapes the final VLM:

- **CLIP ViT** brings strong **semantic** alignment (it "knows" this object is a dog), thanks to its contrastive text-image pre-training.
- **SigLIP** replaces CLIP's softmax loss with a pairwise **sigmoid** loss, more stable at scale and often better at equal budget.
- **DINOv2** (self-supervised, text-free) excels at **geometry** and fine spatial detail, but with no linguistic grounding.

Hence a strong trend: **fusing two encoders** (e.g. SigLIP + DINOv2) to combine semantics and spatial precision — which is exactly OpenVLA's choice, as we will see.

## Wiring vision into an LLM: three schools

How do you make a purely textual LLM "see"? Three strategies, with increasingly deep fusion.

![From VLM to VLA: the image is split into patches then encoded by a ViT, projected into the LLM's space, fused with text tokens; the LLM outputs text (VLM) or action tokens executed by a robot (VLA).](/articles/modeles-multimodaux-et-vision-language-action/vlm-to-vla.svg)
*Figure: VLM → VLA architecture and training stages. Original diagram.*

**1. Projection (LLaVA).** The simplest and most widespread approach. A **projector** — a learned matrix in LLaVA 1.0, then a **two-layer MLP** with a GELU activation in LLaVA 1.5 — converts the ViT vectors into "image tokens" that live in the LLM's embedding space. You simply **prepend** them to the text tokens: the LLM then consumes a mixed sequence, with no change to its architecture. Lightweight and efficient, this is the default recipe of most open-source VLMs.

**2. Cross-attention (Flamingo).** DeepMind inserts **gated cross-attention blocks** between the LLM's layers. Queries (text) attend to keys/values (vision); a **zero-initialized** tanh gate guarantees that the model initially behaves exactly like the original LLM, then gradually opens the door to visual signals — which stabilizes training. This route elegantly handles multiple interleaved images and long sequences.

**3. Native multimodal (GPT-4o, Gemini, Qwen2-VL).** Rather than bolting vision on afterward, you train a single model from the start that processes image and text tokens **uniformly** through the same self-attention, with special tokens to demarcate the visual part (e.g. `vision_start`/`vision_end`). GPT-4o and Gemini are described as natively multimodal; their exact details remain unpublished. This is the route that generalizes best but is the most expensive to train.

How to choose? In practice:

- **Projection** is unbeatable on the simplicity/cost ratio: you reuse an existing ViT and LLM and train almost nothing at the start. That is why nearly all open-source VLMs adopt it.
- **Cross-attention** shines on **multi-image** interleaved sequences (documents, videos, image-rich dialogues) and keeps the LLM intact at the start thanks to zero-gating.
- **Native** offers the best fusion and generalization, but requires end-to-end multimodal pre-training — meaning resources only large labs can muster.

## Tokenizing an image

In the "projection" and "native" approaches, the image ends up represented as a **sequence of tokens** homogeneous with text. That is what makes VLMs so practical: the whole LLM pipeline (attention, auto-regressive generation, KV-cache) applies as is. The flip side is the **token budget**: a high-resolution image can cost hundreds, even thousands, of tokens. Hence reduction techniques — *pooling*, a Flamingo-style *resampler* (Perceiver Resampler compressing a variable grid into a small fixed number of tokens), adaptive-resolution tiling — to contain the cost without sacrificing useful detail.

## Training a VLM: align, then instruct

LLaVA's training scheme has become canonical, in **two stages**.

- **Stage 1 — feature alignment.** You **freeze** the vision encoder AND the LLM, and train **only the projector** on image-caption pairs (595,000 pairs filtered from CC3M for LLaVA). The goal: teach the projector to map the visual space into the language space. This is fast and stable.
- **Stage 2 — visual instruction.** You **unfreeze** the LLM and the projector (the vision encoder often stays frozen) and train on higher-quality **visual dialogues** (158,000 synthetic conversations for LLaVA). The model learns to follow instructions, describe, reason, and answer questions about the image.

More recent models (Qwen2-VL) add a **third stage** and progressively unfreeze components, over massive volumes (on the order of 10^12 tokens). The principle is unchanged: **first align the modalities, then teach behavior.**

```python
# Pseudo-code: forward pass of a LLaVA-style VLM
patches = vit(image)               # ViT → grid of visual tokens (spatial)
img_tokens = projector(patches)    # MLP: vision space → LLM space
txt_tokens = embed(text)           # text tokens
seq = concat([img_tokens, txt_tokens])  # PREPEND the image to the text
logits = llm(seq)                  # the LLM reasons jointly
answer = decode(logits)            # auto-regressive text generation
```

## Evaluating VLMs — and the hallucination problem

VLMs are evaluated on visual questions (VQA), captioning, spatial reasoning, OCR, charts and documents. But the major risk is **object hallucination**: the model describes elements **absent** from the image, out of an excess of language **prior** (the LLM "completes" according to what is textually plausible, not what it sees). The usual causes: a frozen vision encoder that loses detail, an imbalance between visual and textual signal, biased instruction data. The remedies: better encoders, higher resolution, *grounding* data (explicit object ↔ region anchoring), and dedicated benchmarks (POPE for object hallucination, MMMU for demanding multimodal reasoning). A word of caution: **a fluent VLM is not a reliable VLM** — you must measure grounding, not just prose quality.

## From VLMs to VLAs: adding action

A **Vision-Language-Action model (VLA)** is a multimodal foundation model that integrates **vision, language AND actions**. From camera images and a natural-language instruction, it **produces commands executable** by a robot. The founding intuition: a VLA is a VLM **plus an action decoder**. Perception and reasoning are handled by the VLM (which encodes images + text into tokens in a shared latent space); a final stage converts these tokens into commands representing the robot's **degrees of freedom** (translations, rotations, gripper opening).

The stroke of genius that launched the field: **treating actions as tokens**. If you discretize a continuous action (e.g. "move 50 mm forward") into a discrete identifier, then generating a trajectory is exactly like **generating text** — same Transformer architecture, same auto-regressive objective, and above all **transfer** of knowledge acquired from billions of web images and texts to motor control.

Concretely, you **discretize** each action dimension (translation in x, y, z, rotation, gripper state) into a small number of **bins** (often 256), each mapped to a token. A control step then becomes a short sequence of action tokens, and the model learns to **generate** it as it would generate a sentence. At inference, you **de-tokenize** these identifiers back into continuous commands (torques, velocities) sent to the low-level controller. The elegance of the approach is that it **invents nothing** on the architecture side: it is a standard multimodal LLM whose vocabulary has merely been widened.

## An end-to-end example: from instruction to motion

Let's follow "put away the blue cup" through an action-token VLA:

1. The **camera** captures the scene; the **vision encoder** cuts it into patches and produces visual tokens.
2. The **projector** brings them into the LLM's space; they are fused with the instruction tokens.
3. The **LLM** reasons jointly: it "sees" the blue cup, understands the order, plans.
4. It **generates action tokens** (Δx, Δy, Δz, rotation, gripper) step after step.
5. These tokens are **de-tokenized** into continuous commands sent to the arm's controller.
6. The robot acts; the new image returns as input, and the loop starts again.

```text
image (patches) ─▶ ViT ─▶ projector ─┐
                                      ├─▶ LLM ─▶ action tokens ─▶ de-tokenize ─▶ robot
instruction ─────▶ text tokens ───────┘            ▲                                │
                                                    └──────── new image ◀────────────┘
```

The whole practical difficulty lies in this **closed loop**: each iteration must be fast enough for the motion to stay smooth, which brings us back to the latency challenge discussed below.

## RT-2: the pioneer

RT-2 (Google DeepMind, July 2023) established the VLA paradigm. It **co-trains** a VLM (on the PaLI-X and PaLM-E backbones) on both web data and robot data, representing the robot's **actions as tokens** alongside language. Concretely, a continuous action is **discretized** into a token ("move 50 mm" becomes, say, token #1247), reusing the model's vocabulary. The result: RT-2 generalizes better than its predecessor RT-1 to **new tasks** and can chain multi-step reasoning (chain-of-thought) leveraging knowledge inherited from the web — for example "pick the object used to hammer a nail."

The decisive contribution is not a new architecture but a **representation**: by writing the action in the same alphabet as language, RT-2 turns motor control into a special case of text generation, and inherits "for free" the reasoning learned on the web.

## OpenVLA: open source at 7 billion parameters

OpenVLA (Stanford and collaborators, June 2024) made the paradigm **open and reproducible**. It is a **7-billion-parameter** model trained on **~970,000 robot episodes** from the **Open X-Embodiment** dataset (covering 22 different robot embodiments). Its architecture:

- a **fused vision encoder** combining **SigLIP** and **DINOv2** (semantics + spatial detail);
- a **projector** mapping the visual embeddings into the language space;
- a **Llama-2 7B backbone** that **predicts** discrete **action tokens**, later de-tokenized into directly executable continuous commands.

OpenVLA **outperforms** earlier generalist policies (RT-1-X, Octo) and even **RT-2-X** (a 55-billion-parameter closed VLA) on many tasks — with RT-2-X keeping the edge on fine semantic generalization that requires web knowledge. On the engineering side, OpenVLA was trained on **64 A100 GPUs for 15 days**, and it **fine-tunes** efficiently: LoRA tunes only **1.4%** of the parameters while matching full fine-tuning — a decisive asset for adapting to a new robot or a new task.

For the practitioner, OpenVLA illustrates two concrete lessons: choosing a **fused encoder** (semantic + spatial) pays off for manipulation, and **quantization** plus LoRA make adaptation realistic on modest hardware, without a cluster.

## π0 (pi-zero): continuous output via flow-matching

π0 (Physical Intelligence, late 2024) takes the other path for **high-frequency dexterity**. Rather than discrete tokens, it builds on a VLM (**PaliGemma**, built from SigLIP and Gemma) and **directly generates continuous actions** via a **flow-matching** network (a cousin of diffusion models). This produces smooth trajectories at **~50 Hz**, better suited to robots with many degrees of freedom and to fine gestures (folding laundry, handling delicate objects). The trade-off shapes up as follows: **discrete tokens** = simplicity and direct LLM inheritance, but limited granularity; **continuous output (diffusion/flow)** = precision and high frequency, at the cost of a more complex decoder.

A key notion appears here: *action chunks*. Instead of predicting a single step, the model generates a short sequence of future actions **at once**, which smooths the command and amortizes the cost of one pass through the large VLM — a central trick for keeping up with real-time control.

## Generalization and embodied reasoning

The promise of VLAs is **embodied generalization**: because the trunk is a VLM pre-trained on the web, the robot inherits **visual and linguistic common sense** that no amount of robot demonstration alone could provide. It can follow novel instructions, recognize objects never seen in demonstration, and reason about the task ("*first* put away the cup, *then* wipe the table"). The Open X-Embodiment dataset pushes further: training **one** policy on **multiple embodiments** to aim at a generic controller, transferable from one robot to another.

## Open challenges

- **Latency.** A multi-billion-parameter LLM in a real-time **control loop** is a challenge: generating tokens is slow, while control demands 10–50 Hz. Hence continuous outputs (π0), distillation, quantization, *action chunks* (predicting several steps at once).
- **Data.** Robot demonstrations are scarce and costly compared to the web's terabytes of text/image; **teleoperation** does not scale. Open X-Embodiment, simulation, and learning from human videos try to bridge the gap.
- **Safety and reliability.** A hallucination in a chatbot produces a false sentence; in a VLA, it produces a **motion**. Guardrails, out-of-distribution detection, emergency stops, and torque limits are indispensable.
- **Evaluation.** Measuring a robot policy in the real world is slow, noisy, and hard to reproduce — a concrete brake on progress.

## Discrete or continuous: a recap

The choice of **action representation** structures the whole system:

- **Discrete tokens (RT-2, OpenVLA).** Pros: architecture identical to an LLM, direct inheritance of web pre-training, implementation simplicity, very efficient LoRA fine-tuning. Limit: granularity depends on the number of bins, and token-by-token generation can cap the control frequency.
- **Continuous output / flow-matching (π0).** Pros: smooth trajectories, high frequency (~50 Hz), good dexterity over many degrees of freedom. Limit: a more complex decoder (diffusion/flow), training and debugging that are less "classic-LLM."

There is no universal winner: the right choice depends on the **task** (coarse grasping vs fine gestures), the required **control frequency**, and **on-board compute** constraints.

## Beyond robotics

Multimodality is not limited to robot arms. The same building blocks power:

- **visual assistants** (describe a scene, read a chart, extract a table from a screenshot);
- **agents that drive an interface** (computer use) by "seeing" the screen and clicking;
- **multimodal search** and content moderation (text + image);
- **accessibility** (image descriptions for the visually impaired) and document analysis (OCR + reasoning).

The common thread is identical: project heterogeneous modalities into a common space, then let a Transformer reason — whether the output is a sentence, a click, or a motion.

## Data and scaling

One lesson runs through the whole field: **data is the real bottleneck.** On the VLM side, there are **billions** of image-text pairs on the web, which made CLIP and its successors possible. On the VLA side, we are talking about only **hundreds of thousands** of episodes (≈970,000 for Open X-Embodiment) — several orders of magnitude fewer. Three avenues to close the gap:

- **Pool embodiments.** Open X-Embodiment aggregates data from 22 different robots to train a single policy; each embodiment benefits from the others.
- **Co-train on the web.** RT-2 mixes robot data with general image-text data, which injects "common sense" and improves semantic generalization.
- **Learn without teleoperation.** Simulation, human videos, and self-supervision aim to produce action signal without costly human demonstration.

The implicit bet of VLAs is that the **scaling** that transformed NLP, then vision, will eventually transform motor control — provided we first solve the scarcity of action data.

## A practical checklist: choosing and building

To turn this theory into decisions, a grid of questions to ask in order:

1. **Which output?** Retrieval/classification → a dual encoder (CLIP/SigLIP) is enough. Description, Q&A, reasoning → a VLM. Robot control → a VLA.
2. **Which vision encoder?** Favor the **semantic** (CLIP/SigLIP) to describe and reason; add the **spatial** (DINOv2) when geometry matters (manipulation, fine OCR).
3. **Which fusion mode?** **Projection** by default (simple, cheap); **cross-attention** for long multi-image sequences; **native** only with a substantial pre-training budget.
4. **Which resolution / token budget?** Raising the resolution improves detail but burdens the LLM; consider *pooling* or a *resampler* if the context explodes.
5. **Discrete or continuous (VLA)?** Discrete tokens for simplicity and web transfer; continuous output (flow/diffusion) for high frequency and fine dexterity.
6. **How to evaluate?** Measure *grounding* (POPE, MMMU), not just fluency; for a VLA, plan a real-world evaluation, slow but unavoidable.

This sequence — output, encoder, fusion, resolution, action representation, evaluation — covers the bulk of the trade-offs of a multimodal project, from a search engine to a robot policy.

## In short

The trajectory is coherent: **CLIP** learns to align image and text in a shared space; the **ViT** turns an image into tokens; a **projector** (or cross-attention, or native training) wires that vision into an **LLM**; training proceeds by **aligning** then **instructing**; and **VLAs** push the idea all the way to producing **actions as tokens** (RT-2, OpenVLA) or continuous trajectories (π0). The same principle — projecting heterogeneous modalities into a common space and letting a Transformer reason over it — connects the image caption to the robot arm. The real-world locks remain: latency, data, safety, evaluation — and that is where the next generation of embodied AI will be decided.