A multimodal model learns to handle several kinds of inputs — text, image, audio, sometimes actions — in a shared representation space, so that a caption and the photo it describes end up "in the same place." This is the foundation of assistants that "see," and ultimately it is also what lets a robot turn "put away the blue cup" into an arm motion. This article follows that thread: how modalities are fused (CLIP, ViT), how vision is wired into an LLM (projector, cross-attention, native), how they are trained and evaluated, and then how Vision-Language-Action models (VLA) extend all of this to embodied action.
Fusing modalities: the core idea
The basic problem is heterogeneity: an image is a grid of pixels, text a sequence of discrete symbols. Fusing them means projecting each modality into one vector space where geometric proximity reflects semantic proximity, regardless of the original format. Once that bridge exists, an image encoder and a text encoder can "talk to each other": you query in one modality and retrieve in the other.
Two families coexist:
- Dual-encoder models (CLIP-style) keep two separate towers and only align the final vectors. They are perfect for retrieval (text ⇄ image) and zero-shot classification, but they generate nothing — they measure proximity.
- Deep-fusion models (generative VLMs) inject visual tokens into an LLM, which reasons jointly over text and image to produce language: describe, answer, reason.
- VLAs extend this second family all the way to producing actions rather than (or in addition to) text.
We speak of early fusion when modalities are mixed at the network's input, and late fusion when only separately computed representations are aligned. CLIP is the archetype of late fusion; generative VLMs lean toward early fusion, because the LLM must see image and text together to reason. Understanding this continuum helps choose the architecture per task: retrieve, generate, or act.
Beyond image and text, the same principle extends to other modalities:
- Audio (speech, environmental sounds), itself aligned to text by contrast.
- Video, treated as a sequence of images with a temporal dimension.
- Robot sensors (depth, proprioception, forces) in VLAs.
The pattern is always the same: one encoder per modality, alignment toward a common space, then joint reasoning. It is this uniformity that makes the approach so fertile.
CLIP: contrastive image-text alignment
CLIP (Contrastive Language-Image Pre-training, OpenAI, 2021) is the founding model of image-text fusion. Its architecture is a dual encoder: an image encoder (ViT or ResNet) and a text encoder (Transformer) each project their input into a shared embedding space where images and texts that are close in meaning end up side by side.
Figure: CLIP's contrastive pre-training. Original diagram.
Training is contrastive. Over a batch of N (image, caption) pairs, you compute an N×N matrix of cosine similarities: the diagonal holds the correct pairs, the rest are negatives. The objective (a symmetric InfoNCE loss, image→text and text→image) pulls each image toward ITS caption and pushes all the others apart. CLIP was trained on ~400 million image-text pairs scraped from the web, with gigantic batches (the similarity matrix reaches 32,768 × 32,768).
A remarkable consequence: a zero-shot classifier. To classify an image among categories, you encode each category as a sentence ("a photo of a dog"), encode the image, and pick the highest-similarity caption — with no retraining at all. This ability made CLIP the visual building block of countless later systems (image generation, retrieval, and… VLMs).
A few often-misunderstood points about CLIP:
- Normalization matters. Embeddings are projected onto the unit sphere before computing similarity; dot product and cosine then coincide, and a learned temperature parameter controls how "sharp" the distribution is.
- The contrast is symmetric. The loss sums two terms: find the right caption for each image and the right image for each caption. That is what makes the space reversible.
- CLIP is not generative. It does not describe an image: it measures how well an image and a text go together. Generation comes from VLMs, which nonetheless often reuse CLIP's vision encoder.
To make it concrete, the heart of the contrastive loss fits in a few lines:
# Pseudo-code: CLIP's symmetric contrastive loss (a batch of N pairs)
img_emb = l2_normalize(image_encoder(images)) # (N, d) on the unit sphere
txt_emb = l2_normalize(text_encoder(texts)) # (N, d)
logits = (img_emb @ txt_emb.T) * exp(temperature) # N×N similarities
labels = arange(N) # the correct pair is on the diagonal
loss_i = cross_entropy(logits, labels) # image → text
loss_t = cross_entropy(logits.T, labels) # text → image
loss = (loss_i + loss_t) / 2 # symmetric objective
Why a shared space "works"
Why is it reasonable to hope that an image and its description end up in the same place? Because meaning is relational. In a learned space, what matters is not the absolute value of an axis but the relative position: "cat" and "dog" are close to each other and far from "airplane," whether you start from a photo or a word. Contrastive training carves exactly this geometry: it pulls matching pairs together and pushes non-matching ones apart, until the structure of the space is consistent across modalities.
Two properties follow, valuable for what comes next:
- Compositionality. Related concepts share common directions; you can combine "cup" and "blue" in a partly additive way, which helps generalize to combinations never seen.
- Transferability. An encoder that has learned this space over hundreds of millions of web pairs carries reusable visual "common sense" — that is the capital a VLM, then a VLA, will recycle instead of relearning everything.
This is the deep reason why we reuse pre-trained encoders (CLIP, SigLIP, DINOv2) rather than train new ones: alignment is expensive, so capitalize on it.
Vision encoders: from patch to token
On the image side, the dominant tool is the Vision Transformer (ViT). The key idea is simple and powerful: cut the image into patches (e.g. 14×14 or 16×16 pixels), flatten each patch, project it linearly into a vector — and treat that sequence of vectors exactly like a sequence of tokens for a Transformer. A patch becomes a "visual word."
A useful order of magnitude: a 224×224 image in 14×14 patches yields 16×16 = 256 visual tokens; moving to 336×336 raises that to 576. This simple arithmetic explains why resolution is so costly downstream — every extra token weighs on the LLM's whole attention.
The ViT then outputs a grid of vectors (one per patch) carrying local spatial information. That is precisely what a VLM needs: LLaVA, for instance, takes the grid tokens from a penultimate layer of CLIP ViT-L/14 to preserve fine detail rather than settling for a single global vector. More patches = more visual tokens = more detail, but also more compute in the downstream LLM — a central trade-off of modern VLMs.
Not all vision encoders are equal, and the choice strongly shapes the final VLM:
- CLIP ViT brings strong semantic alignment (it "knows" this object is a dog), thanks to its contrastive text-image pre-training.
- SigLIP replaces CLIP's softmax loss with a pairwise sigmoid loss, more stable at scale and often better at equal budget.
- DINOv2 (self-supervised, text-free) excels at geometry and fine spatial detail, but with no linguistic grounding.
Hence a strong trend: fusing two encoders (e.g. SigLIP + DINOv2) to combine semantics and spatial precision — which is exactly OpenVLA's choice, as we will see.
Wiring vision into an LLM: three schools
How do you make a purely textual LLM "see"? Three strategies, with increasingly deep fusion.
Figure: VLM → VLA architecture and training stages. Original diagram.
1. Projection (LLaVA). The simplest and most widespread approach. A projector — a learned matrix in LLaVA 1.0, then a two-layer MLP with a GELU activation in LLaVA 1.5 — converts the ViT vectors into "image tokens" that live in the LLM's embedding space. You simply prepend them to the text tokens: the LLM then consumes a mixed sequence, with no change to its architecture. Lightweight and efficient, this is the default recipe of most open-source VLMs.
2. Cross-attention (Flamingo). DeepMind inserts gated cross-attention blocks between the LLM's layers. Queries (text) attend to keys/values (vision); a zero-initialized tanh gate guarantees that the model initially behaves exactly like the original LLM, then gradually opens the door to visual signals — which stabilizes training. This route elegantly handles multiple interleaved images and long sequences.
3. Native multimodal (GPT-4o, Gemini, Qwen2-VL). Rather than bolting vision on afterward, you train a single model from the start that processes image and text tokens uniformly through the same self-attention, with special tokens to demarcate the visual part (e.g. vision_start/vision_end). GPT-4o and Gemini are described as natively multimodal; their exact details remain unpublished. This is the route that generalizes best but is the most expensive to train.
How to choose? In practice:
- Projection is unbeatable on the simplicity/cost ratio: you reuse an existing ViT and LLM and train almost nothing at the start. That is why nearly all open-source VLMs adopt it.
- Cross-attention shines on multi-image interleaved sequences (documents, videos, image-rich dialogues) and keeps the LLM intact at the start thanks to zero-gating.
- Native offers the best fusion and generalization, but requires end-to-end multimodal pre-training — meaning resources only large labs can muster.
Tokenizing an image
In the "projection" and "native" approaches, the image ends up represented as a sequence of tokens homogeneous with text. That is what makes VLMs so practical: the whole LLM pipeline (attention, auto-regressive generation, KV-cache) applies as is. The flip side is the token budget: a high-resolution image can cost hundreds, even thousands, of tokens. Hence reduction techniques — pooling, a Flamingo-style resampler (Perceiver Resampler compressing a variable grid into a small fixed number of tokens), adaptive-resolution tiling — to contain the cost without sacrificing useful detail.
Training a VLM: align, then instruct
LLaVA's training scheme has become canonical, in two stages.
- Stage 1 — feature alignment. You freeze the vision encoder AND the LLM, and train only the projector on image-caption pairs (595,000 pairs filtered from CC3M for LLaVA). The goal: teach the projector to map the visual space into the language space. This is fast and stable.
- Stage 2 — visual instruction. You unfreeze the LLM and the projector (the vision encoder often stays frozen) and train on higher-quality visual dialogues (158,000 synthetic conversations for LLaVA). The model learns to follow instructions, describe, reason, and answer questions about the image.
More recent models (Qwen2-VL) add a third stage and progressively unfreeze components, over massive volumes (on the order of 10^12 tokens). The principle is unchanged: first align the modalities, then teach behavior.
# Pseudo-code: forward pass of a LLaVA-style VLM
patches = vit(image) # ViT → grid of visual tokens (spatial)
img_tokens = projector(patches) # MLP: vision space → LLM space
txt_tokens = embed(text) # text tokens
seq = concat([img_tokens, txt_tokens]) # PREPEND the image to the text
logits = llm(seq) # the LLM reasons jointly
answer = decode(logits) # auto-regressive text generation
Evaluating VLMs — and the hallucination problem
VLMs are evaluated on visual questions (VQA), captioning, spatial reasoning, OCR, charts and documents. But the major risk is object hallucination: the model describes elements absent from the image, out of an excess of language prior (the LLM "completes" according to what is textually plausible, not what it sees). The usual causes: a frozen vision encoder that loses detail, an imbalance between visual and textual signal, biased instruction data. The remedies: better encoders, higher resolution, grounding data (explicit object ↔ region anchoring), and dedicated benchmarks (POPE for object hallucination, MMMU for demanding multimodal reasoning). A word of caution: a fluent VLM is not a reliable VLM — you must measure grounding, not just prose quality.
From VLMs to VLAs: adding action
A Vision-Language-Action model (VLA) is a multimodal foundation model that integrates vision, language AND actions. From camera images and a natural-language instruction, it produces commands executable by a robot. The founding intuition: a VLA is a VLM plus an action decoder. Perception and reasoning are handled by the VLM (which encodes images + text into tokens in a shared latent space); a final stage converts these tokens into commands representing the robot's degrees of freedom (translations, rotations, gripper opening).
The stroke of genius that launched the field: treating actions as tokens. If you discretize a continuous action (e.g. "move 50 mm forward") into a discrete identifier, then generating a trajectory is exactly like generating text — same Transformer architecture, same auto-regressive objective, and above all transfer of knowledge acquired from billions of web images and texts to motor control.
Concretely, you discretize each action dimension (translation in x, y, z, rotation, gripper state) into a small number of bins (often 256), each mapped to a token. A control step then becomes a short sequence of action tokens, and the model learns to generate it as it would generate a sentence. At inference, you de-tokenize these identifiers back into continuous commands (torques, velocities) sent to the low-level controller. The elegance of the approach is that it invents nothing on the architecture side: it is a standard multimodal LLM whose vocabulary has merely been widened.
An end-to-end example: from instruction to motion
Let's follow "put away the blue cup" through an action-token VLA:
- The camera captures the scene; the vision encoder cuts it into patches and produces visual tokens.
- The projector brings them into the LLM's space; they are fused with the instruction tokens.
- The LLM reasons jointly: it "sees" the blue cup, understands the order, plans.
- It generates action tokens (Δx, Δy, Δz, rotation, gripper) step after step.
- These tokens are de-tokenized into continuous commands sent to the arm's controller.
- The robot acts; the new image returns as input, and the loop starts again.
image (patches) ─▶ ViT ─▶ projector ─┐
├─▶ LLM ─▶ action tokens ─▶ de-tokenize ─▶ robot
instruction ─────▶ text tokens ───────┘ ▲ │
└──────── new image ◀────────────┘
The whole practical difficulty lies in this closed loop: each iteration must be fast enough for the motion to stay smooth, which brings us back to the latency challenge discussed below.
RT-2: the pioneer
RT-2 (Google DeepMind, July 2023) established the VLA paradigm. It co-trains a VLM (on the PaLI-X and PaLM-E backbones) on both web data and robot data, representing the robot's actions as tokens alongside language. Concretely, a continuous action is discretized into a token ("move 50 mm" becomes, say, token #1247), reusing the model's vocabulary. The result: RT-2 generalizes better than its predecessor RT-1 to new tasks and can chain multi-step reasoning (chain-of-thought) leveraging knowledge inherited from the web — for example "pick the object used to hammer a nail."
The decisive contribution is not a new architecture but a representation: by writing the action in the same alphabet as language, RT-2 turns motor control into a special case of text generation, and inherits "for free" the reasoning learned on the web.
OpenVLA: open source at 7 billion parameters
OpenVLA (Stanford and collaborators, June 2024) made the paradigm open and reproducible. It is a 7-billion-parameter model trained on ~970,000 robot episodes from the Open X-Embodiment dataset (covering 22 different robot embodiments). Its architecture:
- a fused vision encoder combining SigLIP and DINOv2 (semantics + spatial detail);
- a projector mapping the visual embeddings into the language space;
- a Llama-2 7B backbone that predicts discrete action tokens, later de-tokenized into directly executable continuous commands.
OpenVLA outperforms earlier generalist policies (RT-1-X, Octo) and even RT-2-X (a 55-billion-parameter closed VLA) on many tasks — with RT-2-X keeping the edge on fine semantic generalization that requires web knowledge. On the engineering side, OpenVLA was trained on 64 A100 GPUs for 15 days, and it fine-tunes efficiently: LoRA tunes only 1.4% of the parameters while matching full fine-tuning — a decisive asset for adapting to a new robot or a new task.
For the practitioner, OpenVLA illustrates two concrete lessons: choosing a fused encoder (semantic + spatial) pays off for manipulation, and quantization plus LoRA make adaptation realistic on modest hardware, without a cluster.
π0 (pi-zero): continuous output via flow-matching
π0 (Physical Intelligence, late 2024) takes the other path for high-frequency dexterity. Rather than discrete tokens, it builds on a VLM (PaliGemma, built from SigLIP and Gemma) and directly generates continuous actions via a flow-matching network (a cousin of diffusion models). This produces smooth trajectories at ~50 Hz, better suited to robots with many degrees of freedom and to fine gestures (folding laundry, handling delicate objects). The trade-off shapes up as follows: discrete tokens = simplicity and direct LLM inheritance, but limited granularity; continuous output (diffusion/flow) = precision and high frequency, at the cost of a more complex decoder.
A key notion appears here: action chunks. Instead of predicting a single step, the model generates a short sequence of future actions at once, which smooths the command and amortizes the cost of one pass through the large VLM — a central trick for keeping up with real-time control.
Generalization and embodied reasoning
The promise of VLAs is embodied generalization: because the trunk is a VLM pre-trained on the web, the robot inherits visual and linguistic common sense that no amount of robot demonstration alone could provide. It can follow novel instructions, recognize objects never seen in demonstration, and reason about the task ("first put away the cup, then wipe the table"). The Open X-Embodiment dataset pushes further: training one policy on multiple embodiments to aim at a generic controller, transferable from one robot to another.
Open challenges
- Latency. A multi-billion-parameter LLM in a real-time control loop is a challenge: generating tokens is slow, while control demands 10–50 Hz. Hence continuous outputs (π0), distillation, quantization, action chunks (predicting several steps at once).
- Data. Robot demonstrations are scarce and costly compared to the web's terabytes of text/image; teleoperation does not scale. Open X-Embodiment, simulation, and learning from human videos try to bridge the gap.
- Safety and reliability. A hallucination in a chatbot produces a false sentence; in a VLA, it produces a motion. Guardrails, out-of-distribution detection, emergency stops, and torque limits are indispensable.
- Evaluation. Measuring a robot policy in the real world is slow, noisy, and hard to reproduce — a concrete brake on progress.
Discrete or continuous: a recap
The choice of action representation structures the whole system:
- Discrete tokens (RT-2, OpenVLA). Pros: architecture identical to an LLM, direct inheritance of web pre-training, implementation simplicity, very efficient LoRA fine-tuning. Limit: granularity depends on the number of bins, and token-by-token generation can cap the control frequency.
- Continuous output / flow-matching (π0). Pros: smooth trajectories, high frequency (~50 Hz), good dexterity over many degrees of freedom. Limit: a more complex decoder (diffusion/flow), training and debugging that are less "classic-LLM."
There is no universal winner: the right choice depends on the task (coarse grasping vs fine gestures), the required control frequency, and on-board compute constraints.
Beyond robotics
Multimodality is not limited to robot arms. The same building blocks power:
- visual assistants (describe a scene, read a chart, extract a table from a screenshot);
- agents that drive an interface (computer use) by "seeing" the screen and clicking;
- multimodal search and content moderation (text + image);
- accessibility (image descriptions for the visually impaired) and document analysis (OCR + reasoning).
The common thread is identical: project heterogeneous modalities into a common space, then let a Transformer reason — whether the output is a sentence, a click, or a motion.
Data and scaling
One lesson runs through the whole field: data is the real bottleneck. On the VLM side, there are billions of image-text pairs on the web, which made CLIP and its successors possible. On the VLA side, we are talking about only hundreds of thousands of episodes (≈970,000 for Open X-Embodiment) — several orders of magnitude fewer. Three avenues to close the gap:
- Pool embodiments. Open X-Embodiment aggregates data from 22 different robots to train a single policy; each embodiment benefits from the others.
- Co-train on the web. RT-2 mixes robot data with general image-text data, which injects "common sense" and improves semantic generalization.
- Learn without teleoperation. Simulation, human videos, and self-supervision aim to produce action signal without costly human demonstration.
The implicit bet of VLAs is that the scaling that transformed NLP, then vision, will eventually transform motor control — provided we first solve the scarcity of action data.
A practical checklist: choosing and building
To turn this theory into decisions, a grid of questions to ask in order:
- Which output? Retrieval/classification → a dual encoder (CLIP/SigLIP) is enough. Description, Q&A, reasoning → a VLM. Robot control → a VLA.
- Which vision encoder? Favor the semantic (CLIP/SigLIP) to describe and reason; add the spatial (DINOv2) when geometry matters (manipulation, fine OCR).
- Which fusion mode? Projection by default (simple, cheap); cross-attention for long multi-image sequences; native only with a substantial pre-training budget.
- Which resolution / token budget? Raising the resolution improves detail but burdens the LLM; consider pooling or a resampler if the context explodes.
- Discrete or continuous (VLA)? Discrete tokens for simplicity and web transfer; continuous output (flow/diffusion) for high frequency and fine dexterity.
- How to evaluate? Measure grounding (POPE, MMMU), not just fluency; for a VLA, plan a real-world evaluation, slow but unavoidable.
This sequence — output, encoder, fusion, resolution, action representation, evaluation — covers the bulk of the trade-offs of a multimodal project, from a search engine to a robot policy.
In short
The trajectory is coherent: CLIP learns to align image and text in a shared space; the ViT turns an image into tokens; a projector (or cross-attention, or native training) wires that vision into an LLM; training proceeds by aligning then instructing; and VLAs push the idea all the way to producing actions as tokens (RT-2, OpenVLA) or continuous trajectories (π0). The same principle — projecting heterogeneous modalities into a common space and letting a Transformer reason over it — connects the image caption to the robot arm. The real-world locks remain: latency, data, safety, evaluation — and that is where the next generation of embodied AI will be decided.