From CLIP and ViT to VLMs (LLaVA, Flamingo, GPT-4o) then VLAs for robotics (RT-2, OpenVLA, π0): modality fusion and actions as tokens.