At UCL RoMA Lab, we scale foundation models to Visual–Language–Action (VLA) systems for robotics, transforming multimodal perception into intelligent, goal-directed behavior. Our work explores VLA systems built on vision–language and world models, enabling perception, reasoning, and control in embodied settings. We advance embodied AI by tackling generalization across different sensors and tasks, computational efficiency on resource-constrained hardware, and trustworthy human–robot interaction, with the goal of enabling autonomous systems that operate reliably in complex, dynamic environments.