Physical Intelligence Team Unveils MEM for Robots: A Multi-Scale Memory System Giving Gemma 3-4B VLAs 15-Minute Context for Complex Tasks

Introducing MEM: A Breakthrough Multi-Scale Memory System Enhancing AI Automation in Robotics

Robotics has entered a transformative era where AI automation is revolutionizing how machines perceive, remember, and act within complex environments. A recent collaboration between the Physical Intelligence team, Stanford, UC Berkeley, and MIT has resulted in a cutting-edge memory system called Multi-Scale Embodied Memory (MEM). This novel architecture empowers Vision-Language-Action (VLA) robotic models—specifically the Gemma 3-4B VLA—to maintain up to 15 minutes of contextual memory, fundamentally improving their performance on long-horizon, intricate tasks such as kitchen cleaning and recipe preparation.

The Challenge of Memory in Robotic AI Automation

Current robotic policies typically process only a single observation or a very short history, leading to significant challenges in long-duration tasks. This “lack of memory” results in difficulties managing continuous manipulation or adapting to environmental changes:

Limited task context diminishes precision in dynamic settings.
Computational inefficiency when attempting to scale memory by conventional means.
Increased likelihood of failure in complex sequences like multi-step cooking or cleaning.

MEM addresses these challenges by combining short-term dense visual data with long-term semantic language memory, thereby unlocking new frontiers in AI automation and business efficiency within robotic systems.

MEM’s Dual-Scale Architecture: Balancing Precision and Context

Memory Scale	Description	Key Features
Short-Term Video Memory	Captures fine-grained spatial data across recent video frames	Extends Vision Transformers (ViTs) with an efficient video encoder Uses Space-Time Separable Attention for reduced computation Maintains real-time inference below 380ms on NVIDIA H100 GPU Processes up to 16 observation frames (~1-minute span)
Long-Term Language Memory	Maintains high-level semantic summaries over extended task durations	Generates compressed language summaries of past events Subtask instructions created by a high-level policy that informs low-level control Enables context retention for up to 15 minutes Reduces distribution shift via Large Language Model (LLM)-generated summaries

Technical Highlights of MEM’s Approach

Computational Efficiency: By interleaving intra-frame spatial attention with causal-temporal attention, MEM cuts down from O(n²K²) to O(Kn² + nK²) complexity.
Memory Fusion: The output of the video encoder feeds into the VLA backbone while the long-term language memory feeds task summaries and future subtasks, enabling an adaptive policy.
Training Robustness: Language memory is trained using LLM-generated summaries that safely compress complex state information, avoiding common pitfalls in generalizing from training to real-world deployment.

Empirical Gains: How MEM Advances AI Automation Efficiency

MEM’s integration into the pre-trained Gemma 3-4B VLA model demonstrated substantial improvements across diverse robotics challenges:

Task	Memory-Less VLA Success Rate	MEM-Enabled VLA Success Rate	Improvement
Opening fridges with unknown hinges	~54%	~87%	+62%
Picking up chopsticks at varying heights	~70%	~81%	+11%
Long-horizon tasks (Recipe Setup, Kitchen Cleaning)	Poor to fail often	Successful completion	Significant improvement

These results mark a pivotal step forward for business efficiency in AI-driven automation, enabling robotic systems to navigate and manipulate the real world over longer periods with unprecedented adaptability and contextual awareness.

Why MEM Matters for AI-Driven Business Automation

The MEM system illustrates an elegant solution to the enduring problem of “working memory” in embodied AI. By fusing dense, real-time visual input with semi-compressed long-term semantic memory, MEM enables robotic agents to:

Execute complex, sequential tasks without retraining or manual resets
Reduce computational overhead while scaling context length
Adapt to dynamic environments and recover from errors autonomously

For industries investing in automation—whether in manufacturing, logistics, or service robotics—MEM’s approach paves the way for smarter machines that are both efficient and reliable over extended operational times.

Conclusion

The introduction of Multi-Scale Embodied Memory by the Physical Intelligence team and their academic collaborators marks a significant milestone in AI automation for robotics. By tackling the fundamental memory bottleneck through innovative attention mechanisms and language-guided summarization, MEM enables sophisticated long-horizon robotic behavior that was previously out of reach.

As businesses increasingly turn to automation to optimize workflows and reduce operational costs, technologies like MEM will be instrumental in delivering intelligent, adaptable robotic assistants capable of handling real-world complexity with finesse.

To explore the full technical details, access the official MEM research paper.