StereoPolicy: Improving Robotic Manipulation
Policies via Stereo Perception

Evans Han1,2 Yunfan Jiang1 Yingke Wang1,* Haoyue Xiao1,*
Huang Huang1 Jianwen Xie3 Jiajun Wu1 Li Fei-Fei1 Ruohan Zhang1,2
1Stanford University 2Northwestern University 3Lambda, Inc
Preprint

Abstract

Recent advances in robot imitation learning have yielded powerful visuomotor policies capable of manipulating a wide variety of objects directly from monocular visual inputs. However, monocular observations inherently lack reliable depth cues and spatial awareness, which are critical for precise manipulation in cluttered or geometrically complex scenes. To address this limitation, we introduce StereoPolicy, a new visuomotor policy learning framework that directly leverages synchronized stereo image pairs to strengthen geometric reasoning, without requiring explicit 3D reconstruction or camera calibration. StereoPolicy employs pretrained 2D vision encoders to process each image independently and fuses the resulting representations through a Stereo Transformer. This design implicitly captures spatial correspondence and disparity cues. The framework integrates seamlessly with diffusion-based and pretrained vision-language-action (VLA) policies, delivering consistent improvements over RGB, RGB-D, point cloud, and multi-view baselines across three simulation benchmarks: RoboMimic, RoboCasa, and OmniGibson. We further validate StereoPolicy on real-robot experiments spanning both tabletop and bimanual mobile manipulation settings. Our results underscore stereo vision as a scalable and robust modality that bridges 2D pretrained representations with 3D geometric understanding for robotic manipulation.

Motivation of StereoPolicy

RGB-D and PCD are fragile in real world deployment

Hover to zoom
Regular Cup Hang task depth visualization
Regular Cup Hang Task: Depth Visualization
Glass Cup Hang task depth visualization
Glass Cup Hang Task: Depth Visualization
Regular Cup Hang task point cloud visualization
Regular Cup Hang Task: Point Cloud (PCD) Visualization
Glass Cup Hang task point cloud visualization
Glass Cup Hang Task: Point Cloud (PCD) Visualization

Overview of StereoPolicy

Hover a module, then click to highlight it and read how it works.
Overview of StereoPolicy pipeline

Click a module in the pipeline

StereoPolicy directly turns synchronized stereo images into geometry-aware policy features, bridging pretrained 2D visual representations with implicit 3D spatial reasoning.



Pipeline of StereoPolicy. A stereo perception module to extract geometry-aware features for robot policies. The resulting representations can be seamlessly integrated into both diffusion policies and VLA models without modifying their backbone architectures.

  • StereoPolicy-DP integrates stereo encoder into diffusion policy, trained from scratch on each benchmark task.
  • StereoPolicy-VLA combines the stereo encoder with pre-trained VLA model and fine-tunes the system.

Real-World Deployments

Bimanual Mobile Manipulation

Turn On Radio

PnP Toast

Bimanual Mobile Manipulation

Real-World
1.0
0.0
70%
85%
PnP Toast
40%
60%
Turn On Radio
RGB Stereo
Simulation
1.0
0.0
80%
90%
Open Door
70%
85%
Turn On Radio
RGB Stereo

StereoPolicy-VLA (Pi0.5) performance on bimanual mobile manipulation tasks in both real-world and simulation settings (success rate over 20 trials).

Tabletop Manipulation

Interactive demos

Select a tabletop task to inspect its synchronized third-person and wrist stereo views.

Videos are shown from the original 3x recordings.

PnP Banana

Third-Person Stereo View
Left Camera
Right Camera
Wrist Stereo View
Left Camera
Right Camera

Performance

Tabletop Manipulation

Method Banana
PnP
Toast
Insert
Plastic
Cup Hang
Steel
Cup Hang
Glass
Cup Hang
AVG SR (%)
RGB 12/20 7/20 12/20 10/20 1/20 42.0%
RGBD 14/20 8/20 11/20 8/20 0/20 41.0%
RGBD-3DDA 13/20 9/20 13/20 10/20 0/20 45.0%
PCD-PointNet 7/20 0/20 5/20 2/20 0/20 14.0%
PCD-DP3 11/20 3/20 8/20 5/20 0/20 27.0%
MultiView 13/20 8/20 13/20 9/20 1/20 44.0%
StereoPolicy-DP 16/20 12/20 15/20 13/20 3/20 59.0%

Real-World Tabletop Task Performance. StereoPolicy-DP consistently outperforms other visual modalities. PCD performs worst in all real tasks; both RGBD and PCD fail on glass cup hang tasks.

Baseline Modalities Failure Cases

Mobile Manipulation

Failure Case 1

Imprecise radio-handle grasp.

Failure Case 2

Imprecise button press.

Failure Case 3

Imprecise toast grasp.

Tabletop Manipulation

Tabletop failure videos are shown at 2x speed.

Failure Case 1

Toast misses slot.

Failure Case 2

Insertion misaligned.

Failure Case 3

Cup grasp misses.

Failure Case 4

Imprecise handle insertion.

Failure Case 5

Transparent cup missed.

Simulation Benchmarks

Simulation benchmarks across OmniGibson, RoboCasa, and RoboMimic

Diffusion Policy Performance

Method OmniGibson RoboMimic
Strawberry Pour Water Open Door Turn On Radio Tool Hang Square Transport
100 200 100 200 200 300 200 300 100 200 100 200 100 200
RGB 59.0% 88.0% 10.0% 46.0% 26.0% 77.0% 42.0% 71.0% 53.0% 90.0% 74.0% 98.0% 92.0% 94.0%
RGB-D 63.0% 85.0% 16.0% 52.0% 31.0% 80.0% 47.0% 73.0% 56.0% 88.0% 79.0% 92.0% 94.0% 94.0%
RGBD-3DDA 74.0% 93.0% 26.0% 61.0% 48.0% 100.0% 46.0% 75.0% 84.0% 92.0% 83.0% 97.0% 94.0% 96.0%
PCD-DP3 45.0% 63.0% 3.0% 31.0% 30.0% 69.0% 35.0% 64.0% 40.0% 76.0% 69.0% 88.0% 63.0% 72.0%
MultiView 68.0% 89.0% 21.0% 52.0% 31.0% 75.0% 43.0% 71.0% 54.0% 92.0% 78.0% 96.0% 92.0% 94.0%
StereoPolicy-DP 82.0% 100.0% 34.0% 70.0% 57.0% 100.0% 55.0% 82.0% 94.0% 96.0% 88.0% 100.0% 94.0% 96.0%

Simulation Task Performance of Diffusion Policies over Different Visual Modalities. Stereo input consistently improves performance, especially under low-data regime.

VLA Average Performance on RoboCasa-Kitchen 24 Tasks

PI0.5
75% 40% 48.71 51.72 67.64 71.54 70.31 74.40 30 100 300 Number of Demos RGB Stereo
GROOT-N1.5
75% 40% 44.98 47.12 63.58 66.17 64.30 67.50 30 100 300 Number of Demos RGB Stereo

Average StereoPolicy-VLA Performance on RoboCasa-Kitchen 24 Tasks.

Citation

@misc{han2026stereopolicyimprovingroboticmanipulation,
      title={StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception}, 
      author={Evans Han and Yunfan Jiang and Yingke Wang and Haoyue Xiao and Huang Huang and Jianwen Xie and Jiajun Wu and Li Fei-Fei and Ruohan Zhang},
      year={2026},
      eprint={2605.09989},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2605.09989}, 
}