StereoPolicy: Improving Robotic Manipulation
Policies via Stereo Perception

Evans Han^1,2 Yunfan Jiang¹ Yingke Wang^1,* Haoyue Xiao^1,*

Huang Huang¹ Jianwen Xie³ Jiajun Wu¹ Li Fei-Fei¹ Ruohan Zhang^1,2

¹Stanford University ²Northwestern University ³Lambda, Inc

Preprint

Abstract

Recent advances in robot imitation learning have produced powerful visuomotor policies that manipulate diverse objects from visual inputs. However, monocular observations lack depth information, which is critical for precise manipulation in cluttered or geometrically complex scenes. Explicit depth maps and point clouds are often noisy and fragile in real-world manipulation. We introduce StereoPolicy, a visuomotor policy learning framework that directly leverages synchronized stereo image pairs to improve geometric reasoning without constructing explicit 3D representations. StereoPolicy processes each image with pretrained 2D vision encoders and fuses left-right features through a cross-attention-based Stereo Transformer, capturing spatial correspondence and disparity cues implicitly. The framework integrates with diffusion-based and pretrained vision-language-action (VLA) policies, delivering consistent improvements over RGB, RGB-D, point cloud, and multi-view baselines across three simulation benchmarks and seven real-robot tabletop and bimanual mobile manipulation tasks. Our results show that stereo vision bridges 2D pretrained representations and 3D geometric understanding for robotic manipulation.

RGB-D and PCD are fragile in real world deployment

Hover to zoom

Regular Cup Hang

External-View Stereo RGB Reference

Left Camera

Right Camera

Regular Cup Hang task depth visualization — Depth Visualization

Regular Cup Hang task point cloud visualization — Point Cloud (PCD) Visualization

Glass Cup Hang

External-View Stereo RGB Reference

Left Camera

Right Camera

Glass Cup Hang task depth visualization — Depth Visualization

Glass Cup Hang task point cloud visualization — Point Cloud (PCD) Visualization

Overview of StereoPolicy

Hover a module, then click to highlight it and read how it works.

Click a module in the pipeline

StereoPolicy directly turns synchronized stereo images into geometry-aware policy features, bridging pretrained 2D visual representations with implicit 3D spatial reasoning.

Pipeline of StereoPolicy. A stereo perception module to extract geometry-aware features for robot policies. The resulting representations can be seamlessly integrated into both diffusion policies and VLA models without modifying their backbone architectures.

StereoPolicy-DP integrates stereo encoder into diffusion policy, trained from scratch on each benchmark task.
StereoPolicy-VLA combines the stereo encoder with pre-trained VLA model and fine-tunes the system.

Real-World Deployments

Bimanual Mobile Manipulation

Turn On Radio

PnP Toast

Bimanual Mobile Manipulation

Real-World

1.0

0.0

70%

85%

PnP Toast

40%

60%

Turn On Radio

RGB Stereo

Simulation

1.0

0.0

80%

90%

Open Door

70%

85%

Turn On Radio

RGB Stereo

StereoPolicy-VLA (Pi0.5) performance on bimanual mobile manipulation tasks in both real-world and simulation settings (success rate over 20 trials).

Tabletop Manipulation

Interactive demos

Select a tabletop task to inspect its synchronized third-person and wrist stereo views.

Videos are shown from the original 3x recordings.

PnP Banana

External Stereo View

Left Camera

Right Camera

Wrist Stereo View

Left Camera

Right Camera

Performance

Tabletop Manipulation

Method	Banana PnP	Toast Insert	Plastic Cup Hang	Steel Cup Hang	Glass Cup Hang	AVG SR (%)
RGB	12/20	7/20	12/20	10/20	1/20	42.0%
RGBD	14/20	8/20	11/20	8/20	0/20	41.0%
RGBD-3DDA	13/20	9/20	13/20	10/20	0/20	45.0%
PCD-PointNet	7/20	0/20	5/20	2/20	0/20	14.0%
PCD-DP3	11/20	3/20	8/20	5/20	0/20	27.0%
MultiView	13/20	8/20	13/20	9/20	1/20	44.0%
StereoPolicy-DP	16/20	12/20	15/20	13/20	3/20	59.0%

Real-World Tabletop Task Performance. StereoPolicy-DP consistently outperforms other visual modalities. PCD performs worst in all real tasks; both RGBD and PCD fail on glass cup hang tasks.

Baseline Modalities Failure Cases

Mobile Manipulation

Failure Case 1

Imprecise radio-handle grasp.

Failure Case 2

Imprecise button press.

Failure Case 3

Imprecise toast grasp.

Tabletop Manipulation

Tabletop failure videos are shown at 2x speed.

Failure Case 1

Toast misses slot.

Failure Case 2

Insertion misaligned.

Failure Case 3

Cup grasp misses.

Failure Case 4

Imprecise handle insertion.

Failure Case 5

Transparent cup missed.

Simulation Benchmarks

Diffusion Policy Performance

Method	OmniGibson								RoboMimic
	Strawberry		Pour Water		Open Door		Turn On Radio		Tool Hang		Square		Transport
	100	200	100	200	200	300	200	300	100	200	100	200	100	200
RGB	59.0%	88.0%	10.0%	46.0%	26.0%	77.0%	42.0%	71.0%	53.0%	90.0%	74.0%	98.0%	92.0%	94.0%
RGB-D	63.0%	85.0%	16.0%	52.0%	31.0%	80.0%	47.0%	73.0%	56.0%	88.0%	79.0%	92.0%	94.0%	94.0%
RGBD-3DDA	74.0%	93.0%	26.0%	61.0%	48.0%	100.0%	46.0%	75.0%	84.0%	92.0%	83.0%	97.0%	94.0%	96.0%
PCD-DP3	45.0%	63.0%	3.0%	31.0%	30.0%	69.0%	35.0%	64.0%	40.0%	76.0%	69.0%	88.0%	63.0%	72.0%
MultiView	68.0%	89.0%	21.0%	52.0%	31.0%	75.0%	43.0%	71.0%	54.0%	92.0%	78.0%	96.0%	92.0%	94.0%
StereoPolicy-DP	82.0%	100.0%	34.0%	70.0%	57.0%	100.0%	55.0%	82.0%	94.0%	96.0%	88.0%	100.0%	94.0%	96.0%

Simulation Task Performance of Diffusion Policies over Different Visual Modalities. Stereo input consistently improves performance, especially under low-data regime.

VLA Average Performance on RoboCasa-Kitchen 24 Tasks

PI0.5

GROOT-N1.5

Average StereoPolicy-VLA Performance on RoboCasa-Kitchen 24 Tasks.

Citation

@misc{han2026stereopolicyimprovingroboticmanipulation,
      title={StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception}, 
      author={Evans Han and Yunfan Jiang and Yingke Wang and Haoyue Xiao and Huang Huang and Jianwen Xie and Jiajun Wu and Li Fei-Fei and Ruohan Zhang},
      year={2026},
      eprint={2605.09989},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2605.09989}, 
}

StereoPolicy: Improving Robotic ManipulationPolicies via Stereo Perception

Abstract

RGB-D and PCD are fragile in real world deployment

Regular Cup Hang

External-View Stereo RGB Reference

Glass Cup Hang

External-View Stereo RGB Reference

Overview of StereoPolicy

Click a module in the pipeline

Real-World Deployments

Bimanual Mobile Manipulation

Turn On Radio

PnP Toast

Bimanual Mobile Manipulation

Tabletop Manipulation

PnP Banana

External Stereo View

Wrist Stereo View

Insert Toast

External Stereo View

Wrist Stereo View

Plastic Cup Hang

External Stereo View

Wrist Stereo View

Steel Cup Hang

External Stereo View

Wrist Stereo View

Glass Cup Hang

External Stereo View

Wrist Stereo View

Performance

Tabletop Manipulation

Baseline Modalities Failure Cases

Mobile Manipulation

Failure Case 1

Failure Case 2

Failure Case 3

Tabletop Manipulation

Failure Case 1

Failure Case 2

Failure Case 3

Failure Case 4

Failure Case 5

Simulation Benchmarks

Diffusion Policy Performance

VLA Average Performance on RoboCasa-Kitchen 24 Tasks

Citation

StereoPolicy: Improving Robotic Manipulation
Policies via Stereo Perception