Depth estimation has long been a fundamental problem in computer vision, with numerous monocular and stereo-based methods successfully applied in fields such as robotics and autonomous driving. While monocular depth estimation methods have achieved impressive results on various real-world image and video datasets, their lack of absolute scale information continues to limit their practical use. In contrast, stereo-based approaches can readily produce depth maps with scale information. However, when applied to consecutive video frames, these methods often suffer from poor temporal consistency. To address this challenge, we propose a stereo-based model tailored for video data, offering strong zero-shot inference capabilities and robust temporal coherence. The model does not require full retraining; fine-tuning on a small dataset is sufficient to significantly enhance spatiotemporal consistency. Our experiments on the Sintel dataset demonstrate the effectiveness of the proposed approach.
As shown in the figure above, our method integrates monocular priors and stereo matching in a unified structure, enabling both temporal consistency and metric depth prediction.
We present a comparison among the original input video, the disparity estimated by FoundationStereo, and video considered temporally consistent disparity.
We would like to express our sincere gratitude to Haofei Xu for his patient guidance and continuous support throughout this project. His feedback during weekly meetings and technical discussions was instrumental in shaping our work.
We also thank Prof. Marc Pollefeys and Dr. Daniel Barath for designing and teaching this excellent course, which gave us the opportunity to explore and build this project.
Finally, we appreciate the great examples and inspiration provided by Video Depth Anything and FoundationStereo.