Apple’s AI research division has introduced a groundbreaking model named Depth Pro, which promises to revolutionize machine depth perception. This advancement could have far-reaching impacts across various sectors, including augmented reality and self-driving cars.
Depth Pro is an AI model designed to generate high-resolution 3D depth maps from a single 2D image in a fraction of a second. This capability does not rely on traditional camera metadata, which is usually necessary for depth estimation.
The model can produce a 2.25-megapixel depth map in about 0.3 seconds on a standard GPU, showcasing both its efficiency and speed. This performance metric indicates a leap in real-time depth estimation technology.
This technology has profound implications for various industries, including augmented reality (AR), autonomous vehicles, robotics, and any field requiring real-time 3D environment understanding. For AR, in particular, being able to estimate metric depth from any given image without prior training on specific data sets (zero-shot learning) could enhance how virtual objects are integrated into the real world with accurate scale and perspective.
Depth Pro’s ability to handle “flying pixels” and provide metric depth in a zero-shot regime sets it apart. It means the model can offer real-world measurements from images without needing extensive domain-specific training, which could significantly cut down on development time and costs for new applications.
Apple has also released Depth Pro’s code and weights on GitHub, encouraging further development and application in various fields. This move could accelerate innovation in areas beyond what Apple might directly pursue.
Depth Pro could eventually be integrated into products like the iPhone or the Apple Vision Pro, enhancing their AR capabilities or even improving functionalities like computational photography or biometric recognition.
The above image shows; A comparison of depth maps from Apple’s Depth Pro, Marigold, Depth Anything v2, and Metric3D v2. Depth Pro excels in capturing fine details like fur and birdcage wires, producing sharp, high-resolution depth maps in just 0.3 seconds, outperforming other models in accuracy and detail. (credit: arxiv.org)
Leave a Reply