Apple researchers are advancing AI and ML through fundamental research, and to support the broader research community and help accelerate progress in this field, we share much of our research through publications and engagement at conferences. This week, the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), will take place in Nashville, Tennessee. Apple is proud to once again participate in this important event for the community and to be an industry sponsor.
At the main conference and associated workshops, Apple researchers will present new research across a number of topics in computer vision, including vision language models, 3D photogrammetry, large multimodal models, and video diffusion models.
CVPR attendees will be able to experience demonstrations of Apple’s ML research in our booth #1217 during exhibition hours. Apple is also sponsoring and participating in a number of affinity group-hosted events that support underrepresented groups in the ML community. A comprehensive overview of Apple’s participation in and contributions to CVPR 2025 can be found here, and a selection of highlights follow below.
FastVLM: Efficient Vision encoding for Vision Language Models
The performance of Vision Language Models (VLMs) improves as the resolution of input images increases, but popular visual encoders such as ViTs become inefficient at high resolutions because of the large number of tokens and high encoding latency. For many production use-cases, VLMs need to be both accurate and efficient to meet the low-latency demands of real-time applications and run on device for privacy-preserving AI experiences.
At CVPR 2025, Apple researchers will present FastVLM: Efficient Vision encoding for Vision Language Models. The work shares FastViTHD: a novel hybrid vision encoder, designed to output fewer tokens and significantly reduce encoding time for high-resolution images. Using this efficient encoder for high-res input, FastVLM significantly improves accuracy-latency trade-offs with a simple design. FastVLM delivers accurate, fast, and efficient visual query processing, making it suitable for powering real-time applications on-device, and the inference code, model checkpoints, and an iOS/macOS demo app based on MLX are available here.
Matrix3D: Large Photogrammetry Model All-in-One
Photogrammetry allows 3D scenes to be constructed from 2D images, but the traditional approach has two limitations. First, it usually requires a dense collection of 2D images to achieve robust and accurate 3D reconstruction. Second, the pipeline generally entails multiple processing a number of independent tasks – like feature detection, structure-from-motion, and multi-view stereo – that are not correlated or jointly optimized with one another.
In a Highlight presentation at CVPR, Apple researchers will present a new approach to this challenge that overcomes these prior limitations. The paper Matrix3D: Large Photogrammetry Model All-in-Oneshares a single unified model that performs several photogrammetry subtasks, including pose estimation, depth prediction, and novel view synthesis. Matrix3D utilizes a multi-modal diffusion transformer (DiT) to integrate transformations across several modalities, such as images, camera parameters, and depth maps. The multimodal training for this approach integrates a mask learning strategy that enables full-modality training even with partially complete data, such as bi-modality data of image-pose and image-depth pairs, which significantly increases the pool of available training data. Matrix3D demonstrates state-of-the-art performance in pose estimation and novel view synthesis tasks, and, it offers fine-grained control through multi-round interactions, making it an innovative tool for 3D content creation. Code is available here.
Multimodal Autoregressive Pre-Training of Large Vision Encoders
Large multimodal models are commonly trained by pairing a large language decoder with a vision encoder. These vision encoders are usually pre-trained with a discriminative objective, such as contrastive loss, but this creates a mismatch between pre-training and the generative autoregressive downstream task. Following the success of autoregressive approaches for training language models, autoregressive image models have been shown to pre-train strong and scalable vision encoders.
In a Highlight presentation at CVPR 2025, Apple ML researchers will share Multimodal Autoregressive Pre-Training of Large Vision Encoders, which describes AIMv2, a family of large, strong vision encoders pre-trained with a multimodal autoregressive objective. A multimodal decoder generates both raw patches and text tokens, leading these models to excel not only at multimodal tasks but also in visual recognition benchmarks such as localization, grounding, and classification. The work also shows that AIMv2 models are efficient to train, outperforming the current state of the art with significantly fewer samples seen during pre-training. Code and model checkpoints are available here.
World-Consistent Video Diffusion with Explicit 3D Modeling
Diffusion models have become the dominant paradigm for realistic image and video generation, but these models still struggle with efficiently and explicitly generating 3D-consistent content. Traditionally, these methods implicitly learn 3D consistency by generating only RGB frames, which can lead to artifacts and inefficiencies in training.
In a Highlight presentation at CVPR, Apple researchers will share World-Consistent Video Diffusion with Explicit 3D Modeling, which details a new approach that addresses these challenges. This technique, World-consistent Video Diffusion (WVD), trains a diffusion transformer to learn the joint distribution of both RGB (color) and XYZ (coordinates in space) frames. As a result, the model can adapt to multiple tasks with a flexible inpainting capability. For example, given ground-truth RGB, the model can estimate XYZ frames; or, it can generate novel RGB frames using XYZ projections along a specified camera trajectory. With this flexibility, WVD unifies tasks like single-image-to-3D generation, multi-view stereo, and camera-controlled video generation.
Demonstrating ML Research in the Apple Booth
During exhibition hours, CVPR attendees will be able to interact with live demos of Apple ML research in booth #1217, including FastVLM, described above.
Supporting the ML Research Community
Apple is committed to supporting underrepresented groups in the ML community. We are proud to again sponsor multiple affinity groups hosting events onsite at CVPR, including LatinX in CV (LXCV is a sub-group of LXAI) (workshop on June 11), and Women in Computer Vision (WiCV) (workshop on June 12).
Learn More about Apple ML Research at CVPR 2025
CVPR brings together the community of researchers advancing the state of the art in computer vision, and Apple is proud to again share innovative new research at the event and connect with the community attending it. This post highlights just a selection of the works Apple ML researchers will present at CVPR 2025, and a comprehensive overview and schedule of our participation can be found here.