• About
  • Privacy Policy
  • Disclaimer
  • Contact
Soft Bliss Academy
No Result
View All Result
  • Home
  • Artificial Intelligence
  • Software Development
  • Machine Learning
  • Research & Academia
  • Startups
  • Home
  • Artificial Intelligence
  • Software Development
  • Machine Learning
  • Research & Academia
  • Startups
Soft Bliss Academy
No Result
View All Result
Home Machine Learning

Apple Machine Learning Research at CVPR 2025

softbliss by softbliss
June 13, 2025
in Machine Learning
0
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


Apple researchers are advancing AI and ML through fundamental research, and to support the broader research community and help accelerate progress in this field, we share much of our research through publications and engagement at conferences. This week, the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), will take place in Nashville, Tennessee. Apple is proud to once again participate in this important event for the community and to be an industry sponsor.

At the main conference and associated workshops, Apple researchers will present new research across a number of topics in computer vision, including vision language models, 3D photogrammetry, large multimodal models, and video diffusion models.

CVPR attendees will be able to experience demonstrations of Apple’s ML research in our booth #1217 during exhibition hours. Apple is also sponsoring and participating in a number of affinity group-hosted events that support underrepresented groups in the ML community. A comprehensive overview of Apple’s participation in and contributions to CVPR 2025 can be found here, and a selection of highlights follow below.

FastVLM: Efficient Vision encoding for Vision Language Models

The performance of Vision Language Models (VLMs) improves as the resolution of input images increases, but popular visual encoders such as ViTs become inefficient at high resolutions because of the large number of tokens and high encoding latency. For many production use-cases, VLMs need to be both accurate and efficient to meet the low-latency demands of real-time applications and run on device for privacy-preserving AI experiences.

At CVPR 2025, Apple researchers will present FastVLM: Efficient Vision encoding for Vision Language Models. The work shares FastViTHD: a novel hybrid vision encoder, designed to output fewer tokens and significantly reduce encoding time for high-resolution images. Using this efficient encoder for high-res input, FastVLM significantly improves accuracy-latency trade-offs with a simple design. FastVLM delivers accurate, fast, and efficient visual query processing, making it suitable for powering real-time applications on-device, and the inference code, model checkpoints, and an iOS/macOS demo app based on MLX are available here.

Figure 1: Demo app running FastVLM 0.5B model with MLX on iPhone 16 Pro.

Matrix3D: Large Photogrammetry Model All-in-One

Photogrammetry allows 3D scenes to be constructed from 2D images, but the traditional approach has two limitations. First, it usually requires a dense collection of 2D images to achieve robust and accurate 3D reconstruction. Second, the pipeline generally entails multiple processing a number of independent tasks – like feature detection, structure-from-motion, and multi-view stereo – that are not correlated or jointly optimized with one another.

In a Highlight presentation at CVPR, Apple researchers will present a new approach to this challenge that overcomes these prior limitations. The paper Matrix3D: Large Photogrammetry Model All-in-Oneshares a single unified model that performs several photogrammetry subtasks, including pose estimation, depth prediction, and novel view synthesis. Matrix3D utilizes a multi-modal diffusion transformer (DiT) to integrate transformations across several modalities, such as images, camera parameters, and depth maps. The multimodal training for this approach integrates a mask learning strategy that enables full-modality training even with partially complete data, such as bi-modality data of image-pose and image-depth pairs, which significantly increases the pool of available training data. Matrix3D demonstrates state-of-the-art performance in pose estimation and novel view synthesis tasks, and, it offers fine-grained control through multi-round interactions, making it an innovative tool for 3D content creation. Code is available here.

Multimodal Autoregressive Pre-Training of Large Vision Encoders

Large multimodal models are commonly trained by pairing a large language decoder with a vision encoder. These vision encoders are usually pre-trained with a discriminative objective, such as contrastive loss, but this creates a mismatch between pre-training and the generative autoregressive downstream task. Following the success of autoregressive approaches for training language models, autoregressive image models have been shown to pre-train strong and scalable vision encoders.

In a Highlight presentation at CVPR 2025, Apple ML researchers will share Multimodal Autoregressive Pre-Training of Large Vision Encoders, which describes AIMv2, a family of large, strong vision encoders pre-trained with a multimodal autoregressive objective. A multimodal decoder generates both raw patches and text tokens, leading these models to excel not only at multimodal tasks but also in visual recognition benchmarks such as localization, grounding, and classification. The work also shows that AIMv2 models are efficient to train, outperforming the current state of the art with significantly fewer samples seen during pre-training. Code and model checkpoints are available here.

World-Consistent Video Diffusion with Explicit 3D Modeling

Diffusion models have become the dominant paradigm for realistic image and video generation, but these models still struggle with efficiently and explicitly generating 3D-consistent content. Traditionally, these methods implicitly learn 3D consistency by generating only RGB frames, which can lead to artifacts and inefficiencies in training.

In a Highlight presentation at CVPR, Apple researchers will share World-Consistent Video Diffusion with Explicit 3D Modeling, which details a new approach that addresses these challenges. This technique, World-consistent Video Diffusion (WVD), trains a diffusion transformer to learn the joint distribution of both RGB (color) and XYZ (coordinates in space) frames. As a result, the model can adapt to multiple tasks with a flexible inpainting capability. For example, given ground-truth RGB, the model can estimate XYZ frames; or, it can generate novel RGB frames using XYZ projections along a specified camera trajectory. With this flexibility, WVD unifies tasks like single-image-to-3D generation, multi-view stereo, and camera-controlled video generation.

Figure 2: Pipeline of the proposed World-consistent Video Diffusion Model.

Demonstrating ML Research in the Apple Booth

During exhibition hours, CVPR attendees will be able to interact with live demos of Apple ML research in booth #1217, including FastVLM, described above.

Supporting the ML Research Community

Apple is committed to supporting underrepresented groups in the ML community. We are proud to again sponsor multiple affinity groups hosting events onsite at CVPR, including LatinX in CV (LXCV is a sub-group of LXAI) (workshop on June 11), and Women in Computer Vision (WiCV) (workshop on June 12).

Learn More about Apple ML Research at CVPR 2025

CVPR brings together the community of researchers advancing the state of the art in computer vision, and Apple is proud to again share innovative new research at the event and connect with the community attending it. This post highlights just a selection of the works Apple ML researchers will present at CVPR 2025, and a comprehensive overview and schedule of our participation can be found here.


Tags: AppleCVPRLearningMachineResearch
Previous Post

German EdTech startup Knowunity raises €27 million to bring AI tutor to 1 billion students

Next Post

Photonic processor could streamline 6G wireless signal processing | MIT News

softbliss

softbliss

Related Posts

Normal Technology at Scale – O’Reilly
Machine Learning

Normal Technology at Scale – O’Reilly

by softbliss
June 14, 2025
NVIDIA CEO Drops the Blueprint for Europe’s AI Boom
Machine Learning

NVIDIA CEO Drops the Blueprint for Europe’s AI Boom

by softbliss
June 13, 2025
Bringing meaning into technology deployment | MIT News
Machine Learning

Bringing meaning into technology deployment | MIT News

by softbliss
June 12, 2025
Google for Nonprofits to expand to 100+ new countries and launch 10+ new no-cost AI features
Machine Learning

Google for Nonprofits to expand to 100+ new countries and launch 10+ new no-cost AI features

by softbliss
June 12, 2025
ML Model Serving with FastAPI and Redis for faster predictions
Machine Learning

ML Model Serving with FastAPI and Redis for faster predictions

by softbliss
June 12, 2025
Next Post
Photonic processor could streamline 6G wireless signal processing | MIT News

Photonic processor could streamline 6G wireless signal processing | MIT News

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Premium Content

Repurposing Protein Folding Models for Generation with Latent Diffusion – The Berkeley Artificial Intelligence Research Blog

Repurposing Protein Folding Models for Generation with Latent Diffusion – The Berkeley Artificial Intelligence Research Blog

April 11, 2025
Trump Sets Demands Harvard Must Meet to Regain Federal Funds

Trump Sets Demands Harvard Must Meet to Regain Federal Funds

April 5, 2025
AI updates from the past week: New OpenAI models, NVIDIA AI-Q Blueprint, and Anthropic’s Google Workspace integration — April 18, 2025

AI updates from the past week: New OpenAI models, NVIDIA AI-Q Blueprint, and Anthropic’s Google Workspace integration — April 18, 2025

April 19, 2025

Browse by Category

  • Artificial Intelligence
  • Machine Learning
  • Research & Academia
  • Software Development
  • Startups

Browse by Tags

Amazon App Apps Artificial Blog Build Building Business CEO Coding Data Development Framework Future Gemini Generative Google Guide Innovation Intelligence Language Learning LLM LLMs Machine Microsoft MIT model Models News NVIDIA opinion OReilly Research Science Series Software Solutions Startup Startups Strategies students Tech Tools Video

Soft Bliss Academy

Welcome to SoftBliss Academy, your go-to source for the latest news, insights, and resources on Artificial Intelligence (AI), Software Development, Machine Learning, Startups, and Research & Academia. We are passionate about exploring the ever-evolving world of technology and providing valuable content for developers, AI enthusiasts, entrepreneurs, and anyone interested in the future of innovation.

Categories

  • Artificial Intelligence
  • Machine Learning
  • Research & Academia
  • Software Development
  • Startups

Recent Posts

  • A Guide to Telemedicine Software Development
  • 6 New ChatGPT Projects Features You Need to Know
  • Normal Technology at Scale – O’Reilly

© 2025 https://softblissacademy.online/- All Rights Reserved

No Result
View All Result
  • Home
  • Artificial Intelligence
  • Software Development
  • Machine Learning
  • Research & Academia
  • Startups

© 2025 https://softblissacademy.online/- All Rights Reserved

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?