Apple researchers are advancing machine learning (ML) and AI through fundamental research that improves the world’s understanding of this technology and helps to redefine what is possible with it. To support the broader research community and help accelerate progress in this field, we share much of our research through publications, open source resources, and engagement at conferences.
This week, the Thirteenth International Conference on Learning Representations (ICLR) will be held in Singapore. ICLR brings together leading experts on deep learning and the application of representation learning, and Apple is proud to once again participate in this important event for the community and to support it with our sponsorship.
At the main conference and associated workshops, Apple researchers will present innovative research across a variety of topics in ML and AI, including visual understanding, generative AI, reasoning, instruction-following and uncertainty, efficiency, as well as fundamental topics like attention and optimization. A number of notable Apple ML research papers accepted at ICLR are detailed below, organized in the following sections:
ICLR attendees will be able to experience demonstrations of Apple’s ML research in our booth (C03), during exhibition hours, and Apple is also sponsoring and participating in a number of affinity group-hosted events that support underrepresented groups in the ML community. A comprehensive overview of Apple’s participation in and contributions to ICLR 2025 can be found here, and a selection of highlights follow below.
Estimating Metric Depth from a 2D Image
Estimating depth from a single image underpins a growing number of applications, including conditional image generation, view synthesis, advanced image editing, and augmented reality. Accurate depth estimation has been limited to narrow domains, low resolutions, long runtimes, or required known metadata such as the camera intrinsics.
At ICLR, Apple ML researchers will present their work Depth Pro: Sharp Monocular Metric Depth in Less Than a Second, which surpasses these prior limitations. From a single image, Depth Pro synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details (see Figure 1). The model delivers predictions that are metric, with absolute scale, and they do not rely on the availability of metadata such as camera intrinsics. Attendees will be able to explore this work in a demo in the Apple booth, and code is available here.
New Methods for Text-to-Image Generation and Control
The Apple ML research work to be presented at ICLR includes two papers relating to text-to-image generation and control. One shares a new technique for fine-grained control over the output of generative text and image models, and the other presents a new approach for diffusion-based text-to-image generation.
Large generative models are becoming increasingly capable and more widely deployed to power production applications, but getting these models to produce exactly what’s desired can still be challenging. Fine-grained control over these models’ outputs is important to meet user expectations and to mitigate potential misuses, ensuring the models’ reliability and safety. In a Spotlight presentation at ICLR, Apple ML researchers will share a new technique to address these issues: Controlling Language and Diffusion Models by Transporting Activations. The work shares Activation Transport (AcT), a general framework to steer activations (see Figure 2) guided by optimal transport theory, which generalizes many previous activation-steering works. AcT is modality-agnostic and works for LLMs as well as text-to-image diffusion models, providing fine-grained control over the model’s behavior with negligible computational overhead, while minimally impacting the model’s abilities. Code is available here, and for more on this work, read the Research Highlight post here.
At ICLR, Apple ML researchers will also share work proposing an alternative to the diffusion models that have become predominant for text-to-image generation tasks. These diffusion models are trained by denoising a Markovian process which gradually adds noise to the input, and in DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation, Apple ML researchers argue that this Markovian property leads to inefficiencies during training and inference because it limits the model’s ability to fully utilize the generation trajectory. To address this limitation, the paper shares DART: a transformer-based model that unifies autoregressive and diffusion within a non-Markovian framework. This approach iteratively denoises image patches spatially and spectrally using an autoregressive model that has the same architecture as standard language models. DART does not rely on image quantization, which enables more effective image modeling while maintaining flexibility, and it seamlessly trains with both text and image data in a unified model. DART demonstrates competitive performance (see Figure 3) on class-conditioned and text-to-image generation tasks, offering a scalable, efficient alternative to traditional diffusion models.
Exploring LLMs for Sequential Decision Making
Sequential decision-making is central to many real-world challenges for AI. In these tasks, an agent interacts with a dynamic environment, and to be successful, the agent must balance exploratory behavior with maximizing some utility function, more widely known as reinforcement learning (RL). While RL algorithms have proven effective for many sequential decision-making tasks, they often require a lot of information about the environment in order to learn the optimal behavior. At ICLR, Apple ML researchers will present On the Modeling Capabilities of Large Language Models for Sequential Decision Making, which explores the capabilities of LLMs for RL across a variety of interactive domains. The work shows that LLMs’s general knowledge can be leveraged for policy learning for RL agents, and the results suggest that foregoing costly human-designed reward functions in favor of automatic annotations by generalist foundation models can be a viable and cost-efficient path to training better interactive agents. Code is available here.
Understanding and Advancing LLMs’ Ability to Reason
Among the Apple ML research that will be presented at ICLR are two papers relating to LLMs’ ability to do mathematical reasoning.
Driven by research innovations, LLMs have grown increasingly capable, but multi-step reasoning, like that required to solve complex math and coding problems, has remained a challenge. One thing that makes this difficult is that each reasoning step is an opportunity for introducing errors, and maintaining consistency throughout steps is challenging, particularly for autoregressive LLMs. A promising strategy to mitigate this is verification, in which multiple solutions are sampled from those the LLM generates, and then evaluated by an external verifier. The verification results are then used to adjust the weight of each solution in determining the final answer. However, current verification approaches suffer from sampling inefficiencies, requiring a large number of samples to achieve satisfactory performance. Additionally, training an effective verifier often depends on extensive process supervision, which is costly to acquire.
At ICLR, Apple ML researchers will present a new approach that addresses these limitations. The paper, Step-by-Step Reasoning for Math Problems via Twisted Sequential Monte Carlo, shares a novel verification method based on Twisted Sequential Monte Carlo (TSMC), which sequentially refines its sampling effort to focus exploration on promising candidates, resulting in more efficient generation of high-quality solution. TSMC is applied to LLMs by estimating the expected future rewards for partial solutions, and this approach results in a more straightforward training target that eliminates the need for step-wise human annotations.
Apple ML researchers will also share work at the conference that reveals the limitations of GSM8K, a popular mathematical reasoning benchmark, and introduces a more rigorous reasoning evaluation for LLMs. In GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models, Apple researchers show that simple adjustments to the word problems in GSM8K, such as changing numbers or adding clauses irrelevant to the answer, result in significant drops in performance for models that had previously performed well on the benchmark. This suggests that those models were not using genuine logical reasoning to solve the problems, and instead may be replicating reasoning steps from their training data. As an improved and more rigorous benchmark for mathematical reasoning, the work introduces GSM-Symbolic, a new dataset created from symbolic templates that allow for the generation of a diverse set of questions (see Figure 4). Generated datasets are available here.
Understanding LLMs’ Ability to Follow Instructions and Estimate Uncertainty
Two of the Apple ML research papers that will be presented at the conference explore capabilities of LLMs that are important for both safety and utility. In order to build safe and useful AI agents with LLMs, the models must be able to follow user-provided constraints and guidelines. However, LLMs are prone to errors, often failing to follow even simple and unambiguous instructions.
To address this, Apple ML researchers will present Do LLMs Know Internally When They Follow Instructions? at ICLR. The work explores whether LLMs encode information in their representations that correlates with instruction-following success, and identifies a specific dimension within the input embedding space that is strongly associated with instruction-following. This instruction-following dimension predicts whether a response will comply with a given instruction, and it generalizes well across unseen tasks, but not across unseen instruction types. The work shows that modifying representations along this dimension improves instruction-following success rates, without compromising response quality, suggesting a path toward more reliable LLM-based AI agents. Code and data are available here.
Because of their propensity for errors, it’s important that LLMs also be able to accurately estimate and communicate their uncertainty, particularly in high-stakes applications. In those situations, if an LLM deviates from or misinterprets a user’s instructions, but appropriately recognizes and signals high uncertainty, it could prompt an additional review or intervention to prevent harmful output. At ICLR, Apple ML researchers will present Do LLMs Estimate Uncertainty Well in Instruction-Following?, which systematically evaluates the uncertainty estimation abilities of LLMs in the context of instruction-following, using a new benchmark dataset designed assess this capability. Using pre-generated responses to prompts in order facilitate direct comparisons across different instruction types and models, the work shows that existing uncertainty estimation methods perform poorly, especially when the model makes subtle errors in following instructions. Code and data are available here
New Methods Improving LLM Efficiency
Scaling model capacity and training data has been shown to improve LLM performance, but as models and training continue to grow, so do the operating and engineering challenges and costs associated with them. The work Apple ML researchers will present at ICLR includes two novel approaches to these challenges, enabling improved efficiency for LLMs without sacrificing performance.
Large scale training typically depends on high-bandwidth communication between nodes, and inference for large models often requires low-latency communication between multiple compute nodes to distribute the model. At ICLR, Apple ML researchers will share No Need to Talk: Training Mixture of Language Models Independently, which explores strategies to mitigate the communication cost of LLMs, both at training and inference, while keeping the inference efficient. The work shows that efficient training and inference can be achieved without relying on fast interconnects, and without compromising model performance, both in terms of perplexity or downstream task accuracy. The paper shares an innovative method for training a mixture of language models in an almost asynchronous manner: SMALLTALK LM. With this approach, each model of the mixture specializes in distinct parts of the data distribution, without the need of high-bandwidth communication between the nodes training each model. At inference, a lightweight router directs a given sequence to a single expert, according to a short prefix. This inference scheme naturally uses a fraction of the parameters from the overall mixture model.
Another scaling challenge is the size of LLMs’ vocabularies (the number of tokens that can be used to represent the input), which increases as models grow ever larger. This has shifted the memory footprint of LLMs during training disproportionately to one single layer: the cross-entropy in the loss computation. In fact, cross-entropy loss is responsible for up to 90% of the memory footprint of modern LLM training, making it an important target for improved efficiency. In an oral presentation at ICLR , Apple ML researchers will share Cut Your Losses in Large-Vocabulary Language Models, which proposes Cut Cross-Entropy (CCE), a new method that computes the cross-entropy loss without materializing the logits for all tokens into global memory. CCE only computes the logit for the correct token and evaluates the log-sum-exp over all logits on the fly, resulting in a dramatic reduction in memory consumption without sacrificing training speed or convergence. Code is available here
Novel Approaches to Attention and Optimization
The Apple ML work accepted to ICLR also includes two papers that share advancements in the fundamental areas of attention and optimization.
Attention is a key part of the transformer architecture, which is ubiquitous across modern machine learning, ranging from LLMs, speech recognition models, and even generative diffusion models. Attention is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The attention weights are typically obtained as the softmax of dot products between keys and queries. However, relying on the softmax function to recover token probabilities has some limitations. It can sometimes lead to a concentration of attention on just a few features, potentially neglecting other informative aspects of the input data. Additionally, because it requires performing a row-wise reduction along the length of the input sequence, it can slow down computation in the case of efficient hardware aware attention kernels.
At ICLR, Apple ML researchers will present Theory, Analysis, and Best Practices for Sigmoid Self-Attention, which explores and advances sigmoid attention as an alternative that surpasses the limitations of softmax attention, while matching its strong performance across modalities. The work proves that transformers with sigmoid attention are universal function approximators and benefit from improved regularity compared to softmax attention. The paper is also accompanied with the release of FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid attention yielding a 17% inference kernel speed-up over FLASHATTENTION2 on H100 GPUs. This work unifies prior art and establishes best practices for sigmoid attention as a drop-in softmax replacement in transformers. Code and one-to-one pretrained 7B softmax and sigmoid LLM weights using a deterministic dataloader are available here.
Momentum-based optimizers are common in ML training and have been shown to accelerate convergence and to result in better generalization. These optimizers typically rely on an Exponential Moving Average (EMA) of gradients, which exponentially decays the present contribution of older gradients. However, a single EMA cannot simultaneously give a high weight to the immediate past, and a non-negligible weight to older gradients.
At ICLR, Apple ML researchers will present The AdEMAMix Optimizer: Better, Faster, Older, which addresses this issue. The paper shares AdEMAMix: a simple modification of the Adam optimizer with a mixture of two EMAs to better take advantage of past gradients. Experiments on language modeling and image classification show that gradients can actually stay relevant for tens of thousands of steps, helping to converge faster and often to lower minima. Additionally, the new method is shown to significantly slow down model forgetting during training.
Demonstrating ML Research in the Apple Booth
During exhibition hours, ICLR attendees will be able to interact with live demos of Apple ML research in booth C03, including:
- Depth Pro: Zero-shot monocular depth estimation underpins a growing variety of applications, such as advanced image editing, view synthesis, and conditional image generation. Depth Pro is motivated in particular by novel view synthesis from a single image. It has been designed to work on every image (zero-shot), and produce accurate metric depth at high resolution with low latency. For the broadest applicability ‘in the wild’, it produces metric depth maps with absolute scale even if no camera intrinsics (such as focal length) are provided.
- FastVLM: FastVLM is a family of mobile-friendly vision language models. These models use a mix of CNN and Transformer architectures for vision encoding designed specifically for processing high-resolution images. Together, they demonstrate a strong approach that achieves an optimal balance between accuracy and speed.
Supporting the ML Research Community
Apple is committed to supporting underrepresented groups in the ML community. We are proud to again sponsor several affinity groups hosting events onsite at ICLR, including LatinX in AI (social on April 25), Women in Machine Learning (WiML) (social on April 25), and Queer in AI (social on April 26). In addition to supporting these workshops with sponsorship, Apple employees will also be participating at each of these and other affinity events.
Learn More about Apple ML Research at ICLR 2025
ICLR brings together professionals dedicated to the advancement of deep learning, and Apple is proud to again share innovative new research at the event and connect with the community attending it. This post highlights just a selection of the works Apple ML researchers will present at ICLR 2025, and a comprehensive overview and schedule of our participation can be found here.