• About
  • Privacy Policy
  • Disclaimer
  • Contact
Soft Bliss Academy
No Result
View All Result
  • Home
  • Artificial Intelligence
  • Software Development
  • Machine Learning
  • Research & Academia
  • Startups
  • Home
  • Artificial Intelligence
  • Software Development
  • Machine Learning
  • Research & Academia
  • Startups
Soft Bliss Academy
No Result
View All Result
Home Artificial Intelligence

Meet Open-Qwen2VL: A Fully Open and Compute-Efficient Multimodal Large Language Model

softbliss by softbliss
April 4, 2025
in Artificial Intelligence
0
Meet Open-Qwen2VL: A Fully Open and Compute-Efficient Multimodal Large Language Model
0
SHARES
1
VIEWS
Share on FacebookShare on Twitter


Multimodal Large Language Models (MLLMs) have advanced the integration of visual and textual modalities, enabling progress in tasks such as image captioning, visual question answering, and document interpretation. However, the replication and further development of these models are often hindered by a lack of transparency. Many state-of-the-art MLLMs do not release key components, including training code, data curation methodologies, and pretraining datasets. Furthermore, the substantial computational resources required for training these models pose a significant barrier, particularly for academic researchers with limited infrastructure. This lack of accessibility impedes reproducibility and slows the dissemination of new techniques within the research community.

Researchers from UC Santa Barbara, Bytedance and NVIDIA introduce Open-Qwen2VL, a 2-billion parameter Multimodal Large Language Model that has been pre-trained on 29 million image-text pairs using approximately 220 A100-40G GPU hours. Developed collaboratively by researchers from UC Santa Barbara, ByteDance, and Nvidia Research, Open-Qwen2VL is designed to address reproducibility and resource constraints in MLLM research. The project provides a complete suite of open-source resources, including the training codebase, data filtering scripts, WebDataset-formatted pretraining data, and both base and instruction-tuned model checkpoints. This comprehensive release aims to support transparent experimentation and method development in the multimodal learning domain.

Open-Qwen2VL is based on the Qwen2.5-1.5B-Instruct LLM backbone, coupled with a SigLIP-SO-400M vision encoder. An Adaptive Average-Pooling Visual Projector reduces the number of visual tokens from 729 to 144 during pretraining, which improves computational efficiency. The token count is increased back to 729 during the supervised fine-tuning (SFT) stage. This low-to-high resolution strategy maintains image understanding capabilities while optimizing for resource usage.

To further enhance training efficiency, Open-Qwen2VL implements multimodal sequence packing, allowing the concatenation of multiple image-text pairs into sequences of approximately 4096 tokens, thereby minimizing padding and computational overhead. The vision encoder parameters remain frozen during pretraining to conserve resources and are optionally unfrozen during SFT to improve downstream performance.

Open-Qwen2VL is trained on only 0.36% of the token count used in Qwen2-VL, yet demonstrates comparable or superior performance across several benchmarks. The model achieves a score of 80.9 on MMBench, and performs competitively on SEEDBench (72.5), MMStar (49.7), and MathVista (53.1). Ablation studies indicate that integrating a small subset (5M samples) of high-quality image-text pairs filtered using MLM-based techniques can result in measurable performance improvements, highlighting the importance of data quality over volume.

In addition, Open-Qwen2VL exhibits robust few-shot multimodal in-context learning capabilities. When evaluated on datasets such as GQA and TextVQA, the model shows 3% to 12% accuracy gains from 0-shot to 8-shot scenarios. Fine-tuning performance scales predictably with the size of the instruction tuning dataset, with performance gains plateauing around 8M examples from the MAmmoTH-VL-10M dataset.

Open-Qwen2VL introduces a reproducible and resource-efficient pipeline for training multimodal large language models. By systematically addressing the limitations of prior models in terms of openness and compute requirements, it enables broader participation in MLLM research. The model’s design choices—including efficient visual token handling, multimodal sequence packing, and judicious data selection—illustrate a viable path forward for academic institutions aiming to contribute to the field. Open-Qwen2VL establishes a reproducible baseline and provides a foundation for future work on scalable, high-performance MLLMs within constrained computational environments.


Check out the Paper, Model, Data and Code. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

🔥 [Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

Tags: ComputeEfficientFullyLanguageLargeMeetmodelmultimodalopenOpenQwen2VL
Previous Post

How To Use AI Agents in Small Business

Next Post

Guide to Uber’s H3 for Spatial Indexing

softbliss

softbliss

Related Posts

3 Questions: How to help students recognize potential bias in their AI datasets | MIT News
Artificial Intelligence

3 Questions: How to help students recognize potential bias in their AI datasets | MIT News

by softbliss
June 7, 2025
A Comprehensive Coding Tutorial for Advanced SerpAPI Integration with Google Gemini-1.5-Flash for Advanced Analytics
Artificial Intelligence

A Comprehensive Coding Tutorial for Advanced SerpAPI Integration with Google Gemini-1.5-Flash for Advanced Analytics

by softbliss
June 7, 2025
Alibaba’s Qwen3 Model Outperforms OpenAI and DeepSeek
Artificial Intelligence

Alibaba’s Qwen3 Model Outperforms OpenAI and DeepSeek

by softbliss
June 6, 2025
The Evolution of AI Boyfriend Apps in NSFW Mode
Artificial Intelligence

The Evolution of AI Boyfriend Apps in NSFW Mode

by softbliss
June 6, 2025
Soham Mazumdar, Co-Founder & CEO of WisdomAI – Interview Series
Artificial Intelligence

Soham Mazumdar, Co-Founder & CEO of WisdomAI – Interview Series

by softbliss
June 6, 2025
Next Post

Guide to Uber's H3 for Spatial Indexing

Premium Content

The Enigma of Enforcing GDPR on LLMs • AI Blog

The Enigma of Enforcing GDPR on LLMs • AI Blog

April 28, 2025
Writing Impactful Introduction and Discussion Sections

Writing Impactful Introduction and Discussion Sections

April 12, 2025

How to Choose the Best SEO Keyword Phrase for Your Website

April 11, 2025

Browse by Category

  • Artificial Intelligence
  • Machine Learning
  • Research & Academia
  • Software Development
  • Startups

Browse by Tags

Amazon App Artificial Blog Build Building Business Coding Data Development Digital Framework Future Gemini Generative Google Guide Impact Innovation Intelligence Key Language Large Learning LLM LLMs Machine Microsoft MIT model Models News NVIDIA opinion OReilly Research Science Series Software Startup Startups students Tech Tools Video

Soft Bliss Academy

Welcome to SoftBliss Academy, your go-to source for the latest news, insights, and resources on Artificial Intelligence (AI), Software Development, Machine Learning, Startups, and Research & Academia. We are passionate about exploring the ever-evolving world of technology and providing valuable content for developers, AI enthusiasts, entrepreneurs, and anyone interested in the future of innovation.

Categories

  • Artificial Intelligence
  • Machine Learning
  • Research & Academia
  • Software Development
  • Startups

Recent Posts

  • How AI startups are leading the battle against sophisticated phishing attacks 
  • Emails Shed Light on UNC’s Plans to Create a New Accreditor
  • 3 Questions: How to help students recognize potential bias in their AI datasets | MIT News

© 2025 https://softblissacademy.online/- All Rights Reserved

No Result
View All Result
  • Home
  • Artificial Intelligence
  • Software Development
  • Machine Learning
  • Research & Academia
  • Startups

© 2025 https://softblissacademy.online/- All Rights Reserved

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?