- Updated: June 21, 2025
- 5 min read
Introducing WINGS: A Dual-Learner Architecture for Multimodal Language Models
Introducing WINGS: Revolutionizing Multimodal Learning with Dual-Learner Architecture
In the ever-evolving landscape of artificial intelligence, the introduction of innovative architectures like WINGS marks a significant milestone. Developed by researchers from Alibaba Group and Nanjing University, WINGS addresses a critical challenge in multimodal language models (MLLMs)βtext-only forgetting. This breakthrough architecture enhances the capability of large language models (LLMs) to handle both text and images, paving the way for more interactive and intuitive AI systems.
The Challenge of Text-Only Forgetting in Multimodal Language Models
Multimodal LLMs have expanded the horizons of AI by enabling systems to interpret visuals, answer questions about images, and engage in dialogues that seamlessly integrate text and pictures. However, this integration comes with a significant challenge known as text-only forgetting. When MLLMs are trained on datasets combining images and text, they often lose their ability to handle purely textual tasks. This phenomenon occurs because visual tokens inserted into the language sequence divert the modelβs attention away from the text, leading to a degradation in performance on tasks that require language understanding.
Existing Mitigation Strategies and Their Limitations
Several strategies have been employed to mitigate text-only forgetting, such as reintroducing large amounts of text-only data during training or alternating between text-only and multimodal fine-tuning. While these approaches aim to remind the model of its original language capabilities, they often increase training costs and require complex switching logic during inference. Moreover, they fail to restore text comprehension entirely, primarily due to how the modelβs attention shifts when image tokens are introduced into the sequence.
WINGS: A Dual-Learner Approach to Balance Text and Visual Learning
The WINGS architecture introduces a novel solution by incorporating two new modulesβvisual and textual learnersβinto each layer of the MLLM. These learners work in parallel with the modelβs core attention mechanism, resembling “wings” attached to either side of the attention layers. A routing component dynamically controls how much attention each learner receives based on the current token mix, allowing the model to balance its focus between visual and textual information efficiently.
Low-Rank Residual Attention: Enhancing Efficiency and Modality Awareness
At the heart of WINGS lies a mechanism called Low-Rank Residual Attention (LoRRA), which maintains computational efficiency while enabling the learners to capture essential modality-specific information. During the first stage of training, only visual learners are activated to align image features. In the second stage, both visual and textual learners are co-trained with a router module that utilizes attention weights to allocate responsibility. This ensures that visual attention does not overwhelm textual understanding, maintaining a balanced performance across modalities.

Performance Benchmarks and Impact on AI Research
WINGS has demonstrated impressive performance across various benchmarks, underscoring its potential to revolutionize multimodal learning. On the MMLU dataset, it achieved a text-only score of 60.53, representing an improvement of 9.70 points compared to a similar baseline model. For CMMLU, it scored 69.82, which is 9.36 points higher than the baseline. In reasoning tasks like Race-High, it gained 11.9 points, and in WSC, an improvement of 11.12 points was recorded. In multimodal benchmarks like MMMU-VAL, WINGS achieved an improvement of 4.78 points.
These results highlight WINGSβ ability to handle mixed text-and-image multi-turn dialogues more effectively than other open-source MLLMs at the same scale. Its innovative architecture not only enhances the performance of multimodal models but also sets a new standard for balancing visual and textual learning.
Exploring Related AI Advancements
The introduction of WINGS aligns with broader trends in AI research, where the focus is on creating more generalizable and efficient models. For instance, platforms like OpenAI ChatGPT integration and ChatGPT and Telegram integration are continuously evolving to offer seamless AI solutions that integrate multiple modalities. These advancements are crucial for applications in education, content generation, and interactive assistants, where understanding and generating both text and images are essential.
Moreover, the role of AI in transforming industries is evident in areas like marketing, where AI agents for enterprises are driving innovation. The Enterprise AI platform by UBOS exemplifies how AI can be leveraged to enhance business processes and decision-making.
Conclusion: The Future of Multimodal Learning with WINGS
In conclusion, the introduction of WINGS represents a significant leap forward in the field of multimodal learning. By addressing the challenge of text-only forgetting, this dual-learner architecture offers a more balanced and efficient approach to integrating visual and textual information. Its robust performance across benchmarks underscores its potential to set new standards in AI research and applications.
As AI continues to evolve, innovations like WINGS will play a pivotal role in shaping the future of intelligent systems. By enhancing the capabilities of multimodal models, WINGS not only improves the performance of AI systems but also opens new avenues for research and development. For those interested in exploring the potential of AI in their organizations, platforms like UBOS homepage offer a range of solutions to harness the power of AI effectively.
For more insights into AI advancements and their impact on various industries, visit our blog and explore our UBOS portfolio examples for real-world applications of AI technology.