Deep LearningTuna 2 Shatters Multimodal Benchmarks by Replacing Vision Encoders with Pixel Embeddings
Tuna-2 completely eliminates the need for pretrained vision encoders by mapping raw pixel embeddings directly into the language model. This unified approach achieves new state-of-the-art results in multimodal understanding and generation.







