Cross-Model Validation: MIVPG's Efficacy on Encoder-Decoder vs. Decoder-Only LLMs

19 Nov 2025

Table of Links

Abstract and 1 Introduction

Related Work

2.1. Multimodal Learning

2.2. Multiple Instance Learning
Methodology

3.1. Preliminaries and Notations

3.2. Relations between Attention-based VPG and MIL

3.3. MIVPG for Multiple Visual Inputs

3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios
Experiments and 4.1. General Setup

4.2. Scenario 1: Samples with Single Image

4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study
Conclusion and References

Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

We implemented the proposed method on NVIDIA A100 GPUs with BFloat16. Except for the number of training epochs mentioned in the main paper, we kept all other hyperparameters the same as in BLIP2[22]. For PatchGastricADC22[36] and ABO[7], we trained the model for 40 epochs.

Figure 8. Experiment results on MSCOCO with or without freezing the visual encoder. We adopt the metrics used in [22].

C.1. Frozen Visual Models

In the original BLIP2[22], image sizes are upscaled to 364 × 364, and consequently, the ViT is unfrozen during the fine-tuning process. This approach yields slightly better performance, albeit at a higher computational cost while training on the entire COCO training set.

In this section, we validate the performance of finetuning while keeping the ViT frozen and image sizes unchanged. Experiment results can be seen as Figure 8. We observed that when working with limited data, such as 50K samples, models exhibit comparable performance whether or not the visual encoder (ViT) is frozen. However, as the number of training epochs increases, the performance gap varies. In some cases, unfreezing the ViT leads to improved performance, while in others, the opposite holds true. Considering that many real-world applications may not have access to massive training data, freezing the ViT can be a more efficient approach while still maintaining similar performance levels.

C.2. Case Study

In the main paper, we employ the FLAN-T5-XL as the language model. Existing large language models can be broadly categorized into two types: encoder-decoder based and decoder-only based models. The FLAN-T5-XL falls into the former category. The decoder-only based models are more computationally efficient and the encoder-decoder based models can handle more sophisticated tasks. In this section, we assess the performance of MIVPG on models from the decoder-only category. Specifically, we use the BLIP2[22] with OPT-2.7b[47] as the base LLM. We validate the performance on the PatchGastricADC22 dataset. In the experiments, we only replace the LLM while keeping other hyperparameters unchanged.

Table 4. Experiments on the PatchGastricADC22 dataset [36] with OPT-2.7b as the language model

The experiment results on PatchGastricADC22 using OPT-2.7b as the language model are presented in Table 4. Overall, the model continues to outperform the baselines shown in Table 1, emphasizing the advantages of integrating MLLMs into the WSI captioning task. Notably, the model with CSA performs better than the one without it, reaffirming the effectiveness of CSA. It’s also worth noting that the performance of using OPT-2.7b is not superior to using Flan-T5-XL. This could be attributed, in part, to the insufficiency of training data. Since OPT-2.7b is relatively less sophisticated, more training data may be required to train a more powerful model.

C.3. More Visualization

This section provides additional visualization results on the ABO dataset, including both patch-level attention weights and image-level attention weights. In the patch-level attention weights, it is evident that the model excels in detecting the shapes of objects, as a significant portion of the patch-level weights is assigned to edges and contours. The image-level attention weights display maps for all twelve heads. Each row in a map represents a query, while each column represents an image. It’s important to note that different heads and queries exhibit varying attention patterns towards the images, demonstrating the diversity in how the model processes and attends to the input images.

References: ['a short bar with weights at each end that is used usually in pairs for exercise.', 'gymnastic weight for dogs.', 'a six sided barbell.', 'a hexagonal shaped dumbbell', 'pesa de mano con exterior hexagonal']Inference: a six sided barbell.

References: ['a stool with raised legs and a curved seat.', 'a rectangular top stool with four standing legs.', 'gray chair without arms or back with four legs that support it.', 'a stool with a rectangular cushion and has four long legs', 'a rectangular top stool .']Inference: a rectangular top stool with four standing legs.

References: ['single seater tufted sofa with seat attached to long back and side arms having for wooden legs.', 'a one seater chair with hole pattern on the backrest and having low armrests.']Inference: a one seater chair with hole pattern on the backrest and having four thin standing legs.

References: ['durable spongy waterproof small size pillow.', 'a structure, observed in certain extrusive igneous rocks, that is characterized by discontinuous pillow-shaped masses ranging in size from a few centimeters to a meter or more in greatest dimension (commonly between 30 cm and 60 cm).', 'square pillow with cushions', 'sofa,bed pillow, square in shape']Inference: a square shaped pillow

References: ['a two seater sofa with four thin standing legs, cylindrical pillow at each armrest and square pattern on the seat.']Inference: a two seater sofa with square pillow at each armrest and having four thin standing legs.

References: ['a cubical shaped ottoman seat.', 'a seat with a square shaped cushion', 'a cube shaped ottoman seat.']Inference: a cube shaped ottoman seat with square shaped legs.

References: ['a one seater sofa with broad seat pillow.', 'padded gray sofa.', 'a one seater sofa with broad seat and four small standing legs.']Inference: a one seater sofa with huge back pillow and seat pillow and having four short standing legs.

Authors:

(1) Wenliang Zhong, The University of Texas at Arlington ([email protected]);

(2) Wenyi Wu, Amazon ([email protected]);

(3) Qi Li, Amazon ([email protected]);

(4) Rob Barton, Amazon ([email protected]);

(5) Boxin Du, Amazon ([email protected]);

(6) Shioulin Sam, Amazon ([email protected]);

(7) Karim Bouyarmane, Amazon ([email protected]);

(8) Ismail Tutar, Amazon ([email protected]);

(9) Junzhou Huang, The University of Texas at Arlington ([email protected]).

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

← Previous

Theoretical Proof: CSA Module Maintains MIL Properties

Up Next →

Enhancing Long-Tailed Segmentation with Gradient Cache and BSGAL