Table of Links
-
Related Work
-
Methodology
3.1. Preliminaries and Notations
3.2. Relations between Attention-based VPG and MIL
3.3. MIVPG for Multiple Visual Inputs
3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios
-
Experiments and 4.1. General Setup
4.2. Scenario 1: Samples with Single Image
4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding
Supplementary Material
A. Detailed Architecture of QFormer
C. More Experiments
We implemented the proposed method on NVIDIA A100 GPUs with BFloat16. Except for the number of training epochs mentioned in the main paper, we kept all other hyperparameters the same as in BLIP2[22]. For PatchGastricADC22[36] and ABO[7], we trained the model for 40 epochs.
![Figure 8. Experiment results on MSCOCO with or without freezing the visual encoder. We adopt the metrics used in [22].](https://cdn.hackernoon.com/images/fWZa4tUiBGemnqQfBGgCPf9594N2-5f033fg.png)
C.1. Frozen Visual Models
In the original BLIP2[22], image sizes are upscaled to 364 × 364, and consequently, the ViT is unfrozen during the fine-tuning process. This approach yields slightly better performance, albeit at a higher computational cost while training on the entire COCO training set.
In this section, we validate the performance of finetuning while keeping the ViT frozen and image sizes unchanged. Experiment results can be seen as Figure 8. We observed that when working with limited data, such as 50K samples, models exhibit comparable performance whether or not the visual encoder (ViT) is frozen. However, as the number of training epochs increases, the performance gap varies. In some cases, unfreezing the ViT leads to improved performance, while in others, the opposite holds true. Considering that many real-world applications may not have access to massive training data, freezing the ViT can be a more efficient approach while still maintaining similar performance levels.
C.2. Case Study
In the main paper, we employ the FLAN-T5-XL as the language model. Existing large language models can be broadly categorized into two types: encoder-decoder based and decoder-only based models. The FLAN-T5-XL falls into the former category. The decoder-only based models are more computationally efficient and the encoder-decoder based models can handle more sophisticated tasks. In this section, we assess the performance of MIVPG on models from the decoder-only category. Specifically, we use the BLIP2[22] with OPT-2.7b[47] as the base LLM. We validate the performance on the PatchGastricADC22 dataset. In the experiments, we only replace the LLM while keeping other hyperparameters unchanged.
![Table 4. Experiments on the PatchGastricADC22 dataset [36] with OPT-2.7b as the language model](https://cdn.hackernoon.com/images/fWZa4tUiBGemnqQfBGgCPf9594N2-sl133j6.png)
The experiment results on PatchGastricADC22 using OPT-2.7b as the language model are presented in Table 4. Overall, the model continues to outperform the baselines shown in Table 1, emphasizing the advantages of integrating MLLMs into the WSI captioning task. Notably, the model with CSA performs better than the one without it, reaffirming the effectiveness of CSA. It’s also worth noting that the performance of using OPT-2.7b is not superior to using Flan-T5-XL. This could be attributed, in part, to the insufficiency of training data. Since OPT-2.7b is relatively less sophisticated, more training data may be required to train a more powerful model.
C.3. More Visualization
This section provides additional visualization results on the ABO dataset, including both patch-level attention weights and image-level attention weights. In the patch-level attention weights, it is evident that the model excels in detecting the shapes of objects, as a significant portion of the patch-level weights is assigned to edges and contours. The image-level attention weights display maps for all twelve heads. Each row in a map represents a query, while each column represents an image. It’s important to note that different heads and queries exhibit varying attention patterns towards the images, demonstrating the diversity in how the model processes and attends to the input images.
![References: ['a short bar with weights at each end that is used usually in pairs for exercise.', 'gymnastic weight for dogs.', 'a six sided barbell.', 'a hexagonal shaped dumbbell', 'pesa de mano con exterior hexagonal']Inference: a six sided barbell.](https://cdn.hackernoon.com/images/fWZa4tUiBGemnqQfBGgCPf9594N2-cx2336t.png)
![References: ['a chair with a metal square like right forming the armrests and the legs and also having lines on the backrest and the seat.', 'the chair is composed of a seat and a square backrest with two armrests and two square legs on each side', 'a one seater chair with flat metal armrests extended to form the legs.', 'a one seater chair with flat metal armrests extended to form the legs and having rows patterns on the backrest and seat pillow.']Inference: a one seater chair with flat armrests and having four thin standing legs.](https://cdn.hackernoon.com/images/fWZa4tUiBGemnqQfBGgCPf9594N2-ik333jz.png)
![References: ['a stool with raised legs and a curved seat.', 'a rectangular top stool with four standing legs.', 'gray chair without arms or back with four legs that support it.', 'a stool with a rectangular cushion and has four long legs', 'a rectangular top stool .']Inference: a rectangular top stool with four standing legs.](https://cdn.hackernoon.com/images/fWZa4tUiBGemnqQfBGgCPf9594N2-cl43321.png)
![References: ['single seater tufted sofa with seat attached to long back and side arms having for wooden legs.', 'a one seater chair with hole pattern on the backrest and having low armrests.']Inference: a one seater chair with hole pattern on the backrest and having four thin standing legs.](https://cdn.hackernoon.com/images/fWZa4tUiBGemnqQfBGgCPf9594N2-54533zt.png)
![References: ['a one seater sofa with huge back pillow and broad seat pillow and having four thin standing legs.', 'a broad one seater sofa with four thin standing legs.', 'one seater sofa with armrests on each side and four short legs. the seat and backrest have rounded rectangular cushions', 'one seater sofas wide with medium legs']Inference: a one seater sofa with huge back pillow and having four thin standing legs.](https://cdn.hackernoon.com/images/fWZa4tUiBGemnqQfBGgCPf9594N2-f9633js.png)
![References: ['durable spongy waterproof small size pillow.', 'a structure, observed in certain extrusive igneous rocks, that is characterized by discontinuous pillow-shaped masses ranging in size from a few centimeters to a meter or more in greatest dimension (commonly between 30 cm and 60 cm).', 'square pillow with cushions', 'sofa,bed pillow, square in shape']Inference: a square shaped pillow](https://cdn.hackernoon.com/images/fWZa4tUiBGemnqQfBGgCPf9594N2-c5733zj.png)
![References: ['a two seater sofa with four thin standing legs, cylindrical pillow at each armrest and square pattern on the seat.']Inference: a two seater sofa with square pillow at each armrest and having four thin standing legs.](https://cdn.hackernoon.com/images/fWZa4tUiBGemnqQfBGgCPf9594N2-i3833ap.png)
![References: ['a cubical shaped ottoman seat.', 'a seat with a square shaped cushion', 'a cube shaped ottoman seat.']Inference: a cube shaped ottoman seat with square shaped legs.](https://cdn.hackernoon.com/images/fWZa4tUiBGemnqQfBGgCPf9594N2-qb93355.png)
![References: ['the sofa consists of two huge pieces with armrests. it has two big size pillows and four little legs.', 'the couch has square set and back consists of square pillow held up by four curved legs.', 'a two seater sofa with huge back pillows and four thin standing legs.', 'two seater sofa with armrests on each side. it has rounded rectangular cushions on the seats and backrest']Inference: a two seater sofa with huge back pillows and having four thin standing legs.](https://cdn.hackernoon.com/images/fWZa4tUiBGemnqQfBGgCPf9594N2-n8a33rw.png)
![References: ['a one seater sofa with broad seat pillow.', 'padded gray sofa.', 'a one seater sofa with broad seat and four small standing legs.']Inference: a one seater sofa with huge back pillow and seat pillow and having four short standing legs.](https://cdn.hackernoon.com/images/fWZa4tUiBGemnqQfBGgCPf9594N2-l4b33kc.png)
Authors:
(1) Wenliang Zhong, The University of Texas at Arlington ([email protected]);
(2) Wenyi Wu, Amazon ([email protected]);
(3) Qi Li, Amazon ([email protected]);
(4) Rob Barton, Amazon ([email protected]);
(5) Boxin Du, Amazon ([email protected]);
(6) Shioulin Sam, Amazon ([email protected]);
(7) Karim Bouyarmane, Amazon ([email protected]);
(8) Ismail Tutar, Amazon ([email protected]);
(9) Junzhou Huang, The University of Texas at Arlington ([email protected]).
This paper is
