Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning

Mustafa Shukor^a, Alexandre Rame^a, Corentin Dancette^a, Matthieu Cord^a,b

a) Sorbonne University b) Valeo.ai

EvALign-ICL is an evaluation framework for large multimodal models (LMMs). Currently, the evaluation of LMMs spans 5 different axes; object hallucinations, answer abstention, compositionality, explainabilty and instruction following. We propose 3 setups; zero-shot, in-context learning (ICL) and different variants of ICL (X-ICL).

Abstract

Following the success of Large Language Models (LLMs), Large Multimodal Models (LMMs), such as the Flamingo model and its subsequent competitors, have started to emerge as natural steps towards generalist agents. However, interacting with recent LMMs reveals major limitations that are hardly captured by the current evaluation benchmarks. Indeed, task performances (\emph{e.g.}, VQA accuracy) alone do not provide enough clues to understand their real capabilities, limitations, and to which extent such models are aligned to human expectations.
To refine our understanding of those flaws, we deviate from the current evaluation paradigm and propose the EvALign-ICL framework (Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning), in which (1) we evaluate 10 recent open-source LMMs from 3B up to 80B parameter scale, on 5 different axes; hallucinations, abstention, compositionality, explainability and instruction following. Our evaluation on these axes reveals major flaws in LMMs.
While the current go-to solution to align these models is based on training, such as instruction tuning or RLHF, we rather (2) explore the training-free in-context learning (ICL) as a solution, and study how it affects these limitations. Based on our ICL study, (3) we push ICL further and propose new multimodal ICL approaches such as; Multitask-ICL, Chain-of-Hindsight-ICL, and Self-Correcting-ICL.
Our findings are as follows. (1) Despite their success, LMMs have flaws that remain unsolved with scaling alone. (2) The effect of ICL on LMMs flaws is nuanced; despite its effectiveness for improved explainability, answer abstention, ICL only slightly improves instruction following, does not improve compositional abilities, and actually even amplifies hallucinations. (3) The proposed ICL variants are promising as post-hoc approaches to efficiently tackle some of those flaws.

Findings

Finding 1. LMMs suffer from severe hallucinations. A small number of ICL shots partially alleviate it, while increasing them exacerbates the problem, especially for small models (less than 9B params.). Pretraining on more high-quality data and unfreezing the LLM weights helps to reduce hallucinations.
Finding 2. LMMs give more likely incorrect answers than abstaining. ICL helps them abstain. Larger models, better quality data, and unfreezing LM weights improve abstention.
Finding 3. LMMs lack compositional ability and struggle to acquire them even with ICL.
Finding 4. LMMs still fail to provide good explanations, yet ICL can improve performances. Bigger models explain better.
Finding 5. LMMs do not precisely follow user instructions, and small number of ICL demonstrations makes them slightly more helpful, especially with models that are not instruction tuned.

Evluated Models

We consider 10 different models from OpenFlamingo (OF) and IDEFICS (up to 80B parameters) as described in the table below

*We evaluate 10 models that differ in size, training data, and LLM initialization. Tr: training/trainable. Inst.: instruction. P/D: image-text pairs/web documents. * use additional ChatGPT data.*

Object Hallucinations (Truthfulness, Harmlessness)

Hallucinations in text is the tendency of LLMs to generate coherent plausible responses, over factual ones. By analogy, when considering multiple modalities, we are also concerned with object hallucinations (OH) wherein the textual description generated by multimodal models describe objects not present in the input image. Addressing OH is critical to avoid any harm, especially in critical applications (\emph{e.g.} autonomous driving or medical imaging). We evaluate the various LMMs for captioning on the COCO dataset. Overall accuracy is measured with CIDEr. In addition, to capture OH, we report the CHAIRs metric comparing the objects referred in the generated captioning to those actually in the image.
Finding 1. LMMs suffer from severe hallucinations. A small number of ICL shots partially alleviate it, while increasing them exacerbates the problem, especially for small models (less than 9B params.). Pretraining on more high-quality data and unfreezing the LLM weights helps to reduce hallucinations.

*Object hallucination. CIDEr (↑) for captioning and CHAIRs (↓) for hallucination on COCO dataset.*

Answer Abstention (Honesty)

LMMs should know when they do not know, and abstain instead of providing incorrect answers. Here we study a scenario where the question can not be answered from the image. We evaluate on TDIUC, a VQA dataset containing absurd questions ($\sim22\%$ of a total number of questions), that are not related to the image and thus should not be answered. In case of abstention, the model should generate a specific keyword. We report the overall accuracy in addition to the F1-score abstention metric (absurd question or not).
Finding 2. LMMs give more likely incorrect answers than abstaining. ICL helps them abstain. Larger models, better quality data, and unfreezing LM weights improve abstention.

*Abstention. Overall VQA accuracy (↑) and abstention F1-score (↑) on TDIUC dataset.*

Compositionality (Generalization and Understanding)

Compositionality exists when the meaning of a sentence is determined by its elements, and the rules to compose them. To study this, we evaluate if LMMs' understanding of a caption is changed when changing its constituents. We evaluate on the CREPE benchmark; an image-text retrieval dataset with hard negatives, constructed by changing the composition of the ground truth captions. Instead of retrieval, we create the task of Image-Text Matching (ITM). For ITM the model is given one caption and asked to decide if it describes the image or not. We use the positive and negative captions provided by the benchmark.
Finding 3. LMMs lack compositional ability and struggle to acquire them even with ICL.

Compositionality. Models are evaluated on the CREPE benchmark with the ITM task. We evaluate on systematicity, we consider 2 types of negative captions: HN-Atom (replacing atoms, such as objects, attributes, or relations with atomic foils) and HN-Comp (composing two negative captions constructed with HN-Atom). We noticed similar observations with productivity

*Compositionality. Evaluation on SugarCREPE*

Explainability (Helpfulness)

Despite the impressive abilities of LMMs, it is still unclear if generations are caused by some underlying complex reasoning based on the input image, or rather on some memorization or bias exploitation. Instead of looking at internal activations and learned features as means of output explanation, we try another and more explicit approach; by asking the model itself for an explanation. We consider VQA-X, a VQA dataset with human-annotated explanations for each image-question-answer triplets, and CIDEr as the metric to measure the syntactic similarity between the generated explanations and the ground truths.
Finding 4. LMMs still fail to provide good explanations, yet ICL can improve performances. Bigger models explain better.

*Explainability. Models are asked to generate an explanation for image, question and answer triplets from the VQA-X dataset.*

Instruction Following (Helpfulness)

Existing multimodal models are trained to solve relatively simple tasks, such as providing shallow image descriptions or answering questions with one or two words. These capabilities are not enough to build general assistants that can engage in conversation with humans. Helpful assistants should help humans answer complex questions, precisely following specific instructions and engaging in conversations. Current approaches to integrate such abilities are based on instruction tuning, wherein the model is fine-tuned on curated instruction datasets. We evaluate LMMs on the LlaVA dataset, which contains 3 types of instructions; giving detailed image descriptions, and answering complex questions and conversations. These instructions are generated with GPT-4 (text-only). For ICL, the demonstrations are selected randomly from the dataset with the same instruction type as the query.
Finding 5LMMs do not precisely follow user instructions, and small number of ICL demonstrations makes them slightly more helpful, especially with models that are not instruction tuned.

Instruction following. Qualitative evaluation results of IDEFICS and OFv2-9B on the LlaVA benchmark on 3 types of instructions (from left to right): detailed descriptions, complex questions and conversations.

*Instruction following. Evaluation with GPT4 on LLaVA-all.*

X-ICL: New Multimodal ICL Variants

We push ICL further and propose new improved variants to address some of LMMs limitations

Chain of Hindsight (CoH) is an alternative approach for aligning LLMs to human preferences. It transforms the feedback into sentences and trains LLMs to generate this feedback in a supervised way. Specifically, the model is trained to generate both helpful and unhelpful responses, and during evaluation, it is prompted with the helpful prompt. Inspired by this, and to avoid costly training, we propose CoH-ICL; a training-free approach that leverages both good and bad responses as kind of in-context demonstrations. Here, we are not limited to human preferences as feedback and use positive and negative responses in general (e.g., from human annotation, previous model generation, random text ...). We leverage CoH-ICL to improve model explainability. The context consists of; an image, question, answer, human annotation as the good response, and previous model's generation (with ICL 32-shot) as the bad response.

Recently, self-correction in LLMs has received large attention. The idea is to use the model itself to automatically correct its generated answers. We explore a similar approach to help LMMs abstain from answering. Specifically, we first ask the model the question using ICL. Then, for each question, we ask the model to decide whether the question is answerable based on the image or not. In case the model recognizes that the question is not answerable, the previous answer is ignored and replaced with an abstention keyword. The correction is with 32-shot in this step 2.

Multitask learning aims at leveraging the synergy between tasks, usually by training one model on different related tasks. Different from this, we propose to do multitask learning in context, without changing the model's weights. Our objective is to benefit from information from other tasks to reduce LMMs flaws. For explainability, we ask the model to simultaneously; answer the question and explain its answers preceded with the prompt "because". For abstention, the main task is to answer the question and the second auxiliary task is to decide whether the question is relevant to the image or not.

Acknowledgements

This work was partly supported by ANR grant VISA DEEP (ANR-20-CHIA-0022), and HPC resources of IDRIS under the allocation 2022-[AD011013415] and 2023-[AD011013415R1] made by GENCI. The authors would like to thank Hugo Laurençon for fruitful discussions.

BibTeX


      @article{shukor2023beyond,
        title={Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning},
        author={Shukor, Mustafa and Rame, Alexandre and Dancette, Corentin and and Cord, Matthieu},
        journal={arXiv preprint arXiv:2310.00647},
        year={2023}
      }