Comparing image captioning tasks in LLM with Gemini Vision Pro, BILP, BILP2, and LLAVA

Tiya Vaj
4 min readMay 19, 2024
  1. Gemini Vision Pro:
  • Type: Large Multimodal Model (LMM)
  • Focus: Text-centric Multimodality
  • Description: Gemini Vision Pro is a powerful LMM trained on a massive dataset of text, code, and images. Its primary focus is on text generation and manipulation tasks that leverage information from various modalities. It can perform actions like:
  • Summarization of factual topics.
  • Code generation based on natural language descriptions.
  • Answering open-ended, challenging, or strange questions in an informative way.
  • Image Description: While Gemini Vision Pro can process images to some extent, it doesn’t have a dedicated image description functionality. It’s better suited for tasks involving text generation and manipulation based on a combination of modalities.

2.BLIP (BiLingual Image Processing) and BLIP2:

  • Type: Vision-Language Models (VLMs)
  • Focus: Image-to-Text Generation
  • Description: BLIP and BLIP2 are VLMs designed to understand and describe visual content. They are trained on large datasets of images and captions in multiple languages. These models can:
  • Generate captions for images.
  • Answer questions about the content of an image.
  • Translate image captions between languages.
  • Key Differences: BLIP2 is an improved version of BLIP with better performance on image captioning tasks.

3.LLaVA (Large Language and Vision Transformer) :

  • Type: VLMs
  • Focus: Image-to-Text Generation and Understanding
  • Description: LLaVA is a family of VLMs with different variations like LLaVA1.5 and LLaVA-NeXT. These models excel at both image understanding and text generation tasks related to images. They can:
  • Generate detailed and informative captions for images.
  • Classify and categorize images based on their content.
  • Answer visual question answering (VQA) tasks.

Choosing the Right Model:

The best model for your needs depends on your specific task.

  • If you need text generation and manipulation with some image processing, consider Gemini Vision Pro.
  • If your focus is on image captioning or image-to-text tasks, BLIP, BLIP2, LLaVA, or LLaVA++ are better choices depending on the desired level of performance and available resources.

-LLAVa caption

1st pic : “A car accident captured from an overhead perspective, showing the aftermath of a collision on a city street.”

2nd pic : “Traffic accident on a city street with debris scattered across the road.”

LLAVA is doing great job in both images.

-BILP 2 caption

1st pic : a car is driving down the road in the rain

2nd pic : a car is on the side of the road and a person is laying on the ground

As we can see from BLIP2 caption, The 1st image’s caption wrong while 2nd image ,the caption is right.

-BILP caption

1st pic : a car is driving down the road in the rain

2nd pic : a car is seen on the side of a road

As we can see from BLIP caption, both captions tend to be short and 1st caption is wrong, while 2nd caption is correct but not mention much information.

1st pic : A car accident has occurred at a crosswalk. A black car has been t-boned by a white car. The black car is facing in the direction the white car was traveling. The white car appears to have run a red light. There is a bus stop to the left of the intersection.

2nd pic :A car accident. A white van and a gray car collided in the middle of an intersection. The gray car flipped over and landed on its roof. A motorcyclist is lying on the ground near the van. There are people standing around the accident, including a woman holding her head in her hands.

As we can see from Gemini Vision Pro, The first caption is partially correct and long texts, while the 2nd caption is also long text but the fact is not existing.

After investigating two images, I found that LLAVa tends to perform well. However, it requires a large amount of photos for thorough verification. This leads to the conclusion that while LLAVa is effective, its reliability increases with the availability of more visual data for analysis.




Tiya Vaj

Ph.D. Research Scholar in NLP and my passionate towards data-driven for social good.Let's connect here