Querying Transformer (Q-Former) in BLIP-2 improves Image-Text Generation in E-Commerce Applications

Querying transformer (Q-former) in BLIP-2 improves image-text generation in e-commerce applications.

In 2023, there was an incredible craze for text-to-image models.

However, in 2024, this trend is expected to reverse.

BLIP-2 is here. (Launched in 2023, I have just begun exploring its capabilities 😂)

In this post, I am not reviewing the paper but rather sharing my recent experience with BLIP-2, which I’ve fine-tuned and integrated with a popular e-commerce application.

A small glimpse of this app is that it allows users to upload any image and answer questions about it.

The successor of BLIP, Salesforce, has introduced BLIP-2, which has addressed the challenges faced by BLIP. Salesforce AI Research has revamped the pre-training objectives from BLIP (Bootstrapping Language Image Pre-training) for vision-and-language representation learning.

Zero-shot image-to-text generation, BLIP-2 from Salesforce AI Research Team, paves the way for multimodal ChatGPT-like models, a new technique for pre-training visual-language models in a cost-efficient manner despite having significantly fewer trainable parameters. The key challenge in using a frozen LLM is aligning visual features to the text space.

Motivation:

Bridging the gap between vision and language modalities has a significant impact when LLMs remain frozen and have not been exposed to any images during their NLP pre-training.

BLIP-2 primarily consists of two stages:

Stage 1: Bootstraps vision-language representation learning from a frozen image encoder.

In stage 1 of the pre-training strategy, BLIP-2 connects the lightweight Querying Transformer (called Q-Former) to a frozen Image Encoder 🥶. Here, you can see Q-Former learns to extract features that are most relevant to the text [Block: Image-Text Matching].

BLIP 2 — Stage 1 Bootstraps vision-language representation learning from a frozen image encoder.

Stage 2: Bootstraps vision-to-language generative learning from a frozen language model.

In stage 2 of the pre-training strategy, BLIP-2 connects the output of the Q-Former to a frozen LLM 🥶, whose outputs are directly used as soft prompts for the frozen LLM. BLIP-2 can effectively and efficiently leverage both frozen image encoders and frozen LLMs for various vision-language tasks, achieving stronger performance at a lower computational cost.

BLIP 2 — Stage 2 Bootstraps vision-to-language generative learning from a frozen language model

Flamingo (Alayrac et al., 2022) inserts new cross-attention layers into the LLM to inject visual features and pre-trains the new layers on billions of image-text pairs. Both methods adopt the language modelling loss, where the language model generates texts conditioned on the image.

Q-Former (188M parameters):

Source : BLIP-2

The queries interact with each other through self-attention layers and interact with frozen image features through cross-attention layers (inserted every other transformer block).

The queries are considered as model parameters, and the weights initialisation is similar to BERT base, where cross-attention layers are randomly initialised.

Note: It shares the same self-attention layers.

Visual-Language Tasks: BLIP-2 can handle various image-to-text tasks, including:

  • Image Captioning: Generating descriptive captions for images.
    Prompted Image Captioning: Providing captions based on specific prompts.
  • Visual Question Answering (VQA): Answering questions related to images.
  • Chat-based Prompting: Engaging in conversations with users based on visual input.
BLIP-2 Image-Text Contrastive Learning

Now, let’s dive into imagining an e-commerce platform integrating BLIP-2, where users can upload images of clothing, accessories, or footwear, and receive tailored recommendations and answers to their fashion-related queries. This is precisely what our e-commerce store aims to achieve.

  1. Virtual Clothing Try-On : Customers can upload images of themselves or select pre-existing models to visualise how different outfits would look on them. And It provides a realistic representation of fit, fabric, and style.
  2. Sales Boost : Hoping customers appreciate the ability to see themselves in the clothes before making a purchase. It can increase in sales.
  3. Reduced Returns : Nearly half of online shoppers are disappointed when clothes don’t look as expected. To minimise this by allowing customers to virtually try on items.

BLIP-2 in Action:

  • Image-Driven Queries: Users can upload an image and ask questions like:
    “What fabric is this dress made of?”
    “Does this shirt come in other colors?”
    “Is this shoe suitable for outdoor activities?”
  • Semantic Understanding: BLIP-2 analyses the image, extracts relevant features, and understands the context of the query.
  • Precise Answers: The model provides accurate and concise answers based on the image content. For instance:
    “The dress is made of silk.”
    “The shirt is available in blue, white, and black.”
    “Yes, the shoe is designed for outdoor use.”

Basic Implementation Steps:

  • Data Collection and Preprocessing:
    → Gather a diverse dataset of fashion images.
    → Annotate images with relevant information (fabric, color, style, etc.).
  • Model Training:
    → Fine-tune the BLIP-2 model on the annotated dataset.
    → Optimise for image-based queries.
  • Integration with E-Commerce Platform:
    → Develop APIs for image upload and question submission.
    → Connect the BLIP-2 model to handle queries.
  • User Interface:
    → Design an intuitive interface for image upload.
    → Display answers alongside product details.
  • Testing and Refinement:
    → Test the system with real users.
    → Continuously improve the model’s accuracy.

BLIP-2 Model Zoo

# ==================================================
# Architectures Types
# ==================================================
# blip2_opt pretrain_opt2.7b, caption_coco_opt2.7b, pretrain_opt6.7b, caption_coco_opt6.7b
# blip2_t5 pretrain_flant5xl, caption_coco_flant5xl, pretrain_flant5xxl
# blip2 pretrain, coco

Demo:

Now, we will bring Hands-on/ in-action Image-to-text Generation example

Since other models are too large and needs large RAM to load the larger models. Running on GPU can optimise inference speed.

Here we will try out :

"Salesforce/blip2-opt-2.7b" # Since other models are too large to load onto the free Inference API)
  • Install from source following LAVIS
pip install salesforce-lavis
  • Import the necessary libraries
import torch
from PIL import Image
import requests
from lavis.models import load_model_and_preprocess
  • Load the input image
# load sample image
raw_image = Image.open("PATH/TO/THE/IMG.png").convert("RGB")
display(raw_image.resize((596, 437)))
  • Caution: Large RAM is required to load the larger models. Running on GPU can optimise inference speed.
# setup device to use
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
device

##################################
#### output: device: "cuda" ####
##################################
  • Load pre-trained/fine-tuned BLIP-2 captioning model
# we associate a model with its preprocessors to make it easier for inference.
model, vis_processors, _ = load_model_and_preprocess(
name="blip2_opt",
model_type="pretrain_opt2.7b",
is_eval=True,
device=device)
# Other available models:
#
# model, vis_processors, _ = load_model_and_preprocess(
# name="blip2_opt", model_type="pretrain_opt2.7b", is_eval=True, device=device
# )
# model, vis_processors, _ = load_model_and_preprocess(
# name="blip2_opt", model_type="pretrain_opt6.7b", is_eval=True, device=device
# )
# model, vis_processors, _ = load_model_and_preprocess(
# name="blip2_opt", model_type="caption_coco_opt2.7b", is_eval=True, device=device
# )
# model, vis_processors, _ = load_model_and_preprocess(
# name="blip2_opt", model_type="caption_coco_opt6.7b", is_eval=True, device=device
# )
# model, vis_processors, _ = load_model_and_preprocess(
# name="blip2_t5", model_type="pretrain_flant5xl", is_eval=True, device=device
# )
# model, vis_processors, _ = load_model_and_preprocess(
# name="blip2_t5 ", model_type="pretrain_flant5xxl", is_eval=True, device=device
# )
# model, vis_processors, _ = load_model_and_preprocess(
# name="blip2_t5", model_type="caption_coco_flant5xl", is_eval=True, device=device
# )

vis_processors.keys()
  • Preparing an image as model input using the associated processors
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
  • Generate caption using beam search
model.generate({"image": image})
  • Generating multiple captions using nucleus sampling
# due to the non-deterministic nature of nucleus sampling, you may get different captions.
model.generate({"image": image}, use_nucleus_sampling=True, num_captions=3)
  • Instructed zero-shot vision-to-language generation
model.generate({
"image": image,
"prompt": "Question: which city is this? Answer:"})

##################################
#### output: ['singapore'] ####
##################################



model.generate({
"image": image,
"prompt": "Question: which city is this? Answer: singapore. Question: why?"})


#######################################
#### output : ['it has a statue of a merlion'] ####
#########################################
context = [
("which city is this?", "singapore"),
("why?", "it has a statue of a merlion"),
]


question = "where is the name merlion coming from?"
template = "Question: {} Answer: {}."
prompt = " ".join([template.format(context[i][0], context[i][1]) for i in range(len(context))]) + " Question: " + question + " Answer:"


#########################################
#### output : Question: which city is this?
#### Answer: singapore.
#### Question: why?
#### Answer: it has a statue of a merlion.
#### Question: where is the name merlion coming from?
#### Answer: sea town in javanese ######
#########################################
model.generate(
{
"image": image,
"prompt": prompt
},
use_nucleus_sampling=False,
)

##################################
#### output: ['merlion is a portmanteau of mermaid and lion'] ####
##################################

Conclusion

By combining virtual try-on technology with the BLIP-2 model, e-commerce stores revolutionise the way users shop for fashion online. Whether they’re curious about fabric composition or seeking style advice, our platform provides personalised and reliable answers.

Get ready to transform the fashion industry — one image at a time!

Don’t forget to 👏 if you liked the article.

Thank you for taking the time to read! Your feedback is most welcome!

Hope this helps! Feel free to let me know if this post was useful. 😃

Hungry for AI? Follow, bite-sized brilliance awaits! ⚡

🔔 Follow Me: LinkedIn | GitHub | Twitter

Buy me a coffee:

Buy Me a Coffee

--

--

Deeraj Manjaray
Deeraj Manjaray

Written by Deeraj Manjaray

ML Engineer focused on building technology that helps people around us in easy ways. Follow : in.linkedin.com/in/deeraj-manjaray