Guiding Instruction-Based Image Editing via Multimodal Large Language Models

Visual design tools and vision language models have widespread applications in the multimedia industry. Despite significant advancements in recent years, a solid understanding of these tools is still necessary for their operation. To enhance accessibility and control, the multimedia industry is increasingly adopting text-guided or instruction-based image editing techniques. These techniques utilize natural language commands instead of traditional regional masks or elaborate descriptions, allowing for more flexible and controlled image manipulation. However, instruction-based methods often provide brief directions that may be challenging for existing models to fully capture and execute. Additionally, diffusion models, known for their ability to create realistic images, are in high demand within the image editing sector.

Moreover, Multimodal Large Language Models (MLLMs) have shown impressive performance in tasks involving visual-aware response generation and cross-modal understanding. MLLM Guided Image Editing (MGIE) is a study inspired by MLLMs that evaluates their capabilities and analyzes how they support editing through text or guided instructions. This approach involves learning to provide explicit guidance and deriving expressive instructions. The MGIE editing model comprehends visual information and executes edits through end-to-end training. In this article, we will delve deeply into MGIE, assessing its impact on global image optimization, Photoshop-style modifications, and local editing. We will also discuss the significance of MGIE in instruction-based image editing tasks that rely on expressive instructions. Let’s begin our exploration.

Multimodal Large Language Models and Diffusion Models are two of the most widely used AI and ML frameworks currently owing to their remarkable generative capabilities. On one hand, you have Diffusion models, best known for producing highly realistic and visually appealing images, whereas on the other hand, you have Multimodal Large Language Models, renowned for their exceptional prowess in generating a wide variety of content including text, language, speech, and images/videos.

Diffusion models swap the latent cross-modal maps to perform visual manipulation that reflects the alteration of the input goal caption, and they can also use a guided mask to edit a specific region of the image. But the primary reason why Diffusion models are widely used for multimedia applications is because instead of relying on elaborate descriptions or regional masks, Diffusion models employ instruction-based editing approaches that allow users to express how to edit the image directly by using text instructions or commands. Moving along, Large Language Models need no introduction since they have demonstrated significant advancements across an array of diverse language tasks including text summarization, machine translation, text generation, and answering the questions. LLMs are usually trained on a large and diverse amount of training data that equips them with visual creativity and knowledge, allowing them to perform several vision language tasks as well. Building upon LLMs, MLLMs or Multimodal Large Language Models can use images as natural inputs and provide appropriate visually aware responses.

With that being said, although Diffusion Models and MLLM frameworks are widely used for image editing tasks, there exist some guidance issues with text based instructions that hampers the overall performance, resulting in the development of MGIE or MLLM Guided Image Editing, an AI-powered framework consisting of a diffusion model, and a MLLM model as demonstrated in the following image.

Within the MGIE architecture, the diffusion model is end-to-end trained to perform image editing with latent imagination of the intended goal whereas the MLLM framework learns to predict precise expressive instructions. Together, the diffusion model and the MLLM framework takes advantage of the inherent visual derivation allowing it to address ambiguous human commands resulting in realistic editing of the images, as demonstrated in the following image.

The MGIE framework draws heavy inspiration from two existing approaches: Instruction-based Image Editing and Vision Large Language Models.

Instruction-based image editing can improve the accessibility and controllability of visual manipulation significantly by adhering to human commands. There are two main frameworks utilized for instruction based image editing: GAN frameworks and Diffusion Models. GAN or Generative Adversarial Networks are capable of altering images but are either limited to specific domains or produce unrealistic results. On the other hand, diffusion models with large-scale training can control the cross-modal attention maps for global maps to achieve image editing and transformation. Instruction-based editing works by receiving straight commands as input, often not limited to regional masks and elaborate descriptions. However, there is a probability that the provided instructions are either ambiguous or not precise enough to follow instructions for editing tasks.

Vision Large Language Models are renowned for their text generative and generalization capabilities across various tasks, and they often have a robust textual understanding, and they can further produce executable programs or pseudo code. This capability of large language models allows MLLMs to perceive images and provide adequate responses using visual feature alignment with instruction tuning, with recent models adopting MLLMs to generate images related to the chat or the input text. However, what separates MGIE from MLLMs or VLLMs is the fact that while the latter can produce images distinct from inputs from scratch, MGIE leverages the abilities of MLLMs to enhance image editing capabilities with derived instructions.

MGIE: Architecture and Methodology

Traditionally, large language models have been used for natural language processing generative tasks. But ever since MLLMs went mainstream, LLMs were empowered with the ability to provide reasonable responses by perceiving images input. Conventionally, a Multimodal Large Language Model is initialized from a pre-trained LLM, and it contains a visual encoder and an adapter to extract the visual features, and project the visual features into language modality respectively. Owing to this, the MLLM framework is capable of perceiving visual inputs although the output is still limited to text.

The proposed MGIE framework aims to resolve this issue, and facilitate a MLLM to edit an input image into an output image on the basis of the given textual instruction. To achieve this, the MGIE framework houses a MLLM and trains to derive concise and explicit expressive text instructions. Furthermore, the MGIE framework adds special image tokens in its architecture to bridge the gap between vision and language modality, and adopts the edit head for the transformation of the modalities. These modalities serve as the latent visual imagination from the Multimodal Large Language Model, and guides the diffusion model to achieve the editing tasks. The MGIE framework is then capable of performing visual perception tasks for reasonable image editing.

Concise Expressive Instruction

Traditionally, Multimodal Large Language Models can offer visual-related responses with its cross-modal perception owing to instruction tuning and features alignment. To edit images, the MGIE framework uses a textual prompt as the primary language input with the image, and derives a detailed explanation for the editing command. However, these explanations might often be too lengthy or involve repetitive descriptions resulting in misinterpreted intentions, forcing MGIE to apply a pre-trained summarizer to obtain succinct narrations, allowing the MLLM to generate summarized outputs. The framework treats the concise yet explicit guidance as an expressive instruction, and applies the cross-entropy loss to train the multimodal large language model using teacher enforcing.

Using an expressive instruction provides a more concrete idea when compared to the text instruction as it bridges the gap for reasonable image editing, enhancing the efficiency of the framework furthermore. Moreover, the MGIE framework during the inference period derives concise expressive instructions instead of producing lengthy narrations and relying on external summarization. Owing to this, the MGIE framework is able to get a hold of the visual imagination of the editing intentions, but is still limited to the language modality. To overcome this hurdle, the MGIE model appends a certain number of visual tokens after the expressive instruction with trainable word embeddings allowing the MLLM to generate them using its LM or Language Model head.

Image Editing with Latent Imagination

In the next step, the MGIE framework adopts the edit head to transform the image instruction into actual visual guidance. The edit head is a sequence to sequence model that helps in mapping the sequential visual tokens from the MLLM to the meaningful latent semantically as its editing guidance. To be more specific, the transformation over the word embeddings can be interpreted as general representation in the visual modality, and uses an instance aware visual imagination component for the editing intentions. Furthermore, to guide image editing with visual imagination, the MGIE framework embeds a latent diffusion model in its architecture that includes a variational autoencoder and addresses the denoising diffusion in the latent space. The primary goal of the latent diffusion model is to generate the latent goal from preserving the latent input and follow the editing guidance. The diffusion process adds noise to the latent goal over regular time intervals and the noise level increases with every timestep.

Learning of MGIE

The following figure summarizes the algorithm of the learning process of the proposed MGIE framework.

As it can be observed, the MLLM learns to derive concise expressive instructions using the instruction loss. Using the latent imagination from the input image instructions, the framework transforms the modality of the edit head, and guides the latent diffusion model to synthesize the resulting image, and applies the editing loss for diffusion training. Finally, the framework freezes a majority of weights resulting in parameter-efficient end to end training.

MGIE: Results and Evaluation

The MGIE framework uses the IPr2Pr dataset as its primary pre-training data, and it contains over 1 million CLIP-filtered data with instructions extracted from GPT-3 model, and a Prompt-to-Prompt model to synthesize the images. Furthermore, the MGIE framework treats the InsPix2Pix framework built upon the CLIP text encoder with a diffusion model as its baseline for instruction-based image editing tasks. Furthermore, the MGIE model also takes into account a LLM-guided image editing model adopted for expressive instructions from instruction-only inputs but without visual perception.

Quantitative Analysis

The following figure summarizes the editing results in a zero-shot setting with the models being trained only on the IPr2Pr dataset. For GIER and EVR data involving Photoshop-style modifications, the expressive instructions can reveal concrete goals instead of ambiguous commands that allows the editing results to resemble the editing intentions better.

Although both the LGIE and the MGIE are trained on the same data as the InsPix2Pix model, they can offer detailed explanations via learning with the large language model, but still the LGIE is confined to a single modality. Furthermore, the MGIE framework can provide a significant performance boost as it has access to images, and can use these images to derive explicit instructions.

To evaluate the performance on instruction-based image editing tasks for specific purposes, developers fine–tune several models on each dataset as summarized in the following table.

As it can be observed, after adapting the Photoshop-style editing tasks for EVR and GIER, the models demonstrate a boost in performance. However, it is worth noting that since fine-tuning makes expressive instructions more domain-specific as well, the MGIE framework witnesses a massive boost in performance since it also learns domain-related guidance, allowing the diffusion model to demonstrate concrete edited scenes from the fine-tuned large language model benefitting both the local modification and local optimization. Furthermore, since the visual-aware guidance is more aligned with the intended editing goals, the MGIE framework delivers superior results consistently when compared to LGIE.

The following figure demonstrates the CLIP-S score across the input or ground truth goal images and expressive instruction. A higher CLIP score indicates the relevance of the instructions with the editing source, and as it can be observed, the MGIE has a higher CLIP score when compared to the LGIE model across both the input and the output images.

Qualitative Results

The following image perfectly summarizes the qualitative analysis of the MGIE framework.

As we know, the LGIE framework is limited to a single modality because of which it has a single language-based insight, and is prone to deriving wrong or irrelevant explanations for editing the image. However, the MGIE framework is multimodal, and with access to images, it completes the editing tasks, and provides explicit visual imagination that aligns with the goal really well.

Final Thoughts

In this article, we have talked about MGIE or MLLM Guided Image Editing, a MLLM-inspired study that aims to evaluate Multimodal Large Language Models and analyze how they facilitate editing using text or guided instructions while learning how to provide explicit guidance by deriving expressive instructions simultaneously. The MGIE editing model captures the visual information and performs editing or manipulation using end to end training. Instead of ambiguous and brief guidance, the MGIE framework produces explicit visual-aware instructions that result in reasonable image editing.