Problem: Current large multimodal models (LMMs) struggle with seemingly straightforward reasoning tasks that require precise perception of low-level visual details, such as comparing the lengths of lines or navigating simple mazes. In particular, this failure mode persists in tasks about vector graphics—images composed purely of 2D objects and shapes.
Method: To solve this challenge, we propose Visually Descriptive Language Model (VDLM), a text-based visual reasoning framework about vector graphics. VDLM leverages Scalable Vector Graphics (SVG) as a more precise initial perception of visual inputs. As existing language models cannot understand raw SVGs in a zero-shot setting, VDLM then bridges SVG with pretrained language models using a newly introduced, intermediate symbolic representation, Primal Visual Description (PVD), which comprises primitive attributes (e.g., shape, position, measurement). By casting an image to a text-based representation, we can leverage the power of language models to learn alignment from SVG to visual primitives and generalize to unseen tasks and domains.
Performance: VDLM achieves stronger zero-shot performance compared to state-of-the-art LMMs, such as GPT-4V, in various low-level multimodal perception and reasoning tasks on vector graphics. VDLM also offers better interpretability due to its disentangled perception and reasoning processes.
We leverage an off-the-shelf rule-based image-to-SVG parsing algorithm, VTracer, for converting any image into SVG without learning. This enables us to obtain an accurate initial perception of the input vector graphic images. However, we observe two key challenges when working with raw SVG representation. First, off-the-shelf LLMs, e.g., GPT-4, have limited zero-shot reasoning ability on SVG representation. Second, fine-tuning on task-specific ⟨SVG, question, answer⟩ pairs limits generalization to unseen tasks and domains. We discuss our approach for extracting intermediate representations below.
We propose Primal Visual Description (PVD), a higher level scene representation that bridges low-level SVG paths to more structured primitives required for reasoning. PVD is a text-based visual description that consists of a set of primitive geometry objects, e.g., circles, line segments. Each PVD element contains the primitives' attributes (e.g., color, shape, position, size) with corresponding predicted values (e.g., blue, circle, pixel coordinates of the center, length of the radius). See Figure 2 for the ontology we defined. Notably, unlike raw SVG, PVD is directly interpretable by state-of-the-art LLMs, enabling zero-shot reasoning on downstream tasks.
Since SVG is text-based, we can effectively learn a SVG-to-PVD model by fine-tuning a pretrained language model (Mistral-7B-v0.1). To obtain the training data, we develop a data generator leveraging PIL.ImageDraw and VTracer, which creates a large-scale ⟨SVG, PVD⟩ paired dataset without any human annotation. See Figure 3 above on an input/output example. During inference, as shown in the Maze Solving example video, we first decompose the input image into single SVG paths and then individually feed them into the SVG-to-PVD model.
Given an unseen task, we first use our visual perception modules, as aformentioned, to generate a fully text-based visual description of the input vector graphics. We then leverage an off-the-shelf LLM, e.g., GPT-4, for interpreting the perception results and reasoning about task queries. An example of the full input prompt and GPT-4 response of the 2×2 Maze Solving example can be viewed below.
We construct an evaluation benchmark that comprises 9 tasks which cover important aspects of low-level visual perception and vision-language reasoning, including measurements, spatial relations, counting, logical reasoning, and complex reasoning problems. See Figure 4 for the task examples.
VDLM outperforms both open- and closed-source state-of-the-art Large Multimodal Models, including LLaVA-1.5 and GPT-4V, demonstrating the effectiveness of its text-based, disentangled framework in achieving precise low-level perception and reasoning. VDLM also outperforms previous visual programming methods, i.e., ViperGPT, indicating that these models are limited by the capability of the vision-language processors, such as GLIP and BLIP2, especially in processing low-level primitives such as angles and shapes.
💻 Code: VDLM Code
🍉 Demo (Jupyter Notebook): VDLM Demo
🤗 Pretrained SVG-to-PVD Model: PVD-160k-Mistral-7b
🤗 SVG-to-PVD Dataset: PVD-160K
@misc{wang2024textbased,
title={Text-Based Reasoning About Vector Graphics},
author={Zhenhailong Wang and Joy Hsu and Xingyao Wang and Kuan-Hao Huang and Manling Li and Jiajun Wu and Heng Ji},
year={2024},
eprint={2404.06479},
archivePrefix={arXiv},
primaryClass={cs.CL}
}