Icon Text-based Reasoning About Vector Graphics

1University of Illinois Urbana-Champaign, 2Stanford University

Abstract

Problem: Current large multimodal models (LMMs) struggle with seemingly straightforward reasoning tasks that require precise perception of low-level visual details, such as comparing the lengths of lines or navigating simple mazes. In particular, this failure mode persists in tasks about vector graphics—images composed purely of 2D objects and shapes.

Method: To solve this challenge, we propose Visually Descriptive Language Model (VDLM), a text-based visual reasoning framework about vector graphics. VDLM leverages Scalable Vector Graphics (SVG) as a more precise initial perception of visual inputs. As existing language models cannot understand raw SVGs in a zero-shot setting, VDLM then bridges SVG with pretrained language models using a newly introduced, intermediate symbolic representation, Primal Visual Description (PVD), which comprises primitive attributes (e.g., shape, position, measurement). By casting an image to a text-based representation, we can leverage the power of language models to learn alignment from SVG to visual primitives and generalize to unseen tasks and domains.

Performance: VDLM achieves stronger zero-shot performance compared to state-of-the-art LMMs, such as GPT-4V, in various low-level multimodal perception and reasoning tasks on vector graphics. VDLM also offers better interpretability due to its disentangled perception and reasoning processes.

VDLM overview.
Figure 1: Comparison of existing monolithic LMM and VDLM. The example is from NLVR dataset.

Method

VDLM zero-shot inference example on 2×2 Maze Solving task.



Encoding images into SVG to preserve low-level details

We leverage an off-the-shelf rule-based image-to-SVG parsing algorithm, VTracer, for converting any image into SVG without learning. This enables us to obtain an accurate initial perception of the input vector graphic images. However, we observe two key challenges when working with raw SVG representation. First, off-the-shelf LLMs, e.g., GPT-4, have limited zero-shot reasoning ability on SVG representation. Second, fine-tuning on task-specific ⟨SVG, question, answer⟩ pairs limits generalization to unseen tasks and domains. We discuss our approach for extracting intermediate representations below.

Learning intermediate symbolic representation to enable text-based reasoning

PVD ontology.
Figure 2: Ontology of the PVD primitives.

We propose Primal Visual Description (PVD), a higher level scene representation that bridges low-level SVG paths to more structured primitives required for reasoning. PVD is a text-based visual description that consists of a set of primitive geometry objects, e.g., circles, line segments. Each PVD element contains the primitives' attributes (e.g., color, shape, position, size) with corresponding predicted values (e.g., blue, circle, pixel coordinates of the center, length of the radius). See Figure 2 for the ontology we defined. Notably, unlike raw SVG, PVD is directly interpretable by state-of-the-art LLMs, enabling zero-shot reasoning on downstream tasks.

SVG-to-PVD model.
Figure 3: An example of the input and output of the SVG-to-PVD model.

Since SVG is text-based, we can effectively learn a SVG-to-PVD model by fine-tuning a pretrained language model (Mistral-7B-v0.1). To obtain the training data, we develop a data generator leveraging PIL.ImageDraw and VTracer, which creates a large-scale ⟨SVG, PVD⟩ paired dataset without any human annotation. See Figure 3 above on an input/output example. During inference, as shown in the Maze Solving example video, we first decompose the input image into single SVG paths and then individually feed them into the SVG-to-PVD model.

Zero-shot task generalization with off-the-shelf LLMs

Given an unseen task, we first use our visual perception modules, as aformentioned, to generate a fully text-based visual description of the input vector graphics. We then leverage an off-the-shelf LLM, e.g., GPT-4, for interpreting the perception results and reasoning about task queries. An example of the full input prompt and GPT-4 response of the 2×2 Maze Solving example can be viewed below.

Maze Example Step 1

Maze Example Step 2

Maze Example Step 3

Performance

Tasks

downstream tasks.
Figure 4: Our full evaluation benchmark, composed of 9 zero-shot tasks on vector graphics.

We construct an evaluation benchmark that comprises 9 tasks which cover important aspects of low-level visual perception and vision-language reasoning, including measurements, spatial relations, counting, logical reasoning, and complex reasoning problems. See Figure 4 for the task examples.

Results

results.
Figure 5: Zero-shot accuracy on 9 tasks.

VDLM outperforms both open- and closed-source state-of-the-art Large Multimodal Models, including LLaVA-1.5 and GPT-4V, demonstrating the effectiveness of its text-based, disentangled framework in achieving precise low-level perception and reasoning. VDLM also outperforms previous visual programming methods, i.e., ViperGPT, indicating that these models are limited by the capability of the vision-language processors, such as GLIP and BLIP2, especially in processing low-level primitives such as angles and shapes.

Resources

💻 Code: VDLM Code

🍉 Demo (Jupyter Notebook): VDLM Demo

🤗 Pretrained SVG-to-PVD Model: PVD-160k-Mistral-7b

🤗 SVG-to-PVD Dataset: PVD-160K

BibTeX


      @misc{wang2024textbased,
        title={Text-Based Reasoning About Vector Graphics}, 
        author={Zhenhailong Wang and Joy Hsu and Xingyao Wang and Kuan-Hao Huang and Manling Li and Jiajun Wu and Heng Ji},
        year={2024},
        eprint={2404.06479},
        archivePrefix={arXiv},
        primaryClass={cs.CL}
      }