From Captions to Queries: A VLM Pipeline for Queryable Metadata in Synthetic Environments

May C, Shaji AR, Franke J, Reitelshöfer S (2026)

Publication Type: Conference contribution

Publication year: 2026

Publisher: SciTePress

Pages Range: 538-547

Conference Proceedings Title: Proceedings of the 21st International Conference on Computer Vision Theory and Applications - Volume 3: VISAPP

Event location: Marbella

DOI: 10.5220/0014417600004084

Abstract

The growing demand for detailed simulated environments in robotics has increased the need for automated methods to manage 3D assets. However, current Vision-Language Model (VLM) annotation methods are designed to generate descriptive text captions, primarily for training generative models. This output is insufficient for automated simulation workflows, which cannot query assets by specific, structured attributes like object class, size, or material. Here we present a fully automated pipeline that addresses this gap by generating both natural-language descriptions and this essential, structured, queryable metadata. Our pipeline uses standardized multi-view rendering and a multi-stage VLM process to extract and consolidate asset attributes. Evaluations on 1,000 Objaverse and AI-generated assets show that our pipeline's semantic descriptions are comparable to existing captioning-focused methods, while additionally extracting structured attributes with an overall accuracy of ̃76%. By emb edding this structured metadata in a vector database, our pipeline enables the hybrid, similarity-based, and attribute-filtered retrieval required for scalable robotics simulation.

Authors with CRIS profile

Christopher May Lehrstuhl für Fertigungsautomatisierung und Produktionssystematik (FAPS) Jörg Franke Lehrstuhl für Fertigungsautomatisierung und Produktionssystematik (FAPS) Sebastian Reitelshöfer Lehrstuhl für Fertigungsautomatisierung und Produktionssystematik (FAPS)

How to cite

APA:

May, C., Shaji, A.R., Franke, J., & Reitelshöfer, S. (2026). From Captions to Queries: A VLM Pipeline for Queryable Metadata in Synthetic Environments. In Proceedings of the 21st International Conference on Computer Vision Theory and Applications - Volume 3: VISAPP (pp. 538-547). Marbella, ES: SciTePress.

MLA:

May, Christopher, et al. "From Captions to Queries: A VLM Pipeline for Queryable Metadata in Synthetic Environments." Proceedings of the 21st International Conference on Computer Vision Theory and Applications, Marbella SciTePress, 2026. 538-547.

BibTeX: Download