Vision Language Model Guided Zero-shot Classification
Abstract
Among various core tasks in Computer Vision, 2D image and 3D object classification are fundamental tasks which serve as the foundation for numerous applications including scene understanding, robotics and autonomous navigation. Vision Language Model (VLMs) are deep learning architectures designed to process and understand both visual and textual information simultaneously. This thesis takes a close look at Vision-Language Models in classification tasks, with a particular emphasis on zero-shot settings in both 2D and 3D scenarios. We provide a comprehensive overview of Vision-Language Models, focusing on their pretraining datasets, architectural components, learning strategies, and representative models. By comparing with supervised 2D approaches including shell learning along with conventional 3D classification methods, in-depth experiments and analysis have been conducted from various perspectives, including classification performance, semantic clustering and computational efficiency.
Description
Keywords
Citation
Collections
Source
Type
Book Title
Entity type
Access Statement
License Rights
Restricted until
Downloads
File
Description
Thesis Material