English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT

Released

Paper

ImageNet3D: Towards General-Purpose Object-Level 3D Understanding

MPS-Authors
/persons/resource/persons283728

Kortylewski,  Adam       
Visual Computing and Artificial Intelligence, MPI for Informatics, Max Planck Society;

External Resource
No external resources are shared
Fulltext (restricted access)
There are currently no full texts shared for your IP range.
Fulltext (public)

arXiv:2406.09613.pdf
(Preprint), 9MB

Supplementary Material (public)
There is no public supplementary material available
Citation

Ma, W., Zeng, G., Zhang, G., Liu, Q., Zhang, L., Kortylewski, A., et al. (2024). ImageNet3D: Towards General-Purpose Object-Level 3D Understanding. Retrieved from https://arxiv.org/abs/2406.09613.


Cite as: https://hdl.handle.net/21.11116/0000-0010-2932-8
Abstract
A vision model with general-purpose object-level 3D understanding should be
capable of inferring both 2D (e.g., class name and bounding box) and 3D
information (e.g., 3D location and 3D viewpoint) for arbitrary rigid objects in
natural images. This is a challenging task, as it involves inferring 3D
information from 2D signals and most importantly, generalizing to rigid objects
from unseen categories. However, existing datasets with object-level 3D
annotations are often limited by the number of categories or the quality of
annotations. Models developed on these datasets become specialists for certain
categories or domains, and fail to generalize. In this work, we present
ImageNet3D, a large dataset for general-purpose object-level 3D understanding.
ImageNet3D augments 200 categories from the ImageNet dataset with 2D bounding
box, 3D pose, 3D location annotations, and image captions interleaved with 3D
information. With the new annotations available in ImageNet3D, we could (i)
analyze the object-level 3D awareness of visual foundation models, and (ii)
study and develop general-purpose models that infer both 2D and 3D information
for arbitrary rigid objects in natural images, and (iii) integrate unified 3D
models with large language models for 3D-related reasoning.. We consider two
new tasks, probing of object-level 3D awareness and open vocabulary pose
estimation, besides standard classification and pose estimation. Experimental
results on ImageNet3D demonstrate the potential of our dataset in building
vision models with stronger general-purpose object-level 3D understanding.