Decoupling Zero-Shot Semantic Segmentation

Ding, Jian; Xue, Nan; Xia, Gui-Song; Dai, Dengxin

doi:10.1109/CVPR52688.2022.01129

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Conference Paper

Decoupling Zero-Shot Semantic Segmentation

MPS-Authors

/persons/resource/persons261420

Dai, Dengxin
Computer Vision and Machine Learning, MPI for Informatics, Max Planck Society;

External Resource

No external resources are shared

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

Ding_Decoupling_Zero-Shot_Semantic_Segmentation_CVPR_2022_paper.pdf
(Preprint), 5MB

Supplementary Material (public)

There is no public supplementary material available

Citation

Ding, J., Xue, N., Xia, G.-S., & Dai, D. (2022). Decoupling Zero-Shot Semantic Segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11573-11582). Piscataway, NJ: IEEE. doi:10.1109/CVPR52688.2022.01129.

Cite as: https://hdl.handle.net/21.11116/0000-000A-16BD-9

Abstract

Zero-shot semantic segmentation (ZS3) aims to segment the novel categories
that have not been seen in the training. Existing works formulate ZS3 as a
pixel-level zero-shot classification problem, and transfer semantic knowledge
from seen classes to unseen ones with the help of language models pre-trained
only with texts. While simple, the pixel-level ZS3 formulation shows the
limited capability to integrate vision-language models that are often
pre-trained with image-text pairs and currently demonstrate great potential for
vision tasks. Inspired by the observation that humans often perform
segment-level semantic labeling, we propose to decouple the ZS3 into two
sub-tasks: 1) a class-agnostic grouping task to group the pixels into segments.
2) a zero-shot classification task on segments. The former sub-task does not
involve category information and can be directly transferred to group pixels
for unseen classes. The latter subtask performs at segment-level and provides a
natural way to leverage large-scale vision-language models pre-trained with
image-text pairs (e.g. CLIP) for ZS3. Based on the decoupling formulation, we
propose a simple and effective zero-shot semantic segmentation model, called
ZegFormer, which outperforms the previous methods on ZS3 standard benchmarks by
large margins, e.g., 35 points on the PASCAL VOC and 3 points on the COCO-Stuff
in terms of mIoU for unseen classes. Code will be released at
https://github.com/dingjiansw101/ZegFormer.