English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT

Released

Paper

Multimodal Image Synthesis and Editing: A Survey

MPS-Authors
/persons/resource/persons285695

Zhan,  Fangneng
Visual Computing and Artificial Intelligence, MPI for Informatics, Max Planck Society;

/persons/resource/persons226679

Liu,  Lingjie
Visual Computing and Artificial Intelligence, MPI for Informatics, Max Planck Society;

/persons/resource/persons283728

Kortylewski,  Adam       
Visual Computing and Artificial Intelligence, MPI for Informatics, Max Planck Society;

/persons/resource/persons45610

Theobalt,  Christian       
Visual Computing and Artificial Intelligence, MPI for Informatics, Max Planck Society;

External Resource
No external resources are shared
Fulltext (restricted access)
There are currently no full texts shared for your IP range.
Fulltext (public)

arXiv:2112.13592.pdf
(Preprint), 7MB

Supplementary Material (public)
There is no public supplementary material available
Citation

Zhan, F., Yu, Y., Wu, R., Zhang, J., Lu, S., Liu, L., et al. (2022). Multimodal Image Synthesis and Editing: A Survey. Retrieved from https://arxiv.org/abs/2112.13592.


Cite as: https://hdl.handle.net/21.11116/0000-000C-72BF-D
Abstract
As information exists in various modalities in real world, effective
interaction and fusion among multimodal information plays a key role for the
creation and perception of multimodal data in computer vision and deep learning
research. With superb power in modelling the interaction among multimodal
information, multimodal image synthesis and editing has become a hot research
topic in recent years. Instead of providing explicit guidance for network
training, multimodal guidance offers intuitive and flexible means for image
synthesis and editing. On the other hand, this field is also facing several
challenges in alignment of features with inherent modality gaps, synthesis of
high-resolution images, faithful evaluation metrics, etc. In this survey, we
comprehensively contextualize the advance of the recent multimodal image
synthesis and editing and formulate taxonomies according to data modality and
model architectures. We start with an introduction to different types of
guidance modalities in image synthesis and editing. We then describe multimodal
image synthesis and editing approaches extensively with detailed frameworks
including Generative Adversarial Networks (GANs), Auto-regressive models,
Diffusion models, Neural Radiance Fields (NeRF) and other methods. This is
followed by a comprehensive description of benchmark datasets and corresponding
evaluation metrics as widely adopted in multimodal image synthesis and editing,
as well as detailed comparisons of various synthesis methods with analysis of
respective advantages and limitations. Finally, we provide insights about the
current research challenges and possible directions for future research. We
hope this survey could lay a sound and valuable foundation for future
development of multimodal image synthesis and editing. A project associated
with this survey is available at https://github.com/fnzhan/MISE.