From Pixels to People

Omran, Mohamed

doi:10.22028/D291-36605

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Thesis

From Pixels to People

MPS-Authors

/persons/resource/persons136609

Omran, Mohamed
Computer Vision and Machine Learning, MPI for Informatics, Max Planck Society;
International Max Planck Research School, MPI for Informatics, Max Planck Society;

External Resource

https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/33466
(Any fulltext)

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

There are no public fulltexts stored in PuRe

Supplementary Material (public)

There is no public supplementary material available

Citation

Omran, M. (2021). From Pixels to People. PhD Thesis, Universität des Saarlandes, Saarbrücken. doi:10.22028/D291-36605.

Cite as: https://hdl.handle.net/21.11116/0000-000A-CDBF-9

Abstract

Abstract
Humans are at the centre of a significant amount of research in computer vision.
Endowing machines with the ability to perceive people from visual data is an immense
scientific challenge with a high degree of direct practical relevance. Success in automatic
perception can be measured at different levels of abstraction, and this will depend on
which intelligent behaviour we are trying to replicate: the ability to localise persons in
an image or in the environment, understanding how persons are moving at the skeleton
and at the surface level, interpreting their interactions with the environment including
with other people, and perhaps even anticipating future actions. In this thesis we tackle
different sub-problems of the broad research area referred to as "looking at people",
aiming to perceive humans in images at different levels of granularity.
We start with bounding box-level pedestrian detection: We present a retrospective
analysis of methods published in the decade preceding our work, identifying various
strands of research that have advanced the state of the art. With quantitative exper-
iments, we demonstrate the critical role of developing better feature representations
and having the right training distribution. We then contribute two methods based
on the insights derived from our analysis: one that combines the strongest aspects of
past detectors and another that focuses purely on learning representations. The latter
method outperforms more complicated approaches, especially those based on hand-
crafted features. We conclude our work on pedestrian detection with a forward-looking
analysis that maps out potential avenues for future research.
We then turn to pixel-level methods: Perceiving humans requires us to both separate
them precisely from the background and identify their surroundings. To this end, we
introduce Cityscapes, a large-scale dataset for street scene understanding. This has since
established itself as a go-to benchmark for segmentation and detection. We additionally
develop methods that relax the requirement for expensive pixel-level annotations, focusing
on the task of boundary detection, i.e. identifying the outlines of relevant objects and
surfaces. Next, we make the jump from pixels to 3D surfaces, from localising and
labelling to fine-grained spatial understanding. We contribute a method for recovering
3D human shape and pose, which marries the advantages of learning-based and model-
based approaches.
We conclude the thesis with a detailed discussion of benchmarking practices in
computer vision. Among other things, we argue that the design of future datasets
should be driven by the general goal of combinatorial robustness besides task-specific
considerations.