hide
Free keywords:
Computer Science, Computer Vision and Pattern Recognition, cs.CV
Abstract:
We present an approach for real-time, robust and accurate hand pose
estimation from moving egocentric RGB-D cameras in cluttered real environments.
Existing methods typically fail for hand-object interactions in cluttered
scenes imaged from egocentric viewpoints, common for virtual or augmented
reality applications. Our approach uses two subsequently applied Convolutional
Neural Networks (CNNs) to localize the hand and regress 3D joint locations.
Hand localization is achieved by using a CNN to estimate the 2D position of the
hand center in the input, even in the presence of clutter and occlusions. The
localized hand position, together with the corresponding input depth value, is
used to generate a normalized cropped image that is fed into a second CNN to
regress relative 3D hand joint locations in real time. For added accuracy,
robustness and temporal stability, we refine the pose estimates using a
kinematic pose tracking energy. To train the CNNs, we introduce a new
photorealistic dataset that uses a merged reality approach to capture and
synthesize large amounts of annotated data of natural hand interaction in
cluttered scenes. Through quantitative and qualitative evaluation, we show that
our method is robust to self-occlusion and occlusions by objects, particularly
in moving egocentric perspectives.