Hand pose estimation with convolutional networks using RGB-D data (Master thesis)
In this work, we study the problem of 3D articulated hand pose estimation from RGB-D images, which consists of estimating all the kinematic parameters of a hand expressed in joint angles or joint positions. Hand pose estimation is a very challenging problem due to the articulated nature of the human hand, which exhibits self-occlusions and large viewpoint variations. The popularization of RGB-D sensors has motivated the interest of the computer vision community in pose estimation as depth images have significantly improved the performance of the related methods. Moreover, the advance of deep learning has spurred this interest and most recent approaches propose convolutional network based methods. The architecture of a convolutional network, its depth as well as its training play a crucial role in its performance. In the first part of our work, we design and evaluate several different convolutional network architectures. Our experiments show that the depth of the network plays a crucial role in the performance, as our deepest convolutional network outperforms the state-of-the-art. Most methods use single depth images for 3D hand pose estimation. Depth images are noisy with quantization errors that result in missing parts around the hand boundaries. We conjecture that the combination of RGB images, which provide a more accurate description of the hand surface with color and texture information, with depth images, can further improve the performance of a convolutional network. Based on these observations, in the second part of our work we propose fusion methods of RGB and depth information using convolutional networks. We propose three different approaches, input fusion, score level fusion and double-stream architecture fusion. Input level fusion aggregates RGB-D data and trains a convolutional network with images that contain both RGB and depth channels, while score level fusion trains two different convolutional networks with RGB and depth images respectively and fuses their predictions. Finally, double-stream architecture fusion, is based on training two separate convolutional networks in parallel and at any arbitrary layer of the network to fuse their feature maps with given feature map fusion functions. We employ fusion functions proposed in state-of-the-art activity recognition methods. The performance of input fusion and score level fusion is limited, as they are applied in a very early and a very late stage of the network respectively. We employed doublestream fusion to mitigate this problem since the fusion takes place inside the network and lets subsequent learning stages to define correspondences between RGB and depth features. Indeed, double-stream fusion outperforms input fusion and score level fusion. Double-stream fusion has comparable performance with the state-ofthe- art, nevertheless our deep convolutional network trained only with depth images, outperforms double-stream fusion providing us state-of-the-art performance. From our experiments we conclude that RGB-D fusion does not leverage further useful information towards more accurate 3D hand pose estimation.
|Institution and School/Department of submitter:||Πανεπιστήμιο Ιωαννίνων. Σχολή Θετικών Επιστημών. Τμήμα Μηχανικών Η/Υ & Πληροφορικής|
|Subject classification:||Computer science|
|Keywords:||Προσανατολισμός χεριού,Νευρωτικά δίκτυο,Συνελικτικά δίκτυα,Μηχανική μάθηση,Hand pose estimation,Deep learning,Convolutional networks,RGB-D|
|Appears in Collections:||Διατριβές Μεταπτυχιακής Έρευνας (Masters)|
Files in This Item:
|Μ.Ε. ΚΑΖΑΚΟΣ ΕΥΑΓΓΕΛΟΣ 2017.pdf||2.3 MB||Adobe PDF||View/Open|
Please use this identifier to cite or link to this item:This item is a favorite for 0 people.
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.