
ReasonField Lab
a SoftwareMill Group Company
hello@reasonfieldlab.com
Human pose estimation is a popular Computer Vision task with more than 20 years of history. This domain focuses on localizing human body joints (e.g. knees and wrists), also known as key points, in images or videos.
The entire human body can be represented with three different types of approaches:
The human pose estimation task aims to first form a skeleton-based representation and then process it according to the needs of the final application. From now on we will discuss skeleton-based datasets, metrics, and solutions. We will focus here on 2D human pose estimation because often 3-D is using initially 2D algorithms and then regressing them to 3-dimensional space.
Main problems in Human Pose Estimation circle around:
To mitigate this issue, practitioners try to either develop new methods or make challenging datasets to create models, which are robust to such problems. We will explore some solutions later on, which try to solve these problems via more elaborate architectural designs.
With the latest developments in Computer Vision and closed gyms for multiple months, the popularity of applications, which assist people in their home workouts increased. In the background, each of these applications is using the Human Pose Estimation model to track human movement, the number of reps and suggest some improvements to your technique.
As collecting data for Augmented Reality and CGI is expensive and time-consuming (i.e. using motion capture technology), human pose estimation models are often used to get cheaper, yet still accurate avatars for movies or Augmented Reality applications. Disney Research is exploring Avatar generation from Human Pose Estimation for more real-life like experience in AR applications.
To make the gaming experience more immersive, various body-tracking cameras were proposed. Behind the scenes, Microsoft Kinect is using 3d human pose estimation to track human movements for rendering the actions of a hero in the game. In the picture below you can see a small toddler playing a bird-flying game.
The rise in cashier-less shops like Amazon GO is thanks to advancements in Human Pose Estimation. By analyzing and tracking customers it's possible to detect which product, and in what amount was taken to the basket of the client. Then they can leave the shop without spending a long time in line to pay.
Thanks to Human Pose Estimation it is possible to detect violent and dangerous acts in the cities. The cooperation between Karlsruhe Institute of Technology and Mannheim Police Headquarters will result in real-life tests in 2023 of their solution.
MPII dataset consists of 40k person instances and each instance is labeled with 16 joints. The train set and validation set contain 22k and 3k person instances.
The COCO dataset consists of 200k images with 250k person instances labeled with 17 key points.
The difference between the labeling of body joints for MPII and COCO is presented below.
This dataset focuses on hard and occluded examples and was introduced in 2019 with the paper “Pose2Seg: Detection Free Human Instance Segmentation”. It consists of 5081 images with 10375 person instances, where each of the instances is suffering from heavy occlusion (MaxIOU >0.5).
This metric is used in the MPII dataset, where the detected joint is considered correct when the distance between the predicted location and the true location is within a certain threshold. To make the threshold relative to body size it is often defined by a fraction of head segment length. In the MPII dataset primarily used metric is PCK@0.5, so then joints only within a distance of 0.5 ∗ ℎ𝑒𝑎𝑑_𝑏𝑜𝑛𝑒_𝑙𝑒𝑛𝑔ℎ𝑡 are considered as correctly detected.
$$OKS = \frac{\sum_i exp(-\frac{d_i^2}{2s^2*k_i^2})*\delta(v_i >0)}{\sum_i \delta (v_i>0)}$$
OKS is the main metric of COCO dataset, where d is the distance between predicted and true location of the key point i \( v_i \) is the visibility flag of the key point i, s is the object scale and \( k_i \) is per key point constant to control falloff (calculated by COCO dataset researchers).
In simple words, OKS acts as a similarity metric as IOU in Object Detection for Image Segmentation. Typically this metric is analyzed via Average Precision(AP@50, AP@75, average across 10 steps between @50 and @95) and average Recall (same steps as for AP).
Throughout the history of Human Pose Estimation, there were multiple solutions based on classical Computer Vision, with a focus on parts and changes in colors and contrast. In the past few years, this area has been dominated by deep learning solutions, so in the following part, we will focus on them.
Deep learning solutions can be distinguished into two branches:
As stated before, within the top-down approaches, there are multiple examples with favorable results. All of them are learning using not the position of key points, but heatmaps of their location. This solution proved to result in a significantly better and more robust outcome. In this blog post, we will limit ourselves to the 3 most influential pose estimation architectures, which shaped the landscape of this branch of approaches.
The hourglass approach is actually doing multiple modules with the very same structure. Each module consists of upsampling and downsampling (looks like an hourglass). Such an architecture allows grabbing both local context (e.g. where is wrist) and global context (e.g. what is the orientation of the body). To make the learning process more successful, intermediate supervision after each module is performed, comparing the prediction of heatmaps to their true position.
Previous solutions used to go from high->low-> high resolution while this architecture is trying to maintain high resolution throughout the whole process, as you can see above. Initially, it starts with high resolution, but with each depth step it builds up more simultaneous scales, which receive information from higher, same, or lower resolutions of previous steps. With access to high-resolution features on each step, HRNet managed to stay on top of the majority of HPE leaderboards.
With the advent of vision transformers and its increased popularity in Computer Vision, it was a matter of time before the “Transformer for Pose Estimation” was proposed. The structure of this solution consists of a collection of TransformerBlocks (each is a combination of Layer Normalization, Multi Headed Self Attention, and Feed Forward Network) and a decoder module. After extracting features in the encoder, quite a simple decoder is used (with 2 times: Deconvolution Layer, followed by Batch Normalization and Relu, and a predictor linear layer). This network is quite simple to scale and does not require careful construction of convolutional layers with a calculated number of parameters but still produces very powerful results.
This solution seems to also work well for multi person pose estimation task with severe occlusions (current leader of OCHuman dataset).
As mentioned before, bottom-up approaches produce multiple skeletons at once, so they are often faster and more suitable for real-time solutions and also perform better in crowd scenes for multi person pose estimation.
OpenPose is probably the most popular bottom up approach model there is, as it was released in the form of an open-source library in 2018. Its popularity is also due to the fact that it was one of the first real-time solution with reliable and widespread human pose estimation.
The architecture works as follows:
The characteristic of this solution makes it suitable for real time multi person pose estimation.
Omnipose is currently best-performing bottom up approach architecture, with quite simple structure. It initially starts with two 3x3 convolutions followed by a ResnetBottleneck block. After that 3 HRNet blocks follow, each with Gaussian heatmap modulation enrichment (proposed in Dark paper). This improvement is essentially transforming the way the direct keypoint location is done, not looking for maximum value (previous solutions) but assuming that the heatmap of a certain keypoint is following Gaussian distribution and trying to find the center of that distribution.
The main contribution of this Omnipose architecture is the Waterfall Atrous Spatial Pyramid module (WASPv2), which can be seen below.
This module was partly inspired by the WASPv1 module (originally proposed in the Unipose paper) but with the idea of combining backbone features with low-level features. Thanks to all these improvements this architecture is one of the best-performing architectures now.
MoveNet is a popular bottom up approach model, which uses heatmaps to accurately localize human key points. The architecture consists of:
There are 4 prediction heads:
This architecture was trained with CocoDataset enriched with Google's private dataset - Activity, specializing in challenging fitness and yoga poses. Thanks to novel architecture and robust training dataset this solution is quite stable to occlusions and unfamiliar poses. It was released as an out-of-the-box solution with tf-js by the Google team. Model itself was released in two formats: “lightning” with a focus on speed and “thunder”aiming for higher accuracy (while still maintaining 30+FPS). You can explore the live demo on their website.
As you can see there are multiple solutions available to choose from. From our side we can suggest the following:
Human Pose Estimation task is a challenging but interesting field, widely used in sports and gaming. In this blog post, we have covered a wide variety of information, from basic definitions and difficulties, through some use cases, metrics, and datasets to evaluate models and most influential top-down and bottom-up approaches, including the ones which are current SOTA.