January 4, 2023

Human Pose Estimation - 2023 guide

Human Pose Estimation - 2023 guide

What is Human Pose Estimation?

Human pose estimation is a popular Computer Vision task with more than 20 years of history. This domain focuses on localizing human body joints (e.g. knees and wrists), also known as key points, in images or videos. 

The entire human body can be represented with three different types of approaches: 

  1. skeleton-based model (also called a kinematic model)
  2. contour-based model (also called a planar model)
  3. volume-based model (also called a volumetric model)
a) skeleton-based, b) contour-based c) volume-based | Source: ” A comprehensive survey on human pose estimation” 

The human pose estimation task aims to first form a skeleton-based representation and then process it according to the needs of the final application. From now on we will discuss skeleton-based datasets, metrics, and solutions. We will focus here on 2D human pose estimation because often 3-D is using initially 2D algorithms and then regressing them to 3-dimensional space.

Why is it hard? 

Main problems in Human Pose Estimation circle around: 

  • occlusions,
  • unusual poses of humans (check the image below), 
  • missing key points,
  • changes in lightning and clothing. 

To mitigate this issue, practitioners try to either develop new methods or make challenging datasets to create models, which are robust to such problems. We will explore some solutions later on, which try to solve these problems via more elaborate architectural designs.

Mispredicting right arm and left leg key points due to unusual pose, Source: TU Delft 

Use Cases

AI-powered personal trainers

With the latest developments in Computer Vision and closed gyms for multiple months, the popularity of applications, which assist people in their home workouts increased. In the background, each of these applications is using the Human Pose Estimation model to track human movement, the number of reps and suggest some improvements to your technique.

Kaia health mobile app: kaiahealth.de

Augmented Reality & CGI

As collecting data for Augmented Reality and CGI is expensive and time-consuming (i.e. using motion capture technology), human pose estimation models are often used to get cheaper, yet still accurate avatars for movies or Augmented Reality applications. Disney Research is exploring Avatar generation from Human Pose Estimation for more real-life like experience in AR applications.  


Source: AR Poser

Interactive Gaming Experience

To make the gaming experience more immersive, various body-tracking cameras were proposed. Behind the scenes, Microsoft Kinect is using 3d human pose estimation to track human movements for rendering the actions of a hero in the game. In the picture below you can see a small toddler playing a bird-flying game.

Toddler playing bird flying game, Source: Towards Data Science 

Cashier-less shops

The rise in cashier-less shops like Amazon GO is thanks to advancements in Human Pose Estimation. By analyzing and tracking customers it's possible to detect which product, and in what amount was taken to the basket of the client. Then they can leave the shop without spending a long time in line to pay. 

Prototype cashier-less shop - Amazon Go in Seattle. Source: Wikipedia 

Monitoring aggressive behaviors

Thanks to Human Pose Estimation it is possible to detect violent and dangerous acts in the cities. The cooperation between Karlsruhe Institute of Technology and Mannheim Police Headquarters will result in real-life tests in 2023 of their solution. 

Human Pose Estimation for street fight detection | Source: “Where are we with Human Pose Estimation in Real-World Surveillance?”



MPII dataset consists of 40k person instances and each instance is labeled with 16 joints. The train set and validation set contain 22k and 3k person instances. 


The COCO dataset consists of 200k images with 250k person instances labeled with 17 key points.

The difference between the labeling of body joints for MPII and COCO is presented below.

Source: TU Delft 


This dataset focuses on hard and occluded examples and was introduced in 2019 with the paper “Pose2Seg: Detection Free Human Instance Segmentation”. It consists of 5081 images with 10375 person instances, where each of the instances is suffering from heavy occlusion (MaxIOU >0.5). 


PCK - Percentage of Correct Keypoints

This metric is used in the MPII dataset, where the detected joint is considered correct when the distance between the predicted location and the true location is within a certain threshold. To make the threshold relative to body size it is often defined by a fraction of head segment length. In the MPII dataset primarily used metric is PCK@0.5, so then joints only within a distance of 0.5 ∗ ℎ𝑒𝑎𝑑_𝑏𝑜𝑛𝑒_𝑙𝑒𝑛𝑔ℎ𝑡 are considered as correctly detected. 

OKS - Object Keypoint Similarity

$$OKS = \frac{\sum_i exp(-\frac{d_i^2}{2s^2*k_i^2})*\delta(v_i >0)}{\sum_i \delta (v_i>0)}$$

OKS is the main metric of COCO dataset, where d is the distance between predicted and true location of the key point i \( v_i \) is the visibility flag of the key point i, s is the object scale and \( k_i \) is per key point constant to control falloff (calculated by COCO dataset researchers). 

In simple words, OKS acts as a similarity metric as IOU in Object Detection for Image Segmentation. Typically this metric is analyzed via Average Precision(AP@50, AP@75, average across 10 steps between @50 and @95) and average Recall (same steps as for AP). 

Deep Learning models

Throughout the history of Human Pose Estimation, there were multiple solutions based on classical Computer Vision, with a focus on parts and changes in colors and contrast. In the past few years, this area has been dominated by deep learning solutions, so in the following part, we will focus on them. 

Deep learning solutions can be distinguished into two branches:

  • top-down: firstly performing person detection and then regressing key points within the chosen bounding box.
  • bottom-up: localize identity free key points and group them into person instances.   

Top-down approaches 

As stated before, within the top-down approaches, there are multiple examples with favorable results. All of them are learning using not the position of key points, but heatmaps of their location. This solution proved to result in a significantly better and more robust outcome. In this blog post, we will limit ourselves to the 3 most influential pose estimation architectures, which shaped the landscape of this branch of approaches. 

Hourglass [paper, code]

High level view of Hourglass structure - sequence of same structure Hourglass modules, continuously reducing(via strided convolutions) and then increasing resolution (nearest neighbor upsampling).  From: original paper. 

The hourglass approach is actually doing multiple modules with the very same structure. Each module consists of upsampling and downsampling (looks like an hourglass). Such an architecture allows grabbing both local context (e.g. where is wrist) and global context (e.g. what is the orientation of the body). To make the learning process more successful, intermediate supervision after each module is performed, comparing the prediction of heatmaps to their true position.  

HRNet [paper, code]

HRNet architecture where arrow is convolution, arrow down is strided convolution, arrow up is upsampling. Source: original paper

Previous solutions used to go from high->low-> high resolution while this architecture is trying to maintain high resolution throughout the whole process, as you can see above. Initially, it starts with high resolution, but with each depth step it builds up more simultaneous scales, which receive information from higher, same, or lower resolutions of previous steps. With access to high-resolution features on each step, HRNet managed to stay on top of the majority of HPE leaderboards. 

VITPose [paper, code]

(a) The framework of ViTPose. (b) The transformer block. (c) The classic decoder. (d) The simple decoder. (e) The decoders for multiple datasets, Source: original paper

With the advent of vision transformers and its increased popularity in Computer Vision, it was a matter of time before the “Transformer for Pose Estimation” was proposed. The structure of this solution consists of a collection of TransformerBlocks (each is a combination of Layer Normalization, Multi Headed Self Attention, and Feed Forward Network) and a decoder module. After extracting features in the encoder, quite a simple decoder is used (with 2 times: Deconvolution Layer, followed by Batch Normalization and Relu, and a predictor linear layer). This network is quite simple to scale and does not require careful construction of convolutional layers with a calculated number of parameters but still produces very powerful results.  

This solution seems to also work well for multi person pose estimation task with severe occlusions (current leader of OCHuman dataset).

Bottom-up approaches

As mentioned before, bottom-up approaches produce multiple skeletons at once, so they are often faster and more suitable for real-time solutions and also perform better in crowd scenes for multi person pose estimation.

Open-Pose [paper, code]

Pipeline of how to produce human skeletons on given image via OpenPose, Source: original paper

OpenPose is probably the most popular bottom up approach model there is, as it was released in the form of an open-source library in 2018. Its popularity is also due to the fact that it was one of the first real-time solution with reliable and widespread human pose estimation. 

The architecture works as follows: 

  1. features are initially pulled out of the first few layers
  2. Then there are two branches of convolutional layers, first one consists of 18 confidence maps, representing each specific part of the human pose skeleton, and the second one having 38 Part Affinity Fields (PAFs), representing a level of association between parts (bipartite graph with keypoint to keypoint connections) . 
  3. Prune keypoint to keypoint connections which have low confidence: Thanks to PAFs it is possible to discard connections between keypoints with low probability of coming from the very same person instance.
  4. After pruning step, multiple human poses are regressed. 

The characteristic of this solution makes it suitable for real time multi person pose estimation.

Omnipose [paper, code]

Architecture of Omnipose, Source: original paper

Omnipose is currently best-performing bottom up approach architecture, with quite simple structure. It initially starts with two 3x3 convolutions followed by a ResnetBottleneck block. After that 3 HRNet blocks follow, each with Gaussian heatmap modulation enrichment (proposed in Dark paper). This improvement is essentially transforming the way the direct keypoint location is done, not looking for maximum value (previous solutions) but assuming that the heatmap of a certain keypoint is following Gaussian distribution and trying to find the center of that distribution.

The main contribution of this Omnipose architecture is the Waterfall Atrous Spatial Pyramid module (WASPv2), which can be seen below. 

Waterfall Atrous Spatial Pyramid module, Source: original paper

This module was partly inspired by the WASPv1 module (originally proposed in the Unipose paper) but with the idea of combining backbone features with low-level features. Thanks to all these improvements this architecture is one of the best-performing architectures now. 

MoveNet [paper, code]

MoveNet architecture, Source: Tensorflow blog 

MoveNet is a popular bottom up approach model, which uses heatmaps to accurately localize human key points. The architecture consists of:

  • feature extractor - essentially MobileNetV2 with feature pyramid network (FPN) 
  • set of prediction heads - following CenterNet-inspired scheme with small adjustments to improve speed and accuracy. 

There are 4 prediction heads:

  • person center heatmap
  • keypoint regression field, which predicts keypoints for a person
  • person key point heatmap, which predicts all keypoints regardless of which person instance they belong to.
  • 2D per-keypoint offset field, predicting offset from each output feature map pixel to location of each keypoint. 
Images from Activity dataset. Source: Tensorflow blog 

This architecture was trained with CocoDataset enriched with Google's private dataset - Activity, specializing in challenging fitness and yoga poses. Thanks to novel architecture and robust training dataset this solution is quite stable to occlusions and unfamiliar poses. It was released as an out-of-the-box solution with tf-js by the Google team. Model itself was released in two formats: “lightning” with a focus on speed and “thunder”aiming for higher accuracy (while still maintaining 30+FPS). You can explore the live demo on their website.

ReasonField Lab model suggestions

As you can see there are multiple solutions available to choose from. From our side we can suggest the following:

  • if you have a lot of data and can operate in an offline fashion, VIT, Omnipose, or HRNet should be your go-to model
  • When working on real-time or crowd applications, consider either MoveNet or OpenPose.
  • for data-deficient scenarios, probably smaller models will perform better (e.g. MoveNet “lightning”, HRNet-32, or OmniPose-Lite) or larger versions with careful and extensive augmentations.


Human Pose Estimation task is a challenging but interesting field, widely used in sports and gaming. In this blog post, we have covered a wide variety of information, from basic definitions and difficulties, through some use cases, metrics, and datasets to evaluate models and most influential top-down and bottom-up approaches, including the ones which are current SOTA.