# depthai_hand_tracker **Repository Path**: leonbdong/depthai_hand_tracker ## Basic Information - **Project Name**: depthai_hand_tracker - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-08-10 - **Last Updated**: 2025-08-10 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Hand tracking with DepthAI Running Google Mediapipe Hand Tracking models on [Luxonis DepthAI](https://docs.luxonis.com/projects/hardware/en/latest/) hardware (OAK-D, OAK-D lite, OAK-1,...)
![]()
![]()



|[
](img/output_no_bpf.jpg)|Because of the padding, hands get very small and palm detection gives a poor result (right hand not detected, left hand detection inaccurate)|
|*No BPF*
|[
](img/output_no_bpf_crop.jpg)|Cropping the image along the shortest side is an easy and inexpensive way to improve the detection, but at the condition the person stays in the center of the image|
|body_pre_focusing=group
|[
](img/output_bpf_group_all_hands.jpg)|BPF algorithm finds a zone that contains both hands, which are correctly detected|
|body_pre_focusing=group
|[
](img/output_bpf_group.jpg)|With "hands_up_only" set to True, the left hand is not taken into consideration since the wrist keypoint is below the elbow keypoint|
|body_pre_focusing=right|
|[
](img/output_bpf_group.jpg)|The right hand is correctly detected, whatever the value of "hands_up_only"|
|body_pre_focusing=left
|[
](img/output_bpf_left_all_hands.jpg)|The left hand is correctly detected|
|body_pre_focusing=left
|[
](img/output_bpf_left.jpg)|Because the left hand is not raised, it is not taken into consideration, so we fall back to the case where BPF is not used|
|body_pre_focusing=higher|
|[
](img/output_bpf_higher.jpg)|Here, same result as for "body_pre_focusing=right", whatever the value of "hands_up_only"|
## Frames per second (FPS)
You will quickly notice that **the FPS can vary a lot**.
Of course, it depends on the modes chosen:
- Edge mode is faster than Host mode,
- Solo mode is faster than Duo mode,
- Body Pre Focusing introduces a delay in the palm detection, but once a hand is detected you get the same speed as without BPF.
Also the FPS highly depends on the number of hands currently in the image. It may sound counter-intuitive but **the FPS is significantly faster when a hand is present than when not**. Why is that ? Because the palm detection inference is slower than the landmark regression inference. When no hand is visible, the palm detection (or the body detection when using Body Pre Focusing) is run on every frame until a hand is found. Once the hand is detected, only the landmark model runs on the following frame until the landmark model loses the hand's track. Actually, on the best case scenario, the palm detection is run only once: on the first frame !
Important recommendation: **tune the internal FPS !** By default, the internal camera FPS is set to a value that depends on chosen modes and on the use of depth ("-xyz"). These default values are based on my own observations.
When starting the demo, you will see a line like below:
```
Internal camera FPS set to: 36
```
Please, don't hesitate to play with the parameter `internal_fps` (via `--internal_fps` argument in the demos) to find the optimal value for your use case. If the observed FPS is well below the default value, you should lower the FPS with this parameter until the set FPS is just above the observed FPS.
## Install
Install the python packages (depthai, opencv) with the following command:
```
python3 -m pip install -r requirements.txt
```
## Run
**Usage:**
Use `demo.py` or `demo_bpf.py` depending on whether or not you nedd Bpdy Pre Focusing. `demo_bpf.py` has the same arguments as `demo.py` with 2 more, which are related to BPF: `--body_pre_focusing` and `--all_hands`.
```
->./demo_bpf.py -h
usage: demo_bpf.py [-h] [-e] [-i INPUT] [--pd_model PD_MODEL] [--no_lm]
[--lm_model LM_MODEL] [--use_world_landmarks] [-s] [-xyz]
[-g] [-c] [-f INTERNAL_FPS] [-r {full,ultra}]
[--internal_frame_height INTERNAL_FRAME_HEIGHT]
[-bpf {right,left,group,higher}] [-ah]
[--single_hand_tolerance_thresh SINGLE_HAND_TOLERANCE_THRESH]
[--dont_force_same_image] [-lmt {1,2}] [-t [TRACE]]
[-o OUTPUT]
optional arguments:
-h, --help show this help message and exit
-e, --edge Use Edge mode (postprocessing runs on the device)
Tracker arguments:
-i INPUT, --input INPUT
Path to video or image file to use as input (if not
specified, use OAK color camera)
--pd_model PD_MODEL Path to a blob file for palm detection model
--no_lm Only the palm detection model is run (no hand landmark
model)
--lm_model LM_MODEL Landmark model 'full', 'lite', 'sparse' or path to a
blob file
--use_world_landmarks
Fetch landmark 3D coordinates in meter
-s, --solo Solo mode: detect one hand max. If not used, detect 2
hands max (Duo mode)
-xyz, --xyz Enable spatial location measure of palm centers
-g, --gesture Enable gesture recognition
-c, --crop Center crop frames to a square shape
-f INTERNAL_FPS, --internal_fps INTERNAL_FPS
Fps of internal color camera. Too high value lower NN
fps (default= depends on the model)
-r {full,ultra}, --resolution {full,ultra}
Sensor resolution: 'full' (1920x1080) or 'ultra'
(3840x2160) (default=full)
--internal_frame_height INTERNAL_FRAME_HEIGHT
Internal color camera frame height in pixels
-bpf {right,left,group,higher}, --body_pre_focusing {right,left,group,higher}
Enable Body Pre Focusing
-ah, --all_hands In Body Pre Focusing mode, consider all hands (not
only the hands up)
--single_hand_tolerance_thresh SINGLE_HAND_TOLERANCE_THRESH
(Duo mode only) Number of frames after only one hand
is detected before calling palm detection (default=10)
--dont_force_same_image
(Edge Duo mode only) Don't force the use the same
image when inferring the landmarks of the 2 hands
(slower but skeleton less shifted)
-lmt {1,2}, --lm_nb_threads {1,2}
Number of the landmark model inference threads
(default=2)
-t [TRACE], --trace [TRACE]
Print some debug infos. The type of info depends on
the optional argument.
Renderer arguments:
-o OUTPUT, --output OUTPUT
Path to output video file
```
**Some examples:**
Whenever you see `demo.py`, you can replace by `demo_bpf.py`.
- To use the color camera as input in Host mode (by default, the `lite` version of the landmark model is used):
```./demo.py```
- Same as above but with the `full` version of the landmark model:
```./demo.py --lm_model full```
or with the older but faster version 0.8.0:
```./demo.py --lm_model models/hand_landmark_080_sh4.blob```
- To use the color camera as input in Edge mode (recommended for Solo mode):
```./demo.py -e```
- To use the color camera as input in Edge mode when you don't need to retrieve the video frame (only the landmark information is transfered to the host):
```./demo.py -e -i rgb_laconic```
- To use a file (video or image) as input (Host mode only):
```./demo.py -i filename```
- To enable gesture recognition:
```./demo.py [-e] -g```

- Recommended options for gesture recognition when the person can move a few meters from the camera:
```./demo_bpf.py -e -g -bpf higher```
or
```./demo_bpf.py -e -g -bpf higher --resolution ultra``` (a bit slower but better image definition)
- To measure hand spatial location in camera coordinate system (only for depth-capable device like OAK-D):
```./demo.py [-e] -xyz```

The measure is made on the wrist keypoint (or on the palm box center if '--no_lm' is used).
- To run only the palm detection model (without hand landmarks, Host mode only):
```./demo.py --no_lm```

Of course, gesture recognition is not possible in this mode.
|Keypress|Function|
|-|-|
|*Esc*|Exit|
|*space*|Pause|
|1|Show/hide the palm bounding box (only in non solo mode)|
|2|Show/hide the palm detection keypoints (only in non solo mode)|
|3|Show/hide the rotated bounding box around the hand|
|4|Show/hide landmarks|
|5|Show/hide handedness (several display mode are available)|
|6|Show/hide scores|
|7|Show/hide recognized gestures (-g or --gesture)|
|8|Show/hide hand spatial location (-xyz)|
|9|Show/hide the zone used to measure the spatial location (-xyz)|
|f|Show/hide FPS|
|b|Show/hide body keypoints, smart cropping zone and focus zone if body pre focusing is used (only in Host mode)|
## Mediapipe models
You can find the models *palm_detector.blob* and *hand_landmark_\*.blob* under the 'models' directory, but below I describe how to get the files.
1) Clone this github repository in a local directory (DEST_DIR)
2) In DEST_DIR/models directory, download the source tflite models from [this archive](https://drive.google.com/file/d/1wmle8tvga6S_m0uEogB-Z1sRxOibdoFc/view?usp=sharing). The archive contains:
* Palm Detection model: palm_detection.tflite : mediapipe tag 0.8.0 04/11/2020 size: 3877888
* Hand Landmarks models:
- hand_landmark_full.tflite: mediapipe 18/10/2021 size: 5478688
- hand_landmark_lite.tflite: mediapipe 18/10/2021 size: 2071408
- hand_landmark_sparse.lite: mediapipe 18/10:2021 size: 2175808
- hand_landmark_080.tflite: mediapipe tag 0.8.0 04/11/2020 size: 3792620 (this older version is still proposed as it is a bit faster than the recent full, lite or sparse versions)
-
3) Install the amazing [PINTO's tflite2tensorflow tool](https://github.com/PINTO0309/tflite2tensorflow). Use the docker installation which includes many packages including a recent version of Openvino.
4) From DEST_DIR, run the tflite2tensorflow container: ```./docker_tflite2tensorflow.sh```
5) From the running container:
```
cd models
./convert_models.sh
```
The *convert_models.sh* converts the tflite models in tensorflow (.pb), then converts the pb file into Openvino IR format (.xml and .bin), and finally converts the IR files in MyriadX format (.blob).
By default, the number of SHAVES associated with the blob files is 4. In case you want to generate new blobs with different number of shaves, you can use the script *gen_blob_shave.sh*:
```
# Example: to generate blobs for 6 shaves
./gen_blob_shave.sh -m palm_detection.xml -n 6 # will generate palm_detection_sh6.blob
./gen_blob_shave.sh -m hand_landmark_full.xml -n 6 # will generate hand_landmark_full_sh6.blob
```
**Explanation about the Model Optimizer params :**
- The preview of the OAK-* color camera outputs BGR [0, 255] frames . The original tflite palm detection model is expecting RGB [-1, 1] frames. ```--reverse_input_channels``` converts BGR to RGB. ```--mean_values [127.5,127.5,127.5] --scale_values [127.5,127.5,127.5]``` normalizes the frames between [-1, 1].
- The images which are fed to hand landmark model are built on the host in a format similar to the OAK-* cameras (BGR [0, 255]). The original hand landmark model is expecting RGB [0, 1] frames. Therefore, the following arguments are used ```--reverse_input_channels --scale_values [255.0, 255.0, 255.0]```
**Blob models vs tflite models**
The palm detection blob does not exactly give the same results as the tflite version, because the tflite ResizeBilinear instruction is converted into IR Interpolate-1. Yet the difference is almost imperceptible thanks to the great help of PINTO (see [issue](https://github.com/PINTO0309/tflite2tensorflow/issues/4)).
**Movenet models :**
The 'lightning' and 'thunder' Movenet models come from the repository [geaxgx/depthai_movenet](https://github.com/geaxgx/depthai_movenet/tree/main/models).
## Custom model
The `custom_models` directory contains the code to build the custom model *PDPostProcessing_top2_sh1.blob*. This model processes the outputs of the palm detection network (a 1x896x1 tensor for the scores and a 1x896x18 for the regressors) and yields the 2 best detections. For more details, please read [this](custom_models/README.md).
## Code
There are 2 classes:
- **HandTracker**, responsible of computing the hand landmarks. There are 4 implementations and associated python file listed in the table below and depending on whether Body Pre Focusing is used and where the tasks are executed :
||Host mode|Edge mode|
|:-:|:-:|:-:|
|Do not use Body Pre Focusing|HandTracker.py|HandTrackerEdge.py|
|Use Body Pre Focusing|HandTrackerBpf.py|HandTrackerBpfEdge.py|
- **HandTrackerRenderer**, which is an example of rendering code and implemented in HandTrackerRenderer.py
You can replace the renderer from this repository and write and personalize your own renderer (for some projects, you may not even need a renderer).
The files ```demo.py``` or ```demo_bpf.py``` are representative examples of how to use these classes.
```
from HandTrackerRenderer import HandTrackerRenderer
from HandTrackerEdge import HandTracker
tracker = HandTracker(
# Your own arguments
...
)
renderer = HandTrackerRenderer(tracker=tracker)
while True:
# Run hand tracker on next frame
# 'bag' is some information common to the frame and to the hands
frame, hands, bag = tracker.next_frame()
if frame is None: break
# Draw hands
frame = renderer.draw(frame, hands, bag)
key = renderer.waitKey(delay=1)
if key == 27 or key == ord('q'):
break
renderer.exit()
tracker.exit()
```
`hands` returned by `tracker.next_frame()` is a list of HandRegion.
For more information on:
- the arguments of the tracker, please refer to the docstring of class `HandTracker`, `HandTrackerEdge`, `HandTrackerBpf` or `HandTrackerBpfEdge` in their respective files;
- the attributes of an `HandRegion` element which stores all the detected hand information, please refer to the doctring of class `HandRegion` in `mediapipe_utils.py`.
## Landmarks
When accessing individual landmarks in the arrays `hand.landmarks` or `hand.norm_landmarks`, the following schema ([source](https://google.github.io/mediapipe/images/mobile/hand_landmarks.png)) references the landmarks' indexes:

](examples/3d_visualization)|
|[Remote control with hand poses](examples/remote_control) |[
](examples/remote_control)|
## Credits
* [Google Mediapipe](https://github.com/google/mediapipe)
* Katsuya Hyodo a.k.a [Pinto](https://github.com/PINTO0309), the Wizard of Model Conversion !