ROSE Lab

Action Recognition Datasets: "NTU RGB+D" Dataset and "NTU RGB+D 120" Dataset

(also include AUTH UAV Gesture Dataset: NTU 4-Class)

This page introduces two datasets: "NTU RGB+D" and "NTU RGB+D 120".
"NTU RGB+D" contains 60 action classes and 56,880 video samples.
"NTU RGB+D 120" extends "NTU RGB+D" by adding another 60 classes and another 57,600 video samples, i.e., "NTU RGB+D 120" has 120 classes and 114,480 samples in total.
These two datasets both contain RGB videos, depth map sequences, 3D skeletal data, and infrared (IR) videos for each sample. Each dataset is captured by three Kinect V2 cameras concurrently.
The resolutions of RGB videos are 1920x1080, depth maps and IR videos are all in 512x424, and 3D skeletal data contains the 3D coordinates of 25 body joints at each frame.

1. Action Classes

The actions in these two datasets are in three major categories: daily actions, mutual actions, and medical conditions, as shown in the tables below.
Note: actions labelled from A1 to A60 are contained in "NTU RGB+D", and actions labelled from A1 to A120 are in "NTU RGB+D 120".

1.1 Daily Actions (82)

A1: drink water	A2: eat meal	A3: brush teeth	A4: brush hair
A5: drop	A6: pick up	A7: throw	A8: sit down
A9: stand up	A10: clapping	A11: reading	A12: writing
A13: tear up paper	A14: put on jacket	A15: take off jacket	A16: put on a shoe
A17: take off a shoe	A18: put on glasses	A19: take off glasses	A20: put on a hat/cap
A21: take off a hat/cap	A22: cheer up	A23: hand waving	A24: kicking something
A25: reach into pocket	A26: hopping	A27: jump up	A28: phone call
A29: play with phone/tablet	A30: type on a keyboard	A31: point to something	A32: taking a selfie
A33: check time (from watch)	A34: rub two hands	A35: nod head/bow	A36: shake head
A37: wipe face	A38: salute	A39: put palms together	A40: cross hands in front
A61: put on headphone	A62: take off headphone	A63: shoot at basket	A64: bounce ball
A65: tennis bat swing	A66: juggle table tennis ball	A67: hush	A68: flick hair
A69: thumb up	A70: thumb down	A71: make OK sign	A72: make victory sign
A73: staple book	A74: counting money	A75: cutting nails	A76: cutting paper
A77: snap fingers	A78: open bottle	A79: sniff/smell	A80: squat down
A81: toss a coin	A82: fold paper	A83: ball up paper	A84: play magic cube
A85: apply cream on face	A86: apply cream on hand	A87: put on bag	A88: take off bag
A89: put object into bag	A90: take object out of bag	A91: open a box	A92: move heavy objects
A93: shake fist	A94: throw up cap/hat	A95: capitulate	A96: cross arms
A97: arm circles	A98: arm swings	A99: run on the spot	A100: butt kicks
A101: cross toe touch	A102: side kick	-	-

1.2 Medical Conditions (12)

A41: sneeze/cough	A42: staggering	A43: falling down	A44: headache
A45: chest pain	A46: back pain	A47: neck pain	A48: nausea/vomiting
A49: fan self	A103: yawn	A104: stretch oneself	A105: blow nose

1.3 Mutual Actions / Two Person Interactions (26)

A50: punch/slap	A51: kicking	A52: pushing	A53: pat on back
A54: point finger	A55: hugging	A56: giving object	A57: touch pocket
A58: shaking hands	A59: walking towards	A60: walking apart	A106: hit with object
A107: wield knife	A108: knock over	A109: grab stuff	A110: shoot with gun
A111: step on foot	A112: high-five	A113: cheers and drink	A114: carry object
A115: take a photo	A116: follow	A117: whisper	A118: exchange things
A119: support somebody	A120: rock-paper-scissors	-	-

2. Size of Datasets

To ease the downloading, we separate the modalities of the samples into different files. The size of each modality is shown in the below table:

Data Modality	"NTU RGB+D"	"NTU RGB+D 120"
3D skeletons (body joints)	5.8 GB	5.8+4.5 GB
Masked depth maps*	83 GB	83+64 GB
Full depth maps	886 GB	886+549 GB
RGB videos	136 GB	136+124 GB
IR data	221 GB	221+168 GB
Total	1.3 TB	2.3 TB

*Masked depth maps are the foreground masked version of the depth maps. Masking is done based on the locations of the detected body joints, to remove the background and less important parts of the depth maps and to improve the compression rate.

3. How to Download Datasets

Researchers can register an account, submit the request form and accept the Release Agrement. We will validate your request and grand approve for downloading the datasets.The LoginID can be used for both "NTU RGB+D" and "NTU RGB+D 120".

4. More Information (FAQs and Sample Codes)

We provide more information about the data, answers to FAQs, samples codes to read the data, and the latest published results on our datasets here.

5. Sample Videos

6. Terms & Conditions of Use

The datasets are released for academic research only, and are free to researchers from educational or research institutes for non-commercial purposes.

The use of these two datasets is governed by the following terms and conditions:
• Without the expressed permission of the ROSE Lab, any of the following will be considered illegal: redistribution, derivation or generation of a new dataset from this dataset, and commercial usage of any of these datasets in any way or form, either partially or in its entirety.
• For the sake of privacy, images of all subjects in any of these datasets are only allowed for the demonstration in academic publications and presentations.
• All users of "NTU RGB+D" and "NTU RGB+D 120" action recognition datasets agree to indemnify, defend and hold harmless, the ROSE Lab and its officers, employees, and agents, individually and collectively, from any and all losses, expenses, and damages.

If interested, researchers can register for an account, submit the request form and accept the Release Agreement. We will validate your request and grant approval for downloading the datasets.The LoginID can be used for both "NTU RGB+D" and "NTU RGB+D 120".

7. Related Publications

All publications using "NTU RGB+D" or "NTU RGB+D 120" Action Recognition Database or any of the derived datasets(see Section 8) should include the following acknowledgement: "(Portions of) the research in this paper used the NTU RGB+D (or NTU RGB+D 120) Action Recognition Dataset made available by the ROSE Lab at the Nanyang Technological University, Singapore."

Furthermore, these publications should cite the following papers:

Amir Shahroudy, Jun Liu, Tian-Tsong Ng, Gang Wang, "NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016 [PDF].
Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, Alex C. Kot, "NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding", IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2019. [PDF].

Some related works on RGB+D action recognition:

Amir Shahroudy, Tian-Tsong Ng, Qingxiong Yang, Gang Wang, "Multimodal Multipart Learning for Action Recognition in Depth Videos", TPAMI, 2016.
Amir Shahroudy, Tian-Tsong Ng, Yihong Gong, Gang Wang, "Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos" TPAMI, 2018.
Amir Shahroudy, Gang Wang, Tian Tsong Ng, "Multi-modal Feature Fusion for Action Recognition in RGB-D Sequences", ISCCSP, 2014.
Jun Liu, Amir Shahroudy, Dong Xu, Gang Wang, "Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition", ECCV, 2016.
Jun Liu, Gang Wang, Ping Hu, Ling-Yu Duan, Alex C. Kot, "Global Context-Aware Attention LSTM Networks for 3D Action Recognition", CVPR, 2017.
Jun Liu, Amir Shahroudy, Dong Xu, Alex C. Kot, Gang Wang, "Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates", TPAMI, 2018.
Jun Liu, Gang Wang, Ling-Yu Duan, Kamila Abdiyeva, Alex C. Kot, "Skeleton-Based Human Action Recognition with Global Context-aware Attention LSTM Networks", TIP, 2018.
Jun Liu, Amir Shahroudy, Gang Wang, Ling-Yu Duan, Alex C. Kot, "Skeleton-Based Online Action Prediction Using Scale Selection Network", TPAMI, 2019.
Siyuan Yang, Jun Liu, Shijian Lu, Er Meng Hwa, and Alex Kot, "Collaborative Learning of Gesture Recognition and 3D Hand Pose Estimation with Multi-Order Feature Analysis", ECCV 2020.
Siyuan Yang, Jun Liu, Shijian Lu, Er Meng Hwa, and Alex Kot, "Skeleton Cloud Colorization for Unsupervised 3D Action Representation Learning", ICCV 2021.

8. Derived Works Based on NTU RGB+D Dataset

Below are some datasets that are derived from, or make partial use of, NTU RGB+D dataset:

8.1. LSMB19: A Large-Scale Motion Benchmark for Searching and Annotating in Continuous Motion Data Streams (http://mocap.fi.muni.cz/LSMB).

J. Sedmidubsky, P. Elias, P. Zezula, "Benchmarking Search and Annotation in Continuous Human Skeleton Sequences", ICMR, 2019.

8.2. AUTH UAV Gesture Dataset (https://aiia.csd.auth.gr/auth-uav-gesture-dataset/ ).

F. Patrona, I. Mademlis, I. Pitas, “An Overview of Hand Gesture Languages for Autonomous UAV Handling”, in Proceedings of the Workshop on Aerial Robotic Systems Physically Interacting with the Environment (AIRPHARO), 2021.

You can request for the relevant 4-class sub-dataset of the NTU RGB+D dataset using the the same request form and download from Section 3.0 of the Download Page

Note: Users of these derived works should also cite the papers found here (see Section 7 on Related Publications)