Action Recognition Datasets: "NTU RGB+D" Dataset and "NTU RGB+D 120" Dataset

(also include AUTH UAV Gesture Dataset: NTU 4-Class)

This page introduces two datasets: "NTU RGB+D" and "NTU RGB+D 120".
"NTU RGB+D" contains 60 action classes and 56,880 video samples.
"NTU RGB+D 120" extends "NTU RGB+D" by adding another 60 classes and another 57,600 video samples, i.e., "NTU RGB+D 120" has 120 classes and 114,480 samples in total.
These two datasets both contain RGB videos, depth map sequences, 3D skeletal data, and infrared (IR) videos for each sample. Each dataset is captured by three Kinect V2 cameras concurrently.
The resolutions of RGB videos are 1920x1080, depth maps and IR videos are all in 512x424, and 3D skeletal data contains the 3D coordinates of 25 body joints at each frame.

 

1. Action Classes

The actions in these two datasets are in three major categories: daily actions, mutual actions, and medical conditions, as shown in the tables below.
Note: actions labelled from A1 to A60 are contained in "NTU RGB+D", and actions labelled from A1 to A120 are in "NTU RGB+D 120".

 

1.1 Daily Actions (82)

A1: drink water A2: eat meal A3: brush teeth A4: brush hair
A5: drop A6: pick up A7: throw A8: sit down
A9: stand up A10: clapping A11: reading A12: writing
A13: tear up paper A14: put on jacket A15: take off jacket A16: put on a shoe
A17: take off a shoe A18: put on glasses A19: take off glasses A20: put on a hat/cap
A21: take off a hat/cap A22: cheer up A23: hand waving A24: kicking something
A25: reach into pocket A26: hopping A27: jump up A28: phone call
A29: play with phone/tablet A30: type on a keyboard A31: point to something A32: taking a selfie
A33: check time (from watch) A34: rub two hands A35: nod head/bow A36: shake head
A37: wipe face A38: salute A39: put palms together A40: cross hands in front
A61: put on headphone A62: take off headphone A63: shoot at basket A64: bounce ball
A65: tennis bat swing A66: juggle table tennis ball A67: hush A68: flick hair
A69: thumb up A70: thumb down A71: make OK sign A72: make victory sign
A73: staple book A74: counting money A75: cutting nails A76: cutting paper
A77: snap fingers A78: open bottle A79: sniff/smell A80: squat down
A81: toss a coin A82: fold paper A83: ball up paper A84: play magic cube
A85: apply cream on face A86: apply cream on hand A87: put on bag A88: take off bag
A89: put object into bag A90: take object out of bag A91: open a box A92: move heavy objects
A93: shake fist A94: throw up cap/hat A95: capitulate A96: cross arms
A97: arm circles A98: arm swings A99: run on the spot A100: butt kicks
A101: cross toe touch A102: side kick - -

 

1.2 Medical Conditions (12)

A41: sneeze/cough A42: staggering A43: falling down A44: headache
A45: chest pain A46: back pain A47: neck pain A48: nausea/vomiting
A49: fan self A103: yawn A104: stretch oneself A105: blow nose

 

1.3 Mutual Actions / Two Person Interactions (26)

A50: punch/slap A51: kicking A52: pushing A53: pat on back
A54: point finger A55: hugging A56: giving object A57: touch pocket
A58: shaking hands A59: walking towards A60: walking apart A106: hit with object
A107: wield knife A108: knock over A109: grab stuff A110: shoot with gun
A111: step on foot A112: high-five A113: cheers and drink A114: carry object
A115: take a photo A116: follow A117: whisper A118: exchange things
A119: support somebody A120: rock-paper-scissors - -

 

 

2. Size of Datasets

To ease the downloading, we separate the modalities of the samples into different files. The size of each modality is shown in the below table:

Data Modality "NTU RGB+D" "NTU RGB+D 120"
3D skeletons (body joints) 5.8 GB 5.8+4.5 GB
Masked depth maps* 83 GB 83+64 GB
Full depth maps 886 GB 886+549 GB
RGB videos 136 GB 136+124 GB
IR data 221 GB 221+168 GB
Total 1.3 TB 2.3 TB

 

*Masked depth maps are the foreground masked version of the depth maps. Masking is done based on the locations of the detected body joints, to remove the background and less important parts of the depth maps and to improve the compression rate.

 

3. How to Download Datasets

Researchers can register an account, submit the request form and accept the Release Agrement. We will validate your request and grand approve for downloading the datasets.The LoginID can be used for both "NTU RGB+D" and "NTU RGB+D 120".

 

4. More Information (FAQs and Sample Codes)

We provide more information about the data, answers to FAQs, samples codes to read the data, and the latest published results on our datasets here.

 

5. Sample Videos

 

​​

6. Terms & Conditions of Use

The datasets are released for academic research only, and are free to researchers from educational or research institutes for non-commercial purposes.

The use of these two datasets is governed by the following terms and conditions:
• Without the expressed permission of the ROSE Lab, any of the following will be considered illegal: redistribution, derivation or generation of a new dataset from this dataset, and commercial usage of any of these datasets in any way or form, either partially or in its entirety.
• For the sake of privacy, images of all subjects in any of these datasets are only allowed for the demonstration in academic publications and presentations.
• All users of "NTU RGB+D" and "NTU RGB+D 120" action recognition datasets agree to indemnify, defend and hold harmless, the ROSE Lab and its officers, employees, and agents, individually and collectively, from any and all losses, expenses, and damages.

If interested, researchers can register for an account, submit the request form and accept the Release Agreement. We will validate your request and grant approval for downloading the datasets.The LoginID can be used for both "NTU RGB+D" and "NTU RGB+D 120".

7. Related Publications

All publications using "NTU RGB+D" or "NTU RGB+D 120" Action Recognition Database or any of the derived datasets(see Section 8) should include the following acknowledgement: "(Portions of) the research in this paper used the NTU RGB+D (or NTU RGB+D 120) Action Recognition Dataset made available by the ROSE Lab at the Nanyang Technological University, Singapore."

Furthermore, these publications should cite the following papers:

  • Amir Shahroudy, Jun Liu, Tian-Tsong Ng, Gang Wang, "NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016 [PDF].
  • Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, Alex C. Kot, "NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding", IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2019. [PDF].

 

Some related works on RGB+D action recognition:

  • Amir Shahroudy, Tian-Tsong Ng, Qingxiong Yang, Gang Wang, "Multimodal Multipart Learning for Action Recognition in Depth Videos", TPAMI, 2016.
  • Amir Shahroudy, Tian-Tsong Ng, Yihong Gong, Gang Wang, "Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos" TPAMI, 2018.
  • Amir Shahroudy, Gang Wang, Tian Tsong Ng, "Multi-modal Feature Fusion for Action Recognition in RGB-D Sequences", ISCCSP, 2014.
  • Jun Liu, Amir Shahroudy, Dong Xu, Gang Wang, "Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition", ECCV, 2016.
  • Jun Liu, Gang Wang, Ping Hu, Ling-Yu Duan, Alex C. Kot, "Global Context-Aware Attention LSTM Networks for 3D Action Recognition", CVPR, 2017.
  • Jun Liu, Amir Shahroudy, Dong Xu, Alex C. Kot, Gang Wang, "Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates", TPAMI, 2018.
  • Jun Liu, Gang Wang, Ling-Yu Duan, Kamila Abdiyeva, Alex C. Kot, "Skeleton-Based Human Action Recognition with Global Context-aware Attention LSTM Networks", TIP, 2018.
  • Jun Liu, Amir Shahroudy, Gang Wang, Ling-Yu Duan, Alex C. Kot, "Skeleton-Based Online Action Prediction Using Scale Selection Network", TPAMI, 2019.
  • Siyuan Yang, Jun Liu, Shijian Lu, Er Meng Hwa, and Alex Kot, "Collaborative Learning of Gesture Recognition and 3D Hand Pose Estimation with Multi-Order Feature Analysis", ECCV 2020.
  • Siyuan Yang, Jun Liu, Shijian Lu, Er Meng Hwa, and Alex Kot, "Skeleton Cloud Colorization for Unsupervised 3D Action Representation Learning", ICCV 2021.
     

8. Derived Works Based on NTU RGB+D Dataset

Below are some datasets that are derived from, or make partial use of, NTU RGB+D dataset:

8.1.  LSMB19: A Large-Scale Motion Benchmark for Searching and Annotating in Continuous Motion Data Streams (http://mocap.fi.muni.cz/LSMB).

  • J. Sedmidubsky, P. Elias, P. Zezula, "Benchmarking Search and Annotation in Continuous Human Skeleton Sequences", ICMR, 2019.

 

8.2.  AUTH UAV Gesture Dataset (https://aiia.csd.auth.gr/auth-uav-gesture-dataset/ ).

  • F. Patrona, I. Mademlis, I. Pitas, “An Overview of Hand Gesture Languages for Autonomous UAV Handling”, in Proceedings of the Workshop on Aerial Robotic Systems Physically Interacting with the Environment (AIRPHARO), 2021.

You can request for the relevant 4-class sub-dataset of the NTU RGB+D dataset using the the same request form and download from Section 3.0 of the Download Page

8.3. TVBench (https://huggingface.co/datasets/FunAILab/TVBench)
You can request for the relevant320 videos of the NTU RGB+D dataset using the the same request form and download from Section 3.0 of the Download Page

 



Note: Users of these derived works should also cite the papers found here (see Section 7 on Related Publications)