search for




 

스켈레톤 데이터에 기반한 동작 분류: 고전적인 머신러닝과 딥러닝 모델 성능 비교
Classification of human actions using 3D skeleton data: A performance comparison between classical machine learning and deep learning models
Korean J Appl Stat 2024;37(5):643-661
Published online October 31, 2024
© 2024 The Korean Statistical Society.

김주환a, 김종찬a, 이성임1,b
Juhwan Kima, Jongchan Kima, Sungim Lee1,b

a단국대학교 응용통계학과; b단국대학교 통계데이터사이언스학과

aDepartment of Applied Statistics, Dankook University;
bDepartment of Statistics and Data Science, Dankook University
1Department of Statistics, 152 Jukjeon-ro, Suji-gu, Yongin-si, Gyeonggi-do 16890, Korea. E-mail: silee@dankook.ac.kr
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea govern- ment (MSIT) (No. 2019R1A2C1003257).
Received July 31, 2024; Revised August 24, 2024; Accepted August 29, 2024.
Abstract
본 연구는 3D 스켈레톤 데이터를 활용하여 머신러닝 및 딥러닝 모델을 통해 동작 인식을 수행하고, 모델 간 분류 성능 차이를 비교 분석하였다. 데이터는 NTU RGB+D 데이터의 정면 촬영 데이터로 40명의 참가자가 수행한 60가지 동작을 분류하였다. 머신러닝 모델로는 선형판별분석(LDA), 다중 클래스 서포트 벡터 머신(SVM), 그리고 랜덤 포레스트(RF)가 있으며, 딥러닝 모델로는 RNN 기반의 HBRNN (hierarchical bidirectional RNN) 모델과 GCN 기반의 SGN (semantics-guided neural network) 모델을 적용하였다. 각 모델의 분류 성능을 평가하기 위해 40명의 참가자별로 교차 검증을 실시하였다. 분석 결과, 모델 간 성능 차이는 동작 유형에 크게 영향을 받았으며, 군집 분석을 통해 각 동작에 대한 분류 성능을 살펴본 결과, 인식이 비교적 쉬운 큰 동작에서는 머신러닝 모델과 딥러닝 모델 간의 성능 차이가 유의미하지 않았고, 비슷한 성능을 나타냈다. 반면, 손뼉치기나 손을 비비는 동작처럼 정면 촬영된 관절 좌표만으로 구별하기 어려운 동작의 경우, 딥러닝 모델이 머신러닝 모델보다 관절의 미세한 움직임을 인식하는 데 더 우수한 성능을 보였다.
This study investigates the effectiveness of 3D skeleton data for human action recognition by comparing the classification performance of machine learning and deep learning models. We use the subset of the NTU RGB+D dataset, containing only frontal-view recordings of 40 individuals performing 60 different actions. Our study uses linear discriminant analysis (LDA), support vector machine (SVM), and random forest (RF) as machine learning models, while the deep learning models are hierarchical bidirectional RNN (HBRNN) and semantics-guided neural network (SGN). To evaluate model performance, cross-subject cross-validation is conducted. Our analysis demonstrates that action type significantly impacts model performance. Cluster analysis by action category shows no significant difference in classification performance between machine learning and deep learning models for easily recognizable actions. However, for actions requiring precise differentiation based on frontal-view joint coordinates such as ‘clapping’ or ‘rubbing hands’, deep learning models show a higher performance in capturing subtle joint movements compared to machine learning models.
주요어 : 스켈레톤 데이터, 머신러닝 모델, 딥러닝 모델, 교차검증
Keywords : skeleton data, machine learning models, deep learning models, cross-subject cross-validation
References
  1. Amor BB, Su J, and Srivastava A (2015). Action recognition using rate-invariant analysis of skeletal shape trajectories, IEEE Transactions on Pattern Analysis and Machine Intelligence, 38, 1-13.
    Pubmed CrossRef
  2. Cao C, Lan C, Zhang Y, Zeng W, Lu H, and Zhang Y (2018). keleton-based action recognition with gated convolutional neural networks, IEEE Transactions on Circuits and Systems for Video Technology, 29, 3247-3257.
    CrossRef
  3. Chaaraoui AA, Padilla-Lopez JR, and Florez-Revuelta F (2015). Abnormal gait detection with RGB-D devices using joint motion history features, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, 7, 1-6.
    CrossRef
  4. Cho K and Chen X (2014). Classifying and visualizing motion capture sequences using deep neural networks, 2014 International Conference on Computer Vision Theory and Applications, 2, 122-130.
    CrossRef
  5. Du G, Zhang P, Mai J, and Li Z (2012). Markerless kinect-based hand tracking for robot teleoperation, International Journal of Advanced Robotic Systems, 9, 36.
    CrossRef
  6. Du Y, Fu Y, and Wang L (2015). Skeleton based action recognition with convolutional neural network, In Proceedings of 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur, 579-583.
    CrossRef
  7. Du Y, Wang W, and Wang L (2015). Hierarchical recurrent neural network for skeleton based action recognition, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1110-1118.
    CrossRef
  8. Ghazal S, Khan US, Mubasher Saleem M, Rashid N, and Iqbal J (2019). Human activity recognition using 2D skeleton data and supervised machine learning, IET Image Processing, 13, 2572-2578.
    CrossRef
  9. Gregor K, Danihelka I, Graves A, Rezende D, and Wierstra D (2015). Draw: A recurrent neural network for image generation, International Conference on Machine Learning, 37, 1462-1471.
  10. Grushin A, Monner DD, Reggia JA, and Mishra A (2013). Robust human action recognition via long short-term memory, In Proceedings of The 2013 International Joint Conference on Neural Networks (IJCNN), Dallas, TX, 1-8.
    CrossRef
  11. Hochreiter S and Schmidhuber J (1997). Long short-term memory, Neural Computation, 9, 1735-1780.
    Pubmed CrossRef
  12. Izenman AJ (2008). Modern Multivariate Statistical Techniques, Springer, New York.
    CrossRef
  13. Jalal A, Uddin MZ, and Kim TS (2012). Depth video-based human activity recognition system using translation and scaling invariant features for life logging at smart home, IEEE Transactions on Consumer Electronics, 58, 863-871.
    CrossRef
  14. Jeong H and Lim C (2019). A review of artificial intelligence based demand forecasting techiques, The Korean Journal of Applied Statistics, 32, 795-835.
  15. Jeong YS and Park Jh (2018). 3D skeleton animation learning using CNN, Asia-pacific Journal of Multimedia Services Convergent with Art, Humanities, and Sociology, 8, 281-288.
  16. Jin X, Yao Y, Jiang Q, Huang X, Zhang J, Zhang X, and Zhang K (2015). Virtual personal trainer via the kinect sensor, In Proceedings of 2015 IEEE 16th International Conference on Communication Technology, Hangzhou, 406-463.
  17. Kang YK, Kang HY, and Weon DS (2021). Human skeleton keypoints based fall detection using GRU, Journal of the Korea Academia-Industrial Cooperation Society, 22, 127-133.
  18. Ke Q, Bennamoun M, An S, Sohel F, and Boussaid F (2017). A new representation of skeleton sequences for 3d action recognition, In Proceedings of the IEEE conference on computer vision and pattern recognition, 3288-3297.
    CrossRef
  19. Kim W, Kim D, Park KS, and Lee S (2023). Motion classification using distributional features of 3D skeleton data, Communications for Statistical Applications and Methods, 30, 551-560.
    CrossRef
  20. Kipf TN and Welling M (2016). Semi-supervised classification with graph convolutional networks,
  21. Lee I, Kim D, Kang S, and Lee S (2017). Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks, In Proceedings of the IEEE international conference on computer vision, 1012-1020.
    CrossRef
  22. Lee J, Lee M, Lee D, and Lee S (2023). Hierarchically decomposed graph convolutional networks for skeleton-based action recognition, In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10444-10453.
    CrossRef
  23. Lefebvre G, Berlemont S, Mamalet F, and Garcia C (2013). BLSTM-RNN based 3D gesture classification, Artificial Neural Networks and Machine Learning-ICANN 2013: 23rd International Conference on Artificial Neural Networks Sofia, Bulgaria, 23, 381-388.
    CrossRef
  24. Li C, Zhong Q, Xie D, and Pu S (2017). Skeleton-based action recognition with convolutional neural networks. In 2017 IEEE international conference on multimedia & expo workshops, 597-600.
    CrossRef
  25. Li C, Zhong Q, Xie D, and Pu S (2018). Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation,
    CrossRef
  26. Lin BS, Wang LY, Hwang YT, Chiang PY, and Chou WJ (2018). Depth camera based system for estimating energy expenditure of physical activities in gyms, IEEE Journal of Biomedical and Health Informatics, 23, 1086-1095.
    Pubmed CrossRef
  27. Liu J, Shahroudy A, Xu D, and Wang G (2016). Spatio-temporal lstm with trust gates for 3d human action recognition, Computer Vision-ECCV 2016: 14th European Conference, 14, 816-833.
    CrossRef
  28. Reddy VR and Chattopadhyay T (2014). Human activity recognition from kinect captured data using stick model, International Conference on Human-Computer Interaction, 305-315.
    CrossRef
  29. Rumelhart DE, Hinton GE, and Williams RJ (1986). Learning representations by back-propagating error, Nature, 323, 533-536.
    CrossRef
  30. Sandra M (2020). Clustering Gestures using Multiple Techniques, Digital Sciences Tilburg University, Tilburg, The Netherlands.
  31. Schuster M and Paliwal KK (1997). Bidirectional recurrent neural networks, IEEE Transactions on Signal Processing, 45, 2673-2681.
    CrossRef
  32. Shahroudy A, Liu J, Ng TT, and Wang G (2016). Ntu rgb+ d: A large scale dataset for 3d human activity analysis, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1010-1019.
    CrossRef
  33. Shan J and Akella S (2014). 3D human action segmentation and recognition using pose kinetic energy, In Proceedings of 2014 IEEE International Workshop on Advanced Robotics and Its Social Impacts, Evanston, IL, 69-75.
    CrossRef
  34. Shin BG, Kim UH, Lee SW, Yang JY, and Kim W (2021). Fall detection based on 2-stacked Bi-LSTM and human-skeleton keypoints of RGBD camera, KIPS Transactions on Software and Data Engineering, 10, 491-500.
  35. Taha A, Zayed HH, Khalifa ME, and El-Horbaty ESM (2015). Human activity recognition for surveillance applications, In Proceedings of the 7th International Conference on Information Technology, 577—586.
    CrossRef
  36. Tao W, Liu T, Zheng R, and Feng H (2012). Gait analysis using wearable sensors, Sensors, 12, 2255-2283. 
    Pubmed KoreaMed CrossRef
  37. Xu H, Gao Y, Hui Z, Li J, and Gao X (2023). Language knowledge-assisted representation learning for skeleton-based action recognition,
  38. Veeriah V, Zhuang N, and Qi GJ (2015). Differential recurrent neural networks for action recognition, In Proceedings of the IEEE International Conference on Computer Vision, 4041-4049.
    CrossRef
  39. Yang Y, Yan H, Dehghan M, and Ang MH (2015). Real-time human-robot interaction in complex environment using kinect v2 image recognition, In 2015 IEEE 7th International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE Conference on Robotics, 112-117.
    CrossRef
  40. Zhang P, Lan C, Zeng W, Xing J, Xue J, and Zheng N (2020). Semantics-guided neural networks for efficient skeleton-based human action recognition, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1112-1121.
    CrossRef
  41. Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, and Xie X (2016, March). Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks, Proceedings of the AAAI Conference on Artificial Intelligence, 30.
    CrossRef


December 2024, 37 (6)