search for




 

투영 조합을 통한 빅데이터 앙상블 모형
Ensemble model through mixed projections useful for big data analytics
Korean J Appl Stat 2024;37(5):691-702
Published online October 31, 2024
© 2024 The Korean Statistical Society.

박혜준a, 김현중1,a, 이영섭2,b
Hyejoon Parka, Hyunjoong Kim1,a, Yung-Seop Lee2,b

a연세대학교 응용통계학과; b동국대학교 통계학과

aDepartment of Applied Statistics, Yonsei University; bDepartment of Statistics, Dongguk University
1Department of Applied Statistics, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 03722, Korea. E-mail: hkim@yonsei.ac.kr
2Department of Statistics, Dongguk University, 30 Pildong-ro 1-gil, Jung-gu, Seoul 04620, Korea. E-mail: yung@dongguk.edu
Hyunjoong Kim’s work was supported by the MSIT (Ministry of Science and ICT), Korea, under the ICAN (ICT Challenge and Advanced Network of HRD) support program (IITP-2023-00259934) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation) and by the National Research Foundation of Korea (NRF) grant funded by the Korean government (No. 2016R1D1A1B02011696). Yung-Seop Lee’s work was supported by the National Research Foundation(NRF) grant funded by the Korea government (MSIT) (No.2021R1A2C1007095) and by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2020-2020-0-01789) supervised by the IITP (Institute of Information & Communications Technology Planning & Evaluation).
Received July 30, 2024; Revised August 16, 2024; Accepted August 19, 2024.
Abstract
이 논문에서는 빅데이터 분석 분야에서 유용하게 사용할 수 있는 새로운 분류 앙상블 방법인 mixed projection forest (MPF)를 제안하였다. 앙상블 내 개별 분류기를 학습할 때, MPF는 주성분 분석(PCA)과 정준 선형 판별 분석(CDA) 등의 데이터 투영 기법의 조합에 의한 회전 행렬을 활용한다. 이를 통해 경사 초평면을 사용함으로써 각 분류기의 정확성을 향상시킨다. 또한 변수 집합의 랜덤 분할을 이용해 다양한 회전 행렬을 도출하여 개별 분류기들의 다양성을 증대시킨다. 이러한 접근 방식은 궁극적으로 분류 성능을 향상시켜 정밀도가 필요한 빅데이터 분석에 매우 효과적이다. 이 논문에서는 실제 및 가상의 30개 데이터셋을 사용하여 MPF와 전통적인 분류 앙상블 모형의 성능을 비교하였다. 결과적으로, MPF는 분류 성능 및 분류기의 다양성 측면에서 우수한 경쟁력을 가진다는 것을 확인할 수 있었다.
In this paper, we propose mixed projection forest (MPF), a new classification ensemble method that can be effectively applied in the field of big data analysis. When training individual classifiers within an ensemble, MPF uses oblique hyperplanes using combined rotation matrix derived from data projection techniques of principal component analysis (PCA) and canonical linear discriminant analysis (CLDA), thereby improving the accuracy of each classifier. Additionally, the diversity of individual classifiers is improved by generating various rotation matrices through random partitioning of the input variable set. This approach ultimately enhances classification performance and proves to be highly effective in big data analysis that demands precision. We conducted a performance comparison of MPF with existing classification ensemble models using 30 real or simulated datasets. The results indicate that MPF achieves competitive performance in terms of classification accuracy and classifier diversity.
주요어 : 분류, 앙상블, rotation forest, canonical forest, random rotation ensemble
Keywords : classification, ensemble, rotation forest, canonical forest, random rotation ensemble
References
  1. Alcalá-FJ, Fernández A, Luengo J, Derrac J, García S, Sánchez L, and Herrera F (2011). KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework, Journal of Multiple-Valued Logic & Soft Computing, 17, 255-287.
  2. Asuncion A and Newman DJ (2007). UCI machine learning repository, Retrieved Oct. 02, 2018,
  3. Blaser R and Fryzlewicz P (2016). Random rotation ensembles, The Journal of Machine Learning Research, 17, 126-151.
  4. Breiman L (1996). Bagging predictors, Machine Learning, 24, 123-140.
    CrossRef
  5. Breiman L (2001). Random forests, Machine Learning, 45, 5-32.
    CrossRef
  6. Chen YC, Ha H, Kim H, and Ahn H (2014). Canonical forest, Computational Statistics, 29, 849-867.
    CrossRef
  7. Cohen J (1960). A coefficient of agreement for nominal scales, Educational and Psychological Measurement, 20, 37-46.
    CrossRef
  8. Fukunaga K (2013). Introduction to Statistical Pattern Recognition, Elsevier, Amsterdam, Netherlands.
  9. Jolliffe IT (2002). Principal Component Analysis for Special Types of Data, Springer, New York.
  10. Kim H, Kim H, Moon H, and Ahn H (2011). A weight-adjusted voting algorithm for ensembles of classifiers, Journal of the Korean Statistical Society, 40, 437-449.
    CrossRef
  11. Lim TS, Loh WY, and Shih Y (2000). A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms, Machine Learning, 40, 203-228.
    CrossRef
  12. Loh WY (2009). Improving the precision of classification trees, The Annals of Applied Statistics, 3, 1710-1737.
    CrossRef
  13. Rodriguez JJ, Kuncheva LI, and Alonso CJ (2006). Rotation forest: A new classifier ensemble method, IEEE Transactions on Pattern Analysis and Machine Intelligence, 28, 1619-1630.
    Pubmed CrossRef
  14. Terhune JM (1994). Geographical variation of harp seal underwater vocalizations, Canadian Journal of Zoology, 72, 892-897.
    CrossRef
  15. Vlachos R (2010). StatLib datasets archive, Retrieved Oct. 02, 2018,
October 2024, 37 (5)