search for




 

한국어 텍스트 분류 분석을 위한 데이터 증강 방법
Data augmentation methods for classifying Korean texts
Korean J Appl Stat 2024;37(5):599-613
Published online October 31, 2024
© 2024 The Korean Statistical Society.

전지현a, 정윤서1,b
Jihyun Jeona, Yoonsuh Jung1,b

aNICE평가정보 CB부분 CB사업2본부 CB사업4실; b고려대학교 통계학과

a2nd Credit Bureau division, Nice Information Service; bDepartment of Statistics, Korea University
1Department of Statistics, Korea University, 145 Anam-ro, Seongbuk-Gu, Seoul 02841, Korea. E-mail: yoons77@korea.ac.kr
Jung’s work has been partially supported by National Research Foundation of Korea (NRF) grants funded by the Korean government (MIST) 2022R1F1A1071126 and by a Korea University Grant (K2305251).
Received April 13, 2024; Revised July 15, 2024; Accepted July 17, 2024.
Abstract
데이터 증강은 학습데이터의 변형을 통해 데이터의 크기 및 다양성을 늘리는 방법으로 과적합 규제화 수단으로 사용되고 있다. 활발한 연구가 이루어지고 있는 컴퓨터비전 영역과 달리 자연어처리 영역에서의 데이터 증강 관련 연구는 다소 제한적인 상황이다. 특히 한국어 데이터 관련 연구는 극히 적다. 본 논문에서는 소규모의 한국어 텍스트 데이터 분류 분석 성능 향상을 위한 증강 방법론을 제안한다. 1) 맞춤법 교정을 통한 데이터 증강(DA-SC), 2) 형태소 분석 기반의 쉬운 데이터 증강(EDA-POS), 3) 조건부 마스킹 언어모형 기반의 데이터 증강(DA-cMLM)의 총 세 가지 방안을 제안한다. 실제 데이터 분석을 통해 본 논문에서 제안하는 증강 방법의 적용을 통해 분류 성능을 향상시킬 수 있음을 보인다.
Data augmentation is widely adopted in computer vision. In contrast, research on data augmentation in the field of natural language processing has been limited. We propose several data augmentation methods to support the classification of Korean texts. We increase the size and diversity of text data which are specifically tailored to Korean. These methods adopt and adjust the existing data augmentation for English texts. We could improve the classification accuracy and sometimes regularize the natural language models to reduce the overfits. Our contribution to the data augmentation regarding Korean texts compose of three parts. 1) data augmentation with Spelling Correction, 2) Easy data augmentation based on part-of-speech tagging, and 3) Data augmentation with conditional Masked Language Modeling. Our experiments show that classification accuracy can be improved with the aids of our proposed methods. Due to the limit of computing facilities, we consider rather small-scale Korean texts only.
주요어 : 데이터 증강, 자연어처리, 한국어 분류, BERT
Keywords : data augmentation, Korean text classification, masked language modeling, BERT
References
  1. Bayer M, Kaufhold M, and Reuter C (2022). A survey on data augmentation for text classification, ACM Computing Surveys, 55, 1-39.
    CrossRef
  2. Chen T, Kornblith S, Norouzi M, and Hinton G (2020). A simple framework for contrastive learning of visual representations, Proceedings of the 37th International Conference on Machine Learning, 119, 1597-1607.
  3. Cho J, Jeong M, Lee J, and Cheong Y (2019). Transformational data augmentation techniques for Korean text data, Proceedings of the Korean Institute of Information Scientists and Engineers Conference, 47, 592-594.
  4. Choi M and On B (2019). A Comparative Study on the Accuracy of Sentiment Analysis of Bi-LSTM Model by Morpheme Feature Proceedings of KIIT Conference, 307-309.
  5. Choi Y and Lee KJ (2020). Performance analysis of Korean morphological analyzer based on transformer and BERT, Journal of The Korean Institute of Information Scientists and Engineers, 47, 730-741.
    CrossRef
  6. Clark K, Luong M, Le Q, and Manning C (2020). ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators,
  7. Devlin J, Chang M, Lee K, and Toutanova K (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 4171-4186.
  8. Dong L, Yang N, Wang W et al. (2019). Unified language model pre-training for natural language understanding and generation,
  9. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, and Bengio Y (2014). Generative adversarial nets, Advances in Neural Information Processing Systems, 2672-2680.
  10. Han S (2015). py-hanspell GitHub repository,
  11. Howard J and Ruder S (2018). Universal language model fine-tuning for text classification, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 328-339.
    CrossRef
  12. Kim J, Jang K, Lee Y, and Park W (2020). Bert-based classification model improvement through minority class data augmentation, Proceedings of the Korea Information Processing Society Conference, 27, 810-813.
  13. Kobayashi S (2018). Contextual augmentation: Data augmentation by words with paradigmatic relations, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 452-457.
    CrossRef
  14. Kumar V, Choudhary A, and Cho E (2020). Data augmentation using pre-trained transformer models, Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems, Suzhou, China, 18-26.
  15. Liu Y, Ott M, Goyal N et al (2019). RoBERTa: A robustly optimized BERT pretraining approach,
  16. Mikolov T, Chen K, Corrado G, and Dean J (2013). Efficient estimation of word representations in vector space,
  17. Min C (2019). Korean pronunciation teaching methods for learners from the isolating language circle, specifically in Vietnam and China, Journal of Koreanology, 23, 337-371.
  18. Moon J, Cho W, and Lee J (2020). BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection, Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media, 25-31. Online. Association for Computational Linguistics.
    CrossRef
  19. Park E (2015). NSMC: Naver sentiment movie corpus v1.0, GitHub repository,
  20. Park J (2020). KoELECTRA: Pretrained ELECTRA Model for Korean, GitHub repository,
  21. Pennington J, Socher R, and Manning C (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 1532-1543.
    CrossRef
  22. Qu C, Yang L, Qiu M, Croft WB, Zhang Y, and Iyyer M (2019). BERT with history answer embedding for conversational question answering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, 1133-1136.
    CrossRef
  23. Radford A, Narasimhan K, Salimans T, and Sutskever I (2018). Improving language understanding by generative pre-training,
  24. Sennrich R, Haddow B, and Birch A (2016). Improving Neural Machine Translation Models with Monolingual Data P Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 86-96.
    CrossRef
  25. SKTBrain (2019). KoBERT GitHub repository,
  26. Song Y, Wang J, Liang Z, Liu Z, and Jiang T (2020). Utilizing BERT intermediate layers for aspect based sentiment analysis and natural language inference,
  27. Vaswani A, Shazeer NM, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, and Polosukhin I (2017). Attention is All you Need. In Proceedings of Advances in Neural Information Processing Systems, Long Beach, CA, 5998-6008.
  28. Wei J and Zou K (2019). EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 6382-6388.
    CrossRef
  29. Wu X, Lv S, Zang L, Han J, and Hu S (2019). Conditional BERT Contextual Augmentation Computational Science - ICCS 2019, 84-95.
    CrossRef


October 2024, 37 (5)