search for




 

심층신경망으로 가는 통계 여행, 세 번째 여행: 언어모형과 트랜스포머
A statistical journey to DNN, the third trip: Language model and transformer
Korean J Appl Stat 2024;37(5):567-582
Published online October 31, 2024
© 2024 The Korean Statistical Society.

김유진a, 황인준a, 장기석a, 이윤동1,a
Yu Jin Kima, In Jun Hwanga, Kisuk Janga, Yoon Dong Lee1,a

a서강대학교 경영학부

aBusiness School, Sogang University
1Business School, Sogang University, PA 804, BaekBumRo, Mapo, Seoul 04107, Korea. E-mail: widylee@sogang.ac.kr
Received July 31, 2024; Revised August 9, 2024; Accepted August 12, 2024.
Abstract
지난 10년의 기간 심층신경망의 비약적 발전은 언어모형의 개발과 그 발전을 함께 해 왔다. 언어모형은 초기 RNN을 이용한 encoder-decoder 모형의 형태로 개발되었으나, 2015년 attention이 등장하고, 2017년 transformer가 등장하여 혁명적 기술로 성장하였다. 본연구에서는 언어모형의 발전과정을 간략하게 살펴보고, 트랜스포머의 작동원리와 기술적 요소에 대하여 구체적으로 살펴본다. 동시에 언어모형, 트랜스포머와 관련되는 통계모형과, 방법론에 대하여 함께 검토한다.
Over the past decade, the remarkable advancements in deep neural networks have paralleled the development and evolution of language models. Initially, language models were developed in the form of Encoder-Decoder models using early RNNs. However, with the introduction of Attention in 2015 and the emergence of the Transformer in 2017, the field saw revolutionary growth. This study briefly reviews the development process of language models and examines in detail the working mechanism and technical elements of the Transformer. Additionally, it explores statistical models and methodologies related to language models and the Transformer.
주요어 : 언어모형, 트랜스포머, 다중어텐션, 인코더-디코더, 위치인코딩
Keywords : language model, transformer, multi-head attention, encoder-decoder, positional encoding
References
  1. Bahdanau D, Cho K, and Bengio Y (2016). Neural machine translation by jointly learning to align and translate. In Proceedings of the Third International Conference on Learning Representations (ICLR 2015), San Diego, CA,
  2. Ba JL, Kiros JR, and Hinton GE (2016). Layer normalization,
  3. Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, and Bengio Y (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 1724-1734.
    CrossRef
  4. Chorowski JK, Bahdanau D, Serdyuk D, Cho K, and Bengio Y (2015). Attention-based models for speech recognition, Advances in Neural Information Processing Systems, 28 (NIPS 2015), 577-585.
  5. He K, Zhang X, Ren S, and Sun J (2016). Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 770-778.
    CrossRef
  6. Hwang IJ, Kim HJ, Kim YJk, and Lee YD (2024). Generalized neural collaborative filtering, The Korean Journal of Applied Statistics, 37, 311-322.
    CrossRef
  7. Ioffe S and Szegedy C (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), Lille, 448-456.
  8. Kim HJ, Kim YJ, Jang K, and Lee YD (2024a). A statistical journey to DNN, the second trip: Architecture of RNN and image classification, The Korean Journal of Applied Statistics, 37, 553-563.
    CrossRef
  9. Kim HJ, Hwang IJ, Kim YJ, and Lee YD (2024b). A statistical journey to DNN, the first trip: From regression to deep neural network, The Korean Journal of Applied Statistics, 37, 541-551.
    CrossRef
  10. Li Y, Si S, Li G, Hsieh CJ, and Bengio S (2021). Learnable fourier features for multi-dimensional spatial positional encoding, Advances in Neural Information Processing Systems, 34 (NeurIPS 2021), 15816-15829.
  11. Luong MT, Pham H, and Manning CD (2015). Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Lisbon, 1412-1421.
    CrossRef
  12. Mikolov T, Chen K, Corrado G, and Dean J (2013a). Efficient estimation of word representations in vector space,
  13. Mikolov T, Sutskever I, Chen K, Corrado G, and Dean J (2013b). Distributed representations of words and phrases and their compositionality, Advances in Neural Information Processing Systems, 26 (NIPS 2013), 3111-3119.
  14. Park C, Na I, Jo Y et al. (2019). SANVis: Visual Analytics for Understanding Self-Attention Networks. In Proceedings of 2019 IEEE Visualization Conference (VIS), Vancouver, BC, 146.
    CrossRef
  15. Parmar N, Vaswani A, Uszkoreit J, Kaiser Ł, Shazeer N, Ku A, and Tran S (2018). Image transformer. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholmsmässan, Stockholm, 4052-4061.
  16. Shaw P, Uszkoreit J, and Vaswani A (2018). Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018), New Orleans, Louisiana, 464-468.
    CrossRef
  17. Siu C (2019). Residual networks behave like boosting algorithms. In Proceedings of 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Washington DC, 31-40.
    CrossRef
  18. Su J, Ahmed M, Lu Y, Pan S, Bo W, and Liu Y (2024). Roformer: Enhanced transformer with rotary position embedding, Neurocomputing, 568, 127063.
    CrossRef
  19. Sutskever I, Vinyals O, and Le QV (2014). Sequence to sequence learning with neural networks, Advances in Neural Information Processing Systems, 27 (NIPS 2014), 3104-3112.
  20. Vaswani A, Shazeer N, and Parmar N (2017). Attention is all you need, Advances in Neural Information Processing Systems, 30 (NIPS 2017), 5998-6008.
  21. Veit A, Wilber MJ, and Belongie S (2016). Residual networks behave like ensembles of relatively shallow networks, Advances in Neural Information Processing Systems, 29 (NIPS 2016), 550-558.
  22. Wang X, Tu Z, Wang L, and Shi S (2019). Self-attention with structural position representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, 1403-1409.
    CrossRef
  23. Zhou X, Ren Z, Zhou S, Jiang Z, Yu TZ, and Luo H (2024). Rethinking position embedding methods in the transformer architecture, Neural Process Letters 56, 41.
    CrossRef


October 2024, 37 (5)