Bahdanau D, Cho K, and Bengio Y (2016). Neural machine translation by jointly learning to align and translate. In Proceedings of the Third International Conference on Learning Representations (ICLR 2015), San Diego, CA,
Ba JL, Kiros JR, and Hinton GE (2016). Layer normalization,
Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, and Bengio Y (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 1724-1734.
Chorowski JK, Bahdanau D, Serdyuk D, Cho K, and Bengio Y (2015). Attention-based models for speech recognition, Advances in Neural Information Processing Systems, 28 (NIPS 2015), 577-585.
He K, Zhang X, Ren S, and Sun J (2016). Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 770-778.
Hwang IJ, Kim HJ, Kim YJk, and Lee YD (2024). Generalized neural collaborative filtering, The Korean Journal of Applied Statistics, 37, 311-322.
Ioffe S and Szegedy C (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), Lille, 448-456.
Kim HJ, Kim YJ, Jang K, and Lee YD (2024a). A statistical journey to DNN, the second trip: Architecture of RNN and image classification, The Korean Journal of Applied Statistics, 37, 553-563.
Kim HJ, Hwang IJ, Kim YJ, and Lee YD (2024b). A statistical journey to DNN, the first trip: From regression to deep neural network, The Korean Journal of Applied Statistics, 37, 541-551.
Li Y, Si S, Li G, Hsieh CJ, and Bengio S (2021). Learnable fourier features for multi-dimensional spatial positional encoding, Advances in Neural Information Processing Systems, 34 (NeurIPS 2021), 15816-15829.
Luong MT, Pham H, and Manning CD (2015). Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Lisbon, 1412-1421.
Mikolov T, Chen K, Corrado G, and Dean J (2013a). Efficient estimation of word representations in vector space,
Mikolov T, Sutskever I, Chen K, Corrado G, and Dean J (2013b). Distributed representations of words and phrases and their compositionality, Advances in Neural Information Processing Systems, 26 (NIPS 2013), 3111-3119.
Park C, Na I, Jo Y et al. (2019). SANVis: Visual Analytics for Understanding Self-Attention Networks. In Proceedings of 2019 IEEE Visualization Conference (VIS), Vancouver, BC, 146.
Parmar N, Vaswani A, Uszkoreit J, Kaiser Ł, Shazeer N, Ku A, and Tran S (2018). Image transformer. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholmsmässan, Stockholm, 4052-4061.
Shaw P, Uszkoreit J, and Vaswani A (2018). Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018), New Orleans, Louisiana, 464-468.
Siu C (2019). Residual networks behave like boosting algorithms. In Proceedings of 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Washington DC, 31-40.
Su J, Ahmed M, Lu Y, Pan S, Bo W, and Liu Y (2024). Roformer: Enhanced transformer with rotary position embedding, Neurocomputing, 568, 127063.
Sutskever I, Vinyals O, and Le QV (2014). Sequence to sequence learning with neural networks, Advances in Neural Information Processing Systems, 27 (NIPS 2014), 3104-3112.
Vaswani A, Shazeer N, and Parmar N (2017). Attention is all you need, Advances in Neural Information Processing Systems, 30 (NIPS 2017), 5998-6008.
Veit A, Wilber MJ, and Belongie S (2016). Residual networks behave like ensembles of relatively shallow networks, Advances in Neural Information Processing Systems, 29 (NIPS 2016), 550-558.
Wang X, Tu Z, Wang L, and Shi S (2019). Self-attention with structural position representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, 1403-1409.
Zhou X, Ren Z, Zhou S, Jiang Z, Yu TZ, and Luo H (2024). Rethinking position embedding methods in the transformer architecture, Neural Process Letters 56, 41.