Deep Learning Approximation of Matrix Functions: From Feedforward Neural Networks to Transformers

Title:

Deep Learning Approximation of Matrix Functions: From Feedforward Neural Networks to Transformers

Padmanabhan, Rahul (2025) Deep Learning Approximation of Matrix Functions: From Feedforward Neural Networks to Transformers. Masters thesis, Concordia University.

[thumbnail of Padmanabhan_MSc_S2025.pdf]

Preview

Text (application/pdf)
Padmanabhan_MSc_S2025.pdf - Accepted Version
Available under License Spectrum Terms of Access.

11MB

Abstract

Deep Neural Networks (DNNs) have been at the forefront of Artificial Intelligence (AI) over the last decade. Transformers, a type of DNN, have revolutionized Natural Language Processing (NLP) through models like ChatGPT, Llama and more recently, Deepseek. While transformers are used mostly in NLP tasks, their potential for advanced numerical computations remains largely unexplored. This presents opportunities in areas like surrogate modeling and raises fundamental questions about AI's mathematical capabilities.

We investigate the use of transformers for approximating matrix functions, which are mappings that extend scalar functions to matrices. These functions are ubiquitous in scientific applications, from continuous-time Markov chains (matrix exponential) to stability analysis of dynamical systems (matrix sign function). Our work makes two main contributions. First, we prove theoretical bounds on the depth and width requirements for ReLU DNNs to approximate the matrix exponential. Second, we use transformers with encoded matrix data to approximate general matrix functions and compare their performance to feedforward DNNs. Through extensive numerical experiments, we demonstrate that the choice of matrix encoding scheme significantly impacts transformer performance. Our results show strong accuracy in approximating the matrix sign function, suggesting transformers' potential for advanced mathematical computations.

Divisions:	Concordia University > Faculty of Arts and Science > Mathematics and Statistics
Item Type:	Thesis (Masters)
Authors:	Padmanabhan, Rahul
Institution:	Concordia University
Degree Name:	M. Sc.
Program:	Mathematics
Date:	19 February 2025
Thesis Supervisor(s):	Brugiapaglia, Simone
Keywords:	Transformers, Deep Learning, Matrix Functions, Matrix Sign, Matrix Exponential, Machine Learning, Scientific Computing
ID Code:	995212
Deposited By:	Rahul Harikashyap Padmanabhan
Deposited On:	17 Jun 2025 17:41
Last Modified:	17 Jun 2025 17:41

References:

[1] Ben Adcock, Simone Brugiapaglia, and Clayton G Webster. Sparse Polynomial
Approximation of High-Dimensional Functions, volume 25. SIAM, 2022.
[2] BenAdcock, SimoneBrugiapaglia, NickDexter, andSebastianMoraga. Near-Optimal
Learning of Banach-Valued, High-Dimensional Functions via Deep Neural Networks.
Neural Networks, 181:106761, 2025.
[3] Mistral AI. Mixtral: A Mixture of Experts Model, 2023. URL https://mistral.ai.
Large language model with mixture of experts.
[4] Dosovitskiy Alexey. An Image is Worth 16x16 Words: Transformers for Image
Recognition at Scale. arXiv preprint arXiv: 2010.11929, 2020.
[5] Alberto Alfarano, François Charton, Amaury Hayat, and CERMICS-Ecole des
Ponts Paristech. Discovering Lyapunov Functions with Transformers. In The 3rd
Workshop on Mathematical Reasoning and AI at NeurIPS, volume 23, 2023.
[6] Alberto Alfarano, François Charton, and Amaury Hayat. Global Lyapunov Functions:
ALong-StandingOpenProbleminMathematics, withSymbolicTransformers. InThe
Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.
[7] Pedro Alonso, Javier Ibáñez, Jorge Sastre, Jesús Peinado, and Emilio Defez. Efficient
and Accurate Algorithms for Computing Matrix Trigonometric Functions. Journal of
Computational and Applied Mathematics, 309:325–332, 2017.
[8] Anthropic. Claude. https://www.anthropic.com/claude, 2024.
[9] Jimmy Lei Ba. Layer Normalization. arXiv preprint arXiv:1607.06450, 2016.
[10] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec
2.0: A Framework for Self-Supervised Learning of Speech Representations. Advances
in Neural Information Processing Systems, 33:12449–12460, 2020.
[11] GA Baker and P Graves-Morris. Padé Approximants 2nd edn. Encyclopedia of
Mathematics and its Applications, Scrics, (59), 1996.
[12] AndrewRBarron. UniversalApproximationBoundsforSuperpositionsofaSigmoidal
Function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
[13] Richard Bellman. Dynamic Programming. Science, 153(3731):34–37, 1966.
[14] Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The Long-Document
Transformer. arXiv preprint arXiv:2004.05150, 2020.
63
[15] Pierfrancesco Beneventano, Patrick Cheridito, Robin Graeber, Arnulf Jentzen, and
Benno Kuckuck. Deep Neural Network Approximation Theory for High-Dimensional
Functions. arXiv preprint arXiv:2112.14523, 2021.
[16] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning Long-Term
Dependencies with Gradient Descent is Difficult. IEEE Transactions on Neural
Networks, 5(2):157–166, 1994.
[17] Peter Benner, Enrique S Quintana-Ortí, and Gregorio Quintana-Ortí. Balanced
Truncation Model Reduction of Large-Scale Dense Systems on Parallel Computers.
Mathematical and Computer Modelling of Dynamical Systems, 6(4):383–405, 2000.
[18] Feliks Aleksandrovich Berezin and Mikhail Shubin. volume 66. Springer Science & Business Media, 2012.
The Schrödinger Equation,
[19] Mogens Bladt and Bo Friis Nielsen. Matrix-Exponential Distributions in Applied
Probability, volume 81. Springer, 2017.
[20] John S. Bridle. Probabilistic interpretation of feedforward classification network
outputs, with relationships to statistical pattern recognition. Neurocomputing, 68:
227–236, 1990.
[21] TomBrown, BenjaminMann, NickRyder, MelanieSubbiah, JaredDKaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.
Language Models are Few-Shot Learners. Advances in Neural Information Processing
Systems, 33:1877–1901, 2020.
[22] Shuhao Cao. Choose a Transformer: Fourier or Galerkin. Advances in Neural
Information Processing Systems, 34:24924–24940, 2021.
[23] François Charton. What is my Math Transformer doing? – Three Results on
Interpretability and Generalization. arXiv preprint arXiv:2211.00170, 2022.
[24] Francois Charton. Learning the Greatest Common Divisor: Explaining Transformer
Predictions. In The Twelfth International Conference on Learning Representations,
2024.
[25] François Charton. Linear Algebra with Transformers. arXiv preprint
arXiv:2112.01898, 12 2021. URL http://arxiv.org/abs/2112.01898.
[26] Alice Cortinovis, Daniel Kressner, and Stefano Massei. Divide-and-Conquer Methods
for Functions of Matrices with Banded or Hierarchical Low-Rank Structure. SIAM
Journal on Matrix Analysis and Applications, 43(1):151–177, 2022.
[27] George Cybenko. Approximation by Superpositions of a Sigmoidal Function.
Mathematics of Control, Signals and Systems, 2(4):303–314, 1989.
[28] DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda
Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya
Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli
Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu,
Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li,
64
Hui Qu, J. L. Cai, Jian Liang, Jianzhong Guo, Jiaqi Ni, Jiashi Li, Jiawei Wang, Jin
Chen, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, Junxiao Song, Kai
Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong
Zhang, Lei Xu, Leyi Xia, Liang Zhao, Litong Wang, Liyue Zhang, Meng Li, Miaojun
Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Ning Tian,
Panpan Huang, Peiyi Wang, Peng Zhang, Qiancheng Wang, Qihao Zhu, Qinyu Chen,
Qiushi Du, R. J. Chen, R. L. Jin, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji
Wang, Runxin Xu, Ruoyu Zhang, Ruyi Chen, S. S. Li, Shanghao Lu, Shangyan Zhou,
Shanhuang Chen, Shaoqing Wu, Shengfeng Ye, Shengfeng Ye, Shirong Ma, Shiyu
Wang, Shuang Zhou, Shuiping Yu, Shunfeng Zhou, Shuting Pan, T. Wang, Tao Yun,
Tian Pei, Tianyu Sun, W. L. Xiao, Wangding Zeng, Wanjia Zhao, Wei An, Wen Liu,
Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, X. Q. Li, Xiangyue Jin,
Xianzu Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaojin Shen, Xiaokang Chen,
Xiaokang Zhang, Xiaosha Chen, Xiaotao Nie, Xiaowen Sun, Xiaoxiang Wang, Xin
Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingkai Yu, Xinnan Song, Xinxia Shan, Xinyi
Zhou, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, Y. K. Li, Y. Q. Wang,
Y. X. Wei, Y. X. Zhu, Yang Zhang, Yanhong Xu, Yanhong Xu, Yanping Huang, Yao
Li, Yao Zhao, Yaofeng Sun, Yaohui Li, Yaohui Wang, Yi Yu, Yi Zheng, Yichao Zhang,
Yifan Shi, Yiliang Xiong, Ying He, Ying Tang, Yishi Piao, Yisong Wang, Yixuan Tan,
Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yu Wu, Yuan Ou, Yuchen Zhu, Yuduan
Wang, Yue Gong, Yuheng Zou, Yujia He, Yukun Zha, Yunfan Xiong, Yunxian Ma,
Yuting Yan, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Z. F. Wu, Z. Z.
Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhen Huang, Zhen Zhang, Zhenda
Xie, Zhengyan Zhang, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong
Shao, Zhipeng Xu, Zhiyu Wu, Zhongyu Zhang, Zhuoshu Li, Zihui Gu, Zijia Zhu,
Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Ziyi Gao, and Zizheng Pan. Deepseek-v3
technical report, 2024. URL https://arxiv.org/abs/2412.19437.
[29] Li Deng, Geoffrey Hinton, and Brian Kingsbury. New Types of Deep Neural Network
Learning for Speech Recognition and Related Applications: An Overview. In 2013
IEEE International Conference on Acoustics, Speech and Signal Processing, pages
8599–8603. IEEE, 2013.
[30] Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805, 2018.
[31] Fang Dong and Yinmei Lv. Matrix Operation and Its Application in Computer
Engineering. In 2018 IEEE/ACIS 17th International Conference on Computer and
Information Science (ICIS), pages 604–607. IEEE, 2018.
[32] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-
Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al.
The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783, 2024.
[33] Paul S Dwyer. Some Applications of Matrix Derivatives in Multivariate Analysis.
Journal of the American Statistical Association, 62(318):607–625, 1967.
[34] Dennis Elbrächter, Dmytro Perekrestenko, Philipp Grohs, and Helmut Bölcskei. Deep
Neural Network Approximation Theory. IEEE Transactions on Information Theory,
67(5):2581–2623, 2021.
65
[35] Jeffrey L Elman. Finding Structure in Time. Cognitive Science, 14(2):179–211, 1990.
[36] Luigi Fantappie. Le Calcul des Matrices. CR Acad. Sci. Paris, 186:619–621, 1928.
[37] Simon Foucart and Holger Rauhut. A Mathematical Introduction to Compressive
Sensing. Birkhäuser Basel, 2013. ISBN 0817649476.
[38] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press,
2016. http://www.deeplearningbook.org.
[39] Alex Graves and Alex Graves. Long Short-Term Memory. Supervised Sequence
Labelling with Recurrent Neural Networks, pages 37–45, 2012.
[40] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui
Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. Conformer:
Convolution-Augmented Transformer for Speech Recognition. arXiv preprint
arXiv:2005.08100, 2020.
[41] Jutho Haegeman, Christian Lubich, Ivan Oseledets, Bart Vandereycken, and Frank
Verstraete. Unifying Time Evolution and Optimization with Matrix Product States.
Physical Review B, 94(16):165116, 2016.
[42] PC Hammer. Adaptive Control Processes: A Guided Tour (R. Bellman), 1962.
[43] Joao P Hespanha. Linear Systems Theory. Princeton University Press, 2018.
[44] Nicholas J Higham. Functions of Matrices: Theory and Computation. SIAM, 2008.
[45] Nicholas J Higham. Siam’s gene golub summer school. 2013. lecture 1 of: Functions
of matrices., 2013. Accessed: September 02, 2024.
[46] Nicholas J Higham and Peter Kandolf. Computing the Action of Trigonometric and
Hyperbolic Matrix Functions. SIAM Journal on Scientific Computing, 39(2):A613–
A627, 2017.
[47] John J Hopfield. Neural Networks and Physical Systems with Emergent Collective
Computational Abilities. Proceedings of the National Academy of Sciences, 79(8):
2554–2558, 1982.
[48] Roger A Horn and Charles R Johnson. Matrix Analysis. Cambridge university press,
2012.
[49] KurtHornik. ApproximationCapabilitiesofMultilayerFeedforwardNetworks. Neural
networks, 4(2):251–257, 1991.
[50] Steven Huss-Lederman, Elaine M Jacobson, Jeremy R Johnson, Anna Tsao, and
Thomas Turnbull. Strassen’s Algorithm for Matrix Multiplication: Modeling,
Analysis, and Implementation. In Proceedings of Supercomputing, volume 96, pages
9–6. Citeseer, 1996.
[51] Sergey Ioffe. Batch Normalization: Accelerating Deep Network Training by Reducing
Internal Covariate Shift. arXiv preprint arXiv:1502.03167, 2015.
66
[52] Uday Kamath, John Liu, and James Whitaker. Deep Learning for NLP and Speech
Recognition, volume 84. Springer, 2019.
[53] Charles S Kenney and Alan J Laub. The Matrix Sign Function. IEEE Transactions
on Automatic Control, 40(8):1330–1348, 1995.
[54] ED Khoroshikh and VG Kurbatov. An Approximation of Matrix Exponential by a
Truncated Laguerre Series. arXiv preprint arXiv:2312.07291, 2023.
[55] Diederik P Kingma. Adam: A Method for Stochastic Optimization. arXiv preprint
arXiv:1412.6980, 2014.
[56] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The Efficient
Transformer. arXiv preprint arXiv:2001.04451, 2020.
[57] Andrei Nikolaevich Kolmogorov. On the Representation of Continuous Functions
of Many Variables by Superposition of Continuous Functions of One Variable and
Addition. In Doklady Akademii Nauk, volume 114, pages 953–956. Russian Academy
of Sciences, 1957.
[58] Vaclav Kosar. Cross-attention in transformer architecture, 2024. URL https:
//vaclavkosar.com/ml/cross-attention-in-transformer-architecture. Ac-
cessed: September 02, 2024.
[59] Nesin Matematik Köyü. The Matrix Exponential. https://nesinkoyleri.org/
wp-content/uploads/2021/07/Exponential.pdf, 2021. Lecture notes.
[60] Guillaume Lample and François Charton. Deep Learning for Symbolic Mathematics.
arXiv preprint arXiv:1912.01412, 2019.
[61] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-Based
Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11):2278–
2324, 1998.
[62] Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning Methods for Generic Object
Recognition with Invariance to Pose and Lighting. In Proceedings of the 2004 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, 2004.
CVPR 2004., volume 2, pages II–104. IEEE, 2004.
[63] Yann LeCun, Koray Kavukcuoglu, and Clément Farabet. Convolutional Networks
and Applications in Vision. In Proceedings of 2010 IEEE International Symposium
on Circuits and Systems, pages 253–256. IEEE, 2010.
[64] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep Learning. Nature, 521
(7553):436–444, 2015.
[65] Nayoung Lee, Kartik Sreenivasan, Jason D Lee, Kangwook Lee, and Dimitris
Papailiopoulos. Teaching Arithmetic to Small Transformers. arXiv preprint
arXiv:2307.03381, 2023.
[66] Zijie Li, Dule Shu, and Amir Barati Farimani. Scalable Transformer for PDE
Surrogate Modeling. Advances in Neural Information Processing Systems, 36, 2024.
67
[67] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A Robustly
Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692, 2019.
[68] Warren S McCulloch and Walter Pitts. A Logical Calculus of the Ideas Immanent in
Nervous Activity. The Bulletin of Mathematical Biophysics, 5:115–133, 1943.
[69] Joseph McDonald, Baolin Li, Nathan Frey, Devesh Tiwari, Vijay Gadepally, and
Siddharth Samsi. Great Power, Great Responsibility: Recommendations for Reducing
Energy for Training Language Models. arXiv preprint arXiv:2205.09646, 2022.
[70] Hrushikesh N Mhaskar and Tomaso Poggio. Deep vs. Shallow Networks: An
Approximation Theory Perspective. Analysis and Applications, 14(06):829–848, 2016.
[71] Anastasiia Minenkova, Evelyn Nitch-Griffin, and Vadim Olshevsky. Backward
Stability of the Schur Decomposition under Small Perturbations. Linear Algebra and
its Applications, 2024.
[72] Cleve Moler and Charles Van Loan. Nineteen Dubious Ways to Compute the
Exponential of a Matrix, Twenty-Five Years Later. SIAM review, 20(4):801–836,
1978.
[73] Cleve Moler and Charles Van Loan. Nineteen Dubious Ways to Compute the
Exponential of a Matrix, Twenty-Five Years Later. SIAM Review, 45(1):3–49, 2003.
[74] Vinod Nair and Geoffrey E Hinton. Rectified Linear Units Improve Restricted
BoltzmannMachines. InProceedings of the 27th International Conference on Machine
Learning (ICML-10), pages 807–814, 2010.
[75] Kuniaki Noda, Yuki Yamaguchi, Kazuhiro Nakadai, Hiroshi G Okuno, and Tetsuya
Ogata. Audio-Visual Speech Recognition Using Deep Learning. Applied Intelligence,
42:722–737, 2015.
[76] Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. Investigating the Limitations of
Transformers with Simple Arithmetic Tasks. arXiv preprint arXiv:2102.13019, 2021.
[77] OpenAI. ChatGPT(v.4),2023. URLhttps://openai.com/chatgpt. Largelanguage
model.
[78] JoostAAOpschoor, ChSchwab, andJakobZech. ExponentialReLUDNNExpression
ofHolomorphicMapsinHighDimension. Constructive Approximation, 55(1):537–582,
2022.
[79] Daniel W Otter, Julian R Medina, and Jugal K Kalita. A Survey of the Usages
of Deep Learning for Natural Language Processing. IEEE Transactions on Neural
Networks and Learning Systems, 32(2):604–624, 2020.
[80] R Pascanu. On the Difficulty of Training Recurrent Neural Networks. arXiv preprint
arXiv:1211.5063, 2013.
[81] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
Chanan, TrevorKilleen, ZemingLin,NataliaGimelshein,LucaAntiga, etal. PyTorch:
An Imperative Style, High-Performance Deep Learning Library. Advances in Neural
Information Processing Systems, 32, 2019.
68
[82] David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia,
Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon Emissions and
Large Neural Network Training. arXiv preprint arXiv:2104.10350, 2021.
[83] PhilippPetersenandFelixVoigtlaender. OptimalApproximationofPiecewiseSmooth
Functions Using Deep ReLU Neural Networks. Neural Networks, 108:296–330, 2018.
[84] Stanislas Polu and Ilya Sutskever. Generative Language Modeling for Automated
Theorem Proving. arXiv preprint arXiv:2009.03393, 2020.
[85] Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin,
and Ilya Sutskever. Formal Mathematics Statement Curriculum Learning. arXiv
preprint arXiv:2202.01344, 2022.
[86] Huizeng Qin and Youmin Lu. An Efficient Algorithm for Basic Elementary Matrix
Functions with Specified Accuracy and Application. Applied Mathematics, 4(2):690–
708, 2024.
[87] JackRaeandetal. Gopher: AScalableandEfficientTransformerforTextGeneration.
arXiv preprint arXiv:2112.11446, 2021. URL https://arxiv.org/abs/2112.11446.
[88] John Douglas Roberts. Linear Model Reduction and Solution of the Algebraic Riccati
Equation by Use of the Sign Function. International Journal of Control, 32(4):677–
687, 1980.
[89] Sheldon M Ross. Introduction to Probability Models. Academic Press, 2014.
[90] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal
representations by error propagation, parallel distributed processing, explorations in
the microstructure of cognition, ed. de rumelhart and j. mcclelland. vol. 1. 1986.
Biometrika, 71(599-607):6, 1986.
[91] Michael E Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyré. Sinkformers:
Transformers with Doubly Stochastic Attention. In International Conference on
Artificial Intelligence and Statistics, pages 3515–3530. PMLR, 2022.
[92] Clayton Sanford, Daniel J Hsu, and Matus Telgarsky. Representational Strengths and
Limitations of Transformers. Advances in Neural Information Processing Systems, 36,
2024.
[93] LS Shieh, YT Tsay, and CT Wang. Matrix Sector Functions and Their Applications
to Systems Theory. In IEE Proceedings D (Control Theory and Applications), volume
131, pages 171–181. IET, 1984.
[94] Ralph C Smith. Uncertainty Quantification: Theory, Implementation, and
Applications. SIAM, 2024.
[95] Richard Socher, Yoshua Bengio, and Christopher D Manning. Deep Learning for NLP
(Without Magic). In Tutorial Abstracts of ACL 2012, pages 5–5. 2012.
[96] Danny C Sorensen and Yunkai Zhou. Direct Methods for Matrix Sylvester and
Lyapunov Equations. Journal of Applied Mathematics, 2003(6):277–303, 2003.
69
[97] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. Dropout: ASimpleWaytoPreventNeuralNetworksfromOverfitting.
The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
[98] Ilya Sutskever, Oriol Vinyals, and Quoc Le. Sequence to sequence learning with neural
networks. arXiv preprint arXiv:1409.3215, 2014.
[99] Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin
Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng.
Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional
Domains. Advances in Neural Information Processing Systems, 33:7537–7547, 2020.
[100] Tina Torabi, Timon S Gutleb, and Christoph Ortner. Fast Automatically
Differentiable Matrix Functions and Applications in Molecular Simulations. arXiv
preprint arXiv:2412.12598, 2024.
[101] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne
Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample.
LLaMA: Open and Efficient Foundation Language Models, 2023. URL https:
//arxiv.org/abs/2302.13971.
[102] Alan M Turing. Computing Machinery and Intelligence. Springer, 2009.
[103] Ashish Vaswani. Attention is All You Need. arXiv preprint arXiv:1706.03762, 2017.
[104] Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer:
Self-Attention with Linear Complexity. arXiv preprint arXiv:2006.04768, 2020.
[105] Yu Zhang, Peter Tiňo, Aleš Leonardis, and Ke Tang. A Survey on Neural
Network Interpretability. IEEE Transactions on Emerging Topics in Computational
Intelligence, 5(5):726–742, 2021.
[106] Zixing Zhang, Jürgen Geiger, Jouni Pohjalainen, Amr El-Desoky Mousa, Wenyu Jin,
and Björn Schuller. Deep Learning for Environmentally Robust Speech Recognition:
An Overview of Recent Developments. ACM Transactions on Intelligent Systems and
Technology (TIST), 9(5):1–28, 2018.

Repository Staff Only: item control page

Download Statistics

Downloads per month over past year

Research related to the current document (at the CORE website)

Spectrum Research Repository

Deep Learning Approximation of Matrix Functions: From Feedforward Neural Networks to Transformers

Deep Learning Approximation of Matrix Functions: From Feedforward Neural Networks to Transformers

Abstract

References: