Tong Zhao (赵曈)

Assistant Professor
Leading a stochastic optimization and reinforcement learning team
at Institue of Computing Technology, Chinese Academy of Sciences.

📰News

Oct 30, 2024	I would like to apply for a postdoctoral position in the AI-related field. I have a multidisciplinary background in mathematics, AI, high-performance computing and following advantages: Curiosity, Adaptability, and Self-Motivation. I am capable of working independently and eager to learn new knowledge, even willing to adopt a doctoral student mindset. I aim to contribute to highly original research. Additionally, I am open to discussing ideas with group members and assisting in guiding PhD students. Computational Resources. Due to my professional experience in the High Performance Computer Research Center, I have established good collaborative relationships with some leading supercomputing companies, internet companies, as well as distributed training centers in universities and research institutes. If necessary, I can provide the required computational resources for our group. Relevant Theoretical Foundation. I obtained my Ph.D. in the School of Mathematical Sciences, with strong algorithm analysis skills and the ability to learn new theories. My background covers deep learning optimizers, reinforcement learning, stochastic analysis, control and game theory, and PDEs. In addition, almost all my personal information (referees, publications, and etc.) related to an postdoctoral application can be found on this website. If you have any interest and question, please feel free to contact me or refer to the FAQ section.

Oct 30,
2024

I would like to apply for a postdoctoral position in the AI-related field. I have a multidisciplinary background in mathematics, AI, high-performance computing and following advantages:

Curiosity, Adaptability, and Self-Motivation. I am capable of working independently and eager to learn new knowledge, even willing to adopt a doctoral student mindset. I aim to contribute to highly original research. Additionally, I am open to discussing ideas with group members and assisting in guiding PhD students.
Computational Resources. Due to my professional experience in the High Performance Computer Research Center, I have established good collaborative relationships with some leading supercomputing companies, internet companies, as well as distributed training centers in universities and research institutes. If necessary, I can provide the required computational resources for our group.
Relevant Theoretical Foundation. I obtained my Ph.D. in the School of Mathematical Sciences, with strong algorithm analysis skills and the ability to learn new theories. My background covers deep learning optimizers, reinforcement learning, stochastic analysis, control and game theory, and PDEs.

In addition, almost all my personal information (referees, publications, and etc.) related to an postdoctoral application can be found on this website. If you have any interest and question, please feel free to contact me or refer to the FAQ section.

🎓Research

Primary: Optimization in Deep Learning

I have studied how momentum aﬀects generalization and its relationship with the sharpness of the landscape.

I explored why the training performance of deep neural networks deteriorates with larger batch sizes, and how momentum can be introduced to improve the performance. I designed adaptive momentum methods to enhance training with large batch sizes.

Many optimizers perform well on small models (parameters<100M), but their performance lags behind SGD on large models (parameters>100M), especially considering the memory and communication constraints. How can we design efficient and practical optimizers for large-scale models?

Kalman filtering is a classical algorithm in control domain, I adapt it to a novel optimizer. This optimizer has shown promising results in molecular potential surface fitting. Moreover, this algorithm has been incorporated into the DeePMD library for molecular dynamics simulations.

I have also dabbled in training neural networks for PDEs (Partial Differential Equations).During my PhD in the Mathematics Department at Fudan University, I researched stochastic control and operations, focusing on stochastic partial differential equations and how to solve forward-backward stochastic differential equations and corresponding PDEs by deep learning.

Secondary: Reinforcement Learning

I designed a co-running scheduler for AI training tasks using reinforcement learning algorithms, fully utilizing the large scheduling window to achieve better performance than traditional (dynamic) schedulers.

I developed a distributed I/O scheduler, especially for training tasks, with reinforcement learning algorithms. It is an interference-aware scheduler of multiple reads and writes, using a neural network to predict the real bandwidth under interference.

📑Selected Publications [full list]

(*) denotes equal contribution, (†) denotes corresponding author

2025

IPDPS
Large scale finite-temperature rt-TDDFT simulation with hybrid functional

Rongrong Liu, Zhuoqiang Guo, Qiuchen Sha, Tong Zhao, Haibo Li, Wei Hu, Lĳun Liu, Guangming Tan, and Weile Jia

In IEEE International Parallel and Distributed Processing Symposium, 2025

Abs Bib

Ultra-fast electronic phenomena originating from finite temperature, such as nonlinear optical excitation, can be simulated with high fidelity via real-time time dependent density functional theory (rt-TDDFT) calculations with hybrid functional. However, previous rt-TDDFT simulations of real materials using the optimal gauge-known as the parallel transport gauge-have been limited to low-temperature systems with band gaps. In this paper, we introduce the parallel transport-implicit midpoint (PT-IM) method, which significantly accelerates finitetemperature rt-TDDFT calculations of real materials with hybrid function. We first implement PT-IM with hybrid functional in our plane wave code PWDFT, and optimized it on both GPU and ARM platforms to build a solid baseline code. Next, we propose a diagonalization method to reduce computation and communication complexity, and then, we employ adaptively compressed exchange (ACE) method to reduce the frequency of the most expensive Fock exchange operator. Finally, we adopt the ring based method and the shared memory mechanism to overlap computation and communication and alleviate memory consumption respectively. Numerical results show that our optimized code can reach 3072 atoms for rt-TDDFT simulation with hybrid functional at finite temperature on 192 computing nodes, the time-to-solution for one time step is 429.3s, which is 41.4 times faster compared to the baseline.
@article{liu2025large, title = "Large scale finite-temperature rt-TDDFT simulation with hybrid functional", author = "Rongrong Liu and Zhuoqiang Guo and Qiuchen Sha and Tong Zhao and Haibo Li and Wei Hu and Lĳun Liu and Guangming Tan and Weile Jia", year = "2025", }

JCAM
Backward error analysis of the Lanczos bidiagonalization with reorthogonalization. Journal of Computational and Applied Mathematics

Haibo Li, Guangming Tan, and Tong Zhao

Journal of Computational and Applied Mathematics, 2025

Abs DOI Bib

The k-step Lanczos bidiagonalization reduces a matrix A∈Rmxn into a bidiagonal form Bk∈R(k+1)xk while generates two orthonormal matrices Uk+1∈Rmx(k+1) and Vk+1∈Rnx(k+1). However, any practical implementation of the algorithm suffers from loss of orthogonality of Uk+1 and Vk+1 due to the presence of rounding errors, and several reorthogonalization strategies are proposed to maintain some level of orthogonality. In this paper, by writing various reorthogonalization strategies in a general form we make a backward error analysis of the Lanczos bidiagonalization with reorthogonalization (LBRO). Our results show that the computed Bk by the k-step LBRO of A with starting vector b is the exact one generated by the k-step Lanczos bidiagonalization of A+E with starting vector b+δb (denoted by LB(A+E,b+δb)), where the 2-norm of perturbation vector/matrix δb and E depend on the roundoff unit and orthogonality levels of Uk+1 and Vk+1. The results also show that the 2-norm of Uk+1-U¯k+1 and Vk+1-V¯k+1 are controlled by the orthogonality levels of Uk+1 and Vk+1, respectively, where U¯k+1 and V¯k+1 are the two orthonormal matrices generated by the k-step LB(A+E,b+δb) in exact arithmetic. Thus the k-step LBRO is mixed forward-backward stable as long as the orthogonality of Uk+1 and Vk+1 are good enough. We use this result to investigate the backward stability of LBRO based SVD computation algorithm and LSQR algorithm. Numerical experiments are made to confirm our results.
@article{li2025backward, title = "Backward error analysis of the Lanczos bidiagonalization with reorthogonalization. Journal of Computational and Applied Mathematics", author = "Haibo Li and Guangming Tan and Tong Zhao", year = "2025", journal = "Journal of Computational and Applied Mathematics", }

2024

JCST
10-million atoms simulation of first-principle package LS3DF

Yujin Yan, Haibo Li^†, Tong Zhao, Lin-Wang Wang, Lin Shi, Tao Liu, Guangming Tan, Weile Jia^†, and Ninghui Sun^†

Journal of Computer Science and Technology, 2024

Abs DOI Bib

The growing demand for semiconductor devices simulation poses a big challenge for large-scale electronic structure calculations. Among various methods, the linearly scaling three-dimensional fragment (LS3DF) method exhibits excellent scalability in large-scale simulations. Based on algorithmic and system-level optimizations, we propose a highly scalable and highly efficient implementation of LS3DF on a domestic heterogeneous supercomputer equipped with accelerators. In terms of algorithmic optimizations, the original all-band conjugate gradient algorithm is refined to achieve faster convergence, and mixed precision computing is adopted to increase overall efficiency. In terms of system-level optimizations, the original two-layer parallel structure is replaced by a coarse-grained parallel method. Optimization strategies such as multi-stream, kernel fusion, and redundant computation removal are proposed to increase further utilization of the computational power provided by the heterogeneous machines. As a result, our optimized LS3DF can scale to a 10-million silicon atoms system, attaining a peak performance of 34.8 PFLOPS (21.2% of the peak). All the improvements can be adapted to the next-generation supercomputers for larger simulations.
@article{yan202410million, title = "10-million atoms simulation of first-principle package LS3DF", author = "Yujin Yan and Haibo Li and Tong Zhao and Lin-Wang Wang and Lin Shi and Tao Liu and Guangming Tan and Weile Jia and Ninghui Sun", year = "2024", journal = "Journal of Computer Science and Technology", }

PPoPP
Training one deepmd model in minutes: A step towards online learning

Siyu Hu, Tong Zhao, Qiuchen Sha, Enji Li, Xiangyu Meng, Lijun Liu, Lin-Wang Wang, Guangming Tan, and Weile Jia

In Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2024

Abs Bib PDF

Neural Network Molecular Dynamics (NNMD) has become a major approach in material simulations, which can speedup the molecular dynamics (MD) simulation for thousands of times, while maintaining ab initio accuracy, thus has a potential to fundamentally change the paradigm of material simulations. However, there are two time-consuming bottlenecks of the NNMD developments. One is the data access of ab initio calculation results. The other, which is the focus of the current work, is reducing the training time of NNMD model. The training of NNMD model is different from most other neural network training because the atomic force (which is related to the gradient of the network) is an important physical property to be fit. Tests show the traditional stochastic gradient methods, like the Adam algorithms, cannot efficiently deploy the multisample minibatch algorithm. As a result, a typical training (taking the Deep Potential Molecular Dynamics (DeePMD) as an example) can take many hours. In this work, we designed a heuristic minibatch quasi-Newtonian optimizer based on Extended Kalman Filter method. An early reduction of gradient and error is adopted to reduce memory footprint and communication. The memory footprint, communication and settings of hyper-parameters of this new method are analyzed in detail. Computational innovations such as customized kernels of the symmetry-preserving descriptor are applied to exploit the computing power of the heterogeneous architecture. Experiments are performed on 8 different datasets representing different real case situations, and numerical results show that our new method has an average speedup of 32.2 compared to the Reorganized Layer-wised Extended Kalman Filter with 1 GPU, reducing the absolute training time of one DeePMD model from hours to several minutes, making it one step toward online training.
@article{hu2024training, title = "Training one deepmd model in minutes: A step towards online learning", author = "Siyu Hu and Tong Zhao and Qiuchen Sha and Enji Li and Xiangyu Meng and Lijun Liu and Lin-Wang Wang and Guangming Tan and Weile Jia", year = "2024", }

EuroPar
Accelerating large-scale sparse LU factorization for RF circuit simulation

Guofeng Feng, Hongyu Wang, Zhuoqiang Guo, Mingzhen Li^†, Tong Zhao, Zhou Jin, Weile Jia, Guangming Tan, and Ninghui Sun

In European Conference on Parallel Processing, 2024

Abs DOI Bib

Sparse LU factorization is the indispensable building block of the circuit simulation, and dominates the simulation time, especially when dealing with large-scale circuits. Radio frequency (RF) circuits have been increasingly emphasized with the evolution of ubiquitous wireless communication (i.e., 5G and WiFi). The RF simulation matrices show a distinctive pattern of structured dense blocks, and this pattern has been inadvertently overlooked by prior works, leading to the underutilization of computational resources. In this paper, by exploiting the block structure, we propose a novel blocked format for L and U factors and re-design the large-scale sparse LU factorization accordingly, which leverages the data locality inherent in RF matrices. The data format transformation is streamlined, strategically eliminating the redundant data movement and costly indirect memory access. Moreover, the vector operations are converted into matrix operations, enabling efficient data reuse and enhancing data-level parallelism. The experiment results demonstrate that our method achieves superior performance compared to state-of-the-art implementation.
@article{feng2024accelerating, title = "Accelerating large-scale sparse LU factorization for RF circuit simulation", author = "Guofeng Feng and Hongyu Wang and Zhuoqiang Guo and Mingzhen Li and Tong Zhao and Zhou Jin and Weile Jia and Guangming Tan and Ninghui Sun", year = "2024", }

2023

SC
Enhance the strong scaling of lammps on fugaku

Jianxiong Li, Tong Zhao, Zuoqiang Guo, Shunchen Shi, Lijun Liu, Guangming Tan, Weile Jia^†, Guojun Yuan, and Zhan Wang^†

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023

Abs Bib PDF

Physical phenomenon such as protein folding requires simulation up to microseconds of physical time, which directly corresponds to the strong scaling of molecular dynamics(MD) on modern supercomputers. In this paper, we present a highly scalable implementation of the state-of-the-art MD code LAMMPS on Fugaku by exploiting the 6D mesh/torus topology of the TofuD network. Based on our detailed analysis of the MD communication pattern, we first adapt coarse-grained peer-to-peer ghost-region communication with uTofu interface, then further improve the scalability via fine-grained thread pool. Finally, Remote direct memory access (RDMA) primitives are utilized to avoid buffer overhead. Numerical results show that our optimized code can reduce 77% of the communication time, improving the performance of baseline LAMMPS by a factor of 2.9x and 2.2x for Lennard-Jones and embedded-atom method potentials when scaling to 36, 846 computing nodes. Our optimization techniques can also benefit other applications with stencil or domain decomposition methods.
@article{li2023enhance, title = "Enhance the strong scaling of lammps on fugaku", author = "Jianxiong Li and Tong Zhao and Zuoqiang Guo and Shunchen Shi and Lijun Liu and Guangming Tan and Weile Jia and Guojun Yuan and Zhan Wang", year = "2023", }

AAAIOral
RLEKF: An optimizer for deep potential with ab initio accuracy

Siyu Hu^*, Wentao Zhang^*, Qiuchen Sha, Feng Pan, Lin-Wang Wang, Weile Jia, Guangming Tan, and Tong Zhao^†

In Proceedings of the AAAI Conference on Artificial Intelligence, 2023

Abs DOI Bib PDF

It is imperative to accelerate the training of neural network force field such as Deep Potential, which usually requires thousands of images based on first-principles calculation and a couple of days to generate an accurate potential energy surface. To this end, we propose a novel optimizer named reorganized layer extended Kalman filtering (RLEKF), an optimized version of global extended Kalman filtering (GEKF) with a strategy of splitting big and gathering small layers to overcome the $O(N^2)$ computational cost of GEKF. This strategy provides an approximation of the dense weights error covariance matrix with a sparse diagonal block matrix for GEKF. We implement both RLEKF and the baseline Adam in our αDynamics package and numerical experiments are performed on 13 unbiased datasets. Overall, RLEKF converges faster with slightly better accuracy. For example, a test on a typical system, bulk copper, shows that RLEKF converges faster by both the number of training epochs (×11.67) and wall-clock time (×1.19). Besides, we theoretically prove that the updates of weights converge and thus are against the gradient exploding problem. Experimental results verify that RLEKF is not sensitive to the initialization of weights. The RLEKF sheds light on other AI-for-science applications where training a large neural network (with tons of thousands parameters) is a bottleneck.
@article{hu2023rlekf, title = "RLEKF: An optimizer for deep potential with ab initio accuracy", author = "Siyu Hu and Wentao Zhang and Qiuchen Sha and Feng Pan and Lin-Wang Wang and Weile Jia and Guangming Tan and Tong Zhao", year = "2023", }

2022

CAM
Limits of one-dimensional interacting particle systems with two-scale interaction

Tong Zhao

Chinese Annals of Mathematics, Series B, 2022

Abs DOI Bib

This paper characterizes the limits of a large system of interacting particles distributed on the real line. The interaction occurring among neighbors involves two kinds of independent actions with different rates. This system is a generalization of the voter process, of which each particle is of type A or a. Under suitable scaling, the local proportion functions of A particles converge to continuous functions which solve a class of stochastic partial differential equations driven by Fisher-Wright white noise. To obtain the convergence, the tightness of these functions is derived from the moment estimate method.
@article{zhao2022limits, title = "Limits of one-dimensional interacting particle systems with two-scale interaction", author = "Tong Zhao", year = "2022", journal = "Chinese Annals of Mathematics, Series B", }