A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization

Yuchen Zhu Georgia Institute of Technology. Email: [email protected]. Yufeng Zhang Northwestern University. Email: [email protected]. Zhaoran Wang Northwestern University. Email: [email protected]. Zhuoran Yang Yale University. Email: [email protected]. Xiaohong Chen Yale University. Email: [email protected].

Abstract

This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparameterized two-layer neural networks. In particular, we consider the minimax optimization problem stemming from estimating linear functional equations defined by conditional expectations, where the objective functions are quadratic in the functional spaces. We address (i) the convergence of the stochastic gradient descent-ascent algorithm and (ii) the representation learning of the neural networks. We establish convergence under the mean-field regime by considering the continuous-time and infinite-width limit of the optimization dynamics. Under this regime, the stochastic gradient descent-ascent corresponds to a Wasserstein gradient flow over the space of probability measures defined over the space of neural network parameters. We prove that the Wasserstein gradient flow converges globally to a stationary point of the minimax objective at a $\mathcal{O}(T^{-1}+\alpha^{-1})$ sublinear rate, and additionally finds the solution to the functional equation when the regularizer of the minimax objective is strongly convex. Here $T$ denotes the time and $\alpha$ is a scaling parameter of the neural networks. In terms of representation learning, our results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $\mathcal{O}(\alpha^{-1})$ , measured in terms of the Wasserstein distance. Finally, we apply our general results to concrete examples including policy evaluation, nonparametric instrumental variable regression, asset pricing, and adversarial Riesz representer estimation.

1 Introduction

Minimax optimization problems are ubiquitous in machine learning, statistics, economics, and other fields. Examples include generative adversarial networks (GANs) (Goodfellow et al., 2020; Salimans et al., 2016), adversarial training (Ganin et al., 2016; Madry et al., 2017), robust optimization (Ben-Tal et al., 2009; Levy et al., 2020), and zero-sum games (Xie et al., 2020b; Zhao et al., 2022). The goal in minimax optimization is to find a solution $(f^{*},g^{*})$ to the problem $\min_{f\in\mathcal{F}}\max_{g\in\mathcal{G}}\mathcal{L}(f,g)$ , where $\mathcal{L}$ is a bivariate objective function, and $\mathcal{F}$ and $\mathcal{G}$ are the feasible sets of the decision variables $f$ and $g$ . In modern machine learning applications, $\mathcal{F}$ and $\mathcal{G}$ are often function classes flexibly parameterized by neural networks, and the objective $\mathcal{L}(f,g)$ can be approximated using data. The minimax optimization problem is often solved using first-order optimization algorithms. Despite hugely successful in diverse applications, there is no global convergence theory for various popular first-order algorithms solving general minimax optimization using neural networks yet.

In this work, we study the convergence of first-order algorithms for solving minimax optimization problems where $\mathcal{F}$ and $\mathcal{G}$ are both flexibly parameterized by two-layer neural networks, and the objective functional is quadratic in $f$ and $g$ up to regularization:

\displaystyle\min_{f\in\mathcal{F}}\max_{g\in\mathcal{G}}\mathcal{L}(f,g),~{}~% {}\mathcal{L}(f,g)=\mathbb{E}\bigl{[}g(Z)\cdot\Phi(X,Z;f)-1/2\cdot g(Z)^{2}+% \mathtt{Reg}(f)\bigr{]},

(1.1)

where $\mathtt{Reg}(f)$ is a convex regularizer that penalizes the complexity of $f\in\mathcal{F}$ . Here the expectation is taken with respect to the joint distribution of random variables $(X,Z)$ , $g$ is a function of $Z$ , and $\Phi$ takes $(X,Z)$ and a function $f$ as its input and is linear in $f$ . The objective function (1.1) arises from solving a linear functional conditional moment equation of the form $\mathbb{E}[\Phi(X,Z;f)|Z=\cdot]=0$ if and only if $f=f^{*}\in\mathcal{F}$ . Here $X$ is a vector containing all the endogenous variables and $Z$ contains all the exogenous/pre-determent variables. This problem has ample applications, including policy evaluation (Cai et al., 2019; Duan et al., 2020; Jin et al., 2021; Chen and Qi, 2022; Ramprasad et al., 2022), nonparametric instrumental variable regression (Blundell et al., 2007; Chen and Pouzo, 2012; Chen and Christensen, 2018; Xu et al., 2020), and asset pricing (Chen and Ludvigson, 2009; Chen et al., 2014, 2024). The minimax objective in (1.1) arises when we solve the conditional moment equation via adversarial estimation (Uehara et al., 2020; Duan et al., 2021; Chernozhukov et al., 2020; Liao et al., 2020; Wai et al., 2020; Bennett et al., 2019), which introduces a dual function and transforms equation solving into a minimax optimization.

We study the infinite-dimensional minimax optimization in (1.1) over the space of overparameterized two-layer neural networks. Specifically, a neural network is represented by $f_{\mathtt{NN}}(\cdot;\bm{\theta})=\alpha/N\sum_{i=1}^{N}\phi(\cdot;\theta^{i})$ , where $N$ is the number of neurons, $\phi(\cdot;\theta^{i})$ denotes the $i$ -th neuron, $\{\theta^{i}\}_{{i\in[N]}}$ are the network parameters, and $\alpha$ is a scaling parameter. We aim to solve the minimax optimization in (1.1) with both $f$ and $g$ are represented by overparameterized two-layer neural networks, which is favorable especially when $Z$ is a high-dimensional vector. To solve this problem, we consider the arguably simplest first-order algorithm, stochastic gradient descent-ascent (SGDA), where the parameters of $f$ and $g$ are simultaneously updated using stochastic gradients of the objective functional. Specifically, we aim to address the following two questions:

•

Does SGDA with overparameterized neural networks converge to some solution?
•

Does SGDA learn data-dependent features that yield a statistically accurate solution?

Answering these questions involves two intricate challenges in terms of optimization and representation learning using neural networks. First, the minimax objective is nonconvex-nonconcave with respect to the neural network parameters of $f$ and $g$ , it is unclear whether first-order algorithms converge. Second, the representation of the neural network evolves during the course of optimization, and it is unclear how to track and assess the data-dependent features learned by the neural networks. While there are some existing works on neural network optimization using the technique of neural tangent kernel (NTK) (Jacot et al., 2018; Du et al., 2018; Cai et al., 2019; Xu and Gu, 2020; Wang et al., 2022), such an approach suggests that the feature representation of the neural networks is fixed throughout training and is only determined by the initialization of the network parameters. Despite being an elegant theoretical framework, the NTK approach is limited in its ability to capture the representation learning aspect of neural network optimization. To show that the neural network optimization algorithms learn useful data-dependent features, in addition to establishing convergence, more importantly, we need to show that (i) the algorithm approximately finds a proper solution concept, e.g., a stationary point or a local or global optimizer of the minimax objective function, and (ii) the representation of the neural networks moves from the initialization by a considerable amount.

In this paper, we tackle both challenges by leveraging the framework of mean-field analysis of overparameterized neural networks (Chizat and Bach, 2018; Mei et al., 2018, 2019; Zhang et al., 2020; Lu et al., 2020b; Zhang et al., 2021b; Sirignano and Spiliopoulos, 2020b, a, 2022; Chen et al., 2020b; Fang et al., 2021b). In particular, we focus on the continuous-time and infinite width limit of the SGDA algorithm, where the stepsize goes to zero and the width $N$ goes to infinity. From the mean-field lens, a neural network $f(\cdot;\bm{\theta})$ can be identified with a probability measure $\mu$ by writing $f(\cdot;\bm{\theta})=\alpha\cdot\int_{\theta}\phi(\cdot;\theta)~{}\mu({\mathrm% {d}}\theta)$ , where $\mu$ is the empirical distribution of $\{\theta^{i}\}_{{i\in[N]}}$ and $\alpha$ is the scaling parameter of the neural network. Thus, parameter updates of SGDA can be regarded as updates of the probability measure $\mu$ . From this perspective, we prove that in the continuous-time and infinite width limit, SGDA corresponds to a gradient flow of the minimax objective $\mathcal{L}$ in the Wasserstein space, i.e., the space of probability measures over the parameter space equipped with the Wasserstein-2 distance. Besides, by defining a proper potential function that characterizes the stationary point of the minimax objective, we prove that the Wasserstein gradient flow converges to a stationary point at a sublinear rate of $\mathcal{O}(1/T+1/\alpha)$ , where $T$ is the time horizon and $\alpha$ is a scaling parameter of the neural network. Moreover, we prove that the Wasserstein distance between the parameter distribution found by SGDA and its initialization is $\mathcal{O}(\alpha^{-1})$ , which shows that the representation of the neural networks is allowed to move from the initialization by a considerable amount. Such a behavior is not captured by the NTK analysis, in which the representation is shown to be fixed at the initialization. Furthermore, when the regularization on $f$ satisfies a version of strong convexity, we prove that the Wasserstein gradient flow converges to the global optimizer $f^{*}$ at a sublinear $\mathcal{O}(1/T+1/\alpha)$ rate.

To the best of our knowledge, our work provides the first theoretical analysis of an optimization algorithm solving functional conditional moment equations using neural networks with representation learning. We apply our general theory to three important examples: policy evaluation, instrumental variables regression, and asset pricing. and adversarial Riesz representer estimation. In these examples, we prove that the SGDA algorithm finds the global solution with overparameterized neural networks. Moreover, SGDA learns data-dependent features that enable these statistically accurate estimators.

1.1 Related Works

Minimax Optimization. Our work is closely related to the literature on first-order methods for solving minimax optimization problems. These works establish the convergence rate or iteration complexity of first-order methods under various assumptions on the objective function. In particular, most of the existing works focus on finite-dimensional parameter spaces and one of the following objective functions: (i) convex-concave (Lin et al., 2020b; Ibrahim et al., 2019; Ouyang and Xu, 2021; Alkousa et al., 2019; Luo et al., 2021; Xie et al., 2020a; Han et al., 2024; Li et al., 2023; Jin et al., 2022), (ii) nonconvex-concave (Jin et al., 2019; Lin et al., 2020a; Lu et al., 2020a; Ostrovskii et al., 2021b; Zhao, 2023; Huang et al., 2022; Luo et al., 2020; Zhang et al., 2021a; Nouiehed et al., 2019; Thekumparampil et al., 2019), and (iii) nonconvex-nonconcave (Li et al., 2022; Diakonikolas et al., 2021; Ostrovskii et al., 2021a; Yang et al., 2022; Grimmer et al., 2022; Hajizadeh et al., 2024; Grimmer et al., 2023; Yang et al., 2020).

Our work can be viewed as an extension of convex-concave minimax optimization to the infinite-dimensional functional space. In particular, our objective is a regularized quadratic functional with respect to the input functions, which is then restricted to the class of overparameterized neural networks. Note that the objective of interest is in fact nonconvex-nonconcave in the neural network parameter space. Compared with the work on general nonconvex-nonconcave minimax optimization problems, our setting has a better underlying structure in the functional space in terms of convexity. This structure enables us to lift the network parameter updates to the Wasserstein space and analyze the gradient flow in the space of distributions. Our approach leverages the hidden convexity-concavity behind the seemingly nonconvex-nonconcave objective function and thus achieves better results in terms of algorithm convergence and complexity.

Mean-field Analysis in Deep Learning. Our work is closely related to the recent study of neural network training via gradient-based methods. One line of research establishes the convergence of gradient-based algorithms for training overparameterized neural networks under the “lazy training” regime, where the neural networks behave similarly to random kernel functions. Such a regime is also known as the as the neural tangent kernel regime (Jacot et al., 2018; Allen-Zhu et al., 2019a, b; Chen et al., 2020a; Frei and Gu, 2021; Zou and Gu, 2019; Du et al., 2018, 2019; Arora et al., 2019a, b; Huang and Yau, 2020). Our work is more related to another line of research based on the perspective of mean-field approximation (Mei et al., 2018, 2019; Chizat and Bach, 2018; Sirignano and Spiliopoulos, 2020b, a, 2022; Chen et al., 2020b; Fang et al., 2021b; Chen et al., 2019). Under the mean-field view, the neural network parameters are identified as a distribution over the parameter space. As a result, the evolution of parameters by gradient-based updates is captured by a differential equation that governs the evolution of the corresponding distribution. By elevating the training dynamics to an infinite Wasserstein space, the optimization objective often enjoys a benign landscape, which yields admits a more tractable analysis and global convergence. See, e.g, Zhang et al. (2020, 2021b); Fang et al. (2021b); Lu et al. (2020b); Fang et al. (2019); Chizat (2022); Hu et al. (2021); Nitanda et al. (2022) and the references therein. Also, see Fang et al. (2021a) for a recent survey.

Our work is especially related to the mean-field analysis of the Neural Temporal Difference (TD) (Zhang et al., 2020) and the Neural Actor-Critic (AC) (Zhang et al., 2021b) in reinforcement learning. These previous works have provided an analysis of the global convergence of the TD and AC algorithm with two-layered overparameterized neural networks. The optimization problem in these two tasks is the minimization of an objective where only one neural network is involved. Rather different from these works, we focus on minimax optimization, which requires neural network parameterization of both the primal function and the dual function. This brings new challenges to the analysis as the gradient dynamics of the primal and dual neural networks give birth to a coupled system of PDEs. To the best of our knowledge, our paper is the first to apply the mean-field limit to study the convergence of algorithms in solving the general form of functional conditional moment equations using neural networks.

Adversarial Estimation. Our work is also related to the literature on adversarial estimation, a method that solves a functional conditional moment equation by introducing a dual function and reformulating the original problem into a minimax optimization. Our work studies this type of minimax optimization with overparameterized neural networks. Thus, our work is more related to the study of adversarial estimation within neural network function classes (Dikkala et al., 2020; Chernozhukov et al., 2020; Bennett et al., 2019; Xu et al., 2021). Compared with our work, these studies focus on statistical errors pertinent to neural networks, assuming the optimization problem is solved perfectly. We instead study the optimization algorithm and establish the convergence of stochastic gradient-descent-ascent with neural networks.

Several previous works have also explored the convergence of optimization dynamics in adversarial estimation with neural networks. In particular, Neural GTD (Wai et al., 2020) and Neural SEM (Liao et al., 2020) analyze respectively the convergence for off-policy evaluation and structural equation models estimation with overparameterized two-layered neural network. However, their analyses are based on the idea of neural tangent kernel (NTK), where the employed neural network has a fixed representation during training, and the representation is completely determined by the initialization. In contrast, our work adopts the mean-field approach, which enables learning a data-dependent representation.

2 Preliminaries

The functional conditional moment equations cover many important examples in statistics, machine learning, economics, and causal inference. In this section, we first introduce the general formulation of the functional conditional moment equations and then reformulate them into a minimax optimization problem. Then, we present a few concrete examples of function conditional moment equations such as policy evaluation, nonparametric instrumental variables regression, asset pricing, and Riesz representers estimation. Finally, we introduce the background of mean-field neural networks and Wasserstein space, which are essential for the convergence analysis of the SGDA algorithm.

2.1 Functional Conditional Moment Equations

In this section, we introduce the general formulation of functional conditional moment equations. Let $X\in\mathcal{X}$ be a vector that includes all the endogenous variables, let $Z\in\mathcal{Z}$ denote all the exogenous variables, and let $\mathcal{D}\in\mathscr{P}(\mathcal{X}\times\mathcal{Z})$ denote the joint distribution of $(X,Z)$ . We let $\mathbb{E}_{\mathcal{D}}[\cdot]$ denote the expectation taken with respect to the joint distribution of $(X,Z)$ and $\mathbb{E}_{X|Z}[\cdot]$ denote the conditional expectation using the conditional distribution of $X$ given $Z$ . Let $W\in\mathcal{W}\subseteq\mathcal{X}\times\mathcal{Z}$ be a subset of variables that may contain both the endogenous and exogenous variables, and let $L^{2}(\mathcal{W})$ denote a Hilbert space of measurable functions of $W$ with finite second moment. Let $\mathcal{F}:=\{f:\mathcal{W}\rightarrow\mathbb{R}\}\subset L^{2}(\mathcal{W})$ denote a class of functions defined on $\mathcal{W}$ . In a functional conditional moment equation problem, we aim to find a function $f_{0}\in\mathcal{F}$ that solves the following functional equation involving the conditional distribution of $X$ given $Z$ over $\mathcal{F}$ :

\displaystyle\mathbb{E}_{X|Z}\bigl{[}\Phi(X,Z;f_{0}){\,\Big{|}\,}Z=z\Bigr{]}=0% ,\qquad\forall z\in\mathcal{Z},

(2.1)

where $\Phi\colon\mathcal{X}\times\mathcal{Z}\times\mathcal{F}\rightarrow\mathbb{R}$ is a known functional.

For any function $f\in\mathcal{F}$ and any $z\in\mathcal{Z}$ , we define a functional $\bar{\delta}\colon\mathcal{Z}\times\mathcal{F}\rightarrow\mathbb{R}$ as

\displaystyle\bar{\delta}(z;f):=\mathbb{E}_{X|Z}\bigl{[}\Phi(X,Z;f){\,\big{|}% \,}Z=z\bigr{]},\qquad\forall f\in\mathcal{F},z\in\mathcal{Z}.

(2.2)

In other words, the conditional moment equation problem in (2.1) boils down to finding a function $f_{0}\in\mathcal{F}$ such that $\bar{\delta}(\cdot;f_{0})$ is a zero function on $\mathcal{Z}$ . Therefore an equivalent way to solve $f_{0}\in\mathcal{F}$ in (2.1) is by solving $\inf_{f\in\mathcal{F}}\mathbb{E}[(\bar{\delta}(Z;f))^{2}]$ (Ai and Chen, 2003; Chen and Pouzo, 2012). To control the complexity of the function class $\mathcal{F}$ , Ai and Chen (2003) propose to use flexible sieve spaces $\mathcal{F}_{k(n)}$ that becomes dense in $\mathcal{F}$ as the sieve dimension $k(n)$ grows to infinity with data sample size $n$ , and proposed the so-called sieve minimum distance criterion $\min_{f\in\mathcal{F}_{k}(n)}\mathbb{E}\bigl{[}\bar{\delta}(Z;f)^{2}\bigr{]}/2.$ In particular, Ai and Chen (2003) allow for two-layer NNs, splines, wavelets, Fourier series, and all kinds of polynomial sieves $\mathcal{F}_{k(n)}$ to approximate functions in $\mathcal{F}\subseteq L^{2}(\mathcal{W})$ . Alternatively Chen and Pouzo (2012) propose the following penalized (or regularized) minimum distance criterion:

\displaystyle\min_{f\in\mathcal{F}}\mathbb{E}\bigl{[}\bar{\delta}(Z;f)^{2}% \bigr{]}/2+\lambda\cdot\mathcal{R}(f),

(2.3)

where $\lambda\geq 0$ is a regularization parameter, $\mathcal{R}(f)$ is a regularizer on function $f\in\mathcal{F}$ . They allow that $\mathcal{R}(f)$ to be any convex or lower-semicompact regularizer. In the minimum distance approach, for any fixed $f$ , the authors first estimate $\bar{\delta}(z;f)$ by the following least squares criterion:

\displaystyle\mathop{\mathrm{argmin}}_{\delta\in L^{2}(\mathcal{Z})}\mathbb{E}% \Big{[}1/2\cdot\big{(}\Phi(X,Z;f)-\delta(Z)\big{)}^{2}\Big{]}=\mathop{\mathrm{% argmax}}_{\delta\in L^{2}(\mathcal{Z})}\mathbb{E}\Big{[}\Phi(X,Z;f)\delta(Z)-1% /2\cdot\delta(Z)^{2}\Big{]}

Furthermore, we assume that the functional $\Phi$ is affine in $f$ , which captures several important applications in machine learning and causal inference listed in Section 2.2. Specifically, we define $\widetilde{\Phi}(x,z,f)=\Phi(x,z,f)-\Phi(x,z,0)$ , where $0$ stands for the zero function on $\mathcal{W}$ . Then for any two functions $f_{1},f_{2}\in\mathcal{F}$ and any $a,b\in\mathbb{R}$ , we have

\displaystyle\widetilde{\Phi}(x,z;af_{1}+bf_{2})=a\cdot\widetilde{\Phi}(x,z;f_% {1})+b\cdot\widetilde{\Phi}(x,z;f_{2}),\qquad\forall(x,z)\in\mathcal{X}\times% \mathcal{Z}.

(2.4)

Solving (2.1) with Overparameterized Neural Networks. In the sequel, we aim to solve the problem in (2.1) based on i.i.d. data points sampled from $\mathcal{D}$ , with $\mathcal{F}$ being a class of overparameterized neural networks. In this case, it is possible that (2.1) does not have a solution within $\mathcal{F}$ . Furthermore, for the choice of regularizer, we consider the following specific form of $\mathcal{R}(f)$ :

\displaystyle\mathcal{R}(f)=\mathbb{E}_{\mathcal{D}}[\Psi(X,Z;f)]

(2.5)

where for any given $(x,z)\in\mathcal{X}\times\mathcal{Z}$ , $\Psi(x,z;f):\mathcal{F}\rightarrow{\mathbb{R}}_{+}$ is a convex functional of $f$ that maps each function $f$ to a scalar. Moreover, $\Psi$ satisfies

	$\displaystyle\Psi(x,z;0)$	$\displaystyle=0,\qquad\Psi(x,z;f)\geq 0,\qquad\forall f\in\mathcal{F},$		(2.6)
	$\displaystyle\frac{\delta\Psi(x,z;af_{1}+bf_{2})}{\delta f}$	$\displaystyle=a\cdot\frac{\delta\Psi(x,z;f_{1})}{\delta f}+b\cdot\frac{\delta% \Psi(x,z;f_{2})}{\delta f},\qquad\forall f_{1},f_{2}\in\mathcal{F},\;a,b\in% \mathbb{R}.$		(2.7)

Equation (2.6) requires that $\Psi(X,Z;f)$ is a non-negative functional of $f$ that is equal to $0$ if and only $f=0$ . Equation (2.7) requires that the functional derivative of $\Psi(X,Z;f)$ with respect to $f$ , is linear in $f$ . One example of $\Psi$ is the $L_{2}$ -regularizer of the following type, $\Psi(x,z;f)=f(w)^{2}$ . Here $w\in\mathcal{W}$ is a subset of variables that contain values from both the endogenous variables $x$ and exogenous variables $z$ .

Minimax Estimation. To solve the optimization problem in (2.3), we first transform it into a unconditional moment formulation by introducing a dual function. By Fenchel duality, we can rewrite the objective function $J(f)$ as follows,

$\displaystyle J(f)$	$\displaystyle=\mathbb{E}_{\mathcal{D}}\Bigl{[}1/2\cdot\bar{\delta}(z;f)^{2}+% \lambda\Psi(X,Z;f)\Bigr{]}$	(2.8)
	$\displaystyle=\mathbb{E}_{\mathcal{D}}\Bigl{[}\max_{g:\mathcal{Z}\rightarrow% \mathbb{R}}\left(g(z)\cdot\mathbb{E}\bigl{[}\Phi(X,z;f){\,\big{\|}\,}z\bigr{]}-% 1/2\cdot g(z)^{2}\right)+\lambda\Psi(X,Z;f)\Bigr{]}$
	$\displaystyle=\max_{g:\mathcal{Z}\rightarrow\mathbb{R}}\mathbb{E}_{\mathcal{D}% }\Bigl{[}g(Z)\cdot\Phi(X,Z;f)-1/2\cdot g(Z)^{2}+\lambda\Psi(X,Z;f)\Bigr{]}.$

The formulation in (2.8) leads to the following minimax optimization problem:

\displaystyle\min_{f}\max_{g}\mathcal{L}(f,g)=\mathbb{E}_{\mathcal{D}}\Bigl{[}% g(Z)\cdot\Phi(X,Z;f)-1/2\cdot g(Z)^{2}+\lambda\Psi(X,Z;f)\Bigr{]}.

(2.9)

We note that $\mathcal{L}$ is a convex-concave functional with respect to function $f$ and $g$ . We denote by $(f^{*},g^{*})$ the unique saddle point of (2.9). Here the uniqueness of $f^{*}$ comes from the convexity of regularization $\Phi(X,Z;f)$ , and $g^{*}(z)=\mathbb{E}[\Phi(X,Z;f^{*})|Z=z]$ implies the uniqueness of $g^{*}$ . Without the regularization, i.e., $\lambda=0$ , the saddle point of (2.9) is $f^{*}=f_{0}$ and $g^{*}=0$ .

2.2 Examples of Functional Conditional Moment Equation

In this section, we discuss several important applications of the functional conditional moment equation, which serve as running examples of this paper.

Policy Evaluation. We consider a Markov decision process given by $({\mathcal{S}},\mathcal{A},\mathcal{P},r,\gamma)$ , where ${\mathcal{S}}\subseteq\mathbb{R}^{d}$ is the state space, $\mathcal{A}$ is the action space, $\mathcal{P}:{\mathcal{S}}\times\mathcal{A}\rightarrow\mathscr{P}({\mathcal{S}})$ is the transition kernel, $r:{\mathcal{S}}\times\mathcal{A}\rightarrow[0,1]$ is the reward function, $\gamma\in(0,1)$ is the discount factor. Given a policy $\pi:{\mathcal{S}}\rightarrow\mathscr{P}(\mathcal{A})$ , an agent interacts with the environment in the following manner. At a state $s_{t}$ , the agent takes an action $a_{t}\sim\pi(\cdot{\,|\,}s_{t})$ and receives a reward $r_{t}=r(s_{t},a_{t})$ . Then, the agent transits to the next state $s_{t+1}\sim\mathcal{P}(\cdot{\,|\,}s_{t},a_{t})$ . We denote the transition kernel induced by policy $\pi$ by $\mathcal{P}^{\pi}(s^{\prime}{\,|\,}s)=\int_{\mathcal{A}}\mathcal{P}(s^{\prime}% {\,|\,}s,a)\pi(a{\,|\,}s)\mathrm{d}a$ for any $s,s^{\prime}\in{\mathcal{S}}$ . In policy evaluation, we aim to estimate the value function $V^{\pi}:{\mathcal{S}}\rightarrow\mathbb{R}$ defined as follows,

\displaystyle V^{\pi}(s)=\mathbb{E}_{\pi}\Bigl{[}\sum_{i=0}^{\infty}\gamma^{i}% r(s_{i},a_{i}){\,\Big{|}\,}s_{0}=s\Bigr{]},

where the expectation $\mathbb{E}_{\pi}$ is taken with respect to $a_{t}\sim\pi(\cdot{\,|\,}s_{t})$ and $s_{t+1}\sim\mathcal{P}(\cdot{\,|\,}s_{t},a_{t})$ for $t\geq 0$ . By the Bellman equation (Sutton and Barto, 2018), it holds for any $s\in{\mathcal{S}}$ that

\displaystyle V^{\pi}(s)-{\mathcal{T}}^{\pi}V^{\pi}(s)=0,\quad{\mathcal{T}}^{% \pi}f(s)=\mathbb{E}_{a\sim\pi(\cdot{\,|\,}s)}\bigl{[}r(s,a)\bigr{]}+\gamma% \mathbb{E}_{s^{\prime}\sim\mathcal{P}^{\pi}(\cdot{\,|\,}s)}\bigl{[}f(s^{\prime% })\bigr{]}.

(2.10)

Corresponding to the Bellman equation in (2.10), let $\mathcal{D}$ denotes the joint distribution of the state-action tuple $(s,a,s^{\prime})$ under policy $\pi$ , the value function $V^{\pi}$ satisfies the following functional conditional moment equation,

\displaystyle\mathbb{E}_{s^{\prime}|s}\Bigl{[}r(s,a)-V^{\pi}(s)+\gamma\cdot V^% {\pi}(s^{\prime}){\,\Big{|}\,}s\Bigr{]}=0.

(2.11)

We notice that (2.11) is a special case of the functional conditional moment equation in (2.1) by setting the exogenous variable $Z$ to be the current state $s$ , the endogenous variable $X$ to be the next state $s^{\prime}$ and the function to be estimated $f:{\mathcal{S}}\rightarrow\mathbb{R}$ to be defined on the state space ${\mathcal{S}}$ . In this case, the functional is $\Phi(X,Z;f)=r+\gamma\cdot f(X)-f(Z)$ , where $r$ is the reward function. We remark that the reason function $f$ can be evaluated simultaneously on $X$ and $Z$ is that both $X$ and $Z$ are variables defined on ${\mathcal{S}}$ . Following the same derivation of (2.8), policy evaluation can be formulated as the following minimax optimization problem,

\displaystyle\min_{f}\max_{g}\mathcal{L}(f,g)=\mathbb{E}_{\mathcal{D}}\Bigl{[}% g(Z)\cdot\bigl{(}r+\gamma\cdot f(X)-f(Z)\bigr{)}-1/2\cdot g(Z)^{2}+\lambda\Psi% (X,Z;f)\Bigr{]}.

Nonparametric Instrumental Variables Regression. The nonparametric instrumental variables model is common and useful in statistics and economics. The model can be described simply by a line of equation

\displaystyle Y=f_{0}(X)+\varepsilon,\quad\mathbb{E}\bigl{[}\varepsilon{\,\big% {|}\,}Z\bigr{]}=0.

where $Y$ in an observed outcome, $X$ is the endogenous variable, $Z$ is the exogenous variable, $f_{0}$ is the true model that characterize the relationship between $Y$ and $X$ and is also the function we want to estimate. In this model, $\varepsilon$ is a noise possibly correlated with the endogenous $X$ but uncorrelated with the exogenous $Z$ . It’s straightforward to see that NPIV model fits into the framework of the functional conditional moment equation by plugging the model equation into the equation about $\varepsilon$ ,

\displaystyle\mathbb{E}_{\mathcal{D}}\Bigl{[}Y-f_{0}(X){\,\big{|}\,}Z\Bigr{]}=0.

(2.12)

We notice that (2.12) is a special case of functional conditional moment equation in (2.4) by identifying $X$ , $Z$ with the endogenous and exogenous variable respectively and setting the functional as $\Phi(X,Z;f)=Y-f(X)$ . Following the same derivation of (2.8), the problem of NPIV is equivalent to the following minimax optimization problem,

\displaystyle\min_{f}\max_{g}\mathcal{L}(f,g)=\mathbb{E}_{\mathcal{D}}\Bigl{[}% g(Z)\cdot\bigl{(}Y-f(X)\bigr{)}-1/2\cdot g(Z)^{2}+\lambda\Psi(X,Z;f)\Bigr{]}.

Asset Pricing. Asset pricing refers to the process of determining the fair value of financial assets. This field is fundamental in finance and underpins much of the work in investment, portfolio management, and risk assessment. Semiparametric Consumption Captial Asset Pricing Model (CCAPM) is a foundational model in asset pricing that describes the relationship between systematic risk and expected asset returns, which also incorporates the influence of the consumption preference of investors over time. Moreover, CCAPM can be characterized through a functional conditional moment equation (Chen et al., 2014; Chen and Ludvigson, 2009). To describe the model, let $C_{t}$ denote the consumption level at time $t$ , $c_{t}\equiv C_{t}/C_{t-1}$ the consumption growth. The marginal utility of consumption at time $t$ is given by $\text{MU}_{t}=C_{t}^{-\gamma_{0}}f_{0}(c_{t})$ , where $\gamma_{0}>0$ is the discount factor, $f_{0}:\mathcal{C}\to\mathbb{R}$ is the nonparametric structural demand function, which is an unknown positive function of our interest and is defined on $\mathcal{C}$ , the space of consumption growth. The unknown function $f_{0}$ can be understood as a taste shifter that describes how the marginal utility of consumption changes with the state of the economy in terms of consumption growth.

Now, consider the growth-return tuple $(c_{t},\widetilde{r}_{t+1},c_{t+1})$ for $t\in\mathbb{N}^{+}$ with joint distribution $\mathcal{D}$ , where $c_{t}$ is the consumption growth at the current time $t$ , and $c_{t+1}$ is the consumption growth at the next time $t+1$ . $\widetilde{r}_{t+1}$ is a modified return observed in this period, which is a known function of the actual return $r_{t+1}$ and the consumption growth $c_{t+1}$ at time $t+1$ . We consider the scenario where the time series of consumption growth $\{c_{t}\}_{t\geq 0}$ follows a time-homogenous Markov chain with a smooth transition kernel. That being said, both conditional transition probabilities $c_{t+1}|c_{t}$ and $c_{t}|c_{t+1}$ admit a smooth density function. The CCAPM model captures the behavior of $f_{0}$ through the following equation:

\displaystyle\mathbb{E}_{c_{t+1}|c_{t}}\big{[}\widetilde{r}_{t+1}\cdot f_{0}(c% _{t+1})-f_{0}(c_{t}){\,\big{|}\,}c_{t}\big{]}=0,

(2.13)

where the modified return can be further expressed as $\widetilde{r}_{t+1}=\delta_{0}\cdot r_{t+1}\cdot c_{t+1}^{-\gamma_{0}}$ , $\delta_{0}\in(0,1]$ is the rate of time preference. We focus on a setting where $\mathcal{C}\subseteq\mathbb{R}$ is a compact set, and the modified return $\widetilde{r}_{t+1}$ is bounded for all $t\geq 0$ . We notice that (2.13) is a special case of the functional conditional moment equation in (2.4). We can identify the exogenous variable $Z$ with $c_{t}$ , the consumption growth at the current time $t$ , and the endogenous variable $X$ with $c_{t+1}$ , the consumption growth at the next time $t+1$ . In this scenario, we identify the space $\mathcal{W}$ with $\mathcal{C}$ , the space of consumption growth and the function to be estimated $f:\mathcal{C}\to\mathbb{R}$ is defined on $\mathcal{C}$ . The functional is $\Phi(X,Z;f)=\widetilde{r}_{t+1}\cdot f(X)-f(Z)$ , where $\widetilde{r}_{t+1}$ again denotes the modified return. Similar to the scenario of policy evaluation, the reason function $f$ can be evaluated simultaneously on $X$ and $Z$ is that both $X$ and $Z$ are variables defined on $\mathcal{C}$ . Following the same derivation of (2.8), the problem of asset pricing through CCAPM is equivalent to the following minimax optimization problem,

\displaystyle\min_{f}\max_{g}\mathcal{L}(f,g)=\mathbb{E}_{\mathcal{D}}\Big{[}g% (Z)\cdot(\widetilde{r}_{t+1}\cdot f(X)-f(Z))-1/2\cdot g(Z)^{2}+\lambda\Psi(X,Z% ;f)\Big{]}

Adversarial Riesz representer Estimation. Many problems in statistics, causal inference, and finance involve the task of learning a continuous linear functional in the following form,

\displaystyle\mathcal{V}(g)=\mathbb{E}\bigr{[}m(V;g)\bigl{]}.

(2.14)

where function $g\in\mathcal{G}:\mathcal{X}\rightarrow\mathbb{R}$ , $\mathcal{F}$ is defined on a function space $\mathcal{G}$ , and $V$ is a random vector of which we have access to observations and represents the source of randomness in the functional. Moreover, suppose such continuous linear functional $\mathcal{F}(\cdot)$ is also mean-square continuous with respect to $L^{2}$ norm. In that case, it can be written in a more benign and useful manner, which is also often the case. Formally speaking, for such linear functional $\mathcal{V}$ , there exists function $f_{0}$ such that for any $g\in\mathcal{G}$ ,

\displaystyle\mathcal{V}(g)=\mathbb{E}\bigl{[}f_{0}(X)g(X)\bigr{]}.

(2.15)

The function $f_{0}$ here is called the Riesz representer of the linear functional $\mathcal{V}$ , and the equation (2.15) is known as the Riesz representation theorem. Information about the Riesz representation of such linear functional is crucial to numerous applications and learning tasks. Therefore, we aim to estimate $f_{0}$ by exploiting the relationship characterized by the equation. We have the following trivial observation that the true Riesz representer $f_{0}$ can be recovered by solving the following equation,

\displaystyle\mathbb{E}\Bigl{[}f_{0}(X)-f(X){\,\big{|}\,}X\Bigr{]}=0.

(2.16)

Of course, $f=f_{0}$ will solve the equation above, and therefore the true Riesz representer is achieved. We remark that this is indeed a special case since the expectation taken in (2.16) is unconditioned. In the equation, we only involve the endogenous variable $X$ , which also indicates that the exogenous variable $Z$ coincides with $X$ . While special, the problem still fits in the framework discussed here. By setting $\Phi(X,Z;f)=f(X)-f_{0}(X)$ , we recovered the intractable formulation of Riesz representer estimation.

However, unlike the previous examples where have access to observations of each term in the equation, here we have no direct access to values of $f_{0}$ , making the problem seemingly intractable. Fortunately, the alternative formulation of the original problem as a minimax optimization problem solves this difficulty. When written in the minimax formulation, we will again see the linear functional $\mathcal{V}$ show up in the equation in the form of (2.14), which can be approximated using empirical values calculated from accessible observations of the random vector $V$ . Following the same derivation of (2.8) and the definition of Riesz representer in (2.15), the problem of adversarial Riesz representer estimation is equivalent to the following minimax optimization problem,

\displaystyle\min_{f}\max_{g}\mathcal{L}(f,g)=\mathbb{E}_{\mathcal{D}}\Bigl{[}% m(V;g)-f(X)\cdot g(X)-1/2\cdot g(X)^{2}+\lambda\Psi(X,X;f)\Bigr{]}.

(2.17)

Again, we stress that in (2.17), the absence of $Z$ is due to the fact both the endogenous and exogenous variables are described by $X$ and the objective is computationally tractable since we have access to both observations of $X$ and $V$ .

2.3 Mean-Field Neural Network and Wasserstein Space

In the sequel, we will consider functions in the neural network function class. Consider a neural function defined on a given state space $\Omega$ , $\sigma:\Omega\times\mathbb{R}^{D}\rightarrow\mathbb{R}$ that takes an input $x\in\Omega$ and parameter $\theta\in\mathbb{R}^{D}$ and outputs a value in $\mathbb{R}$ . For $\bm{\theta}=(\theta_{1},\dots,\theta_{N})$ where $\theta_{i}\in\mathbb{R}^{D}$ , we can define an overparameterized two-layered neural network function $h$ using neuron function $\sigma$ ,

\displaystyle h(x,\bm{\theta})=\frac{1}{N}\sum_{i=1}^{N}\sigma(x;\theta_{i}),% \quad\forall x\in\Omega.

For such a form, we can further consider the infinite width limit when $N\rightarrow\infty$ . When taking such a limit, the neural network function $h$ becomes a mean-field neural network and can be parameterized with probability measure over the parameter space, $\mu\in\mathscr{P}(\mathbb{R}^{D})$ .

\displaystyle h(x;\mu)=\int_{\mathbb{R}^{D}}\sigma(x;\theta)\mathrm{d}\mu(% \theta),\quad\forall x\in\Omega.

When considering such a limit, the optimization problem over the neural network function class is turned from a finite-dimensional problem over the parameter space into an infinite-dimensional problem over the space of probability measures. Therefore, we will need to track the convergence of probability measures over the Wasserstein space when analyzing the convergence of algorithms.

We now introduce the background knowledge of the Wasserstein space for the reader’s information. Let $\mathscr{P}_{p}(\mathbb{R}^{D})$ be the space of all the probability measures over the $D$ -dimensional Euclidean space $\mathbb{R}^{D}$ with finite $p$ -th order moments. The Wasserstein- $p$ distance between two probability measures $\mu,\nu\in\mathscr{P}_{p}(\mathbb{R}^{D})$ is defined as follows,

\displaystyle\mathcal{W}_{p}(\mu,\nu)=\inf\biggl{\{}\Bigl{(}\int\|x-y\|^{p}% \mathrm{d}\gamma(x,y)\Bigr{)}^{1/p}{\,\Big{|}\,}\gamma\in\mathscr{P}_{p}(% \mathbb{R}^{D}\times\mathbb{R}^{D}),x_{\sharp}\gamma=\mu,y_{\sharp}\gamma=\nu% \biggr{\}},

(2.18)

where the infimum is taken over all the coupling of $\mu$ and $\nu$ . Here we denote by $x_{\sharp}\gamma$ and $y_{\sharp}\gamma$ the marginal distributions of $\gamma$ with respect to $x$ and $y$ , respectively. We call $\mathcal{M}_{p}=(\mathscr{P}_{p}(\mathbb{R}^{D}),\mathcal{W}_{p})$ the Wasserstein- $p$ space. For any $1\leq p\leq q$ , due to the relation that $\mathbb{E}[|X|^{p}]^{1/p}\leq\mathbb{E}[|X|^{q}]^{1/q}$ , we have that $W_{p}(\mu,\nu)\leq W_{q}(\mu,\nu)$ for two measures $\mu,\nu$ . In this paper, we focus on the cases when $p=1,2$ . Without further clarification, we refer to the distance with $p=2$ as the Wasserstein distance in the sequel.

The Wasserstein-2 space $\mathcal{M}_{2}=(\mathscr{P}_{2}(\mathbb{R}^{D}),\mathcal{W}_{2})$ can be viewed as an infinite-dimensional Riemannian manifold (Villani, 2008). Formally, the tangent space at point $\rho\in\mathscr{P}_{2}(\mathbb{R}^{D})$ is defined as

\displaystyle\mathrm{Tan}_{\rho}\bigl{(}\mathscr{P}_{2}(\mathbb{R}^{D})\bigr{)% }=\Bigl{\{}v\in L^{2}(\rho){\,\Big{|}\,}\int\langle v,u\rangle d\rho=0,\forall u% \in L^{2}(\rho)\text{ s.t. }\mathrm{div}(u\rho)=0\Bigr{\}}.

Then, for any absolutely continuous curve $\rho:[0,1]\rightarrow\mathscr{P}_{2}(\mathbb{R}^{D})$ on the Wasserstein-2 space, there exists a family of vector fields $v_{t}\in\mathrm{Tan}_{\rho_{t}}(\mathscr{P}_{2}(\mathbb{R}^{D}))$ such that the continuity equation

\displaystyle\partial_{t}\rho_{t}+\mathrm{div}(v_{t}\rho_{t})=0

(2.19)

holds in the sense of distributions. For any two absolutely continuous curves $\rho,\widetilde{\rho}:[0,1]\rightarrow\mathscr{P}_{2}(\mathbb{R}^{D})$ , we define the inner product between $\partial_{t}\rho_{t},\partial_{t}\widetilde{\rho}_{t}$ for any $t\in[0,1]$ as follows,

\displaystyle\langle\partial_{t}\rho_{t},\partial_{t}\widetilde{\rho}_{t}% \rangle_{\rho_{t}}=\int\langle v_{t},\widetilde{v}_{t}\rangle\mathrm{d}\rho_{t},

(2.20)

where $\langle v_{t},\widetilde{v}_{t}\rangle$ is the inner product over $\mathbb{R}^{D}$ , $(\rho_{t},v_{t})$ and $(\widetilde{\rho}_{t},\widetilde{v}_{t})$ satisfy the continuity equation in (2.19). Note that (2.20) yields a Riemannian metric over $\mathcal{M}_{2}$ . Furthermore, the Riemannian metric induces a norm $\|\partial_{t}\rho_{t}\|_{\rho_{t}}=\langle\partial\rho_{t},\partial_{t}\rho_{% t}\rangle_{\rho_{t}}^{1/2}$ .

3 Algorithms

In this section, we introduce the stochastic gradient descent-ascent algorithm (SGDA) and its mean-field limit, which is characterized by the continuity equation.

Stochastic Gradient Descent-Ascent Algorithm. We solve the minimax optimization problem in (2.9) via SGDA. Recall that in the minimax objective, we have two functions simultaneously involved, where the primal function $f$ represents the true model of interest and the dual function $g$ represents an adversarial player. Specifically, we parameterize both $f$ and $g$ with neural networks with width $N$ and parameters $\bm{\theta}=(\theta^{1},\theta^{2},\dots,\theta^{N})\in\mathbb{R}^{D\times N}$ and $\bm{\omega}=(\omega^{1},\omega^{2},\dots,\omega^{N})\in\mathbb{R}^{D\times N}$

\displaystyle f(\cdot;\bm{\theta})=\frac{\alpha}{N}\sum_{i=1}^{N}\phi(\cdot;% \theta^{i}),\quad g(\cdot;\bm{\omega})=\frac{\alpha}{N}\sum_{i=1}^{N}\psi(% \cdot;\omega^{i}).

(3.1)

where we use bold symbols $\bm{\theta}$ and $\bm{\omega}$ to denote the whole parameter used by each neural net and unbold symbols $\theta$ and $\omega$ to denote the parameter used by each neuron. Here, $\phi(\cdot;\theta):\mathcal{W}\times\mathbb{R}^{D}\rightarrow\mathbb{R}$ , $\psi(\cdot;\omega):\mathcal{Z}\times\mathbb{R}^{D}\rightarrow\mathbb{R}$ are the functions for neurons. In particular, we can recover the general setting of two-layer neural networks parameterization for $f$ and $g$ when we choose $\phi,\psi$ to be the following specific form,

\displaystyle\phi(w;\beta,W)=\beta\cdot\sigma_{f}(w;W),\quad\psi(z;\beta,W)=% \beta\cdot\sigma_{g}(z;W),

where $\sigma_{f}:\mathcal{W}\times\mathbb{R}^{D}\rightarrow\mathbb{R}$ , $\sigma_{g}:\mathcal{Z}\times\mathbb{R}^{D}\rightarrow\mathbb{R}$ are activation functions with input $w$ and $z$ respectively and parameters $W$ . We note that it’s not necessary to choose the same width $N$ for $f$ and $g$ , and activation functions $\sigma_{f},\sigma_{g}$ need not have the same parameter dimension $D$ . Here we use the same width $N$ and parameter dimension $D$ to keep notations simple as these won’t affect the validity of the results presented in this paper.

Besides, we have also introduced a scaling factor $\alpha>0$ in (3.1). Setting the scaling parameter $\alpha=\sqrt{N}$ in (3.1) recovers the neural tangent kernel regime (Jacot et al., 2018). Setting the parameter $\alpha=1$ recovers the mean-field regime (Mei et al., 2018, 2019). In a discrete-time finite-width (DF) scenario, at the $k$ th iteration, the primal function $f$ and adversarial player $g$ are updated as follows,

	DF-GD:	$\displaystyle\bm{\theta}_{k+1}=\bm{\theta}_{k}-\eta\cdot g(z_{k};\omega_{k})% \cdot\nabla_{\bm{\theta}}\Phi(x_{k},z_{k};f(\cdot;\bm{\theta}_{k}))-\eta% \lambda\cdot\nabla_{\bm{\theta}}\Psi(x_{k},z_{k};f(\cdot;\bm{\theta}_{k})),$
	DF-GA:	$\displaystyle\bm{\omega}_{k+1}=\bm{\omega}_{k}+\eta\cdot\Phi(x_{k},z_{k};f(% \cdot;\bm{\theta}_{k}))\cdot\nabla_{\bm{\omega}}g(z_{k};\bm{\omega}_{k}))-\eta% \cdot g(z_{k};\bm{\omega}_{k})\cdot\nabla_{\bm{\omega}}g(z_{k};\bm{\omega}_{k}),$		(3.2)

where $\bm{\theta}_{k},\bm{\omega}_{k}$ denotes the state of the parameters at iteration $k$ , $\eta>0$ is the step-size, and the data samples $\{(x_{k},z_{k})\}_{k=0}^{\infty}$ are collected by independently sampling from the data distribution $\mathcal{D}$ . When $f,g$ are two-layered neural networks with width $N$ , we can plug in the form for $f,g$ as is described in (3.1). The update for the parameter of $i$ -th neuron at $k$ -th iteration can be further specified to the following,

	$\displaystyle\theta_{k+1}^{i}$	$\displaystyle=\theta_{k}^{i}-\eta\alpha\epsilon\cdot g(z_{k};\bm{\omega}_{k})% \cdot\nabla_{\theta}\Phi(x_{k},z_{k};\phi(\cdot,\theta_{k}^{i}))-\eta\lambda% \epsilon\cdot\frac{\delta\Psi(x_{k},z_{k};f(\cdot,\bm{\theta}_{k}))}{\delta f}% \cdot\nabla_{\theta}\phi(x_{k};\theta_{k}^{i}),$
	$\displaystyle\omega_{k+1}^{i}$	$\displaystyle=\omega_{k}^{i}+\eta\alpha\epsilon\cdot\Phi(x_{k},z_{k};f(\cdot,% \bm{\theta}_{k}))\cdot\nabla_{\omega}\psi(z_{k};\omega_{k}^{i})-\eta\alpha% \epsilon\cdot g(z_{k};\bm{\omega}_{k})\cdot\nabla_{\omega}\psi(z_{k};\omega_{k% }^{i}),$		(3.3)

where $\bm{\theta}_{k}=(\theta^{1}_{k},\theta^{2}_{k},\dots,\theta^{N}_{k})$ and $\bm{\omega}_{k}=(\omega^{1}_{k},\omega^{2}_{k},\dots,\omega^{N}_{k})$ , $\delta\Psi/\delta f$ denotes the variation of $\Psi$ with respect to $f$ . Here, $\alpha$ is the neural network scaling parameter and $\epsilon=1/N$ is the stepsize scale. Both $\alpha$ and $\epsilon$ show up in (3) due to the finite width parameterization of two-layered neural networks described in (3.1).

For a given space ${\mathcal{S}}$ , let $\mathcal{H}$ define a set of functions defined on ${\mathcal{S}}\rightarrow\mathbb{R}$ . For a functional defined over the function class $\mathcal{H}$ , $F:\mathcal{H}\rightarrow\mathbb{R}$ , its variation at $f\in\mathcal{H}$ is a function $\frac{\delta F}{\delta f}:{\mathcal{S}}\rightarrow\mathbb{R}$ , such that for any test function $h\in\mathcal{H}$ ,

\displaystyle\Big{[}\frac{\mathrm{d}}{\mathrm{d}\varepsilon}F(f+\varepsilon h)% \Big{]}_{\varepsilon=0}=\int_{{\mathcal{S}}}\frac{\delta F}{\delta f}(s)\cdot h% (s)~{}\mathrm{d}s.

(3.4)

We initialize the parameters with $\theta_{0}^{i}\sim\mu_{0}$ and $w_{0}^{i}\sim\nu_{0}$ , with $\mu_{0},\nu_{0}=\mathcal{N}(0,I_{D})$ be standard Gaussian distribution in $\mathbb{R}^{D}$ . In addition, to keep track of the evolution of the parameter distribution, we denote the empirical distribution of $\bm{\theta}$ and $\bm{\omega}$ at the $k$ th iteration by,

\displaystyle\widehat{\mu}_{k}(\theta)=\frac{1}{N}\sum_{i=1}^{N}\delta_{\theta% ^{i}_{k}}(\theta),\quad\widehat{\nu}_{k}(\omega)=\frac{1}{N}\sum_{i=1}^{N}% \delta_{\bm{\omega}^{i}_{k}}(\omega),

where $\delta$ is the Dirac mass function.

Mean-Field (MF) Limit. To analyze the convergence of the Stochastic Gradient Descent-Ascent Algorithm for solving functional conditional moment equations with neural networks, we employ an analysis that studies the mean-field limit regime (Mei et al., 2018, 2019) of the discrete-time dynamics described in (3). Here, by the mean-field limit, we are referring to an infinite-width limit, i.e., when $N\rightarrow\infty$ for the neural network width and a continuous time, i.e., $t=k\epsilon$ where the step scale $\epsilon\rightarrow 0$ in (3). In what follows, we introduce the mean-field limit of the SGDA dynamics, which refers to the infinite-width and continuous limit of (3). For $\bm{\theta}=\{\theta^{i}\}_{i=1}^{N}$ and $\bm{\omega}=\{\omega^{i}\}_{i=1}^{N}$ independently sampled respectively from $\mu,\nu\in\mathscr{P}(\mathbb{R}^{D})$ , we can write the infinite width limit of neural networks used in (3.1) as

\displaystyle f(\cdot;\mu)=\alpha\int\phi(\cdot;\theta)\mu(\mathrm{d}\theta),% \quad g(\cdot;\nu)=\alpha\int\psi(\cdot;\omega)\nu(\mathrm{d}\omega).

(3.5)

From now on, we denote by $\mu_{t}$ the distribution of $\theta_{t}^{i}$ and $\nu_{t}$ the distribution of $\omega_{t}^{i}$ for the infinite-width and continuous limit of the neural networks at time $t$ . For notational simplicity, we overload the notation of the objective function in (2.9) via $\mathcal{L}(\mu,\nu)=\mathcal{L}(f(\cdot;\mu),g(\cdot;\nu))$ . This is to further emphasize the dependence of objective $\mathcal{L}$ on $(\mu,\nu)$ when we parameterize the function pair $(f,g)$ using distributions on the parameter space. By Otto’s calculus (Villani, 2008), the mean-field limit of the update direction takes the following form,

$\displaystyle v^{f}(\theta;\mu,\nu)$	$\displaystyle=-\nabla_{\theta}\frac{\delta\mathcal{L}(\mu,\nu)}{\delta\mu}(\theta)$
	$\displaystyle=\alpha\mathbb{E}_{\mathcal{D}}\Bigl{[}-g(Z;\nu)\cdot\big{\langle% }\frac{\delta\Phi(X,Z;f(\cdot;\mu))}{\delta f},\nabla_{\theta}\phi(\cdot;% \theta)\big{\rangle}_{L^{2}}-\lambda\cdot\Big{\langle}\frac{\delta\Psi(X,Z;f(% \cdot;\mu))}{\delta f},\nabla_{\theta}\phi(\cdot;\theta)\Big{\rangle}_{L^{2}}% \biggr{]},$
$\displaystyle v^{g}(\omega;\mu,\nu)$	$\displaystyle=\nabla_{\omega}\frac{\delta\mathcal{L}(\mu,\nu)}{\delta\nu}(\omega)$
	$\displaystyle=\alpha\mathbb{E}_{\mathcal{D}}\Bigl{[}\Phi(X,Z;f(\cdot,\mu))% \cdot\nabla_{\omega}\psi(Z;\omega)-g(Z;\nu)\cdot\nabla_{\omega}\psi(Z;\omega)% \Bigr{]}.$	(3.6)

Here $\langle\cdot,\cdot\rangle_{L_{2}}$ is the inner product on $L^{2}(\mathcal{X}\times\mathcal{Z})$ with respect to the Lebesgue measure. Recall that $\mathcal{D}$ is the data distribution of random variables $(X,Z)\in\mathcal{X}\times\mathcal{Z}$ , we denote by $\rho_{\mathcal{X},\mathcal{Z}}$ the density of $\mathcal{D}$ with respect to the Lebesgue measure on $\mathcal{X}\times\mathcal{Z}$ and we use $\langle\cdot,\cdot\rangle_{\mathcal{D}}$ to represent the inner product on $L^{2}(\mathcal{X}\times\mathcal{Z})$ with respect to the probability distribution $\mathcal{D}$ . That is to say, for any two function $h_{1},h_{2}\in L^{2}(\mathcal{X}\times\mathcal{Z})$ , $\langle h_{1},h_{2}\rangle_{\mathcal{D}}=\int_{\mathcal{X}\times\mathcal{Z}}h_% {1}h_{2}~{}\mathrm{d}\rho_{\mathcal{X},\mathcal{Z}}$ .

In the sequel, we will also slightly abuse this notation and use $\langle\cdot,\cdot\rangle_{\mathcal{D}}$ to denote the inner product on sub-spaces of $L^{2}(\mathcal{X}\times\mathcal{Z})$ , with the measure being the marginals of $\mathcal{D}$ on these sub-spaces. In (3), $\delta\Phi/\delta f$ and $\delta\Psi/\delta f$ is the variation of $\Phi$ and $\Psi$ over $f$ under $\langle\cdot,\cdot\rangle_{L^{2}}$ , where the test functions are chosen over the function class $\mathcal{F}$ . In the same way, $\delta\mathcal{L}/\delta\mu$ and $\delta\mathcal{L}/\delta\nu$ respectively denote the variation of the objective $\mathcal{L}$ with respect to distributions $\mu$ and $\nu$ under $\langle\cdot,\cdot\rangle_{L^{2}}$ , following definition in (3.4) with the test function chosen over $\mathscr{P}(\mathcal{X}\times\mathcal{Z})$ . We also remark that we can also define the variation under $\langle\cdot,\cdot\rangle_{\mathcal{D}}$ , which will only differ from the variation under $\langle\cdot,\cdot\rangle_{L^{2}}$ by a constant function factor that corresponds to the density of the marginals of $\mathcal{D}$ . Then, the mean-field limit of the SGDA update in (3) is characterized by the continuity equation, which is a system of PDEs given by,

\displaystyle\partial_{t}\mu_{t}(\theta)=-\eta\cdot\mathrm{div}_{\theta}\bigl{% (}\mu_{t}(\theta)v^{f}(\theta;\mu_{t},\nu_{t})\bigr{)},\;\partial_{t}\nu_{t}(\omega)

\displaystyle=-\eta\cdot\mathrm{div}_{\omega}\bigl{(}\nu_{t}(\omega)v^{g}(% \omega;\mu_{t},\nu_{t})\bigr{)},

(3.7)

where $\mathrm{div}_{\theta}$ , $\mathrm{div}_{\omega}$ denotes the divergence with respect to $\theta$ , $\omega$ respectively. Note that the initialization $\mu_{0}$ and $\nu_{0}$ are the same as the initialization of the discrete-time dynamics in (3), i.e. $\mu_{0}=\mathcal{N}(0,I_{D})$ , $\nu_{0}=\mathcal{N}(0,I_{D})$ are taken to be the distribution of standard Gaussian random variables in $\mathbb{R}^{D}$ .

4 Main Results

In this section, we introduce the main theoretical results of the stochastic gradient descent-ascent dynamics. We first present the assumptions in §4.1. Then in §4.2 we show that the SGDA dynamics converge to a mean-field limit when the network with $N$ goes to infinity and the stepsize scale $\epsilon$ goes to zero. Finally, in §4.3 we prove that the mean-field limiting dynamics converge to a globally optimal solution of the primal objective $J$ under proper assumptions. Moreover, we will show that the mean-field dynamics learns a data-dependent representation that is $\mathcal{O}(\alpha^{-1})$ away from the initial representation.

4.1 Assumptions

We consider two types of assumptions in this work. The first type of assumption is about the function class in which we search for solutions to the minimax optimization problem. In this category, Assumption 4.1 and Assumption 4.2 discuss the richness and regularity of the two-layered neural network function class. The second type of assumption is about the feasible class of problems to apply our framework. In this category, Assumption 4.3 discusses several technical assumptions on the data space and the regularity/smoothness of the functionals.

We start with the discussion of the two-layered neural network function class. Consider the neuron function $\phi$ and $\psi$ with the following form,

\displaystyle\phi(w;\theta)=b(\beta)\cdot\sigma({\widetilde{\theta}}^{\top}(w,% 1)),\quad\psi(z;\omega)=b(\beta)\cdot\sigma(\widetilde{\omega}^{\top}(z,1)),

(4.1)

where $\theta=(\beta,\widetilde{\theta})\in\mathbb{R}\times\mathbb{R}^{1+% \operatorname{dim}(\mathcal{W})}$ , $\omega=(\beta,\widetilde{\omega})\in\mathbb{R}\times\mathbb{R}^{1+% \operatorname{dim}(\mathcal{Z})}$ contains the parameters in the output layer and the hidden layer, $b:\mathbb{R}\rightarrow\mathbb{R}$ is an odd re-scaling function and $\sigma:\mathbb{R}\rightarrow\mathbb{R}$ is the activation function. Note that such a form of activation function satisfies the condition of universal function approximation theorem (Theorem 3.1 in Pinkus (1999)) if $\sigma$ is not a polynomial. For notational simplicity, we write $\sigma(w;\widetilde{\theta})=\sigma({\widetilde{\theta}}^{\top}(w,1))$ . The re-scaling function $b:\mathbb{R}\rightarrow\mathbb{R}$ is introduced to ensure that the value of the neural network is upper bounded. When $b(\mathbb{R})=(-B_{0},B_{0})$ , the function class induced by the neural network in (3.5) is equivalent to the following class,

\displaystyle\mathcal{F}=\Bigl{\{}f:\mathcal{W}\rightarrow\mathbb{R}{\,\Big{|}% \,}f(w)=\int\beta^{\prime}\cdot\sigma(w;\widetilde{\theta})\;\mathrm{d}\mu(% \beta^{\prime},\widetilde{\theta}),\mu\in\mathscr{P}_{2}\bigl{(}(-B_{0},B_{0})% \times\mathbb{R}^{d+1}\bigr{)}\Bigr{\}},

(4.2)

where $d=\operatorname{dim}(\mathcal{W})$ . This captures a rich function class due to the universal function approximation theorem (Barron, 1993; Pinkus, 1999). We remark that we introduce the re-scaling function $b(\beta)$ in (4.1) to avoid the study of the space of probability measures over $(-B_{0},B_{0})\times\mathbb{R}^{d+1}$ , which has a boundary and thus lacks regularity in the study of optimal transport. Moreover, note that a scaling hyperparameter $\alpha>0$ is introduced in the definition of the mean-field neural nets in (3.5). When $\alpha>1$ , this causes an effect of overparameterization. In brief, $\alpha$ controls the error between the $(f(\cdot;\mu_{t}),g(\cdot;\mu_{t}))$ and optimizer $(f^{*},g^{*})$ according to Theorem 4.7. Furthermore, the overparameterization scale $\alpha$ has an influence through Lemma 4.6, which shows that the Wasserstein distance between the Gaussian initialization $(\mu_{0},\nu_{0})$ and the optimal distribution $(\mu^{*},\nu^{*})$ is upper-bounded by $\mathcal{O}(\alpha^{-1})$ . Next, we impose the following regularity assumptions on the neural network functions $\phi$ and $\psi$ .

Assumption 4.1 (Regularity of Neural Networks).

We assume that there exist absolute constants $B_{0}>0$ , $B_{1}>0$ and $B_{2}>0$ such that

	$\displaystyle\|\phi(w;\theta)\|\leq B_{0},\quad\big{\\|}\nabla_{\theta}\phi(w;% \theta)\big{\\|}\leq B_{1},\quad\big{\\|}\nabla^{2}_{\theta\theta}\phi(w;\theta)% \big{\\|}_{F}\leq B_{2},\qquad\forall w\in\mathcal{W},\;\theta\in\mathbb{R}^{D},$
	$\displaystyle\|\psi(z;\omega)\|\leq B_{0},\quad\big{\\|}\nabla_{\omega}\psi(z;% \omega)\big{\\|}\leq B_{1},\quad\big{\\|}\nabla^{2}_{\omega\omega}\phi(z;\omega)% \big{\\|}_{F}\leq B_{2},\qquad\forall z\in\mathcal{Z},\;\omega\in\mathbb{R}^{D},$

where $\nabla^{2}_{\theta\theta},\nabla^{2}_{\omega\omega}$ denotes the hessian with respect to $\theta$ and $\omega$ respectively, $\|\cdot\|$ denotes the vector $2-$ norm, and $\|\cdot\|_{F}$ denotes the matrix Frobenius norm. Moreover, we assume that the rescaling function $b:\mathbb{R}\rightarrow\mathbb{R}$ is odd and its range satisfies that $b(\mathbb{R})=(-B_{0},B_{0})$ .

Assumption 4.1 is satisfied by a broad class of neuron functions. For example, it is satisfied when we set the activation function $\sigma(x)=\operatorname{sigmoid}(x)$ and rescaling function $b(\beta)=\tanh(\beta)$ .

We also make the following assumption regarding the realizability of the saddle point solution $(f^{*},g^{*})$ to (2.9).

Assumption 4.2 (Realizability).

We assume that the saddle point solution $(f^{*},g^{*})$ of (2.9) belongs to the function class defined in (4.2), i.e., $f^{*},g^{*}\in\mathcal{F}$ .

In general, problem (2.9) may not admit a saddle point within the given neural network function class. Therefore, Assumption 4.2 is introduced to guarantee that the discussion in this paper is meaningful. By universal function approximation theorem (Barron, 1993; Pinkus, 1999), the function class defined in (4.2) captures a rich class of functions. Therefore, such an assumption is quite general and does not restrict the influence of the applications of our results.

We impose the following assumptions on the integrability of the functional $\Phi$ and $\Psi$ and their variations, as well as the compactness of the data space $\mathcal{X}$ and $\mathcal{Z}$ .

Assumption 4.3 (Data regularity and Functional Integrability).

(i) For the data space $\mathcal{X}$ , $\mathcal{Z}$ , we assume that $\mathcal{X}\times\mathcal{Z}$ is compact, in the sense that there exists a positive constant $C_{1}>0$ such that for any data tuple $(x,z)\in\mathcal{X}\times\mathcal{Z}$ , it satisfies that $\|(x,z)\|\leq C_{1}$ . Moreover, the data distribution $\mathcal{D}$ admits a positive, smooth density $\rho_{\mathcal{D}}$ with respect to the Lebesgue measure on $\mathcal{X}\times\mathcal{Z}$ .

(ii) For the functionals $\Phi(x,z;f)$ and $\Psi(x,z;f)$ , there exists a positive constant $C_{2}>0$ such that

\displaystyle\int_{\mathcal{W}}\Big{|}\frac{\delta\Phi(x,z;f)}{\delta f}(w^{% \prime})\Big{|}\mathrm{d}w^{\prime}\leq C_{2},\quad\int_{\mathcal{W}}\Big{|}% \frac{\delta\Psi(x,z;f)}{\delta f}(w^{\prime})\Big{|}\mathrm{d}w^{\prime}\leq C% _{2},\quad\forall(x,z)\in\mathcal{X}\times\mathcal{Z}.

(iii) We assume that $\int_{\mathcal{W}}\frac{\delta\Psi(x,z;f)}{\delta f}(w^{\prime})\mathrm{d}w^{\prime}$ as a linear functional of $f$ is upper-bounded by constant times of values of $f$ . That is to say, there exists $w\in\mathcal{W}$ as a part of the data tuple $(x,z)$ and a positive constant $C_{\Psi}>0$ such that

\displaystyle\int_{\mathcal{W}}\Big{|}\frac{\delta\Psi(x,z;f)}{\delta f}(w^{% \prime})\Big{|}\mathrm{d}w^{\prime}\leq C_{\Psi}\cdot\big{|}f(w)\big{|}.

(iv) We assume that the variation of minimax objective $\mathcal{L}(f,g)$ with respective to $f$ and $g$ are continuous functions defined on $\mathcal{W}$ and $\mathcal{Z}$ . That is to say,

\displaystyle\frac{\delta\mathcal{L}(f,g)}{\delta f}\in\mathscr{C}(\mathcal{W}% ),\quad\frac{\delta\mathcal{L}(f,g)}{\delta g}\in\mathscr{C}(\mathcal{Z}).

Item (i) of Assumption 4.3 restricts our scenarios to data spaces with bounded values and smooth densities for technical reasons. Item (ii) and (iii) of Assumption 4.3 is an integrability condition that we additionally require to avoid discussion of improper functionals that potentially have singularities with exploding values. Item (iv) is a smoothness condition that requires the variation of the minimax objective averaged over data to be continuous on respective space. We also remark that a sufficient condition for item (iv) to hold is the variation of $\Phi$ and $\Psi$ with respect to $f$ averaged under the marginal of $\mathcal{D}$ on $\mathcal{W}$ is continuous. We will also use this condition to verify item (iv) in practice. These are general and reasonable assumptions widely satisfied by various applications in machine learning, causal inference, and statistics.

4.2 Convergence of SGDA dynamics to the Mean-Field Limit

In the following proposition, we show that the empirical distribution of the parameters $\widehat{\mu}_{k}$ and $\widehat{\nu}_{k}$ weakly converges to the mean-field limit in (3.7) as the width $N$ goes to infinity and the stepsize scale $\epsilon$ goes to zero. Let $\rho_{t}(\theta,\omega)=\mu_{t}(\theta)\otimes\nu_{t}(\omega)$ , where $(\mu_{t},\nu_{t})$ is the PDE solution to the continuous deterministic dynamics in (3.7) and $\widehat{\rho}_{k}=N^{-1}\cdot\sum_{i=1}^{N}\delta_{\theta_{k}^{i}}\cdot\delta% _{\omega_{k}^{i}}$ corresponds to the empirical distribution of $(\bm{\theta}_{k},\bm{\omega}_{k})$ , which is $k$ -th iterate of the discrete time stochastic dynamics in (3) with stepsize scale $\epsilon$ . The following proposition proves that the PDE solution $\rho_{t}$ in (3.7) well approximates the discrete time stochastic gradient descent-ascent dynamics in (3).

Proposition 4.4 (Convergence of SGDA to Mean-Field Limit).

Let $\{\rho_{t}\}_{t\geq 0}$ be solution to (3.7) with $\rho_{0}=\mathcal{N}(0,I_{D})\otimes\mathcal{N}(0,I_{D})$ , $\{\widehat{\rho}_{k}\}_{k\geq 0}$ be solution to (3) with $\widehat{\rho}_{0}=\mathcal{N}(0,I_{D})\otimes\mathcal{N}(0,I_{D})$ . Under Assumption 4.1 and 4.3, $\widehat{\rho}_{\lfloor t/\epsilon\rfloor}$ converges weakly to $\rho_{t}$ as $\epsilon\rightarrow 0^{+}$ and $N\rightarrow\infty$ . It holds for any Lipschitz continuous, bounded function $F:\mathbb{R}^{D}\times\mathbb{R}^{D}\rightarrow\mathbb{R}$ that

\displaystyle\lim_{\epsilon\rightarrow 0^{+},N\rightarrow\infty}\int F(\theta,% \omega)\mathrm{d}\widehat{\rho}_{\lfloor t/\epsilon\rfloor}(\theta,\omega)=% \int F(\theta,\omega)\mathrm{d}\rho_{t}(\theta,\omega).

Proof.

See §B for a detailed proof. ∎

The proof of Proposition 4.4 is based on the propagation of chaos (Mei et al., 2018, 2019; Araújo et al., 2019; Zhang et al., 2020; Sznitman, 1991). We deferred the detailed proof of Proposition 4.4 to Appendix B. Proposition 4.4 allows us to convert the discrete-time SGDA dynamics over finite dimensional parameter space to its continuous-time, infinite-dimensional counter-part in Wasserstein space, in which the training is amenable to analysis since our infinitely wide neural network $f(\cdot;\mu)$ and $g(\cdot;\nu))$ in (3.5) is linear in $\mu$ and $\nu$ respectively.

4.3 Global Optimality and Convergence of the Mean-Field Limit

In this section, we will introduce our main results that characterize the global optimality and convergence of the mean-field neural networks, parameterized by the parameter distribution $\rho_{t}=(\mu_{t},\nu_{t})$ . The proof contains two steps. We first show that it is sufficient to find a stationary point of the Wasserstein gradient flow defined in (3.7) in order to solve the minimax optimization problem in (2.9), then we characterize the convergence of $\rho_{t}$ to the stationary point. Before presenting the two stages of the proof, we would need to further clarify the notions of stationarity regarding the Wasserstein gradient flow. We introduce the following definition,

Definition 4.5 (Stationary point of Wasserstein Gradient Flow).

A distribution pair $(\mu,\nu)$ is called a stationary point of the Wasserstein gradient flow (3.7) if it satisfies

\displaystyle v^{f}(\theta;\mu,\nu)=v^{g}(\omega;\mu,\nu)=0,\quad\forall\theta% ,\omega\in\mathbb{R}^{D}.

From Definition 4.5, the stationary point of Wasserstein gradient flow (3.7) is a distribution pair $(\mu,\nu)$ , at which the associated vector field $(v^{f}(\cdot;\mu,\nu),v^{g}(\cdot;\mu,\nu))$ is a zero function on the parameter space $\mathbb{R}^{D}\times\mathbb{R}^{D}$ . Moreover, for the Wasserstein gradient flow following vector field $(v^{f},v^{g})$ and initial condition $(\mu,\nu)$ , the solution to its associated continuity equation $(\mu_{t},\nu_{t})$ is a constant flow such that for all $t\geq 0$ , $\mu_{t}=\mu,\nu_{t}=\nu$ . Now, we have the following important supporting lemma that characterizes the relation between stationary points of Wasserstein gradient flow (3.7) and saddle points of (2.9).

Lemma 4.6.

Under Assumptions 4.1 and 4.2, the following two properties hold.

(i)

Suppose that $(\mu^{*},\nu^{*})$ is a stationary point of the Wasserstein gradient flow as is defined in Definition 4.5. Then, the corresponding function $(f(\cdot;\mu^{*}),g(\cdot;\nu^{*}))$ is the saddle point of the objective function $\mathcal{L}(f,g)$ defined in (2.9).

(ii)

There exists a stationary distribution pair $(\mu^{*},\nu^{*})$ and constant $\bar{D}>0$ such that

\displaystyle W_{2}(\mu_{0},\mu^{*})\leq\alpha^{-1}\bar{D},\quad W_{2}(\nu_{0}% ,\nu^{*})\leq\alpha^{-1}\bar{D}.

Lemma 4.6 demonstrates that the stationary point of the Wasserstein gradient flow in (3.7) achieves global optimality as a solution to the minimax objective (2.9). Lemma 4.6 allows us to bypass the hardness of solving the nonconvex-nonconcave optimization problem (2.9) of finding saddle points in the space of neural network parameters $(\bm{\theta},\bm{\omega})$ by searching for a stationary point of the Wasserstein gradient flow instead. Moreover, there exist good pairs of stationary points that are close to the Gaussian initialization $(\mu_{0},\nu_{0})$ , with Wasserstein distance upper bounded by order $\mathcal{O}(\alpha^{-1})$ .

Proof.

See §A.1 for a detailed proof. ∎

We are now ready to show our main results. The following theorem characterizes the global optimality and convergence of the Wasserstein gradient flow $\rho_{t}$ .

Theorem 4.7 (Global Convergence to Saddle Point).

Let $(\mu_{t},\nu_{t})$ be the solution to the Wasserstein gradient flow (3.7) at time $t$ with $\eta=\alpha^{-2}$ and initial condition $\mu_{0}=\nu_{0}=\mathcal{N}(0,I_{D})$ , $(f^{*},g^{*})$ the saddle point of the minimax objective $\mathcal{L}(f,g)$ in (2.9). Under Assumptions 4.1, 4.2, and 4.3, it holds that

\displaystyle\inf_{t\in[0,T]}\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\Psi\big{(% }X,Z;f(\cdot;\mu_{t})-f^{*}(\cdot)\big{)}+\bigl{(}g(Z;\nu_{t})-g^{*}(Z)\bigr{)% }^{2}\Bigr{]}\leq\mathcal{O}(T^{-1}+\alpha^{-1}).

(4.3)

Proof.

See §A.2 for a detailed proof. ∎

Theorem 4.7 says that the optimality gap between $(f(\cdot;\mu_{t}),g(\cdot;\nu_{t}))$ and $(f^{*},g^{*})$ , quantified by the $\Psi$ -induced distance and $L^{2}$ distance respectively, decays to zero at a sublinear rate in terms of time $T$ up to the error of $\mathcal{O}(\alpha^{-1})$ , where $\alpha>0$ is the scaling parameter in (3.1) and (3.5). In order to prove the convergence, we construct a potential $V(\mu,\nu)=\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\Psi\big{(}X,Z;f(\cdot;\mu)-% f^{*}(\cdot)\big{)}+\bigl{(}g(Z;\nu)-g^{*}(Z)\bigr{)}^{2}\Bigr{]}$ , with $V(\mu,\nu)=0$ if and only if $(\mu,\nu)=(\mu^{*},\nu^{*})$ . Such a potential characterizes the saddle point of the minimax objective. We show that the Wasserstein gradient flow decreases the potential at a sublinear rate, thus suggesting the convergence of the gradient flow to the saddle point. Moreover, varying $\alpha$ allows a trade-off between the error of order $\mathcal{O}(\alpha^{-1})$ in the optimality gap and the maximum deviation between $\rho_{t}$ and the Gaussian initialization $\rho_{0}$ for all $t$ . In the proof of item (ii) of Lemma 4.6, we proved that the deviation of $\rho_{t}$ from $\rho_{0}$ quantified by the Wasserstein distance is of order $\mathcal{O}(\alpha^{-1})$ . Regarding representation learning, this suggests that SGDA induces a data-dependent representation that is significantly different from the initialization. Choosing a small $\alpha$ of order $\mathcal{O}(1)$ will correspond to the mean-field regime (Mei et al., 2018, 2019) that allows $\rho_{t}$ to move further away from the initialization, with the potential drawback of yielding a large error of order $\mathcal{O}(\alpha^{-1})$ . On the other hand, choosing a large $\alpha$ of order $\mathcal{O}(\sqrt{N})$ will correspond to the NTK regime (Jacot et al., 2018), and this causes the Wasserstein flow $\rho_{t}$ to stay close to the initial distribution $\rho_{0}$ along the trajectory, inducing a data-independent representation.

As we have commented before, an important class of regularizer $\Psi(X,Z;f)$ is the $L^{2}$ regularizer. In this scenario, the left-hand side of (4.3) should be understood as a weighted $L^{2}$ distance between the gradient flow iterate at time $t$ to the optimal solution $(f^{*},g^{*})$ . As $T$ and $\alpha$ go to infinity, such a distance will shrink to $0$ , thus the gradient flow converges globally in the minimal distance sense to the optimal solution. Due to this observation, in the sequel we will discuss several additional results in the case where the regularizer $\Psi(X,Z;f)$ is strongly convex, in the sense that it’s bounded below by a quadratic function. We formalize the additional constraint in this case with the following assumption,

Assumption 4.8 (Strong Convexity).

We assume that the regularizer $\Psi(X,Z;f)$ is $c_{\Psi}$ -strongly convex, in the sense that there exists a constant $c_{\Psi}>0$ such that for any $f\in\mathcal{F}$ ,

\displaystyle\Psi(x,z;f)\geq c_{\Psi}\cdot|f(w)|^{2},\quad\forall(x,z)\in% \mathcal{X}\times\mathcal{Z},

where $w\in\mathcal{W}$ is part of the data tuple $(x,z)$ .

Assumption 4.8 implies that regularizer $\Psi(X,Z;f)$ is equivalent to a quadratic regularizer because $\Psi$ is simultaneously bounded above and below by quadratic functionals. We have the following strengthened version of Theorem 4.7 in such case,

\displaystyle\inf_{t\in[0,T]}\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda c_{\Psi}% \cdot\big{(}f(\cdot;\mu_{t})-f^{*}(\cdot)\big{)}^{2}+\bigl{(}g(Z;\nu_{t})-g^{*% }(Z)\bigr{)}^{2}\Bigr{]}\leq\mathcal{O}(T^{-1}+\alpha^{-1}).

(4.4)

Equation (4.4) shows that the iterates $(f(\cdot;\mu_{t}),g(\cdot;\nu_{t}))$ converges to the saddle point solution $(f^{*},g^{*})$ as a weighted $L^{2}$ distance decays to zero at a sublinear rate up to an error of $\mathcal{O}(\alpha^{-1})$ . With Assumption 4.2, the saddle point $f^{*}$ is the global optimizer of the primal functional $J(f)$ defined in (2.3). Therefore, as a direct consequence of Theorem 4.7, when the regularizer $\Psi$ is strongly convex, $f(\cdot;\mu_{t})$ converges globally to $f^{*}$ at a sublinear rate in terms of $T$ up to an error of $\mathcal{O}(\alpha^{-1})$ .

Under Assumption 4.8, we can also quantify the optimality gap between $J(f_{t})$ and $J(f^{*})$ , in terms of the minimal distance $\inf_{t\in[0,T]}J(f_{t})-J(f^{*})$ . The following theorem characterize the global convergence of $J(f_{t})$ to $J(f^{*})$ ,

Theorem 4.9 (Global Convergence to Primal Solution).

Let $(\mu_{t},\nu_{t})$ be the solution to the Wasserstein gradient flow (3.7) at time $t$ with $\eta=\alpha^{-2}$ and initial condition $\mu_{0}=\nu_{0}=\mathcal{N}(0,I_{D})$ . Under Assumptions 4.1, 4.2, 4.3 and 4.8, let $f_{t}=f(\cdot;\mu_{t})$ , it holds that

\displaystyle\inf_{t\in[0,T]}J(f_{t})-J(f^{*})\leq\mathcal{O}(T^{-1/2}+\alpha^% {-1/2}),

where $f^{*}$ is the global minimizer of the objective function defined in (2.3).

Proof.

See §A.3 for a detailed proof. ∎

Theorem 4.9 proves that under the additional strong convexity assumption on the regularizer $\Psi(X,Z;f)$ , the primal objective $J(f_{t})$ as is defined in (2.3) decays to zero at rate of $T^{-1/2}$ in terms of time horizon $T$ , up to an error of $\mathcal{O}(\alpha^{-1/2})$ . Here we use $f^{*}$ to denote the global minimizer instead of the saddle point. However, this will not create any confusion since for each $f^{*}$ global minimizer of the primal objective (2.3), we can find $g^{*}\in\mathcal{F}$ such that $(f^{*},g^{*})$ is a saddle point of (2.9).

5 Applications

In this section, we present the applications of Theorem 4.7 and Theorem 4.9 to several special cases of functional conditional moment equation, such as the problem of policy evaluation, instrumental variables regression, asset pricing, and adversarial Riesz representer estimation. In Section 2.2, we already discussed why these problems are special cases of functional conditional moment equations, thus Theorem 4.7 and Theorem 4.9 are potentially feasible to apply. We will recall the problem settings and examine the technical assumptions for these cases.

5.1 Application 1: Policy Evaluation

Let $\mathcal{D}$ denote the joint distribution of the state-action tuple $(S,A,S^{\prime})$ under policy $\pi$ . In this scenario, the endogenous variable $X=S^{\prime}$ is the next state while the exogenous variable $Z=S$ is the current state. Therefore, $\mathcal{X}={\mathcal{S}}$ , $\mathcal{Z}={\mathcal{S}}$ and $\mathcal{W}={\mathcal{S}}$ . We attempt to estimate the value function $V$ , which is defined on $\mathcal{W}={\mathcal{S}}\rightarrow\mathbb{R}$ . The functional $\Phi$ and regularizer $\Psi$ adopted in this case are,

\displaystyle\Phi(s^{\prime},s;f)=r+\gamma\cdot f(s^{\prime})-f(s),\quad\Psi(s% ^{\prime},s;f)=f(s^{\prime})^{2}.

Here, the regularizer we adopt is a $L^{2}$ regularizer that penalizes the squared value of the estimator evaluated at the next state $s^{\prime}$ . With these specific choices of functional $\Phi$ and regularizer $\Psi$ , the SGDA algorithm identifies with the Gradient Temporal Difference Learning (GTD) algorithm (Wai et al., 2020). Therefore, the application of our general framework to the problem of policy evaluation contributes to the reinforcement learning literature by providing an analysis of the neural GTD algorithm in the mean-field regime. Before presenting the theoretical results, we first verify that Assumption 4.3 and Assumption 4.8 hold.

Verify item (i) of Assumption 4.3. For item (i) of Assumption 4.3, it’s reasonable to assume that $\|(x,z)\|\leq 1$ since we can always re-scale the state space without changing the nature of the problem, therefore the compactness assumption is inherently satisfied.

Verify item (ii) of Assumption 4.3. For item (ii) of Assumption 4.3, we first compute the variation of the functional $\Phi$ and $\Psi$ ,

\displaystyle\frac{\delta\Phi(s^{\prime},s;f)}{\delta f}(w^{\prime})=\gamma% \delta_{s^{\prime}}(w^{\prime})-\delta_{s}(w^{\prime}),\quad\frac{\delta\Psi(s% ^{\prime},s;f)}{\delta f}(w^{\prime})=2f(s^{\prime})\delta_{s^{\prime}}(w^{% \prime}).

Therefore, the desired integrability conditions hold since

\displaystyle\int_{\mathcal{W}}\Big{|}\frac{\delta\Phi(s^{\prime},s;f)}{\delta f% }(w^{\prime})\Big{|}\mathrm{d}w^{\prime}\leq\gamma+1,\quad\int_{\mathcal{W}}% \Big{|}\frac{\delta\Psi(s^{\prime},s;f)}{\delta f}(w^{\prime})\Big{|}\mathrm{d% }w^{\prime}\leq 2\cdot|f(s^{\prime})|.

(5.1)

Verify item (iii) of Assumption 4.3. For item (iii) of Assumption 4.3, we choose $w=s^{\prime}$ , $C_{\Psi}=2$ . The desired condition holds due to (5.1).

Verify item (iv) of Assumption 4.3. For item (iv) of Assumption 4.3, we first compute the variations of $\mathcal{L}(f,g)$ in explicit forms,

	$\displaystyle\frac{\delta\mathcal{L}(f,g)}{\delta f}(w^{\prime})=\mathbb{E}_{S% \|S^{\prime}}\Big{[}\gamma\cdot g(S){\,\big{\|}\,}S^{\prime}=w^{\prime}\Big{]}-g% (w^{\prime})\rho_{s}(w^{\prime})+2\lambda\cdot f(w^{\prime})\rho_{S^{\prime}}(% w^{\prime}),\quad\forall w^{\prime}\in{\mathcal{S}},$
	$\displaystyle\frac{\delta\mathcal{L}(f,g)}{\delta g}(z^{\prime})=\mathbb{E}_{S% ^{\prime}\|S}\Big{[}r+\gamma\cdot f(s^{\prime}){\,\big{\|}\,}S=z^{\prime}\Big{]}% -f(z^{\prime})-g(z^{\prime})\rho_{S}(z^{\prime}),\quad\forall z^{\prime}\in{% \mathcal{S}},$

where $\rho_{S}$ , $\rho_{S^{\prime}}$ denotes the density of the marginal distribution of $\mathcal{D}$ with respect to the current state $S$ and next state $S^{\prime}$ respectively. Due to the item (i) of Assumption 4.3, the variations of $\mathcal{L}$ with respect to $f$ and $g$ are both continuous since the density of the conditional transition $S^{\prime}{\,\big{|}\,}S$ and $S{\,\big{|}\,}S^{\prime}$ are both smooth and the functions $f,g$ are also continuous by construction. Therefore, item (iv) is satisfied.

Verify Assumption 4.8. For Assumption 4.8, we choose $c_{\Psi}=1$ and $w=s^{\prime}$ . The desired condition holds by definition of our choice of regularizer $\Psi(s^{\prime},s;f)=f(s^{\prime})^{2}$ .

We have checked that the technical Assumption 4.3 and Assumption 4.8 hold for the case of policy evaluation. Assumption 4.3 allows us to apply Theorem 4.7. This implies the global convergence of the estimated value function to the minimizer of the primal objective (2.3) applied in this case. The convergence is quantified in a weighted $L^{2}$ distance. Additionally, Assumption 4.8 enables us to apply Theorem 4.9 and further characterize such convergence using the optimality gap between the value of primal objectives. We summarize the conclusions in the following corollary.

Corollary 5.1 (Global Convergence of Mean-field Neural Nets in Policy Evaluation).

Let $f^{*}$ be the minimizer of primal objective $J(f)$ defined in (2.8) with $\Phi(S^{\prime},S;f)=r+\gamma\cdot f(S^{\prime})-f(S)$ , $\mathcal{R}(f)=\mathbb{E}_{\mathcal{D}}[f(S^{\prime})^{2}]$ . Let $(\mu_{t},\nu_{t})$ be the solution to the Wasserstein gradient flow (3.7) at time $t$ with $\eta=\alpha^{-2}$ and initial condition $\mu_{0}=\nu_{0}=\mathcal{N}(0,I_{D})$ . Under Assumption 4.1, 4.2, 4.3, and 4.8, it holds that

	$\displaystyle\inf_{t\in[0,T]}\mathbb{E}_{\mathcal{D}}\Bigl{[}(f(S^{\prime};\mu% _{t})-f^{*}(S^{\prime}))^{2}\Bigr{]}\leq\mathcal{O}(T^{-1}+\alpha^{-1}),$
	$\displaystyle\inf_{t\in[0,T]}J(f(\cdot;\mu_{t}))-J(f^{*}(\cdot))\leq\mathcal{O% }(T^{-1/2}+\alpha^{-1/2}).$

Proof.

We apply Theorem 4.7 and Theorem 4.9 to the setting of policy evaluation. As we have examined above, the Assumption 4.1, 4.2, 4.3, and 4.8 are all satisfied. Thus, by Theorem 4.7 and Theorem 4.9, the desired results hold directly. ∎

Corollary 5.1 proves that in the setting of policy evaluation, the $L^{2}$ distance between the mean-field neural network $f(\cdot;\mu_{t})$ at time $t$ and the global minimizer $f^{*}$ decays to zero at a sub-linear rate, up to an error of order $\mathcal{O}(\alpha^{-1})$ . Moreover, the optimality gap $\inf_{t\in[0,T]}J(f(\cdot;\mu_{t}))-J(f^{*}(\cdot))$ in terms of primal objective values decays to zero at the rate of $\mathcal{O}(T^{-1/2})$ , up to an error $\mathcal{O}(\alpha^{-1/2})$ caused by overparameterization. Corollary 5.1 allows us to efficiently and globally solve the policy evaluation problem using overparameterized two-layer neural networks. We also remark that in such a scenario, the primal objective $J(f)$ is also known as the regularized mean-squared Bellman error (MSBE) in the literature of reinforcement learning. As we have commented before, in the setting of policy evaluation, applying the SGDA algorithm within neural network function classes is equivalent to applying the neural GTD algorithm. Therefore, Corollary 5.1 states that, in the mean-field regime, the neural GTD algorithm converges globally to the minimizer at a sublinear rate up to an additional overparameterization error $\mathcal{O}(\alpha^{-1})$ . The neural GTD algorithm also reduces regularized MSBE at the rate of $\mathcal{O}(T^{-1/2})$ up to an additional overparameterization error $\mathcal{O}(\alpha^{-1/2})$ . Moreover, The global convergence of mean-field neural networks also implies the global convergence of the discrete dynamics in (3) due to the proximity between the discrete dynamics and continuous dynamics, which is proved in Proposition 4.4.

5.2 Application 2: Nonparametric Instrumental Variables Regression

Let $\mathcal{D}$ denote the joint distribution of the endogenous variable $X$ , the exogenous variable $Z$ , and the observed outcome $Y$ . In this scenario, the endogenous variable is defined in space $\mathcal{X}$ , the exogenous variable is defined in space $\mathcal{Z}$ , and $\mathcal{W}=\mathcal{X}$ . We attempt to estimate the model function $f_{0}$ , which is defined on $\mathcal{W}=\mathcal{X}\rightarrow\mathbb{R}$ . The functional $\Phi$ and regularizer $\Psi$ adopted in this case are,

\displaystyle\Phi(x,z;f)=y-f(x),\quad\Psi(x,z;f)=f(x)^{2}.

Here, the regularizer we adopt is a $L^{2}$ regularizer that penalizes the squared value of the estimator of the model function evaluated at the endogenous variable $x$ . We examine Assumption 4.3 and Assumption 4.8 in order to apply results from Section 4.3.

Verify item (i) of Assumption 4.3. For item (i) of Assumption 4.3, the NPIV problem with compact data space captures a large class of important applications, therefore the scenarios considered are still general while imposing this assumption.

Verify item (ii) of Assumption 4.3. For item (ii) of Assumption 4.3, we first compute the variation of the functional $\Phi$ and $\Psi$ ,

\displaystyle\frac{\delta\Phi(x,z;f)}{\delta f}(w^{\prime})=-\delta_{x}(w^{% \prime}),\quad\frac{\delta\Psi(x,z;f)}{\delta f}(w^{\prime})=2f(x)\delta_{x}(w% ^{\prime}).

Therefore, the desired integrability conditions hold since

\displaystyle\int_{\mathcal{W}}\Big{|}\frac{\delta\Phi(x,z;f)}{\delta f}(w^{% \prime})\Big{|}\mathrm{d}w^{\prime}\leq 1,\quad\int_{\mathcal{W}}\Big{|}\frac{% \delta\Psi(x,z;f)}{\delta f}(w^{\prime})\Big{|}\mathrm{d}w^{\prime}\leq 2\cdot% |f(x)|.

(5.2)

Verify item (iii) of Assumption 4.3. For item (iii) of Assumption 4.3, we choose $w=x$ , $C_{\Psi}=2$ . The desired condition holds due to (5.2).

Verify item (iv) of Assumption 4.3. For item (iv) of Assumption 4.3, we first compute the variations of $\mathcal{L}(f,g)$ in explicit forms,

	$\displaystyle\frac{\delta\mathcal{L}(f,g)}{\delta f}(w^{\prime})=\mathbb{E}_{Z% \|X}\Big{[}-g(Z){\,\big{\|}\,}X=w^{\prime}\Big{]}+2\lambda\cdot f(w^{\prime})% \rho_{X}(w^{\prime}),\quad\forall w^{\prime}\in\mathcal{X},$
	$\displaystyle\frac{\delta\mathcal{L}(f,g)}{\delta g}(z^{\prime})=\mathbb{E}_{X% \|Z}\Big{[}Y-f(X){\,\big{\|}\,}Z=z^{\prime}\Big{]}-g(z^{\prime})\rho_{Z}(z^{% \prime}),\quad\forall z^{\prime}\in\mathcal{Z},$

where $\rho_{X}$ , $\rho_{Z}$ denotes the density of the marginal distribution of $\mathcal{D}$ with respect to the endogenous variable $X$ and the exogenous variable $Z$ respectively. Due to the item (i) of Assumption 4.3, the variations of $\mathcal{L}$ with respect to $f$ and $g$ are both continuous since the density of the conditional transition $Z{\,\big{|}\,}X$ and $X{\,\big{|}\,}Z$ are both smooth and the functions $f,g$ are also continuous by construction. Therefore, item (iv) is satisfied.

Verify Assumption 4.8. For Assumption 4.8, we choose $c_{\Psi}=1$ and $w=s^{\prime}$ . The desired condition holds by definition of our choice of regularizer $\Psi(x,z;f)=f(x)^{2}$ .

We have checked that the technical Assumption 4.3 and Assumption 4.8 hold for the case of nonparametric instrumental variables regression. Theorem 4.7 can be applied in this case due to the establishment of Assumption 4.3. This implies the global convergence of the estimated model function to the minimizer of the primal objective. The convergence is quantified in a weighted $L^{2}$ distance. The choice of quadratic regularizer implies the establishment of Assumption 4.8, which further enables us to apply Theorem 4.9 and characterize the convergence in terms of primal objective value. We summarize the conclusions in the following corollary.

Corollary 5.2 (Global Convergence of Mean-field Neural Nets in NPIV).

Let $f^{*}$ be the minimizer of primal objective $J(f)$ defined in (2.8) with $\Phi(X,Z;f)=Y-f(X)$ , $\mathcal{R}(f)=\mathbb{E}_{\mathcal{D}}[f(X)^{2}]$ . Let $(\mu_{t},\nu_{t})$ be the solution to the Wasserstein gradient flow (3.7) at time $t$ with $\eta=\alpha^{-2}$ and initial condition $\mu_{0}=\nu_{0}=\mathcal{N}(0,I_{D})$ . Under Assumption 4.1, 4.2, 4.3, and 4.8, it holds that

	$\displaystyle\inf_{t\in[0,T]}\mathbb{E}_{\mathcal{D}}\Bigl{[}(f(X;\mu_{t})-f^{% *}(X))^{2}\Bigr{]}\leq\mathcal{O}(T^{-1}+\alpha^{-1}),$
	$\displaystyle\inf_{t\in[0,T]}J(f(\cdot;\mu_{t}))-J(f^{*}(\cdot))\leq\mathcal{O% }(T^{-1/2}+\alpha^{-1/2}).$

Proof.

We apply Theorem 4.7 and Theorem 4.9 to the setting of NPIV. As we have examined above, the Assumption 4.1, 4.2, 4.3, and 4.8 are all satisfied. Thus, by Theorem 4.7 and Theorem 4.9, the desired results hold directly. ∎

Corollary 5.2 proves that in the setting of NPIV, the $L^{2}$ distance between the mean-field neural network $f(\cdot;\mu_{t})$ at time $t$ and the global minimizer $f^{*}$ decays to zero at a sub-linear rate, up to an error of order $\mathcal{O}(\alpha^{-1})$ . Moreover, the optimality gap $\inf_{t\in[0,T]}J(f(\cdot;\mu_{t}))-J(f^{*}(\cdot))$ decays to zero at the rate of $\mathcal{O}(T^{-1/2})$ , up to an error $\mathcal{O}(\alpha^{-1/2})$ . Corollary 5.2 allows us to solve the NPIV problem globally using overparameterized two-layer neural networks. We also want to remark that when the true model function is linear in the input, we recover the setting of instrumental variables regression as an important special instance of NPIV. Therefore, Corollary 5.2 also implies IV regression can be globally solved efficiently by using overparameterized two-layer neural networks.

5.3 Application 3: Asset Pricing

Let $\mathcal{D}$ denote the joint distribution of the growth-return tuple $(c_{t},\widetilde{r}_{t+1},c_{t+1})$ . In this scenario, the exogenous variable $Z=c_{t}$ is the consumption growth at the current time $t$ , and the endogenous variable $X=c_{t+1}$ is the consumption growth at the next time $t+1$ . Therefore, $\mathcal{X}=\mathcal{Z}=\mathcal{C}$ , $\mathcal{W}=\mathcal{C}$ where $\mathcal{C}$ is the space of consumption growth and is also a compact subset of $\mathbb{R}$ . Here, we consider the scenario where the modified return $\widetilde{r}_{t+1}$ is also bounded for all $t\geq 0$ , i.e., $\|\widetilde{r}_{t+1}\|\leq R$ for some $R>0$ . We attempt to estimate the function $f_{0}$ , which is defined on $\mathcal{W}={\mathcal{S}}\to\mathbb{R}$ . The functional $\Phi$ and regularizer $\Psi$ adopted in this case are,

\displaystyle\Phi(c_{t+1},c_{t};f)=\widetilde{r}_{t+1}\cdot f(c_{t+1})-f(c_{t}% ),\quad\Psi(c_{t+1},c_{t};f)=f(c_{t+1})^{2}.

Here, the regularizer we adopt is a $L^{2}$ regularizer that penalizes the squared value of the estimator evaluated at the consumption growth of the next time $c_{t+1}$ . Before presenting the theoretical results, we first verify that Assumption 4.3 and Assumption 4.8 hold.

Verify item (i) of Assumption 4.3. For item (i) of Assumption 4.3, since we assume that the space of consumption growth $\mathcal{C}$ is a compact subset of $\mathbb{R}$ , therefore there exists $C_{1}>0$ such that for all $t\geq 0$ , $\|(c_{t+1},c_{t})\|\leq C_{1}$ . Moreover, it is reasonable to assume that the consumption growth is bounded since the data often fluctuates within certain regimes in practice.

Verify item (ii) of Assumption 4.3. For item (ii) of Assumption 4.3, we first compute the variation of the functional $\Phi$ and $\Psi$ ,

\displaystyle\frac{\delta\Phi(c_{t+1},c_{t};f)}{\delta f}(w^{\prime})=% \widetilde{r}_{t+1}\cdot\delta_{c_{t+1}}(w^{\prime})-\delta_{c_{t}}(w^{\prime}% ),\quad\frac{\delta\Psi(c_{t+1},c_{t};f)}{\delta f}(w^{\prime})=2f(c_{t+1})% \cdot\delta_{c_{t+1}}(w^{\prime}).

Therefore, the desired integrability condition holds since,

\displaystyle\int_{\mathcal{W}}\Big{|}\frac{\delta\Phi(c_{t+1},c_{t};f)}{% \delta f}(w^{\prime})\Big{|}\mathrm{d}w^{\prime}\leq R+1,\quad\int_{\mathcal{W% }}\Big{|}\frac{\delta\Psi(c_{t+1},c_{t};f)}{\delta f}(w^{\prime})\Big{|}% \mathrm{d}w^{\prime}\leq 2\cdot|f(c_{t+1})|.

(5.3)

Verify item (iii) of Assumption 4.3. For item (iii) of Assumption 4.3, we choose $w=\widetilde{c}$ , $C_{\Psi}=2$ . The desired property holds due to (5.3).

Verify item (iv) of Assumption 4.3. For item (iv) of Assumption (4.3), we first compute the variations of $\mathcal{L}(f,g)$ in explicit forms,

	$\displaystyle\frac{\delta\mathcal{L}(f,g)}{\delta f}(w^{\prime})=\mathbb{E}_{c% _{t}\|c_{t+1}}\Big{[}\widetilde{r}_{t+1}\cdot g(c_{t}){\,\big{\|}\,}\widetilde{c% }_{t}=w^{\prime}\Big{]}-g(w^{\prime})\rho_{c_{t}}(w^{\prime})+2\lambda\cdot f(% w^{\prime})\rho_{c_{t+1}}(w^{\prime}),\quad\forall w^{\prime}\in{\mathcal{S}},$
	$\displaystyle\frac{\delta\mathcal{L}(f,g)}{\delta g}(z^{\prime})=\mathbb{E}_{c% _{t+1}\|c_{t}}\Big{[}\widetilde{r}_{t+1}\cdot f(c_{t+1}){\,\big{\|}\,}c_{t}=z^{% \prime}\Big{]}-f(z^{\prime})-g(z^{\prime})\rho_{c_{t}}(z^{\prime}),\quad% \forall z^{\prime}\in{\mathcal{S}},$

where $\rho_{c_{t}},\rho_{c_{t+1}}$ denotes the density of the marginal distribution of $\mathcal{D}$ with respect to the current time consumption growth $c_{t}$ and the next time consumption growth $c_{t+1}$ respectively. The variations of $\mathcal{L}$ with respect to $f$ and $g$ are both continuous since the density of the conditional transition $c_{t+1}{\,\big{|}\,}c_{t}$ and $c_{t}{\,\big{|}\,}c_{t+1}$ are both smooth, and the function $f,g$ are also continuous by construction. Therefore, item (iv) is satisfied.

Verify Assumption 4.8. For Assumption 4.8, we choose $c_{\Psi}=1$ and $w=c_{t+1}$ . The desired condition holds by definition of our choice of regularizer $\Psi(c_{t+1},c_{t};f)=f(c_{t+1})^{2}$ .

We have checked that the technical Assumption 4.3 and Assumption 4.8 hold for the case of asset pricing with CCAPM model. Theorem 4.7 can be applied in this case due to the establishment of Assumption 4.3. This implies the global convergence of the estimated function to the minimizer of the primal objective. The convergence is quantified in a weighted $L^{2}$ distance. Since Assumption 4.8 holds, we can apply Theorem 4.9 and characterize the convergence in terms of primal objective value. We summarize the conclusions in the following corollary.

Corollary 5.3 (Global Convergence of Mean-field Neural Nets in Asset Pricing).

Let $f^{*}$ be the minimizer of primal objective $J(f)$ defined in (2.8) with $\Phi(c_{t+1},c_{t};f)=\widetilde{r}_{t+1}\cdot f(c_{t+1})-f(c_{t})$ , $\mathcal{R}(f)=\mathbb{E}_{\mathcal{D}}[f(c_{t+1})^{2}]$ . Let $(\mu_{t},\nu_{t})$ be the solution to the Wasserstein gradient flow (3.7) at time $t$ with $\eta=\alpha^{-2}$ and initial condition $\mu_{0}=\nu_{0}=\mathcal{N}(0,I_{D})$ . Under Assumption 4.1, 4.2, 4.3, and 4.8, it holds that

	$\displaystyle\inf_{t\in[0,T]}\mathbb{E}_{\mathcal{D}}\Big{[}(f(c_{t+1};\mu_{t}% )-f^{*}(c_{t+1}))^{2}\Big{]}\leq\mathcal{O}(T^{-1}+\alpha^{-1}),$
	$\displaystyle\inf_{t\in[0,T]}J(f(\cdot;\mu_{t}))-J(f^{*}(\cdot))\leq\mathcal{O% }(T^{-1/2}+\alpha^{-1/2}).$

Proof.

We apply Theorem 4.7 and Theorem 4.9 to the setting of asset pricing. As we have examined above, the Assumption 4.1, 4.2, 4.3, and 4.8 are all satisfied. Thus, by Theorem 4.7 and Theorem 4.9, the desired results hold directly. ∎

Corollary 5.3 proves that in the setting of asset pricing, the $L^{2}$ distance between the mean-field neural network $f(\cdot;\mu_{t})$ at time $t$ and the global minimizer $f^{*}$ decays to zero at a sub-linear rate, up to an error of order $\mathcal{O}(\alpha^{-1})$ . Moreover, the optimality gap $\inf_{t\in[0,T]}J(f(\cdot;\mu_{t}))-J(f^{*}(\cdot))$ decays to zero at the rate of $\mathcal{O}(T^{-1/2})$ , up to an error $\mathcal{O}(\alpha^{-1/2})$ . Corollary 5.3 allows us to solve the CCAPM model globally by estimating the nonparametric structural demand function with overparameterized two-layer neural networks. Since the return on investment is linked to the marginal utility of consumption through the CCAPM equation, we can price fairly the assets by considering consumption risk and utilizing the marginal utility information.

5.4 Application 4: Adversarial Riesz Representer Estimation

Let $\mathcal{D}$ denote the joint distribution of the endogenous variable $X$ and the random vector $V$ . In this scenario, the exogenous variable $Z$ coincides with the endogenous variable $X$ , therefore the problem is essentially unconditional. The endogenous variable is defined in space $\mathcal{X}$ , the exogenous variable is defined on $\mathcal{Z}=\mathcal{X}$ , and $\mathcal{W}=\mathcal{X}$ . We attempt to estimate the Riesz representer $f_{0}$ , which is defined on $\mathcal{W}=\mathcal{X}\rightarrow\mathbb{R}$ . The functional $\Phi$ and regularizer $\Psi$ adopted in this case are,

\displaystyle\Phi(x,x;f)=f_{0}(x)-f(x),\quad\Psi(x,x;f)=f(x)^{2}.

Here, the regularizer we adopt is a $L^{2}$ regularizer that penalizes the squared value of estimator of the Riez representer evaluated at the variable $x$ . We examine Assumption 4.3 and Assumption 4.8 in order to apply results from Section 4.3.

Verify item (i) of Assumption 4.3. For item (i) of Assumption 4.3, we restrict our attention to estimating Riesz represented of functionals defined on a compact space. In practice, such an assumption is very general since we often treat data distribution on an unbounded space with exponential decay as a distribution defined on a compact space.

Verify item (ii) of Assumption 4.3. For item (ii) of Assumption 4.3, we first compute the variation of the functional $\Phi$ and $\Psi$ ,

\displaystyle\frac{\delta\Phi(x,x;f)}{\delta f}(w^{\prime})=-\delta_{x}(w^{% \prime}),\quad\frac{\delta\Psi(x,x;f)}{\delta f}(w^{\prime})=2f(x)\delta_{x}(w% ^{\prime}).

Therefore, the desired integrability conditions hold since

\displaystyle\int_{\mathcal{W}}\Big{|}\frac{\delta\Phi(x,x;f)}{\delta f}(w^{% \prime})\Big{|}\mathrm{d}w^{\prime}\leq 1,\quad\int_{\mathcal{W}}\Big{|}\frac{% \delta\Psi(x,x;f)}{\delta f}(w^{\prime})\Big{|}\mathrm{d}w^{\prime}\leq 2\cdot% |f(x)|.

(5.4)

Verify item (iii) of Assumption 4.3. For item (iii) of Assumption 4.3, we choose $w=x$ , $C_{\Psi}=2$ . The desired condition holds due to (5.4).

Verify item (iv) of Assumption 4.3. For item (iv) of Assumption 4.3, we first compute the variations of $\mathcal{L}(f,g)$ in explicit forms,

	$\displaystyle\frac{\delta\mathcal{L}(f,g)}{\delta f}(w^{\prime})=\mathbb{E}_{Z% \|X}\Big{[}-g(Z){\,\big{\|}\,}X=w^{\prime}\Big{]}+2\lambda\cdot f(w^{\prime})% \rho_{X}(w^{\prime}),\quad\forall w^{\prime}\in\mathcal{X},$
	$\displaystyle\frac{\delta\mathcal{L}(f,g)}{\delta g}(z^{\prime})=\mathbb{E}_{X% \|Z}\Big{[}f_{0}(X)-f(X){\,\big{\|}\,}Z=z^{\prime}\Big{]}-g(z^{\prime})\rho_{Z}(% z^{\prime}),\quad\forall z^{\prime}\in\mathcal{Z},$

Assumption 4.8. For Assumption 4.8, we choose $c_{\Psi}=1$ and $w=s^{\prime}$ . The desired condition holds by definition of our choice of regularizer $\Psi(x,x;f)=f(x)^{2}$ .

We have checked that the technical Assumption 4.3 and Assumption 4.8 hold for the case of adversarial Riesz representer estimation. Theorem 4.7 can be applied in this case due to the establishment of Assumption 4.3. This implies the global convergence of the estimated Riesz representer to the minimizer of the primal objective. The convergence is quantified in a weighted $L^{2}$ distance. The choice of quadratic regularizer implies the establishment of Assumption 4.8, which further enables us to apply Theorem 4.9 and characterize the convergence in terms of primal objective value. We summarize the conclusions in the following corollary.

Corollary 5.4 (Global Convergence of Mean-field Neural Nets in Adversarial Riesz Representer Estimation).

Let $f^{*}$ be the minimizer of primal objective $J(f)$ defined in (2.8) with $\Phi(x,x;f)=f_{0}(x)-f(x)$ , $\mathcal{R}(f)=\mathbb{E}_{\mathcal{D}}[f(x)^{2}]$ . Let $(\mu_{t},\nu_{t})$ be the solution to the Wasserstein gradient flow (3.7) at time $t$ with $\eta=\alpha^{-2}$ and initial condition $\mu_{0}=\nu_{0}=\mathcal{N}(0,I_{D})$ . Under Assumption 4.1, 4.2, 4.3, and 4.8, it holds that

	$\displaystyle\inf_{t\in[0,T]}\mathbb{E}_{\mathcal{D}}\Bigl{[}(f(X;\mu_{t})-f^{% *}(X))^{2}\Bigr{]}\leq\mathcal{O}(T^{-1}+\alpha^{-1}),$
	$\displaystyle\inf_{t\in[0,T]}J(f(\cdot;\mu_{t}))-J(f^{*}(\cdot))\leq\mathcal{O% }(T^{-1/2}+\alpha^{-1/2}).$

Corollary 5.4 proves that in the setting of adversarial Riesz representer estimation, the $L^{2}$ distance between the mean-field neural network $f(\cdot;\mu_{t})$ at time $t$ and the global minimizer $f^{*}$ decays to zero at a sub-linear rate, up to an error of order $\mathcal{O}(\alpha^{-1})$ . Moreover, the optimality gap $\inf_{t\in[0,T]}J(f(\cdot;\mu_{t}))-J(f^{*}(\cdot))$ decays to zero at the rate of $\mathcal{O}(T^{-1/2})$ , up to an error $\mathcal{O}(\alpha^{-1/2})$ . Corollary 5.4 allows us to estimate the Riesz representer of a given functional using overparameterized two-layer neural networks.

6 Conclusion

In this paper, we focus on the minimax optimization problem derived from solving functional conditional moment equations using overparameterized two-layer neural networks. For such a problem, we first prove that the stochastic gradient descent-ascent algorithm converges to a mean-field limit as the stepsize goes to zero and the network width goes to infinity. In this mean-field limit, the optimization dynamics is characterized by a Wasserstein gradient flow in the space of probability distributions. We further establish the global convergence of the Wasserstein gradient flow, and prove that the feature representation induced by the neural networks is allowed to move by a considerable distance from the initial value. We further apply our general results to policy evaluation with high dimensional state space, nonparametric instrumental variables regression with high dimensional endogenous and exogenous variables, and asset pricing with a nonparametric structural demand function, and general Riesz representer estimation. Our analysis opens avenues for studying functional minimax optimization problems with more complicated objectives, such as nonlinear functional conditional moment equations. We leave the study of the convergence properties of the algorithm in such a general setting to future research. This setting includes nonparametric quantile instrumental variables regression as a leading and important application.

References

Ai and Chen (2003) Ai, C. and Chen, X. (2003). Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica, 71 1795–1843.
Alkousa et al. (2019) Alkousa, M., Dvinskikh, D., Stonyakin, F., Gasnikov, A. and Kovalev, D. (2019). Accelerated methods for composite non-bilinear saddle point problem. arXiv preprint arXiv:1906.03620.
Allen-Zhu et al. (2019a) Allen-Zhu, Z., Li, Y. and Liang, Y. (2019a). Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32.
Allen-Zhu et al. (2019b) Allen-Zhu, Z., Li, Y. and Song, Z. (2019b). A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning. PMLR.
Ambrosio and Gigli (2013) Ambrosio, L. and Gigli, N. (2013). A user’s guide to optimal transport. In Modelling and Optimisation of Flows on Networks. Springer, 1–155.
Ambrosio et al. (2008) Ambrosio, L., Gigli, N. and Savaré, G. (2008). Gradient flows: In metric spaces and in the space of probability measures. Springer.
Araújo et al. (2019) Araújo, D., Oliveira, R. I. and Yukimura, D. (2019). A mean-field limit for certain deep neural networks. arXiv preprint arXiv:1906.00193.
Arora et al. (2019a) Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. R. and Wang, R. (2019a). On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems.
Arora et al. (2019b) Arora, S., Du, S. S., Hu, W., Li, Z. and Wang, R. (2019b). Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584.
Barron (1993) Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39 930–945.
Ben-Tal et al. (2009) Ben-Tal, A., El Ghaoui, L. and Nemirovski, A. (2009). Robust optimization, vol. 28. Princeton university press.
Bennett et al. (2019) Bennett, A., Kallus, N. and Schnabel, T. (2019). Deep generalized method of moments for instrumental variable analysis. Advances in neural information processing systems, 32.
Blundell et al. (2007) Blundell, R., Chen, X. and Kristensen, D. (2007). Semi-nonparametric iv estimation of shape-invariant engel curves. Econometrica, 75 1613–1669.
Cai et al. (2019) Cai, Q., Yang, Z., Lee, J. D. and Wang, Z. (2019). Neural temporal-difference learning converges to global optima. In Advances in Neural Information Processing Systems.
Chen et al. (2024) Chen, L., Pelger, M. and Zhu, J. (2024). Deep learning in asset pricing. Management Science, 70 714–750.
Chen et al. (2014) Chen, X., Chernozhukov, V., Lee, S. and Newey, W. K. (2014). Local identification of nonparametric and semiparametric models. Econometrica, 82 785–809.
Chen and Christensen (2018) Chen, X. and Christensen, T. M. (2018). Optimal sup-norm rates and uniform inference on nonlinear functionals of nonparametric iv regression. Quantitative Economics, 9 39–84.
Chen and Ludvigson (2009) Chen, X. and Ludvigson, S. C. (2009). Land of addicts? an empirical investigation of habit-based asset pricing models. Journal of Applied Econometrics, 24 1057–1093.
Chen and Pouzo (2012) Chen, X. and Pouzo, D. (2012). Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals. Econometrica, 80 277–321.
Chen and Qi (2022) Chen, X. and Qi, Z. (2022). On well-posedness and minimax optimal rates of nonparametric q-function estimation in off-policy evaluation. In International Conference on Machine Learning. PMLR.
Chen et al. (2020a) Chen, Z., Cao, Y., Gu, Q. and Zhang, T. (2020a). A generalized neural tangent kernel analysis for two-layer neural networks. Advances in Neural Information Processing Systems, 33 13363–13373.
Chen et al. (2020b) Chen, Z., Cao, Y., Gu, Q. and Zhang, T. (2020b). Mean-field analysis of two-layer neural networks: Non-asymptotic rates and generalization bounds. arXiv preprint arXiv:2002.04026.
Chen et al. (2019) Chen, Z., Cao, Y., Zou, D. and Gu, Q. (2019). How much over-parameterization is sufficient to learn deep relu networks? arXiv preprint arXiv:1911.12360.
Chernozhukov et al. (2020) Chernozhukov, V., Newey, W., Singh, R. and Syrgkanis, V. (2020). Adversarial estimation of riesz representers. arXiv preprint arXiv:2101.00009.
Chizat (2022) Chizat, L. (2022). Mean-field langevin dynamics: Exponential convergence and annealing. arXiv preprint arXiv:2202.01009.
Chizat and Bach (2018) Chizat, L. and Bach, F. (2018). On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in Neural Information Processing Systems.
Diakonikolas et al. (2021) Diakonikolas, J., Daskalakis, C. and Jordan, M. I. (2021). Efficient methods for structured nonconvex-nonconcave min-max optimization. In International Conference on Artificial Intelligence and Statistics. PMLR.
Dikkala et al. (2020) Dikkala, N., Lewis, G., Mackey, L. and Syrgkanis, V. (2020). Minimax estimation of conditional moment models. Advances in Neural Information Processing Systems, 33 12248–12262.
Du et al. (2019) Du, S., Lee, J., Li, H., Wang, L. and Zhai, X. (2019). Gradient descent finds global minima of deep neural networks. In International conference on machine learning. PMLR.
Du et al. (2018) Du, S. S., Zhai, X., Poczos, B. and Singh, A. (2018). Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054.
Duan et al. (2020) Duan, Y., Jia, Z. and Wang, M. (2020). Minimax-optimal off-policy evaluation with linear function approximation. In International Conference on Machine Learning. PMLR.
Duan et al. (2021) Duan, Y., Jin, C. and Li, Z. (2021). Risk bounds and rademacher complexity in batch reinforcement learning. In International Conference on Machine Learning. PMLR.
Fang et al. (2019) Fang, C., Dong, H. and Zhang, T. (2019). Over parameterized two-level neural networks can learn near optimal feature representations. arXiv preprint arXiv:1910.11508.
Fang et al. (2021a) Fang, C., Dong, H. and Zhang, T. (2021a). Mathematical models of overparameterized neural networks. Proceedings of the IEEE, 109 683–703.
Fang et al. (2021b) Fang, C., Lee, J., Yang, P. and Zhang, T. (2021b). Modeling from features: a mean-field framework for over-parameterized deep neural networks. In Conference on learning theory. PMLR.
Frei and Gu (2021) Frei, S. and Gu, Q. (2021). Proxy convexity: A unified framework for the analysis of neural networks trained by gradient descent. Advances in Neural Information Processing Systems, 34 7937–7949.
Ganin et al. (2016) Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M. and Lempitsky, V. (2016). Domain-adversarial training of neural networks. The journal of machine learning research, 17 2096–2030.
Goodfellow et al. (2020) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63 139–144.
Grimmer et al. (2022) Grimmer, B., Lu, H., Worah, P. and Mirrokni, V. (2022). Limiting behaviors of nonconvex-nonconcave minimax optimization via continuous-time systems. In International Conference on Algorithmic Learning Theory. PMLR.
Grimmer et al. (2023) Grimmer, B., Lu, H., Worah, P. and Mirrokni, V. (2023). The landscape of the proximal point method for nonconvex–nonconcave minimax optimization. Mathematical Programming, 201 373–407.
Hajizadeh et al. (2024) Hajizadeh, S., Lu, H. and Grimmer, B. (2024). On the linear convergence of extragradient methods for nonconvex–nonconcave minimax problems. INFORMS Journal on Optimization, 6 19–31.
Han et al. (2024) Han, Y., Xie, G. and Zhang, Z. (2024). Lower complexity bounds of finite-sum optimization problems: The results and construction. Journal of Machine Learning Research, 25 1–86.
Holte (2009) Holte, J. M. (2009). Discrete Gronwall lemma and applications. In MAA-NCS meeting at the University of North Dakota, vol. 24.
Hu et al. (2021) Hu, K., Ren, Z., Šiška, D. and Szpruch, Ł. (2021). Mean-field langevin dynamics and energy landscape of neural networks. In Annales de l’Institut Henri Poincare (B) Probabilites et statistiques, vol. 57. Institut Henri Poincaré.
Huang and Yau (2020) Huang, J. and Yau, H.-T. (2020). Dynamics of deep neural networks and neural tangent hierarchy. In International conference on machine learning. PMLR.
Huang et al. (2022) Huang, M., Chen, X., Ji, K., Ma, S. and Lai, L. (2022). Efficiently escaping saddle points in bilevel optimization. arXiv preprint arXiv:2202.03684.
Ibrahim et al. (2019) Ibrahim, A., Azizian, W., Gidel, G. and Mitliagkas, I. (2019). Lower bounds and conditioning of differentiable games. arXiv preprint arXiv:1906.07300 31.
Jacot et al. (2018) Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, vol. 31.
Jin et al. (2019) Jin, C., Netrapalli, P. and Jordan, M. I. (2019). Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. arXiv preprint arXiv:1902.00618.
Jin et al. (2022) Jin, Y., Sidford, A. and Tian, K. (2022). Sharper rates for separable minimax and finite sum optimization via primal-dual extragradient methods. In Conference on Learning Theory. PMLR.
Jin et al. (2021) Jin, Y., Yang, Z. and Wang, Z. (2021). Is pessimism provably efficient for offline rl? In International Conference on Machine Learning. PMLR.
Levy et al. (2020) Levy, D., Carmon, Y., Duchi, J. C. and Sidford, A. (2020). Large-scale methods for distributionally robust optimization. Advances in Neural Information Processing Systems, 33 8847–8860.
Li et al. (2023) Li, C. J., Yuan, H., Gidel, G., Gu, Q. and Jordan, M. (2023). Nesterov meets optimism: rate-optimal separable minimax optimization. In International Conference on Machine Learning. PMLR.
Li et al. (2022) Li, J., Zhu, L. and So, A. M.-C. (2022). Nonsmooth nonconvex-nonconcave minimax optimization: Primal-dual balancing and iteration complexity analysis. arXiv preprint arXiv:2209.10825.
Liao et al. (2020) Liao, L., Chen, Y.-L., Yang, Z., Dai, B., Kolar, M. and Wang, Z. (2020). Provably efficient neural estimation of structural equation models: An adversarial approach. Advances in Neural Information Processing Systems, 33 8947–8958.
Lin et al. (2020a) Lin, T., Jin, C. and Jordan, M. (2020a). On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning. PMLR.
Lin et al. (2020b) Lin, T., Jin, C. and Jordan, M. I. (2020b). Near-optimal algorithms for minimax optimization. In Conference on Learning Theory. PMLR.
Lu et al. (2020a) Lu, S., Tsaknakis, I., Hong, M. and Chen, Y. (2020a). Hybrid block successive approximation for one-sided non-convex min-max problems: algorithms and applications. IEEE Transactions on Signal Processing, 68 3676–3691.
Lu et al. (2020b) Lu, Y., Ma, C., Lu, Y., Lu, J. and Ying, L. (2020b). A mean-field analysis of deep resnet and beyond: Towards provable optimization via overparameterization from depth.
Luo et al. (2021) Luo, L., Xie, G., Zhang, T. and Zhang, Z. (2021). Near optimal stochastic algorithms for finite-sum unbalanced convex-concave minimax optimization. arXiv preprint arXiv:2106.01761.
Luo et al. (2020) Luo, L., Ye, H., Huang, Z. and Zhang, T. (2020). Stochastic recursive gradient descent ascent for stochastic nonconvex-strongly-concave minimax problems. Advances in Neural Information Processing Systems, 33 20566–20577.
Madry et al. (2017) Madry, A., Makelov, A., Schmidt, L., Tsipras, D. and Vladu, A. (2017). Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.
Mei et al. (2019) Mei, S., Misiakiewicz, T. and Montanari, A. (2019). Mean-field theory of two-layers neural networks: Dimension-free bounds and kernel limit. arXiv preprint arXiv:1902.06015.
Mei et al. (2018) Mei, S., Montanari, A. and Nguyen, P.-M. (2018). A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115 E7665–E7671.
Nitanda et al. (2022) Nitanda, A., Wu, D. and Suzuki, T. (2022). Convex analysis of the mean field langevin dynamics. In International Conference on Artificial Intelligence and Statistics. PMLR.
Nouiehed et al. (2019) Nouiehed, M., Sanjabi, M., Huang, T., Lee, J. D. and Razaviyayn, M. (2019). Solving a class of non-convex min-max games using iterative first order methods. Advances in Neural Information Processing Systems, 32.
Ostrovskii et al. (2021a) Ostrovskii, D. M., Barazandeh, B. and Razaviyayn, M. (2021a). Nonconvex-nonconcave min-max optimization with a small maximization domain. arXiv preprint arXiv:2110.03950.
Ostrovskii et al. (2021b) Ostrovskii, D. M., Lowy, A. and Razaviyayn, M. (2021b). Efficient search of first-order nash equilibria in nonconvex-concave smooth min-max problems. SIAM Journal on Optimization, 31 2508–2538.
Otto and Villani (2000) Otto, F. and Villani, C. (2000). Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality. Journal of Functional Analysis, 173 361–400.
Ouyang and Xu (2021) Ouyang, Y. and Xu, Y. (2021). Lower complexity bounds of first-order methods for convex-concave bilinear saddle-point problems. Mathematical Programming, 185 1–35.
Pinkus (1999) Pinkus, A. (1999). Approximation theory of the MLP model in neural networks. Acta Numerica, 8 143–195.
Ramprasad et al. (2022) Ramprasad, P., Li, Y., Yang, Z., Wang, Z., Sun, W. W. and Cheng, G. (2022). Online bootstrap inference for policy evaluation in reinforcement learning. Journal of the American Statistical Association 1–14.
Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A. and Chen, X. (2016). Improved techniques for training gans. Advances in neural information processing systems, 29.
Sirignano and Spiliopoulos (2020a) Sirignano, J. and Spiliopoulos, K. (2020a). Mean field analysis of neural networks: A central limit theorem. Stochastic Processes and their Applications, 130 1820–1852.
Sirignano and Spiliopoulos (2020b) Sirignano, J. and Spiliopoulos, K. (2020b). Mean field analysis of neural networks: A law of large numbers. SIAM Journal on Applied Mathematics, 80 725–752.
Sirignano and Spiliopoulos (2022) Sirignano, J. and Spiliopoulos, K. (2022). Mean field analysis of deep neural networks. Mathematics of Operations Research, 47 120–152.
Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
Sznitman (1991) Sznitman, A.-S. (1991). Topics in propagation of chaos. In Ecole d’Été de Probabilités de Saint-Flour XIX—1989. Springer, 165–251.
Thekumparampil et al. (2019) Thekumparampil, K. K., Jain, P., Netrapalli, P. and Oh, S. (2019). Efficient algorithms for smooth minimax optimization. Advances in Neural Information Processing Systems, 32.
Uehara et al. (2020) Uehara, M., Huang, J. and Jiang, N. (2020). Minimax weight and q-function learning for off-policy evaluation. In International Conference on Machine Learning. PMLR.
Villani (2003) Villani, C. (2003). Topics in optimal transportation. American Mathematical Society.
Villani (2008) Villani, C. (2008). Optimal transport: Old and new. Springer.
Wai et al. (2020) Wai, H.-T., Yang, Z., Wang, Z. and Hong, M. (2020). Provably efficient neural GTD for off-policy learning. Advances in Neural Information Processing Systems, 33.
Wainwright (2019) Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press.
Wang et al. (2022) Wang, S., Yu, X. and Perdikaris, P. (2022). When and why pinns fail to train: A neural tangent kernel perspective. Journal of Computational Physics, 449 110768.
Xie et al. (2020a) Xie, G., Luo, L., Lian, Y. and Zhang, Z. (2020a). Lower complexity bounds for finite-sum convex-concave minimax optimization problems. In International Conference on Machine Learning. PMLR.
Xie et al. (2020b) Xie, Q., Chen, Y., Wang, Z. and Yang, Z. (2020b). Learning zero-sum simultaneous-move markov games using function approximation and correlated equilibrium. In Conference on learning theory. PMLR.
Xu et al. (2020) Xu, L., Chen, Y., Srinivasan, S., de Freitas, N., Doucet, A. and Gretton, A. (2020). Learning deep features in instrumental variable regression. arXiv preprint arXiv:2010.07154.
Xu et al. (2021) Xu, L., Kanagawa, H. and Gretton, A. (2021). Deep proxy causal learning and its application to confounded bandit policy evaluation. Advances in Neural Information Processing Systems, 34 26264–26275.
Xu and Gu (2020) Xu, P. and Gu, Q. (2020). A finite-time analysis of q-learning with neural network function approximation. In International Conference on Machine Learning. PMLR.
Yang et al. (2020) Yang, J., Kiyavash, N. and He, N. (2020). Global convergence and variance reduction for a class of nonconvex-nonconcave minimax problems. Advances in Neural Information Processing Systems, 33 1153–1165.
Yang et al. (2022) Yang, J., Orvieto, A., Lucchi, A. and He, N. (2022). Faster single-loop algorithms for minimax optimization without strong concavity. In International Conference on Artificial Intelligence and Statistics. PMLR.
Zhang et al. (2021a) Zhang, S., Yang, J., Guzmán, C., Kiyavash, N. and He, N. (2021a). The complexity of nonconvex-strongly-concave minimax optimization. In Uncertainty in Artificial Intelligence. PMLR.
Zhang et al. (2020) Zhang, Y., Cai, Q., Yang, Z., Chen, Y. and Wang, Z. (2020). Can temporal-difference and q-learning learn representation? A mean-field theory. arXiv preprint arXiv:2006.04761.
Zhang et al. (2021b) Zhang, Y., Chen, S., Yang, Z., Jordan, M. and Wang, Z. (2021b). Wasserstein flow meets replicator dynamics: A mean-field analysis of representation learning in actor-critic. Advances in Neural Information Processing Systems, 34 15993–16006.
Zhao (2023) Zhao, R. (2023). A primal-dual smoothing framework for max-structured non-convex optimization. Mathematics of operations research.
Zhao et al. (2022) Zhao, Y., Tian, Y., Lee, J. and Du, S. (2022). Provably efficient policy optimization for two-player zero-sum markov games. In International Conference on Artificial Intelligence and Statistics. PMLR.
Zou and Gu (2019) Zou, D. and Gu, Q. (2019). An improved analysis of training over-parameterized deep neural networks. In Advances in Neural Information Processing Systems.

Appendix A Proof of Main Results

In this section, we provide proofs for the main theorems and technical lemmas in our work.

A.1 Proof of Lemma 4.6

Proof of (i). The proof for Claim (i) will be two-stage. First, we will show that if function pair $(f^{*},g^{*})$ is a stationary point for $\mathcal{L}(f,g)$ with respect to $(f,g)$ , then it’s a saddle point for the same objective as well. Then we will show that the distribution pair $(\mu^{*},\nu^{*})$ being a stationary point of $\mathcal{L}(\mu,\nu)$ implies that the corresponding $(f^{*},g^{*})$ is a stationary point for $\mathcal{L}(f,g)$ , which concludes the claim. We will start with the first part. We define the following functional $\mathcal{L}_{1}$ and $\mathcal{L}_{2}$ ,

\displaystyle\mathcal{L}_{1}(f,g)=\mathbb{E}_{\mathcal{D}}\Big{[}g(Z)\cdot\Phi% (X,Z;f)\Big{]},\quad\mathcal{L}_{2}(f,g)=\mathbb{E}_{\mathcal{D}}\Big{[}-1/2% \cdot g(Z)^{2}+\lambda\Psi(X,Z;f)\Big{]}.

We see that the minimax objective in (2.9) is indeed the sum of such two functionals,

\displaystyle\mathcal{L}(f,g)=\mathcal{L}_{1}(f,g)+\mathcal{L}_{2}(f,g).

For any function pair $(f,g)$ , we can verify that the following chain of equalities holds,

\displaystyle\max_{g^{\prime}}~{}\mathcal{L}(f,g^{\prime})-\min_{f^{\prime}}~{% }\mathcal{L}(f^{\prime},g)

\displaystyle=\max_{g^{\prime}}~{}\bigl{(}\mathcal{L}(f,g^{\prime})-\mathcal{L% }(f,g)\bigr{)}+\max_{f^{\prime}}~{}\bigl{(}\mathcal{L}(f,g)-\mathcal{L}(f^{% \prime},g)\big{)}.

(A.1)

We considered the function space $L^{2}(\mathcal{W})$ and $L^{2}(\mathcal{Z})$ equipped with inner product $\langle\cdot,\cdot\rangle_{L^{2}}$ , which are also Hilbert spaces. Since $\mathcal{X}\times\mathcal{Z}$ are compact, continuous function $f$ and $g$ parameterized in the form of (3.5) are square-integrable, thus naturally belong to $L^{2}(\mathcal{W})$ and $L^{2}(\mathcal{Z})$ .

For a fixed $f$ , $\mathcal{L}_{1}(f,g)$ is a continuous linear functional in $g$ defined on $L^{2}(\mathcal{Z})$ . Thus, there exists function $h_{f}$ in $L_{2}(\mathcal{Z})$ such that $\mathcal{L}_{1}(f,g)=\big{\langle}h_{f},g\big{\rangle}_{L^{2}}$ . Similarly, for a fixed $g$ , $\mathcal{L}_{1}(f,g)$ is a continuous linear functional in $f$ , thus there exists function $h_{g}$ in $L_{2}(\mathcal{W})$ such that $\mathcal{L}_{1}(f,g)=\big{\langle}h_{g},f\big{\rangle}_{L^{2}}$ . In fact, $h_{f}$ and $h_{g}$ matches the variation of $\mathcal{L}_{1}$ with respect to $g$ and $f$ .

\displaystyle h_{f}=\frac{\delta\mathcal{L}_{1}(f,g)}{\delta g},\quad h_{f}=% \frac{\delta\mathcal{L}_{2}(f,g)}{\delta f}.

Since $\mathcal{L}_{2}$ is a concave functional with respect to $g$ , we apply Jensen’s inequality and it holds that,

$\displaystyle\mathcal{L}(f,g^{\prime})-\mathcal{L}(f,g)$	$\displaystyle=\mathcal{L}_{1}\bigl{(}f,g^{\prime}\bigr{)}-\mathcal{L}_{1}\bigl% {(}f,g\bigr{)}+\mathcal{L}_{2}\bigl{(}f,g^{\prime}\bigr{)}-\mathcal{L}_{2}% \bigl{(}f,g\bigr{)}$
	$\displaystyle\leq\Big{\langle}\frac{\delta\mathcal{L}_{1}(f,g)}{\delta g},g^{% \prime}-g\Big{\rangle}_{L^{2}}+\Big{\langle}\frac{\delta\mathcal{L}_{2}(f,g)}{% g},g^{\prime}-g\Big{\rangle}_{L^{2}}$
	$\displaystyle=\Big{\langle}\frac{\delta\mathcal{L}(f,g)}{\delta g},g^{\prime}-% g\Big{\rangle}_{L^{2}}.$	(A.2)

Follow a similar reasoning, using the fact that $\mathcal{L}_{1}$ is a linear functional with respect to $f$ and $\mathcal{L}_{2}$ is a convex functional with respect to $f$ , it holds that

\displaystyle\mathcal{L}(f,g)-\mathcal{L}(f^{\prime},g)\leq\Big{\langle}\frac{% \delta\mathcal{L}(f,g)}{\delta f},f-f^{\prime}\Big{\rangle}_{L^{2}}.

(A.3)

Plugging (A.1) and (A.3) into (A.1), we re-write the minimax expression in (A.1) using the variation of $\mathcal{L}(f,g)$ , the following inequality holds,

\displaystyle\max_{g^{\prime}}~{}\mathcal{L}(f,g^{\prime})-\min_{f^{\prime}}~{% }\mathcal{L}(f^{\prime},g)\leq\max_{f^{\prime},g^{\prime}}~{}\Big{\langle}% \frac{\delta\mathcal{L}(f,g)}{\delta g},g^{\prime}-g\Big{\rangle}_{L^{2}}+\Big% {\langle}\frac{\delta\mathcal{L}(f,g)}{\delta f},f-f^{\prime}\Big{\rangle}_{L^% {2}}.

(A.4)

Thus, if $(f^{*},g^{*})$ is the stationary point, i.e.,

\displaystyle\frac{\delta\mathcal{L}(f^{*},g^{*})}{\delta f}=\frac{\delta% \mathcal{L}(f^{*},g^{*})}{\delta g}=0,\quad\text{a.s.},

(A.5)

then (A.4) suggests that for such stationary point $(f^{*},g^{*})$ , for any function pair $(f^{\prime},g^{\prime})$ , the following inequality holds,

\displaystyle\max_{g^{\prime}}~{}\mathcal{L}(f^{*},g^{\prime})-\min_{f^{\prime% }}~{}\mathcal{L}(f^{\prime},g^{*})\leq 0.

(A.6)

Equation (A.6) proves that $(f^{*},g^{*})$ is a saddle point for the minimx objective $\mathcal{L}(f,g)$ . Therefore, the stationarity of $(f^{*},g^{*})$ implies that it’s a saddle point for objective $\mathcal{L}(f,g)$ .

Now, we proceed to show the second stage of the proof. We now show that if $(\mu^{*},\nu^{*})$ is the stationary point of $\mathcal{L}$ , i.e., $v^{f}(\cdot;\mu^{*},\nu^{*})=v^{g}(\cdot;\mu^{*},\nu^{*})=0$ , the corresponding function pair $(f(\cdot;\mu^{*}),g(\cdot;\nu^{*}))$ is the stationary point of $\mathcal{L}(f,g)$ with respect to $(f,g)$ . We recall that the correspondence between $(\mu^{*},\nu^{*})$ and $(f(\cdot;\mu^{*}),g(\cdot;\nu^{*}))$ is through (3.5). Let $(\mu^{*},\nu^{*})$ be a stationary point of (2.9), that is

\displaystyle\nabla_{\theta}\frac{\delta\mathcal{L}(\mu^{*},\nu^{*})}{\delta% \mu}(\theta)=\nabla_{\omega}\frac{\delta\mathcal{L}(\mu^{*},\nu^{*})}{\delta% \nu}(\omega)=0,\quad\forall\theta,\omega\in\mathbb{R}^{D}

(A.7)

We can also compute the variation of $\mathcal{L}(\mu,\nu)$ explicitly.

	$\displaystyle\frac{\delta\mathcal{L}(\mu^{},\nu^{})}{\delta\mu}(\theta)$	$\displaystyle=\mathbb{E}_{\mathcal{D}}\Bigl{[}\alpha\Big{\langle}g(Z;\nu^{})% \cdot\frac{\delta\Phi(X,Z;f(\cdot;\mu^{}))}{\delta f}+\lambda\cdot\frac{% \delta\Psi(X,Z;f(\cdot;\mu^{*}))}{\delta f},\phi(\cdot;\theta)\Big{\rangle}_{L% ^{2}}\Bigr{]},$
	$\displaystyle\frac{\delta\mathcal{L}(\mu^{},\nu^{})}{\delta\nu}(\omega)$	$\displaystyle=\mathbb{E}_{\mathcal{D}}\Bigl{[}\alpha\big{(}\Phi(X,Z;f(\cdot,% \mu^{}))-g(Z;\nu^{})\big{)}\cdot\psi(Z;\omega)\Bigr{]}.$

By the oddness of $b$ in Assumption 4.1, we have that $\phi(\cdot;\bm{0})=0$ , This implies that the variation of $\mathcal{L}(\mu^{*},\nu^{*})$ with respect to $\mu$ and $\nu$ are $0$ when $\theta=\omega=\bm{0}$ , i.e.,

\displaystyle\frac{\delta\mathcal{L}(\mu^{*},\nu^{*})}{\delta\mu}(\bm{0})=% \frac{\delta\mathcal{L}(\mu^{*},\nu^{*})}{\delta\nu}(\bm{0})=0.

Combined with (A.7), we deduced that

\displaystyle\frac{\delta\mathcal{L}(\mu^{*},\nu^{*})}{\delta\mu}(\theta)=% \frac{\delta\mathcal{L}(\mu^{*},\nu^{*})}{\delta\nu}(w)=0\quad\forall\theta,% \omega\in\mathbb{R}^{D}.

Note that we can expand the variation of $\mathcal{L}$ with respect to $\mu$ ,

\displaystyle\alpha\Big{\langle}\frac{\delta\mathcal{L}(f(\cdot;\mu^{*}),g(% \cdot;\nu^{*}))}{\delta f},\phi(\cdot;\theta)\Big{\rangle}_{L^{2}}=\frac{% \delta\mathcal{L}(\mu^{*},\nu^{*})}{\delta\mu}(\theta)=0.

(A.8)

By the universal function approximation theorem (Lemma D.1), since $\frac{\mathcal{L}(f(\cdot;\mu^{*}),g(\cdot;\nu^{*}))}{\delta f}$ is in $\mathscr{C}(\mathcal{W})$ as is assumed in item (iv) of Assumption 4.3, there exists $\{\phi_{n}\}_{n=1}^{\infty}\in\mathcal{G}(\phi)$ such that $\phi_{n}\rightarrow\frac{\mathcal{L}(f(\cdot;\mu^{*}),g(\cdot;\nu^{*}))}{% \delta f}$ uniformly. Here, $\mathcal{G}(\phi)$ denotes the space of functions that are linearly spanned by $\phi(\cdot,\theta)$ By (A.8), it holds that

\displaystyle\Big{\langle}\frac{\mathcal{L}(f(\cdot;\mu^{*}),g(\cdot;\nu^{*}))% }{\delta f}(\cdot),\phi_{n}(\cdot)\Big{\rangle}_{L^{2}}=0.

(A.9)

Following a similar strategy, we can show that there exists $\{\psi_{n}\}_{n=1}^{\infty}\in\mathcal{G}(\psi)$ such that $\psi_{n}\rightarrow\frac{\mathcal{L}(f(\cdot;\mu^{*}),g(\cdot;\nu^{*}))}{% \delta g}$ , where for each $\psi_{n}$ , it holds that

\displaystyle\Big{\langle}\frac{\mathcal{L}(f(\cdot;\mu^{*}),g(\cdot;\nu^{*}))% }{\delta g}(\cdot),\psi_{n}(\cdot)\Big{\rangle}_{L^{2}}=0.

(A.10)

We take the limit of (A.9) and (A.10) by passing $n\rightarrow\infty$ and conclude,

\displaystyle\frac{\delta\mathcal{L}(f(\cdot;\mu^{*}),g(\cdot;\nu^{*}))}{% \delta f}=0,\quad\frac{\mathcal{L}(f(\cdot;\mu^{*}),g(\cdot;\nu^{*}))}{\delta g% }(\cdot)=0.\quad\text{a.s.}

(A.11)

Equation (A.11) proves that if $(\mu^{*},\nu^{*})$ is a stationary point of the Wasserstein gradient flow, then the associated function pair $(f(\cdot;\mu^{*},g(\cdot;\nu^{*})))$ is a stationary point of the minimax objective $\mathcal{L}(f,g)$ , which matches the conditions we conclude in (A.5). Therefore, we prove that $(f(\cdot;\mu^{*}),g(\cdot;\nu^{*}))$ is a saddle point of the minimax objective $\mathcal{L}(f,g)$ . We complete the proof of item (i).

Proof of (ii). We now show that there exists good solution pair $(\mu^{*},\nu^{*})$ that is both optimal as well as close to initialization $(\mu_{0},\nu_{0})$ in Wasserstein distance. By Assumption 4.2, there exists distribution $\mu^{\dagger},\nu^{\dagger}\in\mathscr{P}_{2}(\mathbb{R}^{D})$ such that the optimal solution to the optimization problem (2.9) $(f^{*},g^{*})$ satisfies the following,

\displaystyle f^{*}(w)=\int\phi(w;\theta)\mathrm{d}\mu^{\dagger}(\theta),g^{*}% (z)=\int\psi(z;\omega)\mathrm{d}\nu^{\dagger}(\omega),\quad\forall w\in% \mathcal{W},z\in\mathcal{Z}

Recall that $\alpha>0$ is the scaling parameter in neural network parameterization. We can construct $(\mu^{*},\nu^{*})$ using a convex combination of $(\mu^{\dagger},\nu^{\dagger})$ and the initialization $(\mu_{0},\nu_{0})$ ,

\displaystyle\mu^{*}(\theta)=\alpha^{-1}\mu^{\dagger}(\theta)+(1-\alpha^{-1})% \mu_{0}(\theta),\quad\nu^{*}(w)=\alpha^{-1}\nu^{\dagger}(\omega)+(1-\alpha^{-1% })\nu_{0}(\omega).

(A.12)

We claim that $(\mu^{*},\nu^{*})$ constructed in (A.12) satisfies all the desired requirements. Since $\mu_{0},\nu_{0}$ are standard Gaussian distribution, the integration of $\phi(\cdot;\theta)$ with respect to $\mu_{0}$ and $\psi(\cdot;\omega)$ with respect to $\nu_{0}$ are identically $0$ due to oddness of neuron functions,

\displaystyle\int_{\mathcal{W}}\phi(w;\theta)\mathrm{d}\mu_{0}(\theta)=0,\quad% \int_{\mathcal{Z}}\psi(z;\omega)\mathrm{d}\nu_{0}(\omega)=0.\quad\forall w\in% \mathcal{W},z\in\mathcal{Z}

Thus, the expressions for $(f^{*},g^{*})$ are simplified to

\displaystyle f^{*}(w)=\alpha\int\phi(w;\theta)\mathrm{d}\mu^{*}(\theta),\quad g% ^{*}(z)=\alpha\int\psi(z;\omega)\mathrm{d}\nu^{*}(\omega).

By Talagrand’s inequality (Lemma D.5), the following chain of inequalities holds,

$\displaystyle\mathcal{W}_{2}(\mu_{0},\mu^{*})^{2}$	$\displaystyle\leq 2D_{\mathrm{KL}}(\mu^{}\,\\|\,\mu_{0})\leq D_{\chi^{2}}(\mu^% {}\,\\|\,\mu_{0})$
	$\displaystyle=\int\biggl{(}\frac{\mu^{*}(\theta)}{\mu_{0}(\theta)}-1\biggr{)}^% {2}\,\mathrm{d}\mu_{0}(\theta)=\int\biggl{(}\frac{(1-\alpha^{-1})\cdot\mu_{0}(% \theta)+\alpha^{-1}\cdot\mu^{\dagger}(\theta)}{\mu_{0}(\theta)}-1\biggr{)}^{2}% \,\mathrm{d}\mu_{0}(\theta)$
	$\displaystyle=\alpha^{-2}D_{\chi^{2}}(\mu^{\dagger}\,\\|\,\mu_{0}).$	(A.13)

A similar bound on $\mathcal{W}_{2}(\nu_{0},\nu^{*})$ also applies,

\displaystyle\mathcal{W}_{2}(\nu_{0},\nu^{*})^{2}\leq\alpha^{-2}D_{\chi^{2}}(% \nu^{\dagger}\,\|\,\nu_{0}).

(A.14)

Let $\bar{D}=\max\{D_{\chi^{2}}(\mu^{\dagger}\,\|\,\mu_{0})^{1/2},D_{\chi^{2}}(\nu^% {\dagger}\,\|\,\nu_{0})^{1/2}\}$ , we conclude the proof of item (ii).

A.2 Proof of Theorem 4.7

By Lemma 4.6, there exists distribution $(\mu^{*},\nu^{*})$ that is a stationary point of Wasserstein gradient flow (3.7) and simultaneously satisfying the distance bound in item (ii) of Lemma 4.6. For such $(\mu^{*},\nu^{*})$ , we denote $\rho^{*}(\theta,\omega)=\mu^{*}(\theta)\nu^{*}(\omega)$ as their product measure. Moreover, for any distribution pair $(\mu,\nu)$ , we use $\rho(\theta,\omega)=\mu(\theta)\nu(\omega)$ as their product measure for simplicity. To rewrite the Wasserstein gradient flow for $(\mu,\nu)$ into the flow for $\rho$ , we define vector the stacked vector field $v$ as,

\displaystyle v(\theta,\omega;\mu,\nu)=\bigl{(}v^{f}(\theta;\mu,\nu),v^{g}(% \omega;\mu,\nu)\bigr{)}.

(A.15)

Following from Lemma D.2, (A.1), and (A.14), it holds that $\mathcal{W}_{2}(\rho^{*},\rho_{0})\leq\alpha^{-1}\bar{D}$ , where $\bar{D}$ is defined in Lemma 4.6. Note that

	$\displaystyle f(w;\mu)=\alpha\int\phi(w;\theta)\mu(\theta)\mathrm{d}\theta=% \alpha\int\phi(w;\theta)\rho(\theta,\omega)\mathrm{d}(\theta,\omega),\quad% \forall w\in\mathcal{W},$
	$\displaystyle g(z;\nu)=\alpha\int\psi(z;\omega)\nu(\omega)\mathrm{d}\omega=% \alpha\int\psi(z;\omega)\rho(\theta,\omega)\mathrm{d}(\theta,\omega),\quad% \forall z\in\mathcal{Z}.$

Thus, we overload the notation to write $f(\cdot;\rho)=f(\cdot;\mu)$ and $g(\cdot;\rho)=g(\cdot;\nu)$ for $\rho\in\mathscr{P}_{2}(\mathbb{R}^{D}\times\mathbb{R}^{D})$ . By writing $\rho_{t}=(\mu_{t},\nu_{t})$ , the update in (3.7) takes the following form

\displaystyle\partial_{t}\rho_{t}(\theta,\omega)=-\mathrm{div}\bigl{(}\rho_{t}% (\theta,\omega)v(\theta,\omega;\rho_{t})\bigr{)},\quad\rho_{0}=(\mu_{0},\nu_{0% }).

Before we prove Theorem 4.7, we first show the following important technical lemma.

Lemma A.1.

We assume $\mathcal{W}_{2}(\rho_{t},\rho^{*})\leq 2\mathcal{W}_{2}(\rho_{0},\rho^{*})$ . Under Assumptions 4.1, 4.3, 4.2, it holds that

\displaystyle\frac{1}{2}\frac{\mathrm{d}\mathcal{W}_{2}(\rho_{t},\rho^{*})^{2}% }{\mathrm{d}t}

\displaystyle\;\leq-\eta\cdot\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\Psi\big{(% }X,Z;f(\cdot;\mu_{t})-f^{*}(\cdot)\big{)}+\bigl{(}g(Z;\nu_{t})-g^{*}(Z)\bigr{)% }^{2}\Bigr{]}+C_{*}\cdot\eta\alpha^{-1}.

(A.16)

where $C_{*}>0$ is a constant depending on $B_{0},B_{1},B_{2}$ , $\lambda$ , and $\bar{D}$

Proof.

Let $\{\beta_{s}\}_{s\in[0,1]}$ be the geodesic connecting $\rho_{t}$ and $\rho^{*}$ with $\beta_{0}=\rho_{t}$ and $\beta_{1}=\rho^{*}$ . Let $u$ be the corresponding veclocity field such that $\partial_{s}\beta_{s}=-\mathrm{div}(\beta_{s}u_{s})$ . By the first variation formula of Wasserstein distance in Lemma D.3, it holds that

	$\displaystyle\frac{1}{2}\frac{\mathrm{d}\mathcal{W}_{2}(\rho_{t},\rho^{*})^{2}% }{\mathrm{d}t}$	$\displaystyle=-\eta\big{\langle}v(\cdot;\rho_{t}),u_{0}\big{\rangle}_{\rho_{t}% }=-\eta\big{\langle}v(\cdot;\rho^{}),u_{1}\big{\rangle}_{\rho^{}}+\eta\int_{% 0}^{1}\partial_{s}\big{\langle}v(\cdot;\beta_{s}),u_{s}\big{\rangle}_{\beta_{s% }}\mathrm{d}s$		(A.17)
		$\displaystyle=\eta\underbrace{\int_{0}^{1}\big{\langle}\partial_{s}v(\cdot;% \beta_{s}),u_{s}\big{\rangle}_{\beta_{s}}\mathrm{d}s}_{\displaystyle\mathrm{(i% )}}+\eta\underbrace{\int_{0}^{1}\int\big{\langle}v(\theta,\omega;\beta_{s}),% \partial_{s}(u_{s}(\theta,w)\beta_{s}(\theta,\omega))\big{\rangle}\mathrm{d}(% \theta,\omega)\mathrm{d}s}_{\displaystyle\mathrm{(ii)}}.$

where the notation $\big{\langle}h_{1},h_{2}\big{\rangle}_{\rho}=\int h_{1}\cdot h_{2}\mathrm{d}\rho$ for any distribution $\rho$ and functions $h_{1},h_{2}$ . We will provide bounds for term (i) and (ii) separately in the sequel.

Upper bounding term (i). For term (i) of (A.17), by the definitions of $v$ , $v^{f}$ , and $v^{g}$ in (A.15) and (3), we have that

	$\displaystyle\partial_{s}v^{f}(\theta,\omega;\beta_{s})$	$\displaystyle=\alpha\partial_{s}\mathbb{E}_{\mathcal{D}}\Bigl{[}-g(Z;\beta_{s}% )\cdot\Big{\langle}\frac{\delta\Phi(X,Z;f(\cdot;\beta_{s}))}{\delta f},\nabla_% {\theta}\phi(\cdot;\theta)\Big{\rangle}_{L^{2}}-\lambda\cdot\Big{\langle}\frac% {\delta\Psi(X,Z;f(\cdot;\beta_{s}))}{\delta f},\nabla_{\theta}\phi(\cdot;% \theta)\Big{\rangle}_{L^{2}}\Bigr{]}$
		$\displaystyle=\alpha\nabla_{\theta}\mathbb{E}_{\mathcal{D}}\Bigl{[}-g(Z;% \partial_{s}\beta_{s})\cdot\Big{\langle}\frac{\delta\Phi(X,Z;f(\cdot;\beta_{s}% ))}{\delta f},\phi(\cdot;\theta)\Big{\rangle}_{L^{2}}-\lambda\cdot\Big{\langle% }\frac{\delta\Psi(X,Z;f(\cdot;\partial_{s}\beta_{s}))}{\delta f},\phi(\cdot;% \theta)\Big{\rangle}_{L^{2}}\Bigr{]}.$

where the second inequality holds since $\frac{\delta\Phi(X,Z;f)}{\delta f}$ a constant, $s-$ independent function, $\frac{\delta\Psi(X,Z;f)}{\delta f}$ is linear in $f$ , and $\partial_{s}f(\cdot;\beta_{s}),\partial_{s}g(\cdot;\beta_{s})$ satisfies

	$\displaystyle\partial_{s}f(w;\beta_{s})=\int\partial_{s}\big{(}\phi(w;\theta)% \beta_{s}(\theta,\omega)\big{)}\mathrm{d}(\theta,\omega)=\int\phi(w;\theta)% \partial_{s}\beta_{s}\mathrm{d}(\theta,\omega)=f(w;\partial_{s}\beta_{s}),% \quad\forall w\in\mathcal{W}$
	$\displaystyle\partial_{s}g(z;\beta_{s})=\int\partial_{s}\big{(}\psi(z;\omega)% \beta_{s}(\theta,\omega)\big{)}\mathrm{d}(\theta,\omega)=\int\psi(z;\omega)% \partial_{s}\beta_{s}\mathrm{d}(\theta,\omega)=g(z;\partial_{s}\beta_{s}),% \quad\forall z\in\mathcal{Z}$

A similar computation for $\partial_{s}v^{g}(\theta,\omega;\beta_{s})$ gives

\displaystyle\partial_{s}v^{g}(\theta,\omega;\beta_{s})=\alpha\nabla_{\omega}% \mathbb{E}_{\mathcal{D}}\Bigl{[}\widetilde{\Phi}(X,Z;f(\cdot,\partial_{s}\beta% _{s}))\cdot\phi(Z;\omega)-g(Z;\partial_{s}\beta_{s})\cdot\phi(Z;\omega)\Bigr{]}

We recall that $\widetilde{\Phi}(x,z;f)=\Phi(x,z;f)-\Phi(x,z;\bm{0})$ is the linear component in $\Phi$ . We note that the variation of $\widetilde{\Phi}$ is the same as the variation of $\Phi$ with respect to $f$ , $\frac{\delta\Phi(X,Z;f)}{\delta f}=\frac{\delta\widetilde{\Phi}(X,Z;f)}{\delta f}.$

We define the potential $\mathcal{V}(\theta,\omega;\partial_{s}\beta_{s})$ as

	$\displaystyle\mathcal{V}(\theta,\omega;\partial_{s}\beta_{s})$	$\displaystyle=\mathbb{E}_{\mathcal{D}}\Bigl{[}g(Z;\partial_{s}\beta_{s})\cdot% \Big{\langle}\frac{\delta\Phi(X,Z;f(\cdot;\beta_{s}))}{\delta f},\phi(\cdot;% \theta)\Big{\rangle}_{L^{2}}+\lambda\cdot\Big{\langle}\frac{\delta\Psi(X,Z;f(% \cdot;\partial_{s}\beta_{s}))}{\delta f},\phi(\cdot;\theta)\Big{\rangle}_{L^{2% }}\Bigr{]}$
		$\displaystyle-\mathbb{E}_{\mathcal{D}}\Bigl{[}\widetilde{\Phi}(X,Z;f(\cdot,% \partial_{s}\beta_{s}))\cdot\psi(Z;\omega)-g(Z;\partial_{s}\beta_{s})\cdot\psi% (Z;\omega)\Bigr{]}$

Then, the vector field $\partial_{s}v(\theta,\omega;\beta_{s})$ is the gradient of such potential $\mathcal{V}(\theta,\omega;\partial_{s}\beta_{s})$

\displaystyle\partial_{s}v(\theta,\omega;\beta_{s})=\begin{pmatrix}&\partial_{% s}v^{f}(\theta;\beta_{s})\\ &\partial_{s}v^{g}(\omega;\beta_{s})\end{pmatrix}=-\alpha\nabla\mathcal{V}(% \theta,w;\partial_{s}\beta_{s}),

where the gradient operator $\nabla=(\nabla_{\theta},\nabla_{\omega})$ . Then, by Stoke’s formula and integration by parts, we have

	$\displaystyle\big{\langle}\partial_{s}v(\cdot;\beta_{s}),u_{s}\big{\rangle}_{% \beta_{s}}$	$\displaystyle=-\int\alpha\nabla\mathcal{V}(\theta,\omega;\partial_{s}\beta_{s}% )u_{s}(\theta,w)\beta_{s}(\theta,w)\mathrm{d}(\theta,w)$
		$\displaystyle\quad=\int\alpha\mathcal{V}(\theta,w;\partial_{s}\beta_{s})% \mathrm{div}(u_{s}\beta_{s})\mathrm{d}(\theta,w)=-\int\alpha\mathcal{V}(\theta% ,w;\partial_{s}\beta_{s})\partial_{s}\beta_{s}\mathrm{d}(\theta,w)$

Integrating potential $\mathcal{V}$ with respect to $\partial_{s}\beta_{s}$ simplied the expression to

		$\displaystyle\int\alpha\mathcal{V}(\theta,\omega;\partial_{s}\beta_{s})% \partial_{s}\beta_{s}\mathrm{d}(\theta,\omega)$
		$\displaystyle\qquad=\mathbb{E}_{\mathcal{D}}\Big{[}g(Z;\partial_{s}\beta_{s})% \cdot\Big{\langle}\frac{\delta\Phi(X,Z;f(\cdot;\beta_{s}))}{\delta f},\int% \alpha\phi(\cdot;\theta)\partial_{s}\beta_{s}(\mathrm{d}\theta)\Big{\rangle}_{% L^{2}}\Big{]}$
		$\displaystyle\qquad\qquad+\mathbb{E}_{\mathcal{D}}\Big{[}\lambda\cdot\big{% \langle}\frac{\delta\Psi(X,Z;f(\cdot;\partial_{s}\beta_{s}))}{\delta f},\int% \alpha\phi(\cdot;\theta)\partial_{s}\beta_{s}(\mathrm{d}\theta)\big{\rangle}_{% L^{2}}\Big{]}$
		$\displaystyle\qquad\qquad-\mathbb{E}_{\mathcal{D}}\Bigl{[}\widetilde{\Phi}(X,Z% ;f(\cdot,\partial_{s}\beta_{s}))\cdot\int\alpha\psi(Z;\omega)\partial_{s}\beta% _{s}(\mathrm{d}\omega)-g(Z;\partial_{s}\beta_{s})\cdot\int\alpha\psi(Z;\omega)% \partial_{s}\beta_{s}(\mathrm{d}\omega)\Bigr{]}$
		$\displaystyle\qquad=\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\cdot\Big{\langle}% \frac{\delta\Psi(X,Z;f(\cdot;\partial_{s}\beta_{s}))}{\delta f},f(\cdot;% \partial_{s}\beta_{s})\Big{\rangle}_{L^{2}}+g(Z;\partial_{s}\beta_{s})^{2}% \Bigr{]}.$		(A.18)

By convexity of $\Psi(x,z;f)$ and $\Psi(x,z;\bm{0})=0$ for all $(x,z)\in\mathcal{X}\times\mathcal{Z}$ , it holds that

\displaystyle\Psi(x,z;f(\cdot;\partial_{s}\beta_{s}))\leq\Big{\langle}\frac{% \delta\Psi(x,z;f(\cdot;\partial_{s}\beta_{s}))}{\delta f},f(\cdot;\partial_{s}% \beta_{s})\Big{\rangle}_{L^{2}},\quad\forall(x,z)\in\mathcal{X}\times\mathcal{% Z}.

(A.19)

Integrating (A.2) with respect to $s\in[0,1]$ , we have that

$\displaystyle\int_{0}^{1}\big{\langle}\partial_{s}v(\cdot;\beta_{s}),u_{s}\big% {\rangle}_{\beta_{s}}\mathrm{d}s$	$\displaystyle=-\int_{0}^{1}\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\cdot\big{% \langle}\frac{\delta\Psi(X,Z;f(\cdot;\partial_{s}\beta_{s}))}{\delta f},f(% \cdot;\partial_{s}\beta_{s})\big{\rangle}_{L^{2}}+g(Z;\partial_{s}\beta_{s})^{% 2}\Bigr{]}\mathrm{d}s$
	$\displaystyle\leq-\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\cdot\Psi(X,Z;f(\cdot% ;\partial_{s}\beta_{s}))+g(Z;\partial_{s}\beta_{s})^{2}\Bigr{]}\mathrm{d}s$
	$\displaystyle\leq-\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\cdot\Psi(X,Z;f(\cdot% ;\rho_{t})-f(\cdot;\rho^{}))+\Big{(}g(Z;\rho_{t})-g(Z;\rho^{})\Big{)}^{2}% \Bigr{]}\mathrm{d}s$
	$\displaystyle=-\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\cdot\Psi(X,Z;f(\cdot;% \rho_{t})-f^{}(\cdot))+\bigl{(}g(Z;\rho_{t})-g^{}(Z)\bigr{)}^{2}\Bigr{]}.$	(A.20)

where the first inequality holds due to (A.19), and the second holds by Jensen’s inequality.

Upper bounding term (ii). By Lemma D.6, for term (ii) in (A.17), it holds that

		$\displaystyle\int\big{\langle}v(\theta,\omega;\beta_{s}),\partial_{s}(u_{s}(% \theta,\omega)\beta_{s}(\theta,\omega))\big{\rangle}\mathrm{d}(\theta,\omega)$
		$\displaystyle\qquad=\int\big{\langle}\nabla v(\theta,\omega;\beta_{s}),u_{s}(% \theta,\omega)\otimes u_{s}(\theta,\omega)\beta_{s}(\theta,\omega)\big{\rangle% }\mathrm{d}(\theta,\omega)$
		$\displaystyle\qquad\leq\sup_{\theta,\omega}~{}\bigl{\\|}\nabla v(\theta,\omega;% \beta_{s})\bigl{\\|}_{F}\cdot\\|u_{s}\\|_{\beta_{s}}^{2}.$		(A.21)

where $\bigl{\|}\cdot\bigl{\|}_{F}$ denotes the Frobenius norm. Since $u_{s}$ is the velocity field corresponding to the geodesic connecting $\rho^{*}$ , by assumptions, it holds that

\displaystyle\|u_{s}\|_{\beta_{s}}^{2}=\mathcal{W}_{2}(\rho_{t},\rho^{*})^{2}% \leq 4\mathcal{W}_{2}(\rho_{0},\rho^{*})^{2}=4\alpha^{-2}\bar{D}^{2}=\mathcal{% O}(\alpha^{-2})

(A.22)

On the other hand, by the definition of $v$ in (A.15), we have that

\displaystyle\bigl{\|}\nabla v(\theta,\omega;\beta_{s})\bigl{\|}_{F}^{2}=\bigl% {\|}\nabla_{\theta}v^{f}(\theta;\beta_{s})\bigl{\|}_{F}^{2}+\bigl{\|}\nabla_{% \omega}v^{g}(\omega;\beta_{s})\bigl{\|}_{F}^{2}

(A.23)

By the definition of $v^{f}$ in (3), we have that

$\displaystyle\big{\\|}\nabla_{\theta}v^{f}(\theta;\beta_{s})\big{\\|}_{F}$	$\displaystyle\leq\alpha\cdot\mathbb{E}_{\mathcal{D}}\Big{[}\Big{\|}g(Z;\beta_{s% })\cdot\int_{\mathcal{W}}\frac{\delta\Phi(X,Z;f(\cdot;\beta_{s}))}{\delta f}(w% ^{\prime})\mathrm{d}w^{\prime}\Big{\|}\Big{]}\cdot\underset{w\in\mathcal{W}}{% \sup}\big{\\|}\nabla^{2}_{\theta,\theta}\phi(w;\theta)\big{\\|}_{F}^{2}$
	$\displaystyle\qquad+\alpha\cdot\mathbb{E}_{\mathcal{D}}\Big{[}\lambda\cdot\Big% {\|}\int_{\mathcal{W}}\frac{\delta\Psi(X,Z;f(\cdot;\beta_{s}))}{\delta f}(w^{% \prime})\mathrm{d}w^{\prime}\Big{\|}\Big{]}\cdot\underset{w\in\mathcal{W}}{\sup% }\big{\\|}\nabla^{2}_{\theta,\theta}\phi(w;\theta)\big{\\|}_{F}^{2}$
	$\displaystyle\leq\alpha B_{2}\cdot\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda C_{% \Psi}\bigl{\|}f(W;\beta_{s})\bigr{\|}+C_{2}\big{\|}g(Z;\beta_{s})\big{\|}\Bigr{]}.$	(A.24)

where the first inequality follows from Assumption 4.1, and second inequality comes from the integrability conditions in Assumption 4.3. Thus, it suffices to upper bound $f(w;\beta_{s})$ and $g(z;\beta_{s})$ for all $(w,z)\in\mathcal{W}\times\mathcal{Z}$ . For $f(w;\beta_{s})$ , we have that

	$\displaystyle\bigl{\|}f(w;\beta_{s})\bigr{\|}$	$\displaystyle=\alpha\cdot\Bigl{\|}\int\phi(w\theta)\mathrm{d}\beta_{s}(\theta,% \omega)\Bigr{\|}=\alpha\cdot\Bigl{\|}\int\phi(w;\theta)~{}\mathrm{d}(\beta_{s}-% \rho_{0})(\theta,\omega)\Bigr{\|}$
		$\displaystyle\leq\alpha B_{1}\cdot\mathcal{W}_{1}(\beta_{s},\rho_{0})\leq% \alpha B_{1}\cdot\mathcal{W}_{2}(\beta_{s},\rho_{0}).$		(A.25)

Moreover, it holds that

\displaystyle\mathcal{W}_{2}(\beta_{s},\rho_{0})\leq\mathcal{W}_{2}(\beta_{s},% \rho^{*})+\mathcal{W}_{2}(\rho^{*},\rho_{0})\leq\mathcal{W}_{2}(\rho_{t},\rho^% {*})+\mathcal{W}_{2}(\rho_{0},\rho^{*})\leq 3\alpha^{-1}\bar{D},

(A.26)

where the second inequality follows from the fact that $\beta_{s},s\in[0,1]$ is the geodesic connecting $\rho_{t}$ and $\rho^{*}$ and the last inequality follows from (ii) in Lemma 4.6. Plugging (A.26) into (A.2), we have that

\displaystyle\bigl{|}f(w;\beta_{s})\bigr{|}\leq\mathcal{O}(1),\quad\forall w% \in\mathcal{W}.

(A.27)

Through a similar argument, such an upper bound can also be established for $g(z;\beta_{s})$ for all $z\in\mathcal{Z}$ ,

\displaystyle\bigl{|}g(z;\beta_{s})\bigr{|}\leq\mathcal{O}(1),\quad z\in% \mathcal{Z}.

(A.28)

Plugging (A.27) and (A.28) into (A.2), we establish an upper bound for $\big{\|}\nabla_{\theta}v^{f}(\theta;\beta_{s})\big{\|}_{F}$ ,

\displaystyle\big{\|}\nabla_{\theta}v^{f}(\theta;\beta_{s})\big{\|}_{F}\leq% \mathcal{O}(\alpha).

(A.29)

Similarly, by the definition of $v^{g}$ in (3) we have that

	$\displaystyle\big{\\|}\nabla_{\omega}v^{g}(\omega;\beta_{s})\big{\\|}_{F}$	$\displaystyle\leq\alpha\cdot\mathbb{E}_{\mathcal{D}}\Bigl{[}\big{\|}\Phi(X,Z;f(% \cdot;\beta_{s}))\big{\|}+\big{\|}g(Z;\beta_{s})\big{\|}\Bigr{]}\cdot\underset{z% \in\mathcal{Z}}{\sup}\big{\\|}\nabla^{2}_{\omega,\omega}\psi(z;\omega)\big{\\|}_% {F}^{2}$
		$\displaystyle\leq\alpha B_{2}\cdot\Big{(}\mathbb{E}_{\mathcal{D}}\Bigl{[}\big{% \|}\Phi(X,Z;\bm{0})\big{\|}+C_{2}\bigl{\|}f(W;\beta_{s})\bigr{\|}+\big{\|}g(Z;\beta% _{s})\big{\|}\Big{)}=\mathcal{O}(\alpha).$		(A.30)

Combining the bound from (A.29) and (A.2) and plugging into(A.23), it holds that

\displaystyle\Big{\|}\nabla v(\theta,\omega;\beta_{s})\Big{\|}_{F}^{2}=\Big{\|% }\nabla_{\theta}v^{f}(\theta;\beta_{s})\Big{\|}_{F}^{2}+\Big{\|}\nabla_{\omega% }v^{g}(\omega;\beta_{s})\Big{\|}_{F}^{2}\leq\mathcal{O}(\alpha^{2}).

(A.31)

Equation (A.22) and (A.31) provide upper bounds on the two terms involved in (A.2). Plugging the upper bounds that we have achieved, it holds that

\displaystyle\int\Big{\langle}v(\theta,w;\beta_{s}),\partial_{s}(u_{s}(\theta,% \omega)\beta_{s}(\theta,\omega))\Big{\rangle}\mathrm{d}(\theta,\omega)\leq% \mathcal{O}(\alpha^{-1}).

(A.32)

Now combining (A.2) and (A.32), we have that

\displaystyle\frac{1}{2}\frac{\mathrm{d}\mathcal{W}_{2}(\rho_{t},\rho^{*})^{2}% }{\mathrm{d}t}

\displaystyle\;\leq-\eta\cdot\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\Psi(X,Z;f% (\cdot,\rho_{t})-f(\cdot;\rho^{*}))+\bigl{(}g(Z;\rho_{t})-g(Z;\rho^{*})\bigr{)% }^{2}\Bigr{]}+C_{*}\cdot\eta\cdot\alpha^{-1}.

where $C_{*}=C_{*}\bigl{(}B_{0},B_{1},B_{2},C,\lambda,\bar{D}\bigr{)}>0$ is a constant. This completes the proof of Lemma A.1. ∎

We are now ready to present the proof of Theorem 4.7 with the help of Lemma A.1.

Proof.

We define

\displaystyle t^{*}=\inf\Bigl{\{}\tau\in\mathbb{R}_{+}{\,\bigg{|}\,}\mathbb{E}% _{\mathcal{D}}\bigl{[}\lambda\Psi(X,Z;f(\cdot,\rho_{\tau})-f(\cdot;\rho^{*}))+% \bigl{(}g(Z;\rho_{\tau})-g(Z;\rho^{*})\bigr{)}^{2}\bigr{]}<C_{*}\cdot\alpha^{-% 1}\Bigr{\}}

(A.33)

Also, we define

\displaystyle t_{*}=\inf\Bigl{\{}\tau\in\mathbb{R}_{+}{\,\bigg{|}\,}\mathcal{W% }_{2}(\rho_{\tau},\rho^{*})>2\mathcal{W}_{2}(\rho_{0},\rho^{*})\Bigr{\}}

(A.34)

In other words, (A.16) of Lemma A.1 holds for $t\leq t_{*}$ , and for $0\leq t\leq\min\{t_{*},t^{*}\}$ , we have

\displaystyle\frac{1}{2}\frac{\mathrm{d}\mathcal{W}_{2}(\rho_{t},\rho^{*})^{2}% }{\mathrm{d}t}\leq-\eta\cdot\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\Psi(X,Z;f(% \cdot,\rho_{t})-f(\cdot;\rho^{*}))+\bigl{(}g(Z;\rho_{t})-g(Z;\rho^{*})\bigr{)}% ^{2}\Bigr{]}+C_{*}\cdot\eta\alpha^{-1}\leq 0

We now show that $t_{*}>t^{*}$ by contradiction. By the continuity of $\mathcal{W}_{2}(\rho_{t},\rho^{*})^{2}$ with respect to $t$ Ambrosio et al. (2008), since $\mathcal{W}_{2}(\rho_{0},\rho^{*})<2\mathcal{W}_{2}(\rho_{0},\rho^{*})$ , it holds that $t_{*}>0$ . Let’s assume $t_{*}\leq t^{*}$ , then $t_{*}=\min\{t_{*},t^{*}\}$ . Thus, by (A.16), (A.33), (A.34), it holds that for $0\leq t\leq t_{*}$ that

\displaystyle\frac{1}{2}\frac{\mathrm{d}\mathcal{W}_{2}(\rho_{t},\rho^{*})^{2}% }{\mathrm{d}t}\leq 0

which further implies that $\mathcal{W}_{2}(\rho_{t_{*}},\rho^{*})\leq\mathcal{W}_{2}(\rho_{0},\rho^{*})$ . This contradicts the definition of $t_{*}$ in (A.34). Thus, it holds that $t_{*}\geq t^{*}$ , which implies that (A.16) of Lemma A.1 holds for any $0\leq t\leq t^{*}$ . We now discuss two different situations.

Scenario (i) If $t_{*}\leq T$ , then it holds that

		$\displaystyle\inf_{t\in[0,T]}\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\Psi(X,Z;f% (\cdot,\mu_{t})-f^{})+\bigl{(}g(Z;\nu_{t})-g^{}\bigr{)}^{2}\Bigr{]}$
		$\displaystyle\quad\leq\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\Psi(X,Z;f(\cdot,% \mu_{t_{}})-f^{})+\bigl{(}g(Z;\nu_{t_{}})-g^{}\bigr{)}^{2}\Bigr{]}$
		$\displaystyle\quad<C_{*}\alpha^{-1}=\mathcal{O}(T^{-1}+\alpha^{-1}).$		(A.35)

Therefore, (A.2) implies Theorem 4.7 in this scenario.

Scenario (ii) If $t_{*}>T$ , then (A.16) in Lemma A.1 holds for $0\leq t\leq T$ . Re-arranging the terms, we have the following inequality for all $0\leq t\leq T$ ,

\displaystyle\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\Psi(X,Z;f(\cdot,\mu_{t})-% f^{*})+\bigl{(}g(Z;\nu_{t})-g^{*}\bigr{)}^{2}\Bigr{]}\leq-\eta^{-1}\cdot\frac{% 1}{2}\frac{\mathrm{d}\mathcal{W}_{2}(\rho_{t},\rho^{*})^{2}}{\mathrm{d}t}+C_{*% }\cdot\alpha^{-1}

(A.36)

This further suggests the following upper bound,

		$\displaystyle\inf_{t\in[0,T]}\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\Psi(X,Z;f% (\cdot,\mu_{t})-f^{})+\bigl{(}g(Z;\nu_{t})-g^{}\bigr{)}^{2}\Bigr{]}$
		$\displaystyle\quad\leq T^{-1}\cdot\int_{0}^{T}\mathbb{E}_{\mathcal{D}}\Bigl{[}% \lambda\Psi(X,Z;f(\cdot,\mu_{t})-f^{})+\bigl{(}g(Z;\nu_{t})-g^{}\bigr{)}^{2}% \Bigr{]}\mathrm{d}t$
		$\displaystyle\quad\leq 1/2\cdot\eta^{-1}\cdot T^{-1}\cdot\mathcal{W}_{2}(\rho_% {0},\rho^{})^{2}+C_{}\cdot\alpha^{-1}$
		$\displaystyle\quad\leq 1/2\cdot\alpha^{-2}\cdot\bar{D}^{2}\cdot\eta^{-1}\cdot T% ^{-1}+C_{*}\cdot\alpha^{-1}=\mathcal{O}(T^{-1}+\alpha^{-1}),$		(A.37)

where the second inequality comes from integrating (A.36) in for $t\in[0,T]$ , the third inequality comes from (ii) in Lemma 4.6 and last equality comes from setting $\eta$ to $\alpha^{-2}$ . Therefore, (A.2) implies Theorem 4.7 in this scenario.

Based on the discussion of scenarios (i) and (ii) above, we finish the proof of Theorem 4.7. ∎

A.3 Proof of Theorem 4.9

Proof.

We now prove Theorem 4.9. For notation simplicity, we denote $f_{t}=f(\cdot;\mu_{t})$ as the estimator at time $t$ . Recall the definition of $J(f)$ from (2.3) and $\bar{\delta}(z;f)$ from (2.2).

\displaystyle J(f)=\mathbb{E}_{\mathcal{D}}\bigl{[}1/2\cdot\bar{\delta}(Z;f)^{% 2}+\lambda\cdot\Psi(X,Z;f)\bigr{]},\quad\bar{\delta}(z;f)=\mathbb{E}_{X|Z}% \bigl{[}\Phi(X,Z;f){\,\big{|}\,}Z=z\bigr{]}.

Plugging the definition of $J(f)$ , it holds that

		$\displaystyle\inf_{t\in[0,T]}J(f_{t})-J(f^{*})$
		$\displaystyle\qquad=\inf_{t\in[0,T]}\mathbb{E}_{\mathcal{D}}\Big{[}1/2\cdot% \Big{(}\bar{\delta}(Z,f_{t})^{2}-\bar{\delta}(Z,f^{})^{2}\Big{)}+\lambda\Big{% (}\Psi(X,Z;f_{t})-\Psi(X,Z;f^{})\Big{)}\Big{]}.$		(A.38)

Similar to the proof of Theorem 4.7, we define $t_{*}$ as,

\displaystyle t_{*}=\inf\Bigl{\{}\tau\in\mathbb{R}_{+}{\,\bigg{|}\,}\mathcal{W% }_{2}(\rho_{\tau},\rho^{*})>2\mathcal{W}_{2}(\rho_{0},\rho^{*})\Bigr{\}}.

We will upper-bound the term in (A.3) separately in two different scenarios, depending on the value of $t_{*}$ compared with $T$ .

Scenario (i) If $t_{*}\leq T$ , then we have that

\displaystyle\inf_{t\in[0,T]}J(f_{t})-J(f^{*})\leq J(f_{t_{*}})-J(f^{*}).

(A.39)

In order to upper-bound right-hand side of (A.39), we need to uniformly upper-bound $f_{t_{*}}(w)$ and $f^{*}(w)$ for all $w\in\mathcal{W}$ . For $f_{t_{*}}(w)=f(w;\mu_{t_{*}})$ , we have that

$\displaystyle\underset{w\in\mathcal{W}}{\sup}\|f(w;\mu_{t_{*}})\|$	$\displaystyle\;=\alpha\cdot\underset{w\in\mathcal{W}}{\sup}\Bigl{\|}\int\phi(w;% \theta)\mathrm{d}\mu_{t_{}}(\theta)\Bigr{\|}=\alpha\cdot\underset{w\in\mathcal% {W}}{\sup}\Bigl{\|}\int\phi(w;\theta)\mathrm{d}(\mu_{t_{}}-\mu_{0})(\theta)% \Bigr{\|}$
	$\displaystyle\;\leq\alpha B_{1}\cdot\mathcal{W}_{1}(\mu_{t_{}},\mu_{0})\leq% \alpha B_{1}\cdot\mathcal{W}_{2}(\mu_{t_{}},\mu_{0})\;\leq\alpha B_{1}\cdot% \Bigl{(}\mathcal{W}_{2}(\rho_{t_{}},\rho^{})+\mathcal{W}_{2}(\rho_{0},\rho^{% *})\Bigr{)}$
	$\displaystyle\;\leq 3B_{1}\cdot\bar{D}=\mathcal{O}(1).$	(A.40)

where the first inequality follows from Lemma D.7, the second inequality follows from Lemma D.2. The last inequality follows from (ii) in Lemma (4.6) and definition of $t_{*}$ . For $f^{*}$ , a similar chain of inequalities would apply,

$\displaystyle\underset{w\in\mathcal{W}}{\sup}\|f(w;\mu^{*})\|$	$\displaystyle\;=\alpha\cdot\underset{w\in\mathcal{W}}{\sup}\Bigl{\|}\int\phi(w;% \theta)\mathrm{d}\mu^{}(\theta)\Bigr{\|}=\alpha\cdot\underset{w\in\mathcal{W}}% {\sup}\Bigl{\|}\int\phi(w;\theta)\mathrm{d}(\mu^{}-\mu_{0})(\theta)\Bigr{\|}$
	$\displaystyle\;\leq\alpha B_{1}\cdot\mathcal{W}_{1}(\mu^{},\mu_{0})\leq\alpha B% _{1}\cdot\mathcal{W}_{2}(\mu^{},\mu_{0})\;\leq\alpha B_{1}\cdot\mathcal{W}_{2% }(\rho^{*},\rho_{0})$
	$\displaystyle\;\leq B_{1}\cdot\bar{D}=\mathcal{O}(1).$	(A.41)

With uniform bounds on $f_{t_{*}}$ and $f^{*}$ , we are now ready to upper-bound $\inf_{t\in[0,T]}J(f_{t})-J(f^{*})$ through upper-bounding $J(f_{t_{*}})-J(f^{*})$ ,

$\displaystyle J(f_{t_{}})-J(f_{})$	$\displaystyle\;\leq\mathbb{E}_{\mathcal{D}}\Big{[}\bar{\delta}(Z;f_{t_{}})% \cdot\mathbb{E}_{X\|Z}\big{[}\widetilde{\Phi}(X,Z;f_{t_{}}-f^{})\|Z\big{]}+% \lambda\cdot\Big{\langle}\frac{\delta\Psi(X,Z;f_{t_{}})}{\delta f},f_{t_{}}-% f^{}\Big{\rangle}_{L^{2}}\Big{]}$
	$\displaystyle\;\leq\Big{(}\underset{x,z}{\sup}\|\Phi(x,z;0)\|+C_{\Phi}\cdot% \underset{w\in\mathcal{W}}{\sup}\|f(w;\mu_{t_{}})\|\Big{)}\cdot\mathbb{E}_{% \mathcal{D}}\Big{[}C_{\Phi}\cdot\|f(W;\mu_{t_{}})-f(W;\mu^{*})\|\Big{]}$
	$\displaystyle\qquad+\lambda C_{\Psi}\cdot\mathbb{E}_{\mathcal{D}}\Big{[}\|f(W;% \mu_{t_{}})-f(W;\mu^{})\|\Big{]}$
	$\displaystyle\;\leq B_{}\cdot\mathbb{E}_{\mathcal{D}}\Big{[}\|f(W;\mu_{t_{}})% -f(W;\mu^{})\|\Big{]}\leq B_{}\cdot\Big{(}\mathbb{E}_{\mathcal{D}}\Big{[}% \lambda\|f(W;\mu_{t_{}})-f(W;\mu^{})\|^{2}\Big{]}\Big{)}^{1/2}$
	$\displaystyle\;\leq B_{}\cdot\Big{(}\mathbb{E}_{\mathcal{D}}\Big{[}\Psi(X,Z;f% _{t_{}}-f^{})\Big{]}\Big{)}^{1/2}\leq B_{}\cdot\alpha^{-1/2},$	(A.42)

where $B_{*}=B_{*}(\Phi,c_{\phi},C_{\Phi},C_{\Psi},\lambda,C,B_{1},\bar{D},C_{*})>0$ is a constant and its values changes from line to line. The second inequality follows from (A.3) and (A.3). The last inequality follows from (A.2) in the proof of Theorem (4.7). Therefore, in this scenario, we have that

\displaystyle\inf_{t\in[0,T]}J(f_{t})-J(f_{*})\leq J(f_{t_{*}})-J(f_{*})\leq% \mathcal{O}(T^{-1/2}+\alpha^{-1/2}).

(A.43)

Equation (A.43) concludes the proof of Theorem 4.9 in the scenario of $t_{*}\leq T$ .

Scenario (ii) If $t_{*}>T$ , by definition of $t_{*}$ , we have that

\displaystyle\mathcal{W}_{2}(\mu_{t},\mu^{*})\leq\mathcal{W}_{2}(\rho_{t},\rho% ^{*})\leq 2\mathcal{W}_{2}(\rho_{0},\rho^{*})=2\alpha\cdot\bar{D},\quad\forall% \;t\in[0,T].

Following the same arguments in (A.3) and (A.3), we have a uniform upper-bound for $f_{t}$ for all $t\in[0,T]$ and $f^{*}$ that writes,

\displaystyle\underset{w\in\mathcal{W}}{\sup}\bigl{|}f(w;\mu_{t})\bigr{|}+% \bigl{|}f(w;\mu^{*})\bigr{|}\leq 4B_{1}\cdot\bar{D}=\mathcal{O}(1),\quad% \forall\;t\in[0,T].

Following the same derivation of (A.3), we have that

$\displaystyle\inf_{t\in[0,T]}J(f_{t})-J(f_{*})$	$\displaystyle\;\leq B_{}\cdot\inf_{t\in[0,T]}B_{}\cdot\Big{(}\mathbb{E}_{% \mathcal{D}}\Big{[}\Psi(X,Z;f_{t}-f^{*})\Big{]}\Big{)}^{1/2}$
	$\displaystyle\;\leq B_{}\cdot\Big{(}\inf_{t\in[0,T]}\mathbb{E}_{\mathcal{D}}% \Big{[}\Psi(X,Z;f_{t}-f^{})\Big{]}\Big{)}^{1/2}$
	$\displaystyle\;\leq B_{}\cdot\Bigl{(}T^{-1}\cdot\int_{0}^{T}\mathbb{E}_{% \mathcal{D}}\Big{[}\Psi(X,Z;f_{t}-f^{})\Big{]}\mathrm{d}t\Bigr{)}^{1/2}$
	$\displaystyle\;\leq B_{}\cdot\sqrt{1/2\cdot\bar{D}^{2}\cdot T^{-1}+C_{}\cdot% \alpha^{-1}}=\mathcal{O}(T^{-1/2}+\alpha^{-1/2}),$	(A.44)

where the last inequality follows from (A.36) and (A.2) in the proof of Theorem 4.7. Equation (A.3) concludes the proof of Theorem (4.9) in the scenario of $t_{*}>T$ .

Based on the discussion of scenarios (i) and (ii) above, we finish the proof of Theorem 4.9. ∎

Appendix B Mean Field Limit of Neural Networks

In this section, we prove Proposition 4.4. The formal version is presented as follows. Let $\rho_{t}(\theta,\omega)=\mu_{t}(\theta)\otimes\nu_{t}(\omega)$ , where $(\mu_{t},\nu_{t})$ is the PDE solution in (3.7) and $\widehat{\rho}_{k}(\theta,\omega)=N^{-1}\cdot\sum_{i=1}^{N}\delta_{\theta_{k}^% {i}}(\theta)\cdot\delta_{\omega_{k}^{i}}(\omega)$ is the empirical distribution of $(\bm{\theta}_{k},\bm{\omega}_{k})$ . Here we omit the dependence of the empirical distribution $\widehat{\rho}_{k}$ on $N$ and stepsize scale $\epsilon$ for notational simplicity.

Proposition B.1 (Formal Version of Proposition 4.4).

Let $h:\mathbb{R}^{D}\times\mathbb{R}^{D}\rightarrow\mathbb{R}$ by any continuous function such that $\|h\|_{\infty}\leq 1$ and $\operatorname{Lip}(h)\leq 1$ . Under Assumption 4.1, 4.3, with probability at least $1-5\delta$ , it holds that

\displaystyle\sup_{\begin{subarray}{c}k\leq T/\epsilon\\ (k\in\mathbb{N})\end{subarray}}\left|\int h(\theta,w)\mathrm{d}\rho_{k\epsilon% }(\theta,w)-\int h(\theta,w)\mathrm{d}\widehat{\rho}_{k}(\theta,w)\right|\leq B% \cdot e^{BT}\cdot\Bigl{(}\sqrt{\log(N/\delta)/N}+\sqrt{\epsilon\cdot(D+\log(N/% \delta))}\Bigr{)}.

Here $B$ is a constant that depends on $\alpha,\eta,\lambda,B_{0},B_{1}$ and $B_{2}$ .

The proof of Proposition B.1 based heavily on Mei et al. (2018, 2019); Araújo et al. (2019); Zhang et al. (2020), which make use of the propagation of chaos arguments in Sznitman (1991). Recall that $(v^{f}(\cdot;\rho),v^{g}(\cdot;\rho))$ is the a vector field defined as,

	$\displaystyle v^{f}(\theta;\rho)$	$\displaystyle=\alpha\mathbb{E}_{\mathcal{D}}\Bigl{[}-g(Z;\rho)\cdot\Big{% \langle}\frac{\delta\Phi(X,Z;f(\cdot;\rho))}{\delta f},\nabla_{\theta}\phi(% \cdot;\theta)\Big{\rangle}_{L^{2}}-\lambda\cdot\Big{\langle}\frac{\delta\Psi(X% ,Z;f(\cdot;\rho))}{\delta f},\nabla_{\theta}\phi(\cdot;\theta)\Big{\rangle}_{L% ^{2}}\Bigr{]},$
	$\displaystyle v^{g}(w;\rho)$	$\displaystyle=\alpha\mathbb{E}_{\mathcal{D}}\Bigl{[}\Phi(X,Z;f(\cdot,\rho))% \cdot\nabla_{\omega}\psi(Z;\omega)-g(Z;\rho)\cdot\nabla_{\omega}\psi(Z;\omega)% \Bigr{]}.$		(B.1)

From now on, we equivalently write $\theta_{k}^{i}=\theta_{i}(k)$ , $\omega_{k}^{i}=\omega_{i}(k)$ to emphasize the dependence on iterations. For abbreviation, we denote $\theta^{(N)}(k)=\{\theta_{i}(k)\}_{i=1}^{N}$ and $\omega^{(N)}(k)=\{\omega_{i}(k)\}_{i=1}^{N}$ . We recall the finite-width representation of $f(\cdot;\theta^{(N)})$ and $g(\cdot;\omega^{(N)})$ are,

\displaystyle f(\cdot,\theta^{(N)})=\frac{\alpha}{N}\cdot\sum_{i=1}^{N}\phi(% \cdot;\theta_{i}),\qquad g(\cdot,\omega^{(N)})=\frac{\alpha}{N}\cdot\sum_{i=1}% ^{N}\psi(\cdot;\omega_{i}).

Correspondingly, we defined the finite-width counter-part of $v^{f}$ and $v^{g}$ as following,

$\displaystyle\widehat{v}^{f}(\theta;\theta^{(N)},\omega^{(N)})$	$\displaystyle=\alpha\mathbb{E}_{\mathcal{D}}\bigg{[}-g(Z;\omega^{(N)})\cdot% \big{\langle}\frac{\delta\Phi(X,Z;f(\cdot;\theta^{(N)}))}{\delta f},\nabla_{% \theta}\phi(\cdot;\theta)\big{\rangle}_{L^{2}}$
	$\displaystyle\quad\quad-\lambda\cdot\big{\langle}\frac{\delta\Psi(X,Z;f(\cdot;% \theta^{(N)}))}{\delta f},\nabla_{\theta}\phi(\cdot;\theta)\big{\rangle}_{L^{2% }}\bigg{]},$
$\displaystyle\widehat{v}^{g}(w;\theta^{(N)},\omega^{(N)})$	$\displaystyle=\alpha\mathbb{E}_{\mathcal{D}}\bigg{[}\Phi(X,Z;f(\cdot,\theta^{(% N)}))\cdot\nabla_{\omega}\psi(Z;\omega)-g(Z;\omega^{(N)})\cdot\nabla_{\omega}% \psi(Z;\omega)\bigg{]}.$	(B.2)

And we also defined the stochastic counter-part,

$\displaystyle\widehat{V}_{k}^{f}(\theta;\theta^{(N)},w^{(N)})$	$\displaystyle=\alpha\bigg{[}-g(z_{k};\omega^{(N)})\cdot\Big{\langle}\frac{% \delta\Phi(x_{k},z_{k};f(\cdot;\theta^{(N)}))}{\delta f},\nabla_{\theta}\phi(% \cdot;\theta)\Big{\rangle}_{L^{2}}$
	$\displaystyle\quad\quad-\lambda\cdot\Big{\langle}\frac{\delta\Psi(x_{k},z_{k};% f(\cdot;\theta^{(N)}))}{\delta f},\nabla_{\theta}\phi(\cdot;\theta)\Big{% \rangle}_{L^{2}}\bigg{]},$
$\displaystyle\widehat{V}_{k}^{g}(\omega;\theta^{(N)},w^{(N)})$	$\displaystyle=\alpha\Big{(}\Phi(x_{k},z_{k};f(\cdot;\theta^{(N)}))\cdot\nabla_% {\omega}\psi(z_{k};\omega)-g(z_{k};\omega^{(N)})\cdot\nabla_{\omega}\psi(z_{k}% ;\omega)\Big{)}.$	(B.3)

where $(x_{k},z_{k})\sim\mathcal{D}$ . Following from Mei et al. (2019); Araújo et al. (2019), we consider the following four dynamics.

•

Stochastic Gradient Descent Ascent (SGDA). We consider the following SGDA dynamics for $\theta^{(N)}(k)$ and $\omega^{(N)}(k)$ , where $k\in\mathbb{N}$ , with $\theta_{i}(0)\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mu_{0},w_{i}(0)% \stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\nu_{0}\;(i\in[N])$ as its initialization,

	$\displaystyle\theta_{i}(k+1)$	$\displaystyle=\theta_{i}(k)+\eta\epsilon\cdot\widehat{V}_{k}^{f}(\theta_{i}(k)% ;\theta^{(N)}(k),\omega^{(N)}(k)),$
	$\displaystyle\omega_{i}(k+1)$	$\displaystyle=\omega_{i}(k)+\eta\epsilon\cdot\widehat{V}_{k}^{g}(\omega_{i}(k)% ;\theta^{(N)}(k),\omega^{(N)}(k)).$		(B.4)

Note that this dynamics is equivalent to (3).

•

Population Gradient Descent Ascent (PGDA). We consider the following population gradient descent ascent dynamics for $\breve{\theta}^{(N)}(k)$ and $\breve{\omega}^{(N)}(k)$ , where $k\in\mathbb{N}$ , with $\breve{\theta}_{i}(0)=\theta_{i}(0)$ , $\breve{\omega}_{i}(0)=\omega_{i}(0)\;(i\in[N])$ as its initialization,

	$\displaystyle\breve{\theta}_{i}(k+1)$	$\displaystyle=\breve{\theta}_{i}(k)+\eta\epsilon\cdot\widehat{v}^{f}(\breve{% \theta}_{i}(k);\breve{\theta}^{(N)}(k),\breve{\omega}^{(N)}(k)),$
	$\displaystyle\breve{\omega}_{i}(k+1)$	$\displaystyle=w_{i}(k)+\eta\epsilon\cdot\widehat{v}^{g}(\breve{\omega}_{i}(k);% \breve{\theta}^{(N)}(k),\breve{\omega}^{(N)}(k)).$		(B.5)

•

Continuous-time Population Gradient Descent Ascent (CTPGDA). We consider the following continuous time population gradient descent ascent dynamics for $\widetilde{\theta}^{(N)}(t)$ and $\widetilde{\omega}^{(N)}(t)$ , where $t\in\mathbb{R}_{+}$ , with $\widetilde{\theta}_{i}(0)=\theta_{i}(0)$ , $\widetilde{\omega}_{i}(0)=\omega_{i}(0)\;(i\in[N])$ as initialization,

\displaystyle\frac{\mathrm{d}}{\mathrm{d}t}\widetilde{\theta}_{i}(t)

\displaystyle=\eta\cdot\widehat{v}^{f}(\widetilde{\theta}_{i}(t);\widetilde{% \theta}^{(N)}(t),\widetilde{\omega}^{(N)}(t)),\qquad\frac{\mathrm{d}}{\mathrm{% d}t}\widetilde{\omega}_{i}(t)

\displaystyle=\eta\cdot\widehat{v}^{g}(\widetilde{\omega}_{i}(t);\widetilde{% \theta}^{(N)}(t),\widetilde{\omega}^{(N)}(t)).

(B.6)

•

Ideal particle (IP). We consider the following ideal particle dynamics for $\bar{\theta}^{(N)}(t)$ and $\bar{w}^{(N)}(t)$ , where $t\in\mathbb{R}_{+}$ , with $\bar{\theta}_{i}(0)=\theta_{i}(0)$ , $\bar{w}_{i}(0)=w_{i}(0)\;(i\in[N])$ as initialization,

\displaystyle\frac{\mathrm{d}}{\mathrm{d}t}\bar{\theta}_{i}(t)=\eta\cdot v^{f}% (\bar{\theta}_{i}(t);\rho_{t}),\qquad\frac{\mathrm{d}}{\mathrm{d}t}\bar{\omega% }_{i}(t)=\eta\cdot v^{g}(\bar{\omega}_{i}(t);\rho_{t}).

(B.7)

We aim to prove that $\widehat{\rho}_{k}=N^{-1}\cdot\sum_{i=1}^{N}\delta_{\theta_{i}(k)}\cdot\delta_% {w_{i}(k)}$ weakly converges to $\rho_{k\epsilon}$ . For any continuous function $h$ that satisfies the assumptions of Proposition B.1, using the IP, CTPGDA, and PGDA dynamics as interpolating dynamics, we have,

		$\displaystyle\overbrace{\Bigl{\|}\int h(\theta,\omega)\mathrm{d}\rho_{k\epsilon% }(\theta,\omega)-\int h(\theta,\omega)\mathrm{d}\widehat{\rho}_{k}(\theta,% \omega)\Bigr{\|}}^{\mathrm{PDE}-\mathrm{SGDA}}$
		$\displaystyle\qquad\leq\underbrace{\left\|\int h(\theta,\omega)\mathrm{d}\rho_{% k\epsilon}(\theta)-N^{-1}\cdot\sum_{i=1}^{N}h\left(\bar{\theta}_{i}(k\epsilon)% ,\bar{\omega}_{i}(k\epsilon)\right)\right\|}_{\mathrm{PDE}-\mathrm{IP}}+% \underbrace{\left\\|(\bar{\theta},\bar{\omega})^{(N)}(k\epsilon)-(\widetilde{% \theta},\widetilde{\omega})^{(N)}(k\epsilon)\right\\|_{(N)}}_{\mathrm{IP}-% \mathrm{CTPGDA}}$
		$\displaystyle\qquad+\underbrace{\left\\|(\widetilde{\theta},\widetilde{\omega})% ^{(N)}(k\epsilon)-(\breve{\theta},\breve{\omega})^{(N)}(k)\right\\|_{(N)}}_{% \mathrm{CTPGDA}-\mathrm{PGDA}}+\underbrace{\left\\|(\breve{\theta},\breve{% \omega})^{(N)}(k)-(\theta,\omega)^{(N)}(k)\right\\|_{(N)}}_{\mathrm{PGDA}-% \mathrm{SGDA}}.$		(B.8)

The last inequality follows from the fact that $\operatorname{Lip}(h)\leq 1$ . Here the norm $\|\cdot\|_{(N)}$ denotes the supremum norm over the sequence of vectors $(\theta,w)^{(N)}=\{(\theta_{i},w_{i})\}_{i=1}^{N}$ ,

\displaystyle\Bigl{\|}(\theta,\omega)^{(N)}\Bigr{\|}_{(N)}=\sup_{i\in[N]}~{}% \Bigl{\|}(\theta_{i},\omega_{i})\Bigr{\|}.

(B.9)

In what follows, we define $B>0$ as a constant with its value varying from line to line. We establish the following lemmas as upper-bound of the four terms on right-hand side of (B).

Lemma B.2 (Upper Bound of $\mathrm{PDE}-\mathrm{IP}$ ).

Under Assumption 4.1 and 4.3, with probability at least $1-\delta$ , it holds that

\displaystyle\sup_{t\in[0,T]}\Bigl{|}\int h(\theta,\omega)\mathrm{d}\rho_{t}(% \theta,\omega)-N^{-1}\sum_{i=1}^{N}h\bigl{(}\bar{\theta}_{i}(t),\bar{\omega}_{% i}(t)\bigr{)}\Bigr{|}\leq B\cdot\sqrt{\log(NT/\delta)/N}.

(B.10)

Lemma B.3 (Upper Bound of $\mathrm{IP}-\mathrm{CTPGDA}$ ).

Under Assumption 4.1 and 4.3, with probability at least $1-2\delta$ , it holds that

\displaystyle\sup_{t\in[0,T]}\Bigl{\|}(\bar{\theta},\bar{\omega})^{(N)}(t)-(% \widetilde{\theta},\widetilde{\omega})^{(N)}(t)\Bigr{\|}_{(N)}\leq B\cdot e^{% BT}\cdot\sqrt{\log(N/\delta)/N}.

(B.11)

Lemma B.4 (Upper Bound of $\mathrm{CTPGDA}-\mathrm{PGDA}$ ).

Under Assumption 4.1 and 4.3, it holds that

\displaystyle\sup_{k\leq T/\epsilon}\Bigl{\|}(\widetilde{\theta},\widetilde{% \omega})^{(N)}(k\epsilon)-(\breve{\theta},\breve{\omega})^{(N)}(k))\Bigr{\|}_{% (N)}\leq B\cdot e^{BT}\cdot\epsilon.

(B.12)

Lemma B.5 (Upper Bound of $\mathrm{PGDA}-\mathrm{SGDA}$ ).

Under Assumption 4.1 and 4.3, with probability at least $1-2\delta$ , it holds that

\displaystyle\sup_{k\leq T/\epsilon}\Bigl{\|}(\breve{\theta},\breve{\omega})^{% (N)}(k))-(\theta,w)^{(N)}(k)\Bigr{\|}_{(N)}\leq B\cdot e^{BT}\cdot\sqrt{% \epsilon\cdot(D+\log(N/\delta)}.

(B.13)

With these lemmas, we are now ready to present the proof of Proposition B.1.

Proof.

See §B.1.1, B.1.2, B.1.3, B.1.4 for detailed proofs for Lemma B.2 to Lemma B.5.

Plug in (B.10), (B.12), (B.12) and (B.13) to (B) and condition on the intersection of events in Lemma B.2, B.3, B.4 and B.5, we have that

\displaystyle\Bigl{|}\int h(\theta,\omega)\mathrm{d}\rho_{k\epsilon}(\theta,% \omega)-\int h(\theta,\omega)\mathrm{d}\widehat{\rho}_{k}(\theta,\omega)\Bigr{% |}\leq B\cdot e^{BT}\cdot\Big{(}\sqrt{\log(N/\delta)/N}+\sqrt{\epsilon\cdot(D+% \log(N/\delta))}\Big{)},

with probability at least $1-5\delta$ . Thus, we complete the proof of Proposition B.1. ∎

B.1 Proofs of Lemmas B.2-B.5

In this section, we present the proofs of Lemmas B.2-B.5, which based heavily on Mei et al. (2018, 2019); Araújo et al. (2019); Zhang et al. (2020). The required supporting technical lemmas are in §C. The constant $B$ presented in the proof is a positive constant whose values varies from line to line for notational simplicity.

B.1.1 Proof of Lemma B.2

Proof.

We first consider the ideal particle dynamics in (B.7). It holds that $\bar{\theta}_{i}(t)\sim\mu_{t},\bar{\omega}_{i}(t)\sim\nu_{t},\;(i\in[N])$ (Proposition 8.1.8 in Ambrosio et al. (2008)). Since the randomness of $\bar{\theta}_{i}(t)$ and $\bar{\omega}_{i}(t)$ comes from $\theta_{i}(0)$ and $\omega_{i}(0)$ respectively while $\theta_{i}(0)$ and $\omega_{i}(0)\;(i\in[N])$ are independent, $\bar{\theta}_{i}(t)\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mu_{t}$ , $\bar{\omega}_{i}(t)\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\nu_{t}\;(i% \in[N])$ . Due to independence of $\bar{\theta}_{i}(t)$ and $\bar{\omega}_{i}(t)$ , we also have $(\bar{\theta}_{i}(t),\bar{\omega}_{i}(t))\stackrel{{\scriptstyle\mathrm{i.i.d.% }}}{{\sim}}\rho_{t}\;(i\in[N])$ . This implies the following,

\displaystyle\mathbb{E}_{\rho_{t}}\Bigl{[}N^{-1}\cdot\sum_{i=1}^{N}h(\bar{% \theta}_{i}(t),\bar{w}_{i}(t))\Bigr{]}=\int h(\theta,\omega)\mathrm{d}\rho_{t}% (\theta,\omega).

For notational simplicity, we denote $\gamma_{i}=(\theta_{i},\omega_{i})$ , similar notations also generalize to $\bar{\gamma}_{i},\widetilde{\gamma}_{i},\breve{\gamma}_{i}$ . Let $\gamma^{1,(N)}=\{\gamma_{1},\dots,\gamma_{i}^{1}.\dots,\gamma_{N}\}$ and $\gamma^{2,(N)}=\{\gamma_{1},\dots,\gamma_{i}^{2},\dots,\gamma_{N}\}$ be two sets of variables that only differ in the $i$ -th element. Then, by the assumption that $\|f\|_{\infty}\leq 1$ , we have the following bounded difference property,

\displaystyle\Bigl{|}N^{-1}\sum_{j=1}^{N}h(\gamma_{j}^{1})-N^{-1}\sum_{j=1}^{N% }h(\gamma_{j}^{2})\Bigr{|}=N^{-1}\cdot|h(\gamma_{i}^{1})-h(\gamma_{i}^{2})|% \leq 2/N.

Applying McDiarmid’s inequality (Wainwright, 2019), we have for a fixed $t\in[0,T]$ that

\displaystyle\mathbb{P}\left(\left|N^{-1}\sum_{i=1}^{N}h\left(\bar{\gamma}_{i}% (t)\right)-\int h(\gamma)\mathrm{d}\rho_{t}(\gamma)\right|\geq p\right)\leq% \exp\left(-Np^{2}/4\right).

(B.14)

Moreover, we have for any $s,t\in[0,T]$ that,

	$\displaystyle\left\|\Bigl{\|}N^{-1}\sum_{i=1}^{N}h\left(\bar{\gamma}_{i}(t)% \right)-\int h(\gamma)\mathrm{d}\rho_{t}(\gamma)\Bigr{\|}-\Bigl{\|}N^{-1}\sum_{i% =1}^{N}h\left(\bar{\gamma}_{i}(s)\right)-\int h(\gamma)\mathrm{d}\rho_{s}(% \gamma)\Bigr{\|}\right\|$
	$\displaystyle\qquad\leq\Big{\|}N^{-1}\sum_{i=1}^{N}h\left(\bar{\gamma}_{i}(t)% \right)-N^{-1}\sum_{i=1}^{N}h\left(\bar{\gamma}_{i}(s)\right)\Big{\|}+\Big{\|}% \int h(\gamma)\mathrm{d}\rho_{t}(\gamma)-\int h(\gamma)\mathrm{d}\rho_{s}(% \gamma)\Big{\|}$
	$\displaystyle\qquad\leq\left\\|\bar{\gamma}^{(N)}(t)-\bar{\gamma}^{(N)}(s)% \right\\|_{(N)}+\mathcal{W}_{1}\left(\rho_{t},\rho_{s}\right)\leq\left\\|\bar{% \gamma}^{(N)}(t)-\bar{\gamma}^{(N)}(s)\right\\|_{(N)}+\mathcal{W}_{2}\left(\rho% _{t},\rho_{s}\right)$
	$\displaystyle\qquad\leq\left\\|\bar{\theta}^{(N)}(t)-\bar{\theta}^{(N)}(s)% \right\\|_{(N)}+\left\\|\bar{w}^{(N)}(t)-\bar{w}^{(N)}(s)\right\\|_{(N)}+\mathcal% {W}_{2}\left(\mu_{t},\mu_{s}\right)+\mathcal{W}_{2}\left(\nu_{t},\nu_{s}\right).$

where the second inequality follows from the fact that $\operatorname{Lip}(h)\leq 1$ and Lemma D.7. The last inequality follows from the definition of $\gamma^{(N)}$ , (B.9) and Lemma D.2. Applying (C.12), (C.14) of Lemma C.2, we have for any $s,t\in[0,T]$ that

\displaystyle\left|\Bigl{|}N^{-1}\sum_{i=1}^{N}h(\bar{\gamma}_{i}(t))-\int h(% \gamma)\mathrm{d}\rho_{t}\Bigl{|}-\Bigl{|}N^{-1}\cdot\sum_{i=1}^{N}h(\bar{% \gamma}_{i}(s))-\int h(\gamma)\mathrm{d}\rho_{s}\Bigl{|}\right|\leq B\cdot% \Bigl{|}t-s\Bigr{|}.

Apply the union bound to (B.14) for $t\in\iota\cdot\{0,1,\ldots,\lfloor T/\iota\rfloor\}$ , we have that

\displaystyle\mathbb{P}\left(\sup_{t\in[0,T]}\left|N^{-1}\sum_{i=1}^{N}h\left(% \bar{\gamma}_{i}(t)\right)-\int h(\gamma)\mathrm{d}\rho_{t}(\gamma)\right|\geq p% +B\cdot\iota\right)\leq(T/\iota+1)\cdot\exp\left(-Np^{2}/4\right).

Setting $\iota=N^{-1/2}$ and $p=B\cdot\sqrt{\log(NT/\delta)/N}$ , we have that

\displaystyle\sup_{t\in[0,T]}\left|N^{-1}\sum_{i=1}^{N}h\left(\bar{\theta}_{i}% (t),\bar{\omega}_{i}(t)\right)-\int h(\theta,\omega)\mathrm{d}\rho_{t}\right|% \leq B\cdot\sqrt{\log(NT/\delta)/N}.

with probability at least $1-\delta$ . Thus, we complete the proof of Lemma B.2. ∎

B.1.2 Proof of Lemma B.3

Following from the definition of $\widetilde{\theta}_{i}(t)$ , $\widetilde{w}_{i}(t)$ and $\bar{\theta}_{i}(t)$ , $\bar{w}_{i}(t)$ in (B.6) and (B.7). We have for any $i\in[N]$ and $t\in[0,T]$ that

$\displaystyle\bigl{\\|}\bar{\theta}_{i}(t)-\widetilde{\theta}_{i}(t)\bigr{\\|}$	$\displaystyle\;\leq\int_{0}^{t}\Bigl{\\|}\frac{\mathrm{d}\widetilde{\theta}_{i}% (s)}{\mathrm{d}s}-\frac{\mathrm{d}\bar{\theta}_{i}(s)}{\mathrm{d}s}\Bigr{\\|}% \mathrm{d}s$
	$\displaystyle\;\leq\eta\cdot\int_{0}^{t}\Bigl{\\|}\widehat{v}^{f}(\widetilde{% \theta}_{i}(s);\widetilde{\theta}^{(N)}(s),\widetilde{\omega}^{(N)}(s))-% \widehat{v}^{f}(\bar{\theta}_{i}(s);\bar{\theta}^{(N)}(s),\bar{\omega}^{(N)}(s% ))\Bigr{\\|}\mathrm{d}s$
	$\displaystyle\qquad+\eta\cdot\int_{0}^{t}\Bigl{\\|}\widehat{v}^{f}(\bar{\theta}% _{i}(s);\bar{\theta}^{(N)}(s),\bar{\omega}^{(N)}(s))-v^{f}(\bar{\theta}_{i}(s)% ;\rho_{s})\Bigr{\\|}\mathrm{d}s$
	$\displaystyle\;\leq B\cdot\int_{0}^{t}\Bigl{\\|}\bar{\theta}^{(N)}(s)-% \widetilde{\theta}^{(N)}(s)\Bigr{\\|}_{(N)}+\Bigl{\\|}\bar{\omega}^{(N)}(s)-% \widetilde{\omega}^{(N)}(s)\Bigr{\\|}_{(N)}\mathrm{d}s$
	$\displaystyle\qquad+\eta\cdot\int_{0}^{t}\Bigl{\\|}\widehat{v}^{f}(\bar{\theta}% _{i}(s);\bar{\theta}^{(N)}(s),\bar{\omega}^{(N)}(s))-v^{f}(\bar{\theta}_{i}(s)% ;\rho_{s}),\Bigr{\\|}\mathrm{d}s$	(B.15)

where the last inequality follows from (C.8) of Lemma C.1. Similarly, we have that

	$\displaystyle\bigl{\\|}\bar{\omega}_{i}(t)-\widetilde{\omega}_{i}(t)\bigr{\\|}$	$\displaystyle\;\leq B\cdot\int_{0}^{t}\Bigl{\\|}\bar{\theta}^{(N)}(s)-% \widetilde{\theta}^{(N)}(s)\Bigr{\\|}_{(N)}+\Bigl{\\|}\bar{\omega}^{(N)}(s)-% \widetilde{\omega}^{(N)}(s)\Bigr{\\|}_{(N)}\mathrm{d}s$
		$\displaystyle\qquad+\eta\cdot\int_{0}^{t}\Bigl{\\|}\widehat{v}^{g}(\bar{\omega}% _{i}(s);\bar{\theta}^{(N)}(s),\bar{\omega}^{(N)}(s))-v^{g}(\bar{\omega}_{i}(s)% ;\rho_{s})\Bigr{\\|}\mathrm{d}s,$		(B.16)

where the inequality follows from (C.9). We now upper-bound the second term of (B.15) and (B.16). We start with (B.15). Following from the definition of $v^{f}$ and $\widehat{v}^{f}$ in (B) and (B), we have for any $s\in[0,T]$ and $i\in[N]$ that

\displaystyle\Bigl{\|}\widehat{v}^{f}(\bar{\theta}_{i}(s);\bar{\theta}^{(N)}(s% ),\bar{w}^{(N)}(s))-v^{f}(\bar{\theta}_{i}(s);\rho_{s})\Bigr{\|}=\alpha^{2}% \cdot\Bigl{\|}N^{-1}\cdot\sum_{j=1}^{N}Z_{i}^{j}(s)\Bigr{\|},

(B.17)

where $Z_{i}^{j}(s)$ is given by,

	$\displaystyle Z_{i}^{j}(s)$	$\displaystyle=\mathbb{E}_{\mathcal{D}}\bigg{[}\Big{\langle}\Big{(}\int\psi(Z;% \omega)\mathrm{d}\nu_{s}(\omega)-\psi(Z;\bar{\omega}_{j}(s))\Big{)}\cdot\frac{% \delta\Phi(X,Z;f)}{\delta f},\nabla_{\theta}\phi(\cdot;\bar{\theta}_{i}(s))% \Big{\rangle}_{L^{2}}$
		$\displaystyle\qquad\qquad+\lambda\cdot\Big{\langle}\frac{\delta\Psi(X,Z;\int% \phi(\cdot;\theta)\mathrm{d}\mu_{s}(\theta))}{\delta f}-\frac{\delta\Psi(X,Z;% \phi(\cdot;\bar{\theta}_{j}(s)))}{\delta f},\nabla_{\theta}\phi(\cdot;\bar{% \theta}_{i}(s))\Big{\rangle}_{L^{2}}\bigg{]}.$

Following from Assumption 4.1 and 4.3, we have that $\|Z_{i}^{j}(s)\|\leq B$ . When $j\neq i$ , since $\bar{\theta}_{j}(s)\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mu_{s},% \bar{\omega}_{j}(s)\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\nu_{s}\;(j% \in[N])$ , it holds that $\mathbb{E}[Z_{i}^{j}(s){\,\big{|}\,}\bar{\theta}_{i}(s)]=0$ . Following from Lemma C.3, we have for fixed $s\in[0,T]$ and $i\in[N]$ that

	$\displaystyle\mathbb{P}\bigg{(}\Bigl{\\|}N^{-1}\cdot\sum_{j\neq i}Z_{i}^{j}(s)% \Bigr{\\|}\geq B\cdot\left(N^{-1/2}+p\right)\bigg{)}$	$\displaystyle=\mathbb{E}\Bigl{[}\mathbb{P}\Bigl{(}\Bigl{\\|}N^{-1}\cdot\sum_{j% \neq i}Z_{i}^{j}(s)\Bigr{\\|}\geq B\cdot\left(N^{-1/2}+p\right)\Big{\|}~{}\bar{% \theta}_{i}(s)\Bigr{)}\Bigr{]}$
		$\displaystyle\leq\exp\left(-Np^{2}\right).$		(B.18)

From Lemma D.7 and (C.14) of Lemma C.2, we have that

\displaystyle\sup_{w\in\mathcal{W}}\Bigl{|}\int\phi(w;\theta)\mathrm{d}\mu_{s}% (\theta)-\int\phi(w;\theta)\mathrm{d}\mu_{t}(\theta)\Bigr{|}\leq B\cdot% \mathcal{W}_{1}(\mu_{s},\mu_{t})\leq B\cdot\mathcal{W}_{2}(\mu_{s},\mu_{t})% \leq B\cdot\bigl{|}s-t\bigr{|}.

Following from Assumption 4.1 and 4.3, Lemma C.2, we have for any $s,t\in[0,T]$ that,

\displaystyle\left|\Bigl{\|}N^{-1}\cdot\sum_{j\neq i}Z_{i}^{j}(s)\Bigr{\|}-% \Bigl{\|}N^{-1}\cdot\sum_{j\neq i}Z_{i}^{j}(t)\Bigr{\|}\right|\leq B\cdot\Bigl% {|}t-s\Bigr{|}.

Applying the union bound to (B.18) for $i\in[N]$ and $t\in\iota\cdot\{0,1,\ldots,\lfloor T/\iota\rfloor\}$ , we have that

\displaystyle\mathbb{P}\Bigl{(}\sup_{\begin{subarray}{c}i\in[N]\\ s\in[0,T]\end{subarray}}\bigl{\|}N^{-1}\cdot\sum_{j\neq i}Z_{i}^{j}(s)\bigr{\|% }\geq B\cdot\left(N^{-1/2}+p\right)+B\iota\Bigr{)}\leq N\cdot(T/\iota+1)\cdot% \exp\left(-Np^{2}\right).

Setting $\iota=N^{-1/2}$ and $p=B\cdot\sqrt{\log(NT/\delta)/N}$ , we have that

\displaystyle\sup_{\begin{subarray}{c}i\in[N]\\ s\in[0,T]\end{subarray}}\Bigl{\|}N^{-1}\cdot\sum_{j\neq i}Z_{i}^{j}(s)\Bigr{\|% }\leq B\cdot\sqrt{\log(NT/\delta)/N}.

(B.19)

with probability at least $1-\delta$ . Following from Assumption 4.1, when $i=j$ , $\|N^{-1}Z_{i}^{i}(s)\|\leq B/N$ in (B.17). Plugging (B.19) into (B.17), with probability at least $1-\delta$ , we have that

	$\displaystyle\sup_{\begin{subarray}{c}i\in[N]\\ s\in[0,T]\end{subarray}}\Bigl{\\|}\widehat{v}^{f}(\bar{\theta}_{i}(s);\bar{% \theta}^{(N)}(s),\bar{\omega}^{(N)}(s))-v^{f}(\bar{\theta}_{i}(s);\rho_{s})% \Bigr{\\|}$	$\displaystyle\leq\sup_{i\in[N],s\in[0,T]}\alpha^{2}\cdot\Bigl{\\|}N^{-1}\sum_{j% =1}^{N}Z_{i}^{j}(s)\Bigr{\\|}$
		$\displaystyle\leq B\cdot\sqrt{\log(NT/\delta)/N}.$		(B.20)

Through similar arguments, with probability at least $1-\delta$ , the second term of (B.16) holds

\displaystyle\sup_{\begin{subarray}{c}i\in[N]\\ s\in[0,T]\end{subarray}}\Bigl{\|}\widehat{v}^{g}(\bar{w}_{i}(s);\bar{\theta}^{% (N)}(s),\bar{\omega}^{(N)}(s))-v^{g}(\bar{\omega}_{i}(s);\rho_{s})\Bigr{\|}% \leq B\cdot\sqrt{\log(NT/\delta)/N}.

(B.21)

Now, conditioning on the intersection of event in (B.1.2) and event in (B.21), the following holds simultaneously for any $t\in[0,T]$

	$\displaystyle\left\\|\widetilde{\theta}^{(N)}(t)-\bar{\theta}^{(N)}(t)\right\\|_% {(N)}\leq B\cdot\int_{0}^{t}\left\\|\widetilde{\theta}^{(N)}(s)-\bar{\theta}^{(% m)}(s)\right\\|_{(N)}\mathrm{d}s+BT\cdot\sqrt{\log(NT/\delta)/N}$		(B.22)
	$\displaystyle\left\\|\widetilde{\omega}^{(N)}(t)-\bar{\omega}^{(N)}(t)\right\\|_% {(N)}\leq B\cdot\int_{0}^{t}\left\\|\widetilde{\omega}^{(N)}(s)-\bar{\omega}^{(% N)}(s)\right\\|_{(N)}\mathrm{d}s+BT\cdot\sqrt{\log(NT/\delta)/N}$		(B.23)

Summing (B.22) and (B.23) and applying Gronwall’s Lemma (Holte, 2009), with probability at least $1-2\delta$ , for any $t\in[0,T]$ , it holds that

	$\displaystyle\left\\|\widetilde{\theta}^{(N)}(t)-\bar{\theta}^{(N)}(t)\right\\|_% {(N)}+\left\\|\widetilde{\omega}^{(N)}(t)-\bar{\omega}^{(N)}(t)\right\\|_{(N)}\leq$	$\displaystyle B\cdot e^{Bt}\cdot 2BT\cdot\sqrt{\log(NT/\delta)/N}$
	$\displaystyle\leq$	$\displaystyle B\cdot e^{BT}\cdot\sqrt{\log(N/\delta)/N}.$		(B.24)

The last inequality holds since $B$ as a constant represents values changing from line to line. Therefore, equation (B.24) implies (B.11). Thus, we complete the proof of Lemma B.3.

B.1.3 Proof of Lemma B.4

By the definition of $\widehat{v}^{f},\widehat{v}^{g}$ in (B), $\breve{\theta}_{i}(t),\breve{\omega}_{i}(t)$ in (• ‣ B), $\widetilde{\theta}_{i}(t),\widetilde{\omega}_{i}(t)$ in (B.6), it holds that the distances $\left\|\widetilde{\theta}_{i}(k\epsilon)-\breve{\theta}_{i}(k)\right\|$ and $\left\|\widetilde{\omega}_{i}(k\epsilon)-\breve{\omega}_{i}(k)\right\|$ satisfy

	$\displaystyle\left\\|\widetilde{\theta}_{i}(k\epsilon)-\breve{\theta}_{i}(k)\right\\|$
	$\displaystyle\qquad\leq\eta\cdot\int_{0}^{k\epsilon}\left\\|\widehat{v}^{f}% \left(\widetilde{\theta}_{i}(s);\widetilde{\theta}^{(N)}(s),\widetilde{\omega}% ^{(N)}(s)\right)-\widehat{v}^{f}\left(\widetilde{\theta}_{i}(\lfloor s/% \epsilon\rfloor\epsilon);\widetilde{\theta}^{(N)}(\lfloor s/\epsilon\rfloor% \epsilon),\widetilde{\omega}^{(N)}(\lfloor s/\epsilon\rfloor\epsilon)\right)% \right\\|\mathrm{d}s$
	$\displaystyle\qquad\qquad+\eta\cdot\sum_{\ell=0}^{k-1}\left\\|\widehat{v}^{f}% \left(\widetilde{\theta}_{i}(\ell\epsilon);\widetilde{\theta}^{(N)}(\ell% \epsilon),\widetilde{\omega}^{(N)}(\ell\epsilon)\right)-\widehat{v}^{f}\left(% \breve{\theta}_{i}(\ell);\breve{\theta}^{(N)}(\ell),\breve{\omega}^{(N)}(\ell)% \right)\right\\|$
	$\displaystyle\qquad\leq B\cdot k\cdot\epsilon^{2}+B\cdot\sum_{\ell=0}^{k-1}% \Big{(}\left\\|\widetilde{\theta}^{(N)}(\ell\epsilon)-\breve{\theta}^{(N)}(\ell% )\right\\|_{(N)}+\left\\|\widetilde{\omega}^{(N)}(\ell\epsilon)-\breve{\omega}^{% (N)}(\ell)\right\\|_{(N)}\Big{)}.$		(B.25)

	$\displaystyle\left\\|\widetilde{\omega}_{i}(k\epsilon)-\breve{\omega}_{i}(k)\right\\|$
	$\displaystyle\qquad\leq\eta\cdot\int_{0}^{k\epsilon}\left\\|\widehat{v}^{g}% \left(\widetilde{\omega}_{i}(s);\widetilde{\theta}^{(N)}(s),\widetilde{\omega}% ^{(N)}(s)\right)-\widehat{v}^{g}\left(\widetilde{\omega}_{i}(\lfloor s/% \epsilon\rfloor\epsilon);\widetilde{\theta}^{(N)}(\lfloor s/\epsilon\rfloor% \epsilon),\widetilde{\omega}^{(N)}(\lfloor s/\epsilon\rfloor\epsilon)\right)% \right\\|\mathrm{d}s$
	$\displaystyle\qquad\qquad+\eta\cdot\sum_{\ell=0}^{k-1}\left\\|\widehat{v}^{g}% \left(\widetilde{w}_{i}(\ell\epsilon);\widetilde{\theta}^{(N)}(\ell\epsilon),% \widetilde{\omega}^{(N)}(\ell\epsilon)\right)-\widehat{v}^{g}\left(\breve{% \omega}_{i}(\ell);\breve{\theta}^{(N)}(\ell),\breve{\omega}^{(N)}(\ell)\right)\right\\|$
	$\displaystyle\qquad\leq B\cdot k\cdot\epsilon^{2}+B\cdot\sum_{\ell=0}^{k-1}% \Big{(}\left\\|\widetilde{\theta}^{(N)}(\ell\epsilon)-\breve{\theta}^{(N)}(\ell% )\right\\|_{(N)}+\left\\|\widetilde{\omega}^{(N)}(\ell\epsilon)-\breve{\omega}^{% (N)}(\ell)\right\\|_{(N)}\Big{)}.$		(B.26)

where (B.25) follows from (C.8) of Lemma C.1 and (C.13) of Lemma C.2, (B.26) follows from (C.9) of Lemma C.1 and (C.13) of Lemma C.2. Combining the inequalities in (B.25) and (B.26), it holds for any $k\leq T/\epsilon\;(k\in\mathbb{N})$ that

	$\displaystyle\left\\|\widetilde{\theta}^{(N)}(k\epsilon)-\breve{\theta}^{(N)}(k% )\right\\|_{(N)}+\left\\|\widetilde{\omega}^{(N)}(k\epsilon)-\breve{\omega}^{(N)% }(k)\right\\|_{(N)}$
	$\displaystyle\qquad\leq 2BT\epsilon+B\cdot\sum_{\ell=0}^{k-1}\left\\|\widetilde% {\theta}^{(m)}(\ell\epsilon)-\breve{\theta}^{(N)}(\ell)\right\\|_{(N)}+B\cdot% \sum_{\ell=0}^{k-1}\left\\|\widetilde{\omega}^{(N)}(\ell\epsilon)-\breve{\omega% }^{(N)}(\ell)\right\\|_{(N)}.$		(B.27)

Applying the discrete Gronwall’s lemma (Holte, 2009) to (B.27) , we have that

\displaystyle\sup_{\begin{subarray}{c}k\leq T/\epsilon\\ (k\in\mathbb{N})\end{subarray}}\left\|\widetilde{\theta}^{(N)}(k\epsilon)-% \breve{\theta}^{(N)}(k)\right\|_{(N)}+\left\|\widetilde{\omega}^{(N)}(k% \epsilon)-\breve{\omega}^{(N)}(k)\right\|_{(N)}\leq 2B^{2}\cdot T\cdot\epsilon% \cdot e^{BT}\leq B\cdot e^{BT}\cdot\epsilon,

where the inequalities hold since we allow the value of $B$ to vary from line to line. Thus, we complete the proof of Lemma B.4.

B.1.4 Proof of Lemma B.5

Proof.

Let $\mathcal{G}_{k}=\sigma(\theta^{(N)}(0),w^{(N)}(0),u_{0},\dots,u_{k})$ be the $\sigma-$ algebra generated by $\theta^{(N)}(0),w^{(N)}(0)$ and $u_{\ell}=(x_{\ell},z_{\ell})\;(\ell\leq k)$ . Following from the definition of $\widehat{V}^{f}_{k},\widehat{V}^{g}_{k}$ and $\widehat{v}^{f},\widehat{v}^{g}$ in (B) and (B), we have for any $i\in[N]$ and $k\in\mathbb{N}_{+}$ that

	$\displaystyle\mathbb{E}\Bigl{[}\widehat{V}_{k}^{f}(\theta_{i}(k);\theta^{(N)}(% k),\omega^{(N)}(k)){\,\big{\|}\,}\mathcal{G}_{k-1}\Bigr{]}=\widehat{v}^{f}(% \theta_{i}(k);\theta^{(N)}(k),\omega^{(N)}(k)),$
	$\displaystyle\mathbb{E}\Bigl{[}\widehat{V}_{k}^{g}(\omega_{i}(k);\theta^{(N)}(% k),\omega^{(N)}(k)){\,\big{\|}\,}\mathcal{G}_{k-1}\Bigr{]}=\widehat{v}^{g}(% \omega_{i}(k);\theta^{(N)}(k),\omega^{(N)}(k)).$

Recall the definition of $\theta^{(N)},\omega^{(N)}$ and $\breve{\theta}^{(N)},\breve{\omega}^{(N)}$ as the SGDA and PGDA dynamics defined in (• ‣ B) and (• ‣ B). We have for any $i\in[N]$ , $k\in\mathbb{N}_{+}$ that

	$\displaystyle\left\\|\breve{\theta}_{i}(k)-\theta_{i}(k)\right\\|$
	$\displaystyle\qquad\leq\eta\epsilon\cdot\left\\|\sum_{\ell=0}^{k-1}X_{i}(\ell)% \right\\|+\eta\epsilon\cdot\sum_{\ell=0}^{k-1}\left\\|\widehat{v}^{f}\left(% \breve{\theta}_{i}(\ell);\breve{\theta}^{(N)}(\ell),\breve{\omega}^{(N)}(\ell)% \right)-\widehat{v}^{f}\left(\theta_{i}(\ell);\theta^{(N)}(\ell),\omega^{(N)}(% \ell)\right)\right\\|$
	$\displaystyle\qquad\leq\eta\epsilon\cdot\left\\|A_{i}(k)\right\\|+B\epsilon\cdot% \sum_{\ell=0}^{k-1}\left\\|\breve{\theta}^{(m)}(\ell)-\theta^{(m)}(\ell)\right% \\|_{(N)}+\left\\|\breve{\omega}^{(N)}(\ell)-\omega^{(N)}(\ell)\right\\|_{(N)},$		(B.28)

where the last inequality follows from (C.8) of Lemma C.1. $X_{i}(\ell)$ and $A_{i}(k)$ are defined as,

	$\displaystyle X_{i}(\ell)=\widehat{V}^{f}_{\ell}\left(\theta_{i}(\ell);\theta^% {(N)}(\ell),\omega^{(N)}(\ell)\right)-\mathbb{E}\left[\widehat{V}^{f}_{\ell}% \left(\theta_{i}(\ell);\theta^{(N)}(\ell),\omega^{(N)}(\ell)\right){\,\big{\|}% \,}\mathcal{G}_{\ell-1}\right]\quad\forall\ell\geq 1,$
	$\displaystyle X_{i}(0)=0,\quad A_{i}(k)=\sum_{\ell=0}^{k-1}X_{i}(\ell).$

Following from (C.7) of Lemma C.1, it holds that $\|X_{i}(\ell)\|\leq B$ , thus the stochastic process $\{A_{i}(k)\}_{k\in\mathbb{N}_{+}}$ is a martingale with $\|A_{i}(k)-A_{i}(k-1)\|\leq B$ . Applying the Azuma-Hoeffding bound in Lemma C.4, we have that

\displaystyle\mathbb{P}\Bigl{(}\underset{\begin{subarray}{c}k\leq T/\epsilon\\ \left(k\in\mathbb{N}_{+}\right)\end{subarray}}{\max}\left\|A_{i}(k)\right\|% \geq B\cdot\sqrt{T/\epsilon}\cdot(\sqrt{D}+p)\Bigr{)}\leq\exp\left(-p^{2}% \right).

(B.29)

Apply the union bound to (B.29) for $i\in[N]$ , we have that

\displaystyle\mathbb{P}\Bigl{(}\underset{\begin{subarray}{c}i\in[N]\\ k\leq T/\epsilon,\left(k\in\mathbb{N}_{+}\right)\end{subarray}}{\max}\left\|A_% {i}(k)\right\|\geq B\cdot\sqrt{T/\epsilon}\cdot(\sqrt{D}+p)\Bigr{)}\leq N\cdot% \exp\left(-p^{2}\right).

Setting $p=\sqrt{\log(N/\delta)}$ , with probability at least $1-\delta$ , it holds that

\displaystyle\left\|A_{i}(k)\right\|\leq B\cdot\sqrt{T/\epsilon}\cdot(\sqrt{D}% +\sqrt{\log(N/\delta)}),\quad\forall i\in[N],k\leq T/\epsilon\left(k\in\mathbb% {N}_{+}\right).

(B.30)

Plug (B.30) into (B.28) and taking supremum norm over $i\in[N]$ , we have that

	$\displaystyle\Bigl{\\|}\breve{\theta}^{(N)}(k)-\theta^{(N)}(k)\Bigr{\\|}_{(N)}$	$\displaystyle\leq B\epsilon\cdot\sum_{\ell=0}^{k-1}\bigg{(}\left\\|\breve{% \theta}^{(m)}(\ell)-\theta^{(m)}(\ell)\right\\|_{(N)}+\left\\|\breve{\omega}^{(N% )}(\ell)-\omega^{(N)}(\ell)\right\\|_{(N)}\bigg{)}$
		$\displaystyle\qquad+B\cdot\sqrt{T\epsilon}\cdot(\sqrt{D}+\sqrt{\log(N/\delta)}).$		(B.31)

Through similar arguments, for $\breve{w}_{i}(k)$ and $w_{i}(k)$ , with probability at least $1-\delta$ ,

	$\displaystyle\Bigl{\\|}\breve{\omega}^{(N)}(k)-\omega^{(N)}(k)\Bigr{\\|}_{(N)}$	$\displaystyle\leq B\epsilon\cdot\sum_{\ell=0}^{k-1}\bigg{(}\left\\|\breve{% \theta}^{(m)}(\ell)-\theta^{(m)}(\ell)\right\\|_{(N)}+\left\\|\breve{\omega}^{(N% )}(\ell)-\omega^{(N)}(\ell)\right\\|_{(N)}\bigg{)}$
		$\displaystyle\qquad+B\cdot\sqrt{T\epsilon}\cdot(\sqrt{D}+\sqrt{\log(N/\delta)}).$		(B.32)

Conditioning on the intersection of event in (B.31) and event in (B.32), summing (B.31), (B.32), and applying the discrete Gronwall’s lemma (Holte, 2009), for any $k\leq T/\epsilon,k\in\mathbb{N}_{+}$ , the following inequality holds with probability at least $1-2\delta$ ,

	$\displaystyle\Bigl{\\|}\breve{\theta}^{(N)}(k)-\theta^{(N)}(k)\Bigr{\\|}_{(N)}+% \Bigl{\\|}\breve{\omega}^{(N)}(k)-\omega^{(N)}(k)\Bigr{\\|}_{(N)}$	$\displaystyle\leq B\cdot e^{Bk\epsilon}\cdot B\cdot\sqrt{T\epsilon}\cdot(\sqrt% {D}+\sqrt{\log(N/\delta)})$
		$\displaystyle\leq B\cdot e^{BT}\cdot\sqrt{\epsilon\cdot(D+\log(N/\delta))}.$

Here the last inequality holds since we allow the value of $B$ to vary from line to line. Thus, we complete the proof of Lemma B.5. ∎

Appendix C Supporting Lemmas

C.1 Supporting Lemmas for §B

In what follows, we presented the technical lemmas heavily used in $\S$ B. We recall the definition of $v^{f},v^{g}$ , $\widehat{v}^{f},\widehat{v}^{g}$ and $\widehat{V}^{f}_{k},\widehat{V}^{g}_{k}$ as in (B), (B), and (B) respectively. Let $B>0$ be a constant depending on $\alpha,\eta,B_{0},B_{1},B_{2},C$ , whose value varies from line to line. Recall that $f(\cdot;\theta^{(N)})$ and $g(\cdot;\omega^{(N)})$ are the finite width representation with parameters $\theta^{(N)},\omega^{(N)}$ , whose definitions are given by

\displaystyle f(\cdot;\theta^{(N)})=\frac{\alpha}{N}\cdot\sum_{i=1}^{N}\phi(% \cdot;\theta_{i}),\quad g(\cdot;\omega^{(N)})=\frac{\alpha}{N}\cdot\sum_{i=1}^% {N}\psi(\cdot;\omega_{i}).

Lemma C.1.

Under Assumption 4.1 and 4.3, it holds that for any $\theta^{(N)}=\{\theta_{i}\}_{i=1}^{N}$ , $\underline{\theta}^{(N)}=\{\underline{\theta}_{i}\}_{i=1}^{N}$ , $w^{(N)}=\{w_{i}\}_{i=1}^{N}$ , $\underline{w}^{(N)}=\{\underline{w}_{i}\}_{i=1}^{N}$ , that, $f(\cdot;\theta^{(N)})$ and $g(\cdot;\omega^{(N)})$ are uniformly bounded and Lipschitz in $\theta,\omega$ respectively, which is given by the following,

	$\displaystyle\sup_{w\in\mathcal{W}}\bigl{\|}f(w;\theta^{(N)})\bigr{\|}+\sup_{z% \in\mathcal{Z}}\bigl{\|}g(z;\omega^{(N)})\bigr{\|}\leq B,$		(C.1)
	$\displaystyle\sup_{w\in\mathcal{W}}\bigl{\|}f(w;\theta^{(N)})-f(w;\underline{% \theta}^{(N)})\bigr{\|}\leq B\cdot\bigl{\\|}\theta^{(N)}-\underline{\theta}^{(N)% }\bigr{\\|}_{(N)},$		(C.2)
	$\displaystyle\sup_{z\in\mathcal{Z}}\bigl{\|}g(z;\omega^{(N)})-g(z;\underline{% \omega}^{(N)})\bigr{\|}\leq B\cdot\bigl{\\|}\omega^{(N)}-\underline{\omega}^{(N)% }\bigr{\\|}_{(N)}.$		(C.3)

Recall the definition of $\widehat{v}^{f},\widehat{v}^{g}$ and $\widehat{V}^{f}_{k},\widehat{V}^{g}_{k}$ in (B), (B), the finite width representation of the velocity field and its stochastic counter-part, when evaluated at arbitrary $\theta_{i},\omega_{i}$ , are also uniformly bounded and lipschitz in $\theta,\omega$ respectively. This means for $\widehat{V}^{f}_{k},\widehat{V}^{g}_{k}$ , the following inequalities hold,

	$\displaystyle\bigl{\\|}\widehat{V}^{f}_{k}(\theta_{i};\theta^{(N)},\omega^{(N)}% )\bigr{\\|}+\bigl{\\|}\widehat{V}^{g}_{k}(\omega_{i};\theta^{(N)},w^{(N)})\bigr{% \\|}\leq B,$		(C.4)
	$\displaystyle\bigl{\\|}\widehat{V}^{f}_{k}(\theta_{i};\theta^{(N)},\omega^{(N)}% )-\widehat{V}^{f}_{k}(\underline{\theta}_{i};\underline{\theta}^{(N)},% \underline{\omega}^{(N)})\bigr{\\|}\leq B\cdot\Bigl{(}\bigl{\\|}\theta^{(N)}-% \underline{\theta}^{(N)}\bigr{\\|}_{(N)}+\bigl{\\|}\omega^{(N)}-\underline{% \omega}^{(N)}\bigr{\\|}_{(N)}\Bigr{)},$		(C.5)
	$\displaystyle\bigl{\\|}\widehat{V}^{g}_{k}(\omega_{i};\theta^{(N)},w^{(N)})-% \widehat{V}^{g}_{k}(\underline{\omega}_{i};\underline{\theta}^{(N)},\underline% {\omega}^{(N)})\bigr{\\|}\leq B\cdot\Bigl{(}\bigl{\\|}\theta^{(N)}-\underline{% \theta}^{(N)}\bigr{\\|}_{(N)}+\bigl{\\|}\omega^{(N)}-\underline{\omega}^{(N)}% \bigr{\\|}_{(N)}\Bigr{)}.$		(C.6)

A similar series of inequalities also hold for $\widehat{v}^{f},\widehat{v}^{g}$ ,

	$\displaystyle\bigl{\\|}\widehat{v}^{f}(\theta_{i};\theta^{(N)},\omega^{(N)})% \bigr{\\|}+\bigl{\\|}\widehat{v}^{g}(\omega_{i};\theta^{(N)},\omega^{(N)})\bigr{% \\|}\leq B,$		(C.7)
	$\displaystyle\bigl{\\|}\widehat{v}^{f}(\theta_{i};\theta^{(N)},\omega^{(N)})-% \widehat{v}^{f}_{k}(\underline{\theta}_{i};\underline{\theta}^{(N)},\underline% {\omega}^{(N)})\bigr{\\|}\leq B\cdot\Bigl{(}\bigl{\\|}\theta^{(N)}-\underline{% \theta}^{(N)}\bigr{\\|}_{(N)}+\bigl{\\|}\omega^{(N)}-\underline{\omega}^{(N)}% \bigr{\\|}_{(N)}\Bigr{)},$		(C.8)
	$\displaystyle\bigl{\\|}\widehat{v}^{g}(\omega_{i};\theta^{(N)},\omega^{(N)})-% \widehat{v}^{g}(\underline{\omega}_{i};\underline{\theta}^{(N)},\underline{% \omega}^{(N)})\bigr{\\|}\leq B\cdot\Bigl{(}\bigl{\\|}\theta^{(N)}-\underline{% \theta}^{(N)}\bigr{\\|}_{(N)}+\bigl{\\|}\omega^{(N)}-\underline{\omega}^{(N)}% \bigr{\\|}_{(N)}\Bigr{)}.$		(C.9)

As a corollary of the inequalities stated above, the uniform bounds in fact hold for any $f,g\in\mathcal{F}$ , which says,

\displaystyle\sup_{w\in\mathcal{W}}\bigl{|}f(w)\bigr{|}+\sup_{z\in\mathcal{Z}}% \bigl{|}g(z)\bigr{|}\leq B.

(C.10)

Similarly, the uniform bounds also hold for the velocity field $v^{f},v^{g}$ , such that for any $\rho\in\mathscr{P}_{2}(\mathbb{R}^{D}\times\mathbb{R}^{D})$ , it holds that

\displaystyle\bigl{\|}v^{f}(\theta;\rho)\bigr{\|}+\bigl{\|}v^{g}(\omega;\rho)% \bigr{\|}\leq B.

(C.11)

Proof.

We will prove these results separately.

(i) Proof of (C.1), (C.2), and (C.3)

For (C.1) of Lemma C.1, since $\phi$ , $\psi$ are bounded as is assumed in Assumption 4.1, we have for any $w\in\mathcal{W},z\in\mathcal{Z}$ , any $\theta^{(N)}$ and $\omega^{(N)}$ that

\displaystyle\bigl{|}f(w;\theta^{(N)})\bigr{|}+\bigl{|}g(z;\omega^{(N)})\bigr{% |}\leq\alpha\cdot N^{-1}\sum_{i=1}^{N}\bigl{|}\phi(w;\theta_{i})\bigr{|}+\bigl% {|}\psi(z;\omega_{i})\bigr{|}\leq B.

For (C.2), and (C.3) of Lemma C.1, since for any $w\in\mathcal{W}$ , $z\in\mathcal{Z}$ , $\phi(w;\theta)$ has a bounded gradient in $\theta$ , $\psi(z;\omega)$ has a bounded gradient in $\omega$ . The uniform upper bound of the gradient controls the Lipschitz constant of the function, thus it holds for any $w\in\mathcal{W},z\in\mathcal{Z}$ , any $\theta^{(N)},\underline{\theta}^{(N)}$ and $\omega^{(N)},\underline{\omega}^{(N)}$ that

	$\displaystyle\bigl{\|}f(w;\theta^{(N)})-f(w;\underline{\theta}^{(N)})\bigr{\|}% \leq\alpha N^{-1}\cdot B_{1}\sum_{i=1}^{N}\bigl{\|}\theta_{i}-\underline{\theta% }_{i}\bigr{\|}\leq B\cdot\bigl{\\|}\theta^{(N)}-\underline{\theta}^{(N)}\bigr{\\|% }_{(N)},$
	$\displaystyle\bigl{\|}g(z;\omega^{(N)})-g(z;\underline{\omega}^{(N)})\bigr{\|}% \leq\alpha N^{-1}\cdot B_{1}\sum_{i=1}^{N}\bigl{\|}\omega_{i}-\underline{\omega% }_{i}\bigr{\|}\leq B\cdot\bigl{\\|}\omega^{(N)}-\underline{\omega}^{(N)}\bigr{\\|% }_{(N)}.$

(ii) Proof of (C.4), (C.5) and (C.6)

For (C.4) of Lemma C.1, recall the definition of $\widehat{V}^{f}_{k}$ , $\widehat{V}^{g}_{k}$ in (B), for any $\theta^{(N)}$ and $\omega^{(N)}$ ,

	$\displaystyle\Bigl{\\|}\widehat{V}^{f}_{k}(\theta_{i};\theta^{(N)},\omega^{(N)}% )\Bigr{\\|}$	$\displaystyle\;\leq\alpha\cdot\sup_{w\in\mathcal{W}}\big{\\|}\nabla_{\theta}% \phi(w;\theta_{i})\big{\\|}\cdot\sup_{z\in\mathcal{Z}}\big{\|}g(z;\omega^{(N)})% \big{\|}\cdot\int_{\mathcal{W}}\Big{\|}\frac{\delta\Phi(x_{k},z_{k},f(\cdot;% \theta^{(N)}))}{\delta f}(w^{\prime})\Big{\|}\mathrm{d}w^{\prime}$
		$\displaystyle\qquad+\alpha\cdot\sup_{w\in\mathcal{W}}\big{\\|}\nabla_{\theta}% \phi(w;\theta_{i})\big{\\|}\cdot\lambda\cdot\int_{\mathcal{W}}\Big{\|}\frac{% \delta\Psi(x_{k},z_{k},f(\cdot;\theta^{(N)}))}{\delta f}(w^{\prime})\Big{\|}% \mathrm{d}w^{\prime}\leq B,$
	$\displaystyle\Bigl{\\|}\widehat{V}^{g}_{k}(\omega_{i};\theta^{(N)},\omega^{(N)}% )\Bigr{\\|}$	$\displaystyle\;\leq\alpha\cdot\Big{(}\big{\|}\Phi(x_{k},z_{k};f(\cdot;\theta^{(% N)}))\big{\|}+\sup_{z\in\mathcal{Z}}\big{\|}g(z;\omega^{(N)})\big{\|}\Big{)}\cdot% \sup_{z\in\mathcal{Z}}\big{\\|}\nabla_{\omega}\psi(z;\omega_{i})\big{\\|}\leq B.$

For notational simplicity, we further define

	$\displaystyle u^{f}(\theta^{(N)},w^{(N)})=-\alpha g(z_{k};\omega^{(N)})\cdot% \frac{\delta\Phi(x_{k},z_{k};f(\cdot;\theta^{(N)}))}{\delta f}-\alpha\lambda% \cdot\frac{\delta\Psi(x_{k},z_{k};f(\cdot;\theta^{(N)}))}{\delta f},$
	$\displaystyle u^{g}(\theta^{(N)},w^{(N)})=\alpha\Phi(x_{k},z_{k};f(\cdot;% \theta^{(N)}))-\alpha g(z_{k};\omega^{(N)}).$

For (C.5) of Lemma C.1, following from Assumption 4.3 and the definition of $\widehat{V}^{f}_{k}$ in (B), we have for any $\theta^{(N)},\underline{\theta}^{(N)}$ and $\omega^{(N)},\underline{\omega}^{(N)}$ that

	$\displaystyle\Bigl{\\|}\widehat{V}^{f}_{k}(\theta_{i};\theta^{(N)},\omega^{(N)}% )-\widehat{V}^{f}_{k}(\underline{\theta}_{i};\underline{\theta}^{(N)},% \underline{\omega}^{(N)})\Bigr{\\|}$
	$\displaystyle\qquad\leq\Bigl{\\|}\widehat{V}^{f}_{k}(\theta_{i};\theta^{(N)},% \omega^{(N)})-\widehat{V}^{f}_{k}(\theta_{i};\underline{\theta}^{(N)},% \underline{\omega}^{(N)})\Bigr{\\|}+\Bigl{\\|}\widehat{V}^{f}_{k}(\theta_{i};% \underline{\theta}^{(N)},\underline{\omega}^{(N)})-\widehat{V}^{f}_{k}(% \underline{\theta}_{i};\underline{\theta}^{(N)},\underline{\omega}^{(N)})\Bigr% {\\|}$
	$\displaystyle\qquad\leq\|u^{f}(\theta^{(N)},\omega^{(N)})-u^{f}(\underline{% \theta}^{(N)},\underline{\omega}^{(N)})\|\cdot\sup_{w\in\mathcal{W}}\\|\nabla_{% \theta}\phi(w;\theta_{i})\\|+\Bigl{\\|}\Big{\langle}u^{f}(\underline{\theta}^{(N% )},\underline{\omega}^{(N)}),\nabla_{\theta}\phi(\cdot;\theta_{i})-\nabla_{% \theta}\phi(\cdot;\underline{\theta}_{i})\Big{\rangle}_{L^{2}}\Bigr{\\|}.$

Moreover, $u^{f}(\theta^{(N)},\omega^{(N)})$ is also Lipschitz in $(\theta^{(N)},\omega^{(N)})$ since

	$\displaystyle\|u^{f}(\theta^{(N)},\omega^{(N)})-u^{f}(\underline{\theta}^{(N)},% \underline{\omega}^{(N)})\|$	$\displaystyle\;\leq B\cdot\|f(w_{k};\theta^{(N)})-f(w_{k};\underline{\theta}^{(% N)})\|+B\cdot\|g(z_{k};\omega^{(N)})-g(z_{k};\underline{\omega}^{(N)})\|$
		$\displaystyle\;\leq B\cdot\Bigl{(}\bigl{\\|}\theta^{(N)}-\underline{\theta}^{(N% )}\bigr{\\|}_{(N)}+\bigl{\\|}\omega^{(N)}-\underline{\omega}^{(N)}\bigr{\\|}_{(N)% }\Bigr{)},$

where the second inequality is achieved by applying (C.2), (C.3). Therefore, the fact that $\widehat{V}^{f}_{k}(\theta_{i};\theta^{(N)},\omega^{(N)})$ is Lipschitz in $(\theta^{(N)},\omega^{(N)})$ is due to $\|\nabla_{\theta}\phi(w;\theta_{i})\|$ and $\big{|}\int u^{f}(\theta^{(N)},\omega^{(N)})(w^{\prime})\mathrm{d}w^{\prime}% \big{|}$ is uniformly bounded.

For (C.6) of Lemma C.1, following from Assumption 4.3 and the definition of $\widehat{V}^{g}_{k}$ in (B), through a similar argument as is in the proof of (C.5), we have for any $\theta^{(N)},\underline{\theta}^{(N)}$ and $\omega^{(N)},\underline{\omega}^{(N)}$ that

	$\displaystyle\Bigl{\\|}\widehat{V}^{g}_{k}(\omega_{i};\theta^{(N)},\omega^{(N)}% )-\widehat{V}^{g}_{k}(\underline{\omega}_{i};\underline{\theta}^{(N)},% \underline{\omega}^{(N)})\Bigr{\\|}$
	$\displaystyle\qquad\leq\Bigl{\\|}\widehat{V}^{g}_{k}(\omega_{i};\theta^{(N)},% \omega^{(N)})-\widehat{V}^{g}_{k}(\omega_{i};\underline{\theta}^{(N)},% \underline{\omega}^{(N)})\Bigr{\\|}+\Bigl{\\|}\widehat{V}^{g}_{k}(\omega_{i};% \underline{\theta}^{(N)},\underline{\omega}^{(N)})-\widehat{V}^{g}_{k}(% \underline{\omega}_{i};\underline{\theta}^{(N)},\underline{\omega}^{(N)})\Bigr% {\\|}$
	$\displaystyle\qquad\leq\|u^{g}(\theta^{(N)},\omega^{(N)})-u^{f}(\underline{% \theta}^{(N)},\underline{\omega}^{(N)})\|\cdot\sup_{z\in\mathcal{Z}}\\|\nabla_{% \omega}\psi(z;\omega_{i})\\|+\Bigl{\\|}\Big{\langle}u^{g}(\underline{\theta}^{(N% )},\underline{\omega}^{(N)}),\nabla_{\omega}\psi(\cdot;\omega_{i})-\nabla_{% \omega}\psi(\cdot;\underline{\omega}_{i})\Big{\rangle}_{L^{2}}\Bigr{\\|}.$

Again, $u^{g}(\theta^{(N)},\omega^{(N)})$ is Lipschitz in $(\theta^{(N)},\omega^{(N)})$ since

	$\displaystyle\|u^{g}(\theta^{(N)},\omega^{(N)})-u^{g}(\underline{\theta}^{(N)},% \underline{\omega}^{(N)})\|$	$\displaystyle\;\leq B\cdot\|f(w_{k};\theta^{(N)})-f(w_{k};\underline{\theta}^{(% N)})\|+B\cdot\|g(z_{k};\omega^{(N)})-g(z_{k};\underline{\omega}^{(N)})\|$
		$\displaystyle\;\leq B\cdot\Bigl{(}\bigl{\\|}\theta^{(N)}-\underline{\theta}^{(N% )}\bigr{\\|}_{(N)}+\bigl{\\|}\omega^{(N)}-\underline{\omega}^{(N)}\bigr{\\|}_{(N)% }\Bigr{)}.$

Therefore, the Lipschtizness of $\widehat{V}^{g}_{k}(\omega_{i};\theta^{(N)},\omega^{(N)})$ in $(\theta^{(N)},\omega^{(N)})$ comes from $\|\nabla_{\omega}\psi(z;\omega_{i})\|$ and $\big{|}\int u^{g}(\theta^{(N)},\omega^{(N)})(z^{\prime})\mathrm{d}z^{\prime}% \big{|}$ is uniformly bounded.

(iii) Proof of (C.7), (C.8), and (C.9)

Equations (C.7), (C.8), (C.9) of Lemma C.1 for $\widehat{v}^{f}$ and $\widehat{v}^{g}$ follow from the fact that

\displaystyle\widehat{v}^{f}(\theta_{i};\theta^{(N)},\omega^{(N)})=\mathbb{E}_% {\mathcal{D}}\Big{[}\widehat{V}_{k}^{f}(\theta_{i};\theta^{(N)},\omega^{(N)})% \Big{]},\quad\widehat{v}^{g}(\omega_{i};\theta^{(N)},w^{(N)})=\mathbb{E}_{% \mathcal{D}}\Big{[}\widehat{V}_{k}^{g}(\omega_{i};\theta^{(N)},\omega^{(N)})% \Big{]}.

Therefore, (C.7) follows from (C.4) and triangle inequality,

\displaystyle\bigl{\|}\widehat{v}^{f}(\theta_{i};\theta^{(N)},\omega^{(N)})% \bigr{\|}+\bigl{\|}\widehat{v}^{g}(\omega_{i};\theta^{(N)},\omega^{(N)})\bigr{\|}

\displaystyle\;\leq\mathbb{E}_{\mathcal{D}}\Bigl{[}\bigl{\|}\widehat{V}_{k}^{f% }(\theta_{i};\theta^{(N}),\omega^{(N)})\bigl{\|}\Bigr{]}+\mathbb{E}_{\mathcal{% D}}\Bigl{[}\bigl{\|}\widehat{V}_{k}^{g}(\omega_{i};\theta^{(N)},\omega^{(N)})% \bigl{\|}\Bigr{]}\leq B.

Equations (C.8) and (C.9) follows from (C.5), (C.6) and triangle inequality,

	$\displaystyle\bigl{\\|}\widehat{v}^{f}(\theta_{i};\theta^{(N)},\omega^{(N)})-% \widehat{v}^{f}_{k}(\underline{\theta}_{i};\underline{\theta}^{(N)},\underline% {\omega}^{(N)})\bigr{\\|}$	$\displaystyle\;\leq\mathbb{E}_{\mathcal{D}}\Big{[}\bigl{\\|}\widehat{V}^{f}_{k}% (\theta_{i};\theta^{(N)},\omega^{(N)})-\widehat{V}^{f}_{k}(\underline{\theta}_% {i};\underline{\theta}^{(N)},\underline{\omega}^{(N)})\bigr{\\|}\Big{]}$
		$\displaystyle\;\leq B\cdot\Bigl{(}\bigl{\\|}\theta^{(N)}-\underline{\theta}^{(N% )}\bigr{\\|}_{(N)}+\bigl{\\|}\omega^{(N)}-\underline{\omega}^{(N)}\bigr{\\|}_{(N)% }\Bigr{)},$
	$\displaystyle\bigl{\\|}\widehat{v}^{g}(\omega_{i};\theta^{(N)},\omega^{(N)})-% \widehat{v}^{g}_{k}(\underline{\omega}_{i};\underline{\theta}^{(N)},\underline% {\omega}^{(N)})\bigr{\\|}$	$\displaystyle\;\leq\mathbb{E}_{\mathcal{D}}\Big{[}\bigl{\\|}\widehat{V}^{g}_{k}% (\omega_{i};\theta^{(N)},\omega^{(N)})-\widehat{V}^{g}_{k}(\underline{\omega}_% {i};\underline{\theta}^{(N)},\underline{\omega}^{(N)})\bigr{\\|}\Big{]}$
		$\displaystyle\;\leq B\cdot\Bigl{(}\bigl{\\|}\theta^{(N)}-\underline{\theta}^{(N% )}\bigr{\\|}_{(N)}+\bigl{\\|}\omega^{(N)}-\underline{\omega}^{(N)}\bigr{\\|}_{(N)% }\Bigr{)}.$

(iv) Proof of (C.10), and (C.11)

Equation (C.10) follows from the definition of $\mathcal{F}$ in (4.2) and the uniform bounds of neuron functions $\phi$ and $\psi$ . For any $f,g\in\mathcal{F}$ , there exists probability measures $\widehat{\mu},\widehat{\nu}$ over the parameter space such that

\displaystyle f(w)=\int\phi(w;\theta)\widehat{\mu}(\mathrm{d}\theta),\quad g(z% )=\int\psi(z;\omega)\widehat{\nu}(\mathrm{d}\omega),\quad\forall w\in\mathcal{% W},z\in\mathcal{Z}.

We apply the triangle inequality and achieve,

\displaystyle\sup_{w\in\mathcal{W}}|f(w)|+\sup_{z\in\mathcal{Z}}|g(z)|\leq\int% \sup_{w\in\mathcal{W}}|\phi(w;\theta)|\widehat{\mu}(\mathrm{d}\theta)+\int\sup% _{z\in\mathcal{Z}}|g(z)||\psi(z;\omega)|\widehat{\nu}(\mathrm{d}\omega)\leq B.

Equation (C.11) follows from the definition of $v^{f},v^{g}$ in (B) and the proof of (C.4) and (C.7). Proof of (C.11) is the same as the proof for (C.4) and (C.7), except for the fact that a uniform bound is needed for the infinite width representation of $f$ and $g$ , which is proved in (C.10).

Based on proofs for items (i), (ii), (iii), and (iv) above, we finish the proof of Lemma (C.1). ∎

Now, recall $\rho_{t}$ is the PDE solution to (3.7), $\bar{\theta}^{(N)}(t),\bar{w}^{(N)}(t)$ is the IP dynamics defined in (B.7), $\widetilde{\theta}^{(N)}(t),\widetilde{w}^{(N)}(t)$ is the CTPGDA dynamics defined in (B.6). We have the following lemma that also bound the difference of iterates for IP, CTPGDA dynamics between time $s$ and $t$ .

Lemma C.2.

Under Assumption 4.1 and 4.3, it holds for any $s,t\in[0,T]$ that,

	$\displaystyle\bigl{\\|}\bar{\theta}^{(N)}(t)-\bar{\theta}^{(N)}(s)\bigr{\\|}_{(N% )}+\bigl{\\|}\bar{\omega}^{(N)}(t)-\bar{\omega}^{(N)}(s)\bigr{\\|}_{(N)}\leq B% \cdot\bigl{\|}t-s\bigr{\|},$		(C.12)
	$\displaystyle\bigl{\\|}\widetilde{\theta}^{(N)}(t)-\widetilde{\theta}^{(N)}(s)% \bigr{\\|}_{(N)}+\bigl{\\|}\widetilde{\omega}^{(N)}(t)-\widetilde{\omega}^{(N)}(% s)\bigr{\\|}_{(N)}\leq B\cdot\bigl{\|}t-s\bigr{\|},$		(C.13)
	$\displaystyle\mathcal{W}_{2}(\mu_{t},\mu_{s}))+\mathcal{W}_{2}(\nu_{t},\nu_{s}% ))\leq B\cdot\bigl{\|}t-s\bigr{\|}.$		(C.14)

Proof.

For (C.12) of Lemma C.2, by the definition of $\bar{\theta}_{i}(t)$ and $\bar{\omega}_{i}(t)$ in (B.7) and (C.11) of Lemma C.1, we have for any $s,t\in[0,T]$ and $i\in[N]$ that

	$\displaystyle\bigl{\\|}\bar{\theta}_{i}(t)-\bar{\theta}_{i}(s)\bigr{\\|}\leq\eta% \cdot\int_{s}^{t}\bigl{\\|}v^{f}(\bar{\theta}_{i}(\tau);\rho_{\tau})\bigr{\\|}% \mathrm{d}\tau\leq B\cdot\bigl{\|}t-s\bigr{\|}$
	$\displaystyle\bigl{\\|}\bar{\omega}_{i}(t)-\bar{\omega}_{i}(s)\bigr{\\|}\leq\eta% \cdot\int_{s}^{t}\bigl{\\|}v^{g}(\bar{\omega}_{i}(\tau);\rho_{\tau})\bigr{\\|}% \mathrm{d}\tau\leq B\cdot\bigl{\|}t-s\bigr{\|}$

Similarly, for (C.13) of Lemma C.2, by the definition of $\widetilde{\theta}_{i}(t)$ and $\widetilde{\omega}_{i}(t)$ in (B.6), and (C.7) of Lemma C.1, we have for any $s,t\in[0,T]$ and $i\in[N]$ ,

\displaystyle\big{\|}\widetilde{\theta}_{i}(t)-\widetilde{\theta}_{i}(s)\big{% \|}\leq B\cdot\big{|}t-s\big{|},\quad\big{\|}\widetilde{\omega}_{i}(t)-% \widetilde{\omega}_{i}(s)\big{\|}\leq B\cdot\big{|}t-s\big{|}.

For (C.14) of Lemma C.2, following from the fact that $\bar{\theta}_{i}(t)\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mu_{t}$ , $\bar{\omega}_{i}(t)\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\nu_{t}$ and the definition of $\mathcal{W}_{2}$ in (2.18), it holds that for any $s,t\in[0,T]$ that

	$\displaystyle\mathcal{W}_{2}(\mu_{t},\mu_{s})\leq\mathbb{E}\Bigl{[}\bigl{\\|}% \bar{\theta}_{i}(t)-\bar{\theta}_{i}(s)\bigr{\\|}^{2}\Bigr{]}^{1/2}\leq B\cdot\|% t-s\|$
	$\displaystyle\mathcal{W}_{2}(\nu_{t},\nu_{s})\leq\mathbb{E}\Bigl{[}\bigl{\\|}% \bar{\omega}_{i}(t)-\bar{\omega}_{i}(s)\bigr{\\|}^{2}\Bigr{]}^{1/2}\leq B\cdot\|% t-s\|$

Therefore, we complete the proof of Lemma C.2. ∎

Lemma C.3.

Let $\{X_{i}\}_{i=1}^{N}$ be i.i.d. random variables with $\|X_{i}\|\leq\xi$ and $\mathbb{E}[X_{i}]=0.$ Then it holds for any $p>0$ , there exists $C>0$ being an absolute constant that

\displaystyle\mathbb{P}\left(\left\|N^{-1}\cdot\sum_{i=1}^{N}X_{i}\right\|\geq C% \xi\cdot\left(N^{-1/2}+p\right)\right)\leq\exp\left(-Np^{2}\right),

Proof.

See Lemma 30 in Mei et al. (2019) ∎

Lemma C.4 (Azuma-Hoeffding bound).

Let $X_{k}\in\mathbb{R}^{D}$ be a martingale with respect to the filtration $\mathcal{G}_{k}\;(k\geq 0)$ with $X_{0}=0$ . We assume for $\xi>0$ and any $\lambda\in\mathbb{R}^{D}$ that,

\displaystyle\mathbb{E}\left[\exp\left(\left\langle\lambda,X_{k}-X_{k-1}\right% \rangle\right)\mid\mathcal{G}_{k-1}\right]\leq\exp\left(\xi^{2}\cdot\|\lambda% \|^{2}/2\right)

Then it holds that, with $C>0$ being an absolute constant.

\displaystyle\mathbb{P}\left(\max_{\begin{subarray}{c}k\leq n\\ (k\in\mathbb{N})\end{subarray}}\left\|X_{k}\right\|\geq C\xi\cdot\sqrt{n}\cdot% (\sqrt{D}+p)\right)\leq\exp\left(-p^{2}\right)

Proof.

See Lemma 31 in Mei et al. (2019) and Lemma A.3 in Araújo et al. (2019). ∎

Appendix D Technical Results

D.1 Universal Function Approximation Theorem

In what follows, we introduce the universal function approximation theorem (Pinkus, 1999). For any given activation function $\sigma:\mathbb{R}\rightarrow\mathbb{R}$ , we consider the following function class,

\displaystyle\mathcal{G}(\sigma)=\Bigl{\{}\sum_{i=1}^{r}c_{i}\sigma(x^{\top}w^% {i}+\theta_{i}){\,\Big{|}\,}c_{i},\theta_{i}\in\mathbb{R},w^{i}\in\mathbb{R}^{% d}\Bigr{\}}.

We denote by $\mathscr{C}(\mathbb{R}^{d})$ the class of continuous functions over $\mathbb{R}^{d}$ . Then, the following theorem holds.

Lemma D.1 (Universal Function Approximation Theorem, Theorem 3.1 in Pinkus (1999)).

If the activation function $\sigma\in\mathscr{C}(\mathbb{R})$ is not a polynomial, the function class $\mathcal{G}(\sigma)$ is dense in $\mathscr{C}(\mathbb{R}^{d})$ in the topology of uniform convergence on a compact set.

D.2 Wasserstein Space

We use the definition of absolutely continuous curves in $\mathscr{P}_{2}(\mathbb{R}^{D})$ in Ambrosio et al. (2008) and introduce the following lemmas.

Lemma D.2.

For any probability measures $\mu,\nu,\mu^{\prime},\nu^{\prime}\in\mathscr{P}_{2}(\mathbb{R}^{D})$ , it holds that

\displaystyle\mathcal{W}_{2}(\mu\otimes\nu,\mu^{\prime}\otimes\nu^{\prime})^{2% }\leq\mathcal{W}_{2}(\mu,\mu^{\prime})^{2}+\mathcal{W}_{2}(\nu,\nu^{\prime})^{% 2}.

Lemma D.3 (First Variation Formula, Theorem 8.4.7 in Ambrosio et al. (2008)).

Given $\nu\in\mathscr{P}_{2}(\mathbb{R}^{D})$ and an absolutely continuous curve $\mu:[0,T]\rightarrow\mathscr{P}_{2}(\mathbb{R}^{D})$ , let $\beta:[0,1]\rightarrow\mathscr{P}_{2}(\mathbb{R}^{D})$ be the geodesic connecting $\mu_{t}$ and $\nu$ . It holds that

\displaystyle\frac{\mathrm{d}}{\mathrm{d}t}\frac{\mathcal{W}_{2}(\mu_{t},\nu)^% {2}}{2}=-\langle\dot{\mu}_{t},\dot{\beta}_{0}\rangle_{\mu_{t}}.

where $\dot{\mu_{t}}=\partial_{t}\mu_{t}$ , $\dot{\beta}_{0}=\partial_{s}\beta_{s}|_{s=0}$ .

Lemma D.4 (Benamou-Brenier formula, Proposition 2.30 in Ambrosio and Gigli (2013)).

Let $\mu^{0},\mu^{1}\in\mathscr{P}_{2}(\mathbb{R}^{D})$ . Then, it holds that

\displaystyle\mathcal{W}_{2}(\mu^{0},\mu^{1})=\inf\biggl{\{}\int_{0}^{1}\|\dot% {\mu}_{t}\|_{\mu_{t}}\,\mathrm{d}t{\,\bigg{|}\,}\mu:[0,1]\rightarrow\mathscr{P% }_{2}(\mathbb{R}^{D}),\mu_{0}=\mu^{0},\mu_{1}=\mu^{1}\biggr{\}}.

Lemma D.5 (Talagrand’s Inequality, Corollary 2.1 in Otto and Villani (2000)).

Let $\nu$ be $N(0,\kappa\cdot I_{D})$ . It holds for any $\mu\in\mathscr{P}_{2}(\mathbb{R}^{D})$ that

\displaystyle\mathcal{W}_{2}(\mu,\nu)^{2}\leq 2D_{\rm KL}(\mu\,\|\,\nu)/\kappa.

Lemma D.6 (Eulerian Representation of Geodesics, Proposition 5.38 in Villani (2003)).

Let $\beta:[0,1]\rightarrow\mathscr{P}_{2}(\mathbb{R}^{D})$ be a geodesic and $u$ be the corresponding vector field such that $\partial_{t}\beta_{t}=-\mathrm{div}(\beta_{t}\cdot u_{t})$ . It holds that

\displaystyle\partial_{t}(\beta_{t}\cdot u_{t})=-\mathrm{div}(\beta_{t}\cdot u% _{t}\otimes u_{t}).

where $\otimes$ is the outer product of two vectors.

Lemma D.7 (Dual Representation of the first order Wasserstein Distance, Villani (2008)).

The first order Wasserstein distance has the following dual representation form

\displaystyle\mathcal{W}_{1}(\mu,\nu)=\sup\biggl{\{}\int f(x)\mathrm{d}(\mu-% \nu)(x){\,\bigg{|}\,}f:\mathbb{R}^{D}\rightarrow\mathbb{R}\text{ that is 1-% Lipschitz continuous}\biggr{\}}

for any two probability measures $\mu,\nu\in\mathscr{P}_{1}(\mathbb{R}^{D})$ .

	$\displaystyle\frac{\delta\mathcal{L}(\mu^{},\nu^{})}{\delta\mu}(\theta)$	$\displaystyle=\mathbb{E}_{\mathcal{D}}\Bigl{[}\alpha\Big{\langle}g(Z;\nu^{})% \cdot\frac{\delta\Phi(X,Z;f(\cdot;\mu^{}))}{\delta f}+\lambda\cdot\frac{% \delta\Psi(X,Z;f(\cdot;\mu^{*}))}{\delta f},\phi(\cdot;\theta)\Big{\rangle}_{L% ^{2}}\Bigr{]},$
	$\displaystyle\frac{\delta\mathcal{L}(\mu^{},\nu^{})}{\delta\nu}(\omega)$	$\displaystyle=\mathbb{E}_{\mathcal{D}}\Bigl{[}\alpha\big{(}\Phi(X,Z;f(\cdot,% \mu^{}))-g(Z;\nu^{})\big{)}\cdot\psi(Z;\omega)\Bigr{]}.$

$\displaystyle\big{\\|}\nabla_{\theta}v^{f}(\theta;\beta_{s})\big{\\|}_{F}$	$\displaystyle\leq\alpha\cdot\mathbb{E}_{\mathcal{D}}\Big{[}\Big{\|}g(Z;\beta_{s% })\cdot\int_{\mathcal{W}}\frac{\delta\Phi(X,Z;f(\cdot;\beta_{s}))}{\delta f}(w% ^{\prime})\mathrm{d}w^{\prime}\Big{\|}\Big{]}\cdot\underset{w\in\mathcal{W}}{% \sup}\big{\\|}\nabla^{2}_{\theta,\theta}\phi(w;\theta)\big{\\|}_{F}^{2}$
	$\displaystyle\qquad+\alpha\cdot\mathbb{E}_{\mathcal{D}}\Big{[}\lambda\cdot\Big% {\|}\int_{\mathcal{W}}\frac{\delta\Psi(X,Z;f(\cdot;\beta_{s}))}{\delta f}(w^{% \prime})\mathrm{d}w^{\prime}\Big{\|}\Big{]}\cdot\underset{w\in\mathcal{W}}{\sup% }\big{\\|}\nabla^{2}_{\theta,\theta}\phi(w;\theta)\big{\\|}_{F}^{2}$
	$\displaystyle\leq\alpha B_{2}\cdot\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda C_{% \Psi}\bigl{\|}f(W;\beta_{s})\bigr{\|}+C_{2}\big{\|}g(Z;\beta_{s})\big{\|}\Bigr{]}.$	(A.24)

	$\displaystyle\big{\\|}\nabla_{\omega}v^{g}(\omega;\beta_{s})\big{\\|}_{F}$	$\displaystyle\leq\alpha\cdot\mathbb{E}_{\mathcal{D}}\Bigl{[}\big{\|}\Phi(X,Z;f(% \cdot;\beta_{s}))\big{\|}+\big{\|}g(Z;\beta_{s})\big{\|}\Bigr{]}\cdot\underset{z% \in\mathcal{Z}}{\sup}\big{\\|}\nabla^{2}_{\omega,\omega}\psi(z;\omega)\big{\\|}_% {F}^{2}$
		$\displaystyle\leq\alpha B_{2}\cdot\Big{(}\mathbb{E}_{\mathcal{D}}\Bigl{[}\big{% \|}\Phi(X,Z;\bm{0})\big{\|}+C_{2}\bigl{\|}f(W;\beta_{s})\bigr{\|}+\big{\|}g(Z;\beta% _{s})\big{\|}\Big{)}=\mathcal{O}(\alpha).$		(A.30)

		$\displaystyle\inf_{t\in[0,T]}\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\Psi(X,Z;f% (\cdot,\mu_{t})-f^{})+\bigl{(}g(Z;\nu_{t})-g^{}\bigr{)}^{2}\Bigr{]}$
		$\displaystyle\quad\leq T^{-1}\cdot\int_{0}^{T}\mathbb{E}_{\mathcal{D}}\Bigl{[}% \lambda\Psi(X,Z;f(\cdot,\mu_{t})-f^{})+\bigl{(}g(Z;\nu_{t})-g^{}\bigr{)}^{2}% \Bigr{]}\mathrm{d}t$
		$\displaystyle\quad\leq 1/2\cdot\eta^{-1}\cdot T^{-1}\cdot\mathcal{W}_{2}(\rho_% {0},\rho^{})^{2}+C_{}\cdot\alpha^{-1}$
		$\displaystyle\quad\leq 1/2\cdot\alpha^{-2}\cdot\bar{D}^{2}\cdot\eta^{-1}\cdot T% ^{-1}+C_{*}\cdot\alpha^{-1}=\mathcal{O}(T^{-1}+\alpha^{-1}),$		(A.37)

$\displaystyle\underset{w\in\mathcal{W}}{\sup}\|f(w;\mu_{t_{*}})\|$	$\displaystyle\;=\alpha\cdot\underset{w\in\mathcal{W}}{\sup}\Bigl{\|}\int\phi(w;% \theta)\mathrm{d}\mu_{t_{}}(\theta)\Bigr{\|}=\alpha\cdot\underset{w\in\mathcal% {W}}{\sup}\Bigl{\|}\int\phi(w;\theta)\mathrm{d}(\mu_{t_{}}-\mu_{0})(\theta)% \Bigr{\|}$
	$\displaystyle\;\leq\alpha B_{1}\cdot\mathcal{W}_{1}(\mu_{t_{}},\mu_{0})\leq% \alpha B_{1}\cdot\mathcal{W}_{2}(\mu_{t_{}},\mu_{0})\;\leq\alpha B_{1}\cdot% \Bigl{(}\mathcal{W}_{2}(\rho_{t_{}},\rho^{})+\mathcal{W}_{2}(\rho_{0},\rho^{% *})\Bigr{)}$
	$\displaystyle\;\leq 3B_{1}\cdot\bar{D}=\mathcal{O}(1).$	(A.40)

A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization

Abstract

1 Introduction

1.1 Related Works

2 Preliminaries

2.1 Functional Conditional Moment Equations

2.2 Examples of Functional Conditional Moment Equation

2.3 Mean-Field Neural Network and Wasserstein Space

3 Algorithms

4 Main Results

4.1 Assumptions

Assumption 4.1 (Regularity of Neural Networks).

Assumption 4.2 (Realizability).

Assumption 4.3 (Data regularity and Functional Integrability).

4.2 Convergence of SGDA dynamics to the Mean-Field Limit

Proposition 4.4 (Convergence of SGDA to Mean-Field Limit).

Proof.

4.3 Global Optimality and Convergence of the Mean-Field Limit

Definition 4.5 (Stationary point of Wasserstein Gradient Flow).

Lemma 4.6.

Proof.

Theorem 4.7 (Global Convergence to Saddle Point).

Proof.

Assumption 4.8 (Strong Convexity).

Theorem 4.9 (Global Convergence to Primal Solution).

Proof.

5 Applications

5.1 Application 1: Policy Evaluation

Corollary 5.1 (Global Convergence of Mean-field Neural Nets in Policy Evaluation).

Proof.

5.2 Application 2: Nonparametric Instrumental Variables Regression

Corollary 5.2 (Global Convergence of Mean-field Neural Nets in NPIV).

Proof.

5.3 Application 3: Asset Pricing

Corollary 5.3 (Global Convergence of Mean-field Neural Nets in Asset Pricing).

Proof.

5.4 Application 4: Adversarial Riesz Representer Estimation

Corollary 5.4 (Global Convergence of Mean-field Neural Nets in Adversarial Riesz Representer Estimation).

6 Conclusion

References

Appendix A Proof of Main Results

A.1 Proof of Lemma 4.6

A.2 Proof of Theorem 4.7

Lemma A.1.

Proof.

Proof.

A.3 Proof of Theorem 4.9

Proof.

Appendix B Mean Field Limit of Neural Networks

Proposition B.1 (Formal Version of Proposition 4.4).

Lemma B.2 (Upper Bound of PDE−IPPDEIP\mathrm{PDE}-\mathrm{IP}roman_PDE - roman_IP).

Lemma B.3 (Upper Bound of IP−CTPGDAIPCTPGDA\mathrm{IP}-\mathrm{CTPGDA}roman_IP - roman_CTPGDA).

Lemma B.4 (Upper Bound of CTPGDA−PGDACTPGDAPGDA\mathrm{CTPGDA}-\mathrm{PGDA}roman_CTPGDA - roman_PGDA).

Lemma B.5 (Upper Bound of PGDA−SGDAPGDASGDA\mathrm{PGDA}-\mathrm{SGDA}roman_PGDA - roman_SGDA).

Proof.

B.1 Proofs of Lemmas B.2-B.5

B.1.1 Proof of Lemma B.2

Proof.

B.1.2 Proof of Lemma B.3

B.1.3 Proof of Lemma B.4

B.1.4 Proof of Lemma B.5

Proof.

Appendix C Supporting Lemmas

C.1 Supporting Lemmas for §B

Lemma C.1.

Proof.

Lemma C.2.

Proof.

Lemma C.3.

Proof.

Lemma C.4 (Azuma-Hoeffding bound).

Proof.

Appendix D Technical Results

D.1 Universal Function Approximation Theorem

Lemma D.1 (Universal Function Approximation Theorem, Theorem 3.1 in Pinkus (1999)).

D.2 Wasserstein Space

Lemma D.2.

Lemma D.3 (First Variation Formula, Theorem 8.4.7 in Ambrosio et al. (2008)).

Lemma D.4 (Benamou-Brenier formula, Proposition 2.30 in Ambrosio and Gigli (2013)).

Lemma D.5 (Talagrand’s Inequality, Corollary 2.1 in Otto and Villani (2000)).

Lemma B.2 (Upper Bound of $\mathrm{PDE}-\mathrm{IP}$ ).

Lemma B.3 (Upper Bound of $\mathrm{IP}-\mathrm{CTPGDA}$ ).

Lemma B.4 (Upper Bound of $\mathrm{CTPGDA}-\mathrm{PGDA}$ ).

Lemma B.5 (Upper Bound of $\mathrm{PGDA}-\mathrm{SGDA}$ ).