A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization

Yuchen Zhu Georgia Institute of Technology. Email: [email protected].    Yufeng Zhang Northwestern University. Email: [email protected].    Zhaoran Wang Northwestern University. Email: [email protected].    Zhuoran Yang Yale University. Email: [email protected].    Xiaohong Chen Yale University. Email: [email protected].
Abstract

This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparameterized two-layer neural networks. In particular, we consider the minimax optimization problem stemming from estimating linear functional equations defined by conditional expectations, where the objective functions are quadratic in the functional spaces. We address (i) the convergence of the stochastic gradient descent-ascent algorithm and (ii) the representation learning of the neural networks. We establish convergence under the mean-field regime by considering the continuous-time and infinite-width limit of the optimization dynamics. Under this regime, the stochastic gradient descent-ascent corresponds to a Wasserstein gradient flow over the space of probability measures defined over the space of neural network parameters. We prove that the Wasserstein gradient flow converges globally to a stationary point of the minimax objective at a 𝒪(T1+α1)𝒪superscript𝑇1superscript𝛼1\mathcal{O}(T^{-1}+\alpha^{-1})caligraphic_O ( italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) sublinear rate, and additionally finds the solution to the functional equation when the regularizer of the minimax objective is strongly convex. Here T𝑇Titalic_T denotes the time and α𝛼\alphaitalic_α is a scaling parameter of the neural networks. In terms of representation learning, our results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of 𝒪(α1)𝒪superscript𝛼1\mathcal{O}(\alpha^{-1})caligraphic_O ( italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ), measured in terms of the Wasserstein distance. Finally, we apply our general results to concrete examples including policy evaluation, nonparametric instrumental variable regression, asset pricing, and adversarial Riesz representer estimation.

1 Introduction

Minimax optimization problems are ubiquitous in machine learning, statistics, economics, and other fields. Examples include generative adversarial networks (GANs) (Goodfellow et al., 2020; Salimans et al., 2016), adversarial training (Ganin et al., 2016; Madry et al., 2017), robust optimization (Ben-Tal et al., 2009; Levy et al., 2020), and zero-sum games (Xie et al., 2020b; Zhao et al., 2022). The goal in minimax optimization is to find a solution (f,g)superscript𝑓superscript𝑔(f^{*},g^{*})( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) to the problem minfmaxg𝒢(f,g)subscript𝑓subscript𝑔𝒢𝑓𝑔\min_{f\in\mathcal{F}}\max_{g\in\mathcal{G}}\mathcal{L}(f,g)roman_min start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_g ∈ caligraphic_G end_POSTSUBSCRIPT caligraphic_L ( italic_f , italic_g ), where \mathcal{L}caligraphic_L is a bivariate objective function, and \mathcal{F}caligraphic_F and 𝒢𝒢\mathcal{G}caligraphic_G are the feasible sets of the decision variables f𝑓fitalic_f and g𝑔gitalic_g. In modern machine learning applications, \mathcal{F}caligraphic_F and 𝒢𝒢\mathcal{G}caligraphic_G are often function classes flexibly parameterized by neural networks, and the objective (f,g)𝑓𝑔\mathcal{L}(f,g)caligraphic_L ( italic_f , italic_g ) can be approximated using data. The minimax optimization problem is often solved using first-order optimization algorithms. Despite hugely successful in diverse applications, there is no global convergence theory for various popular first-order algorithms solving general minimax optimization using neural networks yet.

In this work, we study the convergence of first-order algorithms for solving minimax optimization problems where \mathcal{F}caligraphic_F and 𝒢𝒢\mathcal{G}caligraphic_G are both flexibly parameterized by two-layer neural networks, and the objective functional is quadratic in f𝑓fitalic_f and g𝑔gitalic_g up to regularization:

minfmaxg𝒢(f,g),(f,g)=𝔼[g(Z)Φ(X,Z;f)1/2g(Z)2+𝚁𝚎𝚐(f)],subscript𝑓subscript𝑔𝒢𝑓𝑔𝑓𝑔𝔼delimited-[]𝑔𝑍Φ𝑋𝑍𝑓12𝑔superscript𝑍2𝚁𝚎𝚐𝑓\displaystyle\min_{f\in\mathcal{F}}\max_{g\in\mathcal{G}}\mathcal{L}(f,g),~{}~% {}\mathcal{L}(f,g)=\mathbb{E}\bigl{[}g(Z)\cdot\Phi(X,Z;f)-1/2\cdot g(Z)^{2}+% \mathtt{Reg}(f)\bigr{]},roman_min start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_g ∈ caligraphic_G end_POSTSUBSCRIPT caligraphic_L ( italic_f , italic_g ) , caligraphic_L ( italic_f , italic_g ) = blackboard_E [ italic_g ( italic_Z ) ⋅ roman_Φ ( italic_X , italic_Z ; italic_f ) - 1 / 2 ⋅ italic_g ( italic_Z ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + typewriter_Reg ( italic_f ) ] , (1.1)

where 𝚁𝚎𝚐(f)𝚁𝚎𝚐𝑓\mathtt{Reg}(f)typewriter_Reg ( italic_f ) is a convex regularizer that penalizes the complexity of f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F. Here the expectation is taken with respect to the joint distribution of random variables (X,Z)𝑋𝑍(X,Z)( italic_X , italic_Z ), g𝑔gitalic_g is a function of Z𝑍Zitalic_Z, and ΦΦ\Phiroman_Φ takes (X,Z)𝑋𝑍(X,Z)( italic_X , italic_Z ) and a function f𝑓fitalic_f as its input and is linear in f𝑓fitalic_f. The objective function (1.1) arises from solving a linear functional conditional moment equation of the form 𝔼[Φ(X,Z;f)|Z=]=0𝔼delimited-[]conditionalΦ𝑋𝑍𝑓𝑍0\mathbb{E}[\Phi(X,Z;f)|Z=\cdot]=0blackboard_E [ roman_Φ ( italic_X , italic_Z ; italic_f ) | italic_Z = ⋅ ] = 0 if and only if f=f𝑓superscript𝑓f=f^{*}\in\mathcal{F}italic_f = italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_F. Here X𝑋Xitalic_X is a vector containing all the endogenous variables and Z𝑍Zitalic_Z contains all the exogenous/pre-determent variables. This problem has ample applications, including policy evaluation (Cai et al., 2019; Duan et al., 2020; Jin et al., 2021; Chen and Qi, 2022; Ramprasad et al., 2022), nonparametric instrumental variable regression (Blundell et al., 2007; Chen and Pouzo, 2012; Chen and Christensen, 2018; Xu et al., 2020), and asset pricing (Chen and Ludvigson, 2009; Chen et al., 2014, 2024). The minimax objective in (1.1) arises when we solve the conditional moment equation via adversarial estimation (Uehara et al., 2020; Duan et al., 2021; Chernozhukov et al., 2020; Liao et al., 2020; Wai et al., 2020; Bennett et al., 2019), which introduces a dual function and transforms equation solving into a minimax optimization.

We study the infinite-dimensional minimax optimization in (1.1) over the space of overparameterized two-layer neural networks. Specifically, a neural network is represented by f𝙽𝙽(;𝜽)=α/Ni=1Nϕ(;θi)subscript𝑓𝙽𝙽𝜽𝛼𝑁superscriptsubscript𝑖1𝑁italic-ϕsuperscript𝜃𝑖f_{\mathtt{NN}}(\cdot;\bm{\theta})=\alpha/N\sum_{i=1}^{N}\phi(\cdot;\theta^{i})italic_f start_POSTSUBSCRIPT typewriter_NN end_POSTSUBSCRIPT ( ⋅ ; bold_italic_θ ) = italic_α / italic_N ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ϕ ( ⋅ ; italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ), where N𝑁Nitalic_N is the number of neurons, ϕ(;θi)italic-ϕsuperscript𝜃𝑖\phi(\cdot;\theta^{i})italic_ϕ ( ⋅ ; italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) denotes the i𝑖iitalic_i-th neuron, {θi}i[N]subscriptsuperscript𝜃𝑖𝑖delimited-[]𝑁\{\theta^{i}\}_{{i\in[N]}}{ italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_N ] end_POSTSUBSCRIPT are the network parameters, and α𝛼\alphaitalic_α is a scaling parameter. We aim to solve the minimax optimization in (1.1) with both f𝑓fitalic_f and g𝑔gitalic_g are represented by overparameterized two-layer neural networks, which is favorable especially when Z𝑍Zitalic_Z is a high-dimensional vector. To solve this problem, we consider the arguably simplest first-order algorithm, stochastic gradient descent-ascent (SGDA), where the parameters of f𝑓fitalic_f and g𝑔gitalic_g are simultaneously updated using stochastic gradients of the objective functional. Specifically, we aim to address the following two questions:

  • Does SGDA with overparameterized neural networks converge to some solution?

  • Does SGDA learn data-dependent features that yield a statistically accurate solution?

Answering these questions involves two intricate challenges in terms of optimization and representation learning using neural networks. First, the minimax objective is nonconvex-nonconcave with respect to the neural network parameters of f𝑓fitalic_f and g𝑔gitalic_g, it is unclear whether first-order algorithms converge. Second, the representation of the neural network evolves during the course of optimization, and it is unclear how to track and assess the data-dependent features learned by the neural networks. While there are some existing works on neural network optimization using the technique of neural tangent kernel (NTK) (Jacot et al., 2018; Du et al., 2018; Cai et al., 2019; Xu and Gu, 2020; Wang et al., 2022), such an approach suggests that the feature representation of the neural networks is fixed throughout training and is only determined by the initialization of the network parameters. Despite being an elegant theoretical framework, the NTK approach is limited in its ability to capture the representation learning aspect of neural network optimization. To show that the neural network optimization algorithms learn useful data-dependent features, in addition to establishing convergence, more importantly, we need to show that (i) the algorithm approximately finds a proper solution concept, e.g., a stationary point or a local or global optimizer of the minimax objective function, and (ii) the representation of the neural networks moves from the initialization by a considerable amount.

In this paper, we tackle both challenges by leveraging the framework of mean-field analysis of overparameterized neural networks (Chizat and Bach, 2018; Mei et al., 2018, 2019; Zhang et al., 2020; Lu et al., 2020b; Zhang et al., 2021b; Sirignano and Spiliopoulos, 2020b, a, 2022; Chen et al., 2020b; Fang et al., 2021b). In particular, we focus on the continuous-time and infinite width limit of the SGDA algorithm, where the stepsize goes to zero and the width N𝑁Nitalic_N goes to infinity. From the mean-field lens, a neural network f(;𝜽)𝑓𝜽f(\cdot;\bm{\theta})italic_f ( ⋅ ; bold_italic_θ ) can be identified with a probability measure μ𝜇\muitalic_μ by writing f(;𝜽)=αθϕ(;θ)μ(dθ)𝑓𝜽𝛼subscript𝜃italic-ϕ𝜃𝜇d𝜃f(\cdot;\bm{\theta})=\alpha\cdot\int_{\theta}\phi(\cdot;\theta)~{}\mu({\mathrm% {d}}\theta)italic_f ( ⋅ ; bold_italic_θ ) = italic_α ⋅ ∫ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_ϕ ( ⋅ ; italic_θ ) italic_μ ( roman_d italic_θ ), where μ𝜇\muitalic_μ is the empirical distribution of {θi}i[N]subscriptsuperscript𝜃𝑖𝑖delimited-[]𝑁\{\theta^{i}\}_{{i\in[N]}}{ italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_N ] end_POSTSUBSCRIPT and α𝛼\alphaitalic_α is the scaling parameter of the neural network. Thus, parameter updates of SGDA can be regarded as updates of the probability measure μ𝜇\muitalic_μ. From this perspective, we prove that in the continuous-time and infinite width limit, SGDA corresponds to a gradient flow of the minimax objective \mathcal{L}caligraphic_L in the Wasserstein space, i.e., the space of probability measures over the parameter space equipped with the Wasserstein-2 distance. Besides, by defining a proper potential function that characterizes the stationary point of the minimax objective, we prove that the Wasserstein gradient flow converges to a stationary point at a sublinear rate of 𝒪(1/T+1/α)𝒪1𝑇1𝛼\mathcal{O}(1/T+1/\alpha)caligraphic_O ( 1 / italic_T + 1 / italic_α ), where T𝑇Titalic_T is the time horizon and α𝛼\alphaitalic_α is a scaling parameter of the neural network. Moreover, we prove that the Wasserstein distance between the parameter distribution found by SGDA and its initialization is 𝒪(α1)𝒪superscript𝛼1\mathcal{O}(\alpha^{-1})caligraphic_O ( italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ), which shows that the representation of the neural networks is allowed to move from the initialization by a considerable amount. Such a behavior is not captured by the NTK analysis, in which the representation is shown to be fixed at the initialization. Furthermore, when the regularization on f𝑓fitalic_f satisfies a version of strong convexity, we prove that the Wasserstein gradient flow converges to the global optimizer fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT at a sublinear 𝒪(1/T+1/α)𝒪1𝑇1𝛼\mathcal{O}(1/T+1/\alpha)caligraphic_O ( 1 / italic_T + 1 / italic_α ) rate.

To the best of our knowledge, our work provides the first theoretical analysis of an optimization algorithm solving functional conditional moment equations using neural networks with representation learning. We apply our general theory to three important examples: policy evaluation, instrumental variables regression, and asset pricing. and adversarial Riesz representer estimation. In these examples, we prove that the SGDA algorithm finds the global solution with overparameterized neural networks. Moreover, SGDA learns data-dependent features that enable these statistically accurate estimators.

1.1 Related Works

Minimax Optimization. Our work is closely related to the literature on first-order methods for solving minimax optimization problems. These works establish the convergence rate or iteration complexity of first-order methods under various assumptions on the objective function. In particular, most of the existing works focus on finite-dimensional parameter spaces and one of the following objective functions: (i) convex-concave (Lin et al., 2020b; Ibrahim et al., 2019; Ouyang and Xu, 2021; Alkousa et al., 2019; Luo et al., 2021; Xie et al., 2020a; Han et al., 2024; Li et al., 2023; Jin et al., 2022), (ii) nonconvex-concave (Jin et al., 2019; Lin et al., 2020a; Lu et al., 2020a; Ostrovskii et al., 2021b; Zhao, 2023; Huang et al., 2022; Luo et al., 2020; Zhang et al., 2021a; Nouiehed et al., 2019; Thekumparampil et al., 2019), and (iii) nonconvex-nonconcave (Li et al., 2022; Diakonikolas et al., 2021; Ostrovskii et al., 2021a; Yang et al., 2022; Grimmer et al., 2022; Hajizadeh et al., 2024; Grimmer et al., 2023; Yang et al., 2020).

Our work can be viewed as an extension of convex-concave minimax optimization to the infinite-dimensional functional space. In particular, our objective is a regularized quadratic functional with respect to the input functions, which is then restricted to the class of overparameterized neural networks. Note that the objective of interest is in fact nonconvex-nonconcave in the neural network parameter space. Compared with the work on general nonconvex-nonconcave minimax optimization problems, our setting has a better underlying structure in the functional space in terms of convexity. This structure enables us to lift the network parameter updates to the Wasserstein space and analyze the gradient flow in the space of distributions. Our approach leverages the hidden convexity-concavity behind the seemingly nonconvex-nonconcave objective function and thus achieves better results in terms of algorithm convergence and complexity.

Mean-field Analysis in Deep Learning. Our work is closely related to the recent study of neural network training via gradient-based methods. One line of research establishes the convergence of gradient-based algorithms for training overparameterized neural networks under the “lazy training” regime, where the neural networks behave similarly to random kernel functions. Such a regime is also known as the as the neural tangent kernel regime (Jacot et al., 2018; Allen-Zhu et al., 2019a, b; Chen et al., 2020a; Frei and Gu, 2021; Zou and Gu, 2019; Du et al., 2018, 2019; Arora et al., 2019a, b; Huang and Yau, 2020). Our work is more related to another line of research based on the perspective of mean-field approximation (Mei et al., 2018, 2019; Chizat and Bach, 2018; Sirignano and Spiliopoulos, 2020b, a, 2022; Chen et al., 2020b; Fang et al., 2021b; Chen et al., 2019). Under the mean-field view, the neural network parameters are identified as a distribution over the parameter space. As a result, the evolution of parameters by gradient-based updates is captured by a differential equation that governs the evolution of the corresponding distribution. By elevating the training dynamics to an infinite Wasserstein space, the optimization objective often enjoys a benign landscape, which yields admits a more tractable analysis and global convergence. See, e.g, Zhang et al. (2020, 2021b); Fang et al. (2021b); Lu et al. (2020b); Fang et al. (2019); Chizat (2022); Hu et al. (2021); Nitanda et al. (2022) and the references therein. Also, see Fang et al. (2021a) for a recent survey.

Our work is especially related to the mean-field analysis of the Neural Temporal Difference (TD) (Zhang et al., 2020) and the Neural Actor-Critic (AC) (Zhang et al., 2021b) in reinforcement learning. These previous works have provided an analysis of the global convergence of the TD and AC algorithm with two-layered overparameterized neural networks. The optimization problem in these two tasks is the minimization of an objective where only one neural network is involved. Rather different from these works, we focus on minimax optimization, which requires neural network parameterization of both the primal function and the dual function. This brings new challenges to the analysis as the gradient dynamics of the primal and dual neural networks give birth to a coupled system of PDEs. To the best of our knowledge, our paper is the first to apply the mean-field limit to study the convergence of algorithms in solving the general form of functional conditional moment equations using neural networks.

Adversarial Estimation. Our work is also related to the literature on adversarial estimation, a method that solves a functional conditional moment equation by introducing a dual function and reformulating the original problem into a minimax optimization. Our work studies this type of minimax optimization with overparameterized neural networks. Thus, our work is more related to the study of adversarial estimation within neural network function classes (Dikkala et al., 2020; Chernozhukov et al., 2020; Bennett et al., 2019; Xu et al., 2021). Compared with our work, these studies focus on statistical errors pertinent to neural networks, assuming the optimization problem is solved perfectly. We instead study the optimization algorithm and establish the convergence of stochastic gradient-descent-ascent with neural networks.

Several previous works have also explored the convergence of optimization dynamics in adversarial estimation with neural networks. In particular, Neural GTD (Wai et al., 2020) and Neural SEM (Liao et al., 2020) analyze respectively the convergence for off-policy evaluation and structural equation models estimation with overparameterized two-layered neural network. However, their analyses are based on the idea of neural tangent kernel (NTK), where the employed neural network has a fixed representation during training, and the representation is completely determined by the initialization. In contrast, our work adopts the mean-field approach, which enables learning a data-dependent representation.

2 Preliminaries

The functional conditional moment equations cover many important examples in statistics, machine learning, economics, and causal inference. In this section, we first introduce the general formulation of the functional conditional moment equations and then reformulate them into a minimax optimization problem. Then, we present a few concrete examples of function conditional moment equations such as policy evaluation, nonparametric instrumental variables regression, asset pricing, and Riesz representers estimation. Finally, we introduce the background of mean-field neural networks and Wasserstein space, which are essential for the convergence analysis of the SGDA algorithm.

2.1 Functional Conditional Moment Equations

In this section, we introduce the general formulation of functional conditional moment equations. Let X𝒳𝑋𝒳X\in\mathcal{X}italic_X ∈ caligraphic_X be a vector that includes all the endogenous variables, let Z𝒵𝑍𝒵Z\in\mathcal{Z}italic_Z ∈ caligraphic_Z denote all the exogenous variables, and let 𝒟𝒫(𝒳×𝒵)𝒟𝒫𝒳𝒵\mathcal{D}\in\mathscr{P}(\mathcal{X}\times\mathcal{Z})caligraphic_D ∈ script_P ( caligraphic_X × caligraphic_Z ) denote the joint distribution of (X,Z)𝑋𝑍(X,Z)( italic_X , italic_Z ). We let 𝔼𝒟[]subscript𝔼𝒟delimited-[]\mathbb{E}_{\mathcal{D}}[\cdot]blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ ⋅ ] denote the expectation taken with respect to the joint distribution of (X,Z)𝑋𝑍(X,Z)( italic_X , italic_Z ) and 𝔼X|Z[]subscript𝔼conditional𝑋𝑍delimited-[]\mathbb{E}_{X|Z}[\cdot]blackboard_E start_POSTSUBSCRIPT italic_X | italic_Z end_POSTSUBSCRIPT [ ⋅ ] denote the conditional expectation using the conditional distribution of X𝑋Xitalic_X given Z𝑍Zitalic_Z. Let W𝒲𝒳×𝒵𝑊𝒲𝒳𝒵W\in\mathcal{W}\subseteq\mathcal{X}\times\mathcal{Z}italic_W ∈ caligraphic_W ⊆ caligraphic_X × caligraphic_Z be a subset of variables that may contain both the endogenous and exogenous variables, and let L2(𝒲)superscript𝐿2𝒲L^{2}(\mathcal{W})italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_W ) denote a Hilbert space of measurable functions of W𝑊Witalic_W with finite second moment. Let :={f:𝒲}L2(𝒲)assignconditional-set𝑓𝒲superscript𝐿2𝒲\mathcal{F}:=\{f:\mathcal{W}\rightarrow\mathbb{R}\}\subset L^{2}(\mathcal{W})caligraphic_F := { italic_f : caligraphic_W → blackboard_R } ⊂ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_W ) denote a class of functions defined on 𝒲𝒲\mathcal{W}caligraphic_W. In a functional conditional moment equation problem, we aim to find a function f0subscript𝑓0f_{0}\in\mathcal{F}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_F that solves the following functional equation involving the conditional distribution of X𝑋Xitalic_X given Z𝑍Zitalic_Z over \mathcal{F}caligraphic_F:

𝔼X|Z[Φ(X,Z;f0)|Z=z]=0,z𝒵,formulae-sequencesubscript𝔼conditional𝑋𝑍delimited-[]conditionalΦ𝑋𝑍subscript𝑓0𝑍𝑧0for-all𝑧𝒵\displaystyle\mathbb{E}_{X|Z}\bigl{[}\Phi(X,Z;f_{0}){\,\Big{|}\,}Z=z\Bigr{]}=0% ,\qquad\forall z\in\mathcal{Z},blackboard_E start_POSTSUBSCRIPT italic_X | italic_Z end_POSTSUBSCRIPT [ roman_Φ ( italic_X , italic_Z ; italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | italic_Z = italic_z ] = 0 , ∀ italic_z ∈ caligraphic_Z , (2.1)

where Φ:𝒳×𝒵×:Φ𝒳𝒵\Phi\colon\mathcal{X}\times\mathcal{Z}\times\mathcal{F}\rightarrow\mathbb{R}roman_Φ : caligraphic_X × caligraphic_Z × caligraphic_F → blackboard_R is a known functional.

For any function f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F and any z𝒵𝑧𝒵z\in\mathcal{Z}italic_z ∈ caligraphic_Z, we define a functional δ¯:𝒵×:¯𝛿𝒵\bar{\delta}\colon\mathcal{Z}\times\mathcal{F}\rightarrow\mathbb{R}over¯ start_ARG italic_δ end_ARG : caligraphic_Z × caligraphic_F → blackboard_R as

δ¯(z;f):=𝔼X|Z[Φ(X,Z;f)|Z=z],f,z𝒵.formulae-sequenceassign¯𝛿𝑧𝑓subscript𝔼conditional𝑋𝑍delimited-[]conditionalΦ𝑋𝑍𝑓𝑍𝑧formulae-sequencefor-all𝑓𝑧𝒵\displaystyle\bar{\delta}(z;f):=\mathbb{E}_{X|Z}\bigl{[}\Phi(X,Z;f){\,\big{|}% \,}Z=z\bigr{]},\qquad\forall f\in\mathcal{F},z\in\mathcal{Z}.over¯ start_ARG italic_δ end_ARG ( italic_z ; italic_f ) := blackboard_E start_POSTSUBSCRIPT italic_X | italic_Z end_POSTSUBSCRIPT [ roman_Φ ( italic_X , italic_Z ; italic_f ) | italic_Z = italic_z ] , ∀ italic_f ∈ caligraphic_F , italic_z ∈ caligraphic_Z . (2.2)

In other words, the conditional moment equation problem in (2.1) boils down to finding a function f0subscript𝑓0f_{0}\in\mathcal{F}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_F such that δ¯(;f0)¯𝛿subscript𝑓0\bar{\delta}(\cdot;f_{0})over¯ start_ARG italic_δ end_ARG ( ⋅ ; italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is a zero function on 𝒵𝒵\mathcal{Z}caligraphic_Z. Therefore an equivalent way to solve f0subscript𝑓0f_{0}\in\mathcal{F}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_F in (2.1) is by solving inff𝔼[(δ¯(Z;f))2]subscriptinfimum𝑓𝔼delimited-[]superscript¯𝛿𝑍𝑓2\inf_{f\in\mathcal{F}}\mathbb{E}[(\bar{\delta}(Z;f))^{2}]roman_inf start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT blackboard_E [ ( over¯ start_ARG italic_δ end_ARG ( italic_Z ; italic_f ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (Ai and Chen, 2003; Chen and Pouzo, 2012). To control the complexity of the function class \mathcal{F}caligraphic_F, Ai and Chen (2003) propose to use flexible sieve spaces k(n)subscript𝑘𝑛\mathcal{F}_{k(n)}caligraphic_F start_POSTSUBSCRIPT italic_k ( italic_n ) end_POSTSUBSCRIPT that becomes dense in \mathcal{F}caligraphic_F as the sieve dimension k(n)𝑘𝑛k(n)italic_k ( italic_n ) grows to infinity with data sample size n𝑛nitalic_n, and proposed the so-called sieve minimum distance criterion minfk(n)𝔼[δ¯(Z;f)2]/2.subscript𝑓subscript𝑘𝑛𝔼delimited-[]¯𝛿superscript𝑍𝑓22\min_{f\in\mathcal{F}_{k}(n)}\mathbb{E}\bigl{[}\bar{\delta}(Z;f)^{2}\bigr{]}/2.roman_min start_POSTSUBSCRIPT italic_f ∈ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_n ) end_POSTSUBSCRIPT blackboard_E [ over¯ start_ARG italic_δ end_ARG ( italic_Z ; italic_f ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] / 2 . In particular, Ai and Chen (2003) allow for two-layer NNs, splines, wavelets, Fourier series, and all kinds of polynomial sieves k(n)subscript𝑘𝑛\mathcal{F}_{k(n)}caligraphic_F start_POSTSUBSCRIPT italic_k ( italic_n ) end_POSTSUBSCRIPT to approximate functions in L2(𝒲)superscript𝐿2𝒲\mathcal{F}\subseteq L^{2}(\mathcal{W})caligraphic_F ⊆ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_W ). Alternatively Chen and Pouzo (2012) propose the following penalized (or regularized) minimum distance criterion:

minf𝔼[δ¯(Z;f)2]/2+λ(f),subscript𝑓𝔼delimited-[]¯𝛿superscript𝑍𝑓22𝜆𝑓\displaystyle\min_{f\in\mathcal{F}}\mathbb{E}\bigl{[}\bar{\delta}(Z;f)^{2}% \bigr{]}/2+\lambda\cdot\mathcal{R}(f),roman_min start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT blackboard_E [ over¯ start_ARG italic_δ end_ARG ( italic_Z ; italic_f ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] / 2 + italic_λ ⋅ caligraphic_R ( italic_f ) , (2.3)

where λ0𝜆0\lambda\geq 0italic_λ ≥ 0 is a regularization parameter, (f)𝑓\mathcal{R}(f)caligraphic_R ( italic_f ) is a regularizer on function f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F. They allow that (f)𝑓\mathcal{R}(f)caligraphic_R ( italic_f ) to be any convex or lower-semicompact regularizer. In the minimum distance approach, for any fixed f𝑓fitalic_f, the authors first estimate δ¯(z;f)¯𝛿𝑧𝑓\bar{\delta}(z;f)over¯ start_ARG italic_δ end_ARG ( italic_z ; italic_f ) by the following least squares criterion:

argminδL2(𝒵)𝔼[1/2(Φ(X,Z;f)δ(Z))2]=argmaxδL2(𝒵)𝔼[Φ(X,Z;f)δ(Z)1/2δ(Z)2]subscriptargmin𝛿superscript𝐿2𝒵𝔼delimited-[]12superscriptΦ𝑋𝑍𝑓𝛿𝑍2subscriptargmax𝛿superscript𝐿2𝒵𝔼delimited-[]Φ𝑋𝑍𝑓𝛿𝑍12𝛿superscript𝑍2\displaystyle\mathop{\mathrm{argmin}}_{\delta\in L^{2}(\mathcal{Z})}\mathbb{E}% \Big{[}1/2\cdot\big{(}\Phi(X,Z;f)-\delta(Z)\big{)}^{2}\Big{]}=\mathop{\mathrm{% argmax}}_{\delta\in L^{2}(\mathcal{Z})}\mathbb{E}\Big{[}\Phi(X,Z;f)\delta(Z)-1% /2\cdot\delta(Z)^{2}\Big{]}roman_argmin start_POSTSUBSCRIPT italic_δ ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_Z ) end_POSTSUBSCRIPT blackboard_E [ 1 / 2 ⋅ ( roman_Φ ( italic_X , italic_Z ; italic_f ) - italic_δ ( italic_Z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = roman_argmax start_POSTSUBSCRIPT italic_δ ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_Z ) end_POSTSUBSCRIPT blackboard_E [ roman_Φ ( italic_X , italic_Z ; italic_f ) italic_δ ( italic_Z ) - 1 / 2 ⋅ italic_δ ( italic_Z ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]

Furthermore, we assume that the functional ΦΦ\Phiroman_Φ is affine in f𝑓fitalic_f, which captures several important applications in machine learning and causal inference listed in Section 2.2. Specifically, we define Φ~(x,z,f)=Φ(x,z,f)Φ(x,z,0)~Φ𝑥𝑧𝑓Φ𝑥𝑧𝑓Φ𝑥𝑧0\widetilde{\Phi}(x,z,f)=\Phi(x,z,f)-\Phi(x,z,0)over~ start_ARG roman_Φ end_ARG ( italic_x , italic_z , italic_f ) = roman_Φ ( italic_x , italic_z , italic_f ) - roman_Φ ( italic_x , italic_z , 0 ), where 00 stands for the zero function on 𝒲𝒲\mathcal{W}caligraphic_W. Then for any two functions f1,f2subscript𝑓1subscript𝑓2f_{1},f_{2}\in\mathcal{F}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ caligraphic_F and any a,b𝑎𝑏a,b\in\mathbb{R}italic_a , italic_b ∈ blackboard_R, we have

Φ~(x,z;af1+bf2)=aΦ~(x,z;f1)+bΦ~(x,z;f2),(x,z)𝒳×𝒵.formulae-sequence~Φ𝑥𝑧𝑎subscript𝑓1𝑏subscript𝑓2𝑎~Φ𝑥𝑧subscript𝑓1𝑏~Φ𝑥𝑧subscript𝑓2for-all𝑥𝑧𝒳𝒵\displaystyle\widetilde{\Phi}(x,z;af_{1}+bf_{2})=a\cdot\widetilde{\Phi}(x,z;f_% {1})+b\cdot\widetilde{\Phi}(x,z;f_{2}),\qquad\forall(x,z)\in\mathcal{X}\times% \mathcal{Z}.over~ start_ARG roman_Φ end_ARG ( italic_x , italic_z ; italic_a italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_b italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_a ⋅ over~ start_ARG roman_Φ end_ARG ( italic_x , italic_z ; italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_b ⋅ over~ start_ARG roman_Φ end_ARG ( italic_x , italic_z ; italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ∀ ( italic_x , italic_z ) ∈ caligraphic_X × caligraphic_Z . (2.4)

Solving (2.1) with Overparameterized Neural Networks. In the sequel, we aim to solve the problem in (2.1) based on i.i.d. data points sampled from 𝒟𝒟\mathcal{D}caligraphic_D, with \mathcal{F}caligraphic_F being a class of overparameterized neural networks. In this case, it is possible that (2.1) does not have a solution within \mathcal{F}caligraphic_F. Furthermore, for the choice of regularizer, we consider the following specific form of (f)𝑓\mathcal{R}(f)caligraphic_R ( italic_f ):

(f)=𝔼𝒟[Ψ(X,Z;f)]𝑓subscript𝔼𝒟delimited-[]Ψ𝑋𝑍𝑓\displaystyle\mathcal{R}(f)=\mathbb{E}_{\mathcal{D}}[\Psi(X,Z;f)]caligraphic_R ( italic_f ) = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ roman_Ψ ( italic_X , italic_Z ; italic_f ) ] (2.5)

where for any given (x,z)𝒳×𝒵𝑥𝑧𝒳𝒵(x,z)\in\mathcal{X}\times\mathcal{Z}( italic_x , italic_z ) ∈ caligraphic_X × caligraphic_Z, Ψ(x,z;f):+:Ψ𝑥𝑧𝑓subscript\Psi(x,z;f):\mathcal{F}\rightarrow{\mathbb{R}}_{+}roman_Ψ ( italic_x , italic_z ; italic_f ) : caligraphic_F → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is a convex functional of f𝑓fitalic_f that maps each function f𝑓fitalic_f to a scalar. Moreover, ΨΨ\Psiroman_Ψ satisfies

Ψ(x,z;0)Ψ𝑥𝑧0\displaystyle\Psi(x,z;0)roman_Ψ ( italic_x , italic_z ; 0 ) =0,Ψ(x,z;f)0,f,formulae-sequenceabsent0formulae-sequenceΨ𝑥𝑧𝑓0for-all𝑓\displaystyle=0,\qquad\Psi(x,z;f)\geq 0,\qquad\forall f\in\mathcal{F},= 0 , roman_Ψ ( italic_x , italic_z ; italic_f ) ≥ 0 , ∀ italic_f ∈ caligraphic_F , (2.6)
δΨ(x,z;af1+bf2)δf𝛿Ψ𝑥𝑧𝑎subscript𝑓1𝑏subscript𝑓2𝛿𝑓\displaystyle\frac{\delta\Psi(x,z;af_{1}+bf_{2})}{\delta f}divide start_ARG italic_δ roman_Ψ ( italic_x , italic_z ; italic_a italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_b italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_δ italic_f end_ARG =aδΨ(x,z;f1)δf+bδΨ(x,z;f2)δf,f1,f2,a,b.formulae-sequenceabsent𝑎𝛿Ψ𝑥𝑧subscript𝑓1𝛿𝑓𝑏𝛿Ψ𝑥𝑧subscript𝑓2𝛿𝑓for-allsubscript𝑓1formulae-sequencesubscript𝑓2𝑎𝑏\displaystyle=a\cdot\frac{\delta\Psi(x,z;f_{1})}{\delta f}+b\cdot\frac{\delta% \Psi(x,z;f_{2})}{\delta f},\qquad\forall f_{1},f_{2}\in\mathcal{F},\;a,b\in% \mathbb{R}.= italic_a ⋅ divide start_ARG italic_δ roman_Ψ ( italic_x , italic_z ; italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_δ italic_f end_ARG + italic_b ⋅ divide start_ARG italic_δ roman_Ψ ( italic_x , italic_z ; italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_δ italic_f end_ARG , ∀ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ caligraphic_F , italic_a , italic_b ∈ blackboard_R . (2.7)

Equation (2.6) requires that Ψ(X,Z;f)Ψ𝑋𝑍𝑓\Psi(X,Z;f)roman_Ψ ( italic_X , italic_Z ; italic_f ) is a non-negative functional of f𝑓fitalic_f that is equal to 00 if and only f=0𝑓0f=0italic_f = 0. Equation (2.7) requires that the functional derivative of Ψ(X,Z;f)Ψ𝑋𝑍𝑓\Psi(X,Z;f)roman_Ψ ( italic_X , italic_Z ; italic_f ) with respect to f𝑓fitalic_f, is linear in f𝑓fitalic_f. One example of ΨΨ\Psiroman_Ψ is the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-regularizer of the following type, Ψ(x,z;f)=f(w)2Ψ𝑥𝑧𝑓𝑓superscript𝑤2\Psi(x,z;f)=f(w)^{2}roman_Ψ ( italic_x , italic_z ; italic_f ) = italic_f ( italic_w ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Here w𝒲𝑤𝒲w\in\mathcal{W}italic_w ∈ caligraphic_W is a subset of variables that contain values from both the endogenous variables x𝑥xitalic_x and exogenous variables z𝑧zitalic_z.

Minimax Estimation. To solve the optimization problem in (2.3), we first transform it into a unconditional moment formulation by introducing a dual function. By Fenchel duality, we can rewrite the objective function J(f)𝐽𝑓J(f)italic_J ( italic_f ) as follows,

J(f)𝐽𝑓\displaystyle J(f)italic_J ( italic_f ) =𝔼𝒟[1/2δ¯(z;f)2+λΨ(X,Z;f)]absentsubscript𝔼𝒟delimited-[]12¯𝛿superscript𝑧𝑓2𝜆Ψ𝑋𝑍𝑓\displaystyle=\mathbb{E}_{\mathcal{D}}\Bigl{[}1/2\cdot\bar{\delta}(z;f)^{2}+% \lambda\Psi(X,Z;f)\Bigr{]}= blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ 1 / 2 ⋅ over¯ start_ARG italic_δ end_ARG ( italic_z ; italic_f ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ roman_Ψ ( italic_X , italic_Z ; italic_f ) ] (2.8)
=𝔼𝒟[maxg:𝒵(g(z)𝔼[Φ(X,z;f)|z]1/2g(z)2)+λΨ(X,Z;f)]absentsubscript𝔼𝒟delimited-[]subscript:𝑔𝒵𝑔𝑧𝔼delimited-[]conditionalΦ𝑋𝑧𝑓𝑧12𝑔superscript𝑧2𝜆Ψ𝑋𝑍𝑓\displaystyle=\mathbb{E}_{\mathcal{D}}\Bigl{[}\max_{g:\mathcal{Z}\rightarrow% \mathbb{R}}\left(g(z)\cdot\mathbb{E}\bigl{[}\Phi(X,z;f){\,\big{|}\,}z\bigr{]}-% 1/2\cdot g(z)^{2}\right)+\lambda\Psi(X,Z;f)\Bigr{]}= blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ roman_max start_POSTSUBSCRIPT italic_g : caligraphic_Z → blackboard_R end_POSTSUBSCRIPT ( italic_g ( italic_z ) ⋅ blackboard_E [ roman_Φ ( italic_X , italic_z ; italic_f ) | italic_z ] - 1 / 2 ⋅ italic_g ( italic_z ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_λ roman_Ψ ( italic_X , italic_Z ; italic_f ) ]
=maxg:𝒵𝔼𝒟[g(Z)Φ(X,Z;f)1/2g(Z)2+λΨ(X,Z;f)].absentsubscript:𝑔𝒵subscript𝔼𝒟delimited-[]𝑔𝑍Φ𝑋𝑍𝑓12𝑔superscript𝑍2𝜆Ψ𝑋𝑍𝑓\displaystyle=\max_{g:\mathcal{Z}\rightarrow\mathbb{R}}\mathbb{E}_{\mathcal{D}% }\Bigl{[}g(Z)\cdot\Phi(X,Z;f)-1/2\cdot g(Z)^{2}+\lambda\Psi(X,Z;f)\Bigr{]}.= roman_max start_POSTSUBSCRIPT italic_g : caligraphic_Z → blackboard_R end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_g ( italic_Z ) ⋅ roman_Φ ( italic_X , italic_Z ; italic_f ) - 1 / 2 ⋅ italic_g ( italic_Z ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ roman_Ψ ( italic_X , italic_Z ; italic_f ) ] .

The formulation in (2.8) leads to the following minimax optimization problem:

minfmaxg(f,g)=𝔼𝒟[g(Z)Φ(X,Z;f)1/2g(Z)2+λΨ(X,Z;f)].subscript𝑓subscript𝑔𝑓𝑔subscript𝔼𝒟delimited-[]𝑔𝑍Φ𝑋𝑍𝑓12𝑔superscript𝑍2𝜆Ψ𝑋𝑍𝑓\displaystyle\min_{f}\max_{g}\mathcal{L}(f,g)=\mathbb{E}_{\mathcal{D}}\Bigl{[}% g(Z)\cdot\Phi(X,Z;f)-1/2\cdot g(Z)^{2}+\lambda\Psi(X,Z;f)\Bigr{]}.roman_min start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT caligraphic_L ( italic_f , italic_g ) = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_g ( italic_Z ) ⋅ roman_Φ ( italic_X , italic_Z ; italic_f ) - 1 / 2 ⋅ italic_g ( italic_Z ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ roman_Ψ ( italic_X , italic_Z ; italic_f ) ] . (2.9)

We note that \mathcal{L}caligraphic_L is a convex-concave functional with respect to function f𝑓fitalic_f and g𝑔gitalic_g. We denote by (f,g)superscript𝑓superscript𝑔(f^{*},g^{*})( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) the unique saddle point of (2.9). Here the uniqueness of fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT comes from the convexity of regularization Φ(X,Z;f)Φ𝑋𝑍𝑓\Phi(X,Z;f)roman_Φ ( italic_X , italic_Z ; italic_f ), and g(z)=𝔼[Φ(X,Z;f)|Z=z]superscript𝑔𝑧𝔼delimited-[]conditionalΦ𝑋𝑍superscript𝑓𝑍𝑧g^{*}(z)=\mathbb{E}[\Phi(X,Z;f^{*})|Z=z]italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_z ) = blackboard_E [ roman_Φ ( italic_X , italic_Z ; italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | italic_Z = italic_z ] implies the uniqueness of gsuperscript𝑔g^{*}italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Without the regularization, i.e., λ=0𝜆0\lambda=0italic_λ = 0, the saddle point of (2.9) is f=f0superscript𝑓subscript𝑓0f^{*}=f_{0}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and g=0superscript𝑔0g^{*}=0italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 0.

2.2 Examples of Functional Conditional Moment Equation

In this section, we discuss several important applications of the functional conditional moment equation, which serve as running examples of this paper.

Policy Evaluation. We consider a Markov decision process given by (𝒮,𝒜,𝒫,r,γ)𝒮𝒜𝒫𝑟𝛾({\mathcal{S}},\mathcal{A},\mathcal{P},r,\gamma)( caligraphic_S , caligraphic_A , caligraphic_P , italic_r , italic_γ ), where 𝒮d𝒮superscript𝑑{\mathcal{S}}\subseteq\mathbb{R}^{d}caligraphic_S ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the state space, 𝒜𝒜\mathcal{A}caligraphic_A is the action space, 𝒫:𝒮×𝒜𝒫(𝒮):𝒫𝒮𝒜𝒫𝒮\mathcal{P}:{\mathcal{S}}\times\mathcal{A}\rightarrow\mathscr{P}({\mathcal{S}})caligraphic_P : caligraphic_S × caligraphic_A → script_P ( caligraphic_S ) is the transition kernel, r:𝒮×𝒜[0,1]:𝑟𝒮𝒜01r:{\mathcal{S}}\times\mathcal{A}\rightarrow[0,1]italic_r : caligraphic_S × caligraphic_A → [ 0 , 1 ] is the reward function, γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ) is the discount factor. Given a policy π:𝒮𝒫(𝒜):𝜋𝒮𝒫𝒜\pi:{\mathcal{S}}\rightarrow\mathscr{P}(\mathcal{A})italic_π : caligraphic_S → script_P ( caligraphic_A ), an agent interacts with the environment in the following manner. At a state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the agent takes an action atπ(|st)a_{t}\sim\pi(\cdot{\,|\,}s_{t})italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and receives a reward rt=r(st,at)subscript𝑟𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡r_{t}=r(s_{t},a_{t})italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Then, the agent transits to the next state st+1𝒫(|st,at)s_{t+1}\sim\mathcal{P}(\cdot{\,|\,}s_{t},a_{t})italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ caligraphic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). We denote the transition kernel induced by policy π𝜋\piitalic_π by 𝒫π(s|s)=𝒜𝒫(s|s,a)π(a|s)dasuperscript𝒫𝜋conditionalsuperscript𝑠𝑠subscript𝒜𝒫conditionalsuperscript𝑠𝑠𝑎𝜋conditional𝑎𝑠differential-d𝑎\mathcal{P}^{\pi}(s^{\prime}{\,|\,}s)=\int_{\mathcal{A}}\mathcal{P}(s^{\prime}% {\,|\,}s,a)\pi(a{\,|\,}s)\mathrm{d}acaligraphic_P start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s ) = ∫ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT caligraphic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) italic_π ( italic_a | italic_s ) roman_d italic_a for any s,s𝒮𝑠superscript𝑠𝒮s,s^{\prime}\in{\mathcal{S}}italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S. In policy evaluation, we aim to estimate the value function Vπ:𝒮:superscript𝑉𝜋𝒮V^{\pi}:{\mathcal{S}}\rightarrow\mathbb{R}italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT : caligraphic_S → blackboard_R defined as follows,

Vπ(s)=𝔼π[i=0γir(si,ai)|s0=s],superscript𝑉𝜋𝑠subscript𝔼𝜋delimited-[]conditionalsuperscriptsubscript𝑖0superscript𝛾𝑖𝑟subscript𝑠𝑖subscript𝑎𝑖subscript𝑠0𝑠\displaystyle V^{\pi}(s)=\mathbb{E}_{\pi}\Bigl{[}\sum_{i=0}^{\infty}\gamma^{i}% r(s_{i},a_{i}){\,\Big{|}\,}s_{0}=s\Bigr{]},italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) = blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_s ] ,

where the expectation 𝔼πsubscript𝔼𝜋\mathbb{E}_{\pi}blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT is taken with respect to atπ(|st)a_{t}\sim\pi(\cdot{\,|\,}s_{t})italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and st+1𝒫(|st,at)s_{t+1}\sim\mathcal{P}(\cdot{\,|\,}s_{t},a_{t})italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ caligraphic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) for t0𝑡0t\geq 0italic_t ≥ 0. By the Bellman equation (Sutton and Barto, 2018), it holds for any s𝒮𝑠𝒮s\in{\mathcal{S}}italic_s ∈ caligraphic_S that

Vπ(s)𝒯πVπ(s)=0,𝒯πf(s)=𝔼aπ(|s)[r(s,a)]+γ𝔼s𝒫π(|s)[f(s)].\displaystyle V^{\pi}(s)-{\mathcal{T}}^{\pi}V^{\pi}(s)=0,\quad{\mathcal{T}}^{% \pi}f(s)=\mathbb{E}_{a\sim\pi(\cdot{\,|\,}s)}\bigl{[}r(s,a)\bigr{]}+\gamma% \mathbb{E}_{s^{\prime}\sim\mathcal{P}^{\pi}(\cdot{\,|\,}s)}\bigl{[}f(s^{\prime% })\bigr{]}.italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) - caligraphic_T start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) = 0 , caligraphic_T start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT italic_f ( italic_s ) = blackboard_E start_POSTSUBSCRIPT italic_a ∼ italic_π ( ⋅ | italic_s ) end_POSTSUBSCRIPT [ italic_r ( italic_s , italic_a ) ] + italic_γ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_P start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( ⋅ | italic_s ) end_POSTSUBSCRIPT [ italic_f ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] . (2.10)

Corresponding to the Bellman equation in (2.10), let 𝒟𝒟\mathcal{D}caligraphic_D denotes the joint distribution of the state-action tuple (s,a,s)𝑠𝑎superscript𝑠(s,a,s^{\prime})( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) under policy π𝜋\piitalic_π, the value function Vπsuperscript𝑉𝜋V^{\pi}italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT satisfies the following functional conditional moment equation,

𝔼s|s[r(s,a)Vπ(s)+γVπ(s)|s]=0.subscript𝔼conditionalsuperscript𝑠𝑠delimited-[]𝑟𝑠𝑎superscript𝑉𝜋𝑠conditional𝛾superscript𝑉𝜋superscript𝑠𝑠0\displaystyle\mathbb{E}_{s^{\prime}|s}\Bigl{[}r(s,a)-V^{\pi}(s)+\gamma\cdot V^% {\pi}(s^{\prime}){\,\Big{|}\,}s\Bigr{]}=0.blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s end_POSTSUBSCRIPT [ italic_r ( italic_s , italic_a ) - italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) + italic_γ ⋅ italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | italic_s ] = 0 . (2.11)

We notice that (2.11) is a special case of the functional conditional moment equation in (2.1) by setting the exogenous variable Z𝑍Zitalic_Z to be the current state s𝑠sitalic_s, the endogenous variable X𝑋Xitalic_X to be the next state ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and the function to be estimated f:𝒮:𝑓𝒮f:{\mathcal{S}}\rightarrow\mathbb{R}italic_f : caligraphic_S → blackboard_R to be defined on the state space 𝒮𝒮{\mathcal{S}}caligraphic_S. In this case, the functional is Φ(X,Z;f)=r+γf(X)f(Z)Φ𝑋𝑍𝑓𝑟𝛾𝑓𝑋𝑓𝑍\Phi(X,Z;f)=r+\gamma\cdot f(X)-f(Z)roman_Φ ( italic_X , italic_Z ; italic_f ) = italic_r + italic_γ ⋅ italic_f ( italic_X ) - italic_f ( italic_Z ), where r𝑟ritalic_r is the reward function. We remark that the reason function f𝑓fitalic_f can be evaluated simultaneously on X𝑋Xitalic_X and Z𝑍Zitalic_Z is that both X𝑋Xitalic_X and Z𝑍Zitalic_Z are variables defined on 𝒮𝒮{\mathcal{S}}caligraphic_S. Following the same derivation of (2.8), policy evaluation can be formulated as the following minimax optimization problem,

minfmaxg(f,g)=𝔼𝒟[g(Z)(r+γf(X)f(Z))1/2g(Z)2+λΨ(X,Z;f)].subscript𝑓subscript𝑔𝑓𝑔subscript𝔼𝒟delimited-[]𝑔𝑍𝑟𝛾𝑓𝑋𝑓𝑍12𝑔superscript𝑍2𝜆Ψ𝑋𝑍𝑓\displaystyle\min_{f}\max_{g}\mathcal{L}(f,g)=\mathbb{E}_{\mathcal{D}}\Bigl{[}% g(Z)\cdot\bigl{(}r+\gamma\cdot f(X)-f(Z)\bigr{)}-1/2\cdot g(Z)^{2}+\lambda\Psi% (X,Z;f)\Bigr{]}.roman_min start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT caligraphic_L ( italic_f , italic_g ) = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_g ( italic_Z ) ⋅ ( italic_r + italic_γ ⋅ italic_f ( italic_X ) - italic_f ( italic_Z ) ) - 1 / 2 ⋅ italic_g ( italic_Z ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ roman_Ψ ( italic_X , italic_Z ; italic_f ) ] .

Nonparametric Instrumental Variables Regression. The nonparametric instrumental variables model is common and useful in statistics and economics. The model can be described simply by a line of equation

Y=f0(X)+ε,𝔼[ε|Z]=0.formulae-sequence𝑌subscript𝑓0𝑋𝜀𝔼delimited-[]conditional𝜀𝑍0\displaystyle Y=f_{0}(X)+\varepsilon,\quad\mathbb{E}\bigl{[}\varepsilon{\,\big% {|}\,}Z\bigr{]}=0.italic_Y = italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_X ) + italic_ε , blackboard_E [ italic_ε | italic_Z ] = 0 .

where Y𝑌Yitalic_Y in an observed outcome, X𝑋Xitalic_X is the endogenous variable, Z𝑍Zitalic_Z is the exogenous variable, f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the true model that characterize the relationship between Y𝑌Yitalic_Y and X𝑋Xitalic_X and is also the function we want to estimate. In this model, ε𝜀\varepsilonitalic_ε is a noise possibly correlated with the endogenous X𝑋Xitalic_X but uncorrelated with the exogenous Z𝑍Zitalic_Z. It’s straightforward to see that NPIV model fits into the framework of the functional conditional moment equation by plugging the model equation into the equation about ε𝜀\varepsilonitalic_ε,

𝔼𝒟[Yf0(X)|Z]=0.subscript𝔼𝒟delimited-[]𝑌conditionalsubscript𝑓0𝑋𝑍0\displaystyle\mathbb{E}_{\mathcal{D}}\Bigl{[}Y-f_{0}(X){\,\big{|}\,}Z\Bigr{]}=0.blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_Y - italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_X ) | italic_Z ] = 0 . (2.12)

We notice that (2.12) is a special case of functional conditional moment equation in (2.4) by identifying X𝑋Xitalic_X, Z𝑍Zitalic_Z with the endogenous and exogenous variable respectively and setting the functional as Φ(X,Z;f)=Yf(X)Φ𝑋𝑍𝑓𝑌𝑓𝑋\Phi(X,Z;f)=Y-f(X)roman_Φ ( italic_X , italic_Z ; italic_f ) = italic_Y - italic_f ( italic_X ). Following the same derivation of (2.8), the problem of NPIV is equivalent to the following minimax optimization problem,

minfmaxg(f,g)=𝔼𝒟[g(Z)(Yf(X))1/2g(Z)2+λΨ(X,Z;f)].subscript𝑓subscript𝑔𝑓𝑔subscript𝔼𝒟delimited-[]𝑔𝑍𝑌𝑓𝑋12𝑔superscript𝑍2𝜆Ψ𝑋𝑍𝑓\displaystyle\min_{f}\max_{g}\mathcal{L}(f,g)=\mathbb{E}_{\mathcal{D}}\Bigl{[}% g(Z)\cdot\bigl{(}Y-f(X)\bigr{)}-1/2\cdot g(Z)^{2}+\lambda\Psi(X,Z;f)\Bigr{]}.roman_min start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT caligraphic_L ( italic_f , italic_g ) = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_g ( italic_Z ) ⋅ ( italic_Y - italic_f ( italic_X ) ) - 1 / 2 ⋅ italic_g ( italic_Z ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ roman_Ψ ( italic_X , italic_Z ; italic_f ) ] .

Asset Pricing. Asset pricing refers to the process of determining the fair value of financial assets. This field is fundamental in finance and underpins much of the work in investment, portfolio management, and risk assessment. Semiparametric Consumption Captial Asset Pricing Model (CCAPM) is a foundational model in asset pricing that describes the relationship between systematic risk and expected asset returns, which also incorporates the influence of the consumption preference of investors over time. Moreover, CCAPM can be characterized through a functional conditional moment equation (Chen et al., 2014; Chen and Ludvigson, 2009). To describe the model, let Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denote the consumption level at time t𝑡titalic_t, ctCt/Ct1subscript𝑐𝑡subscript𝐶𝑡subscript𝐶𝑡1c_{t}\equiv C_{t}/C_{t-1}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≡ italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / italic_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT the consumption growth. The marginal utility of consumption at time t𝑡titalic_t is given by MUt=Ctγ0f0(ct)subscriptMU𝑡superscriptsubscript𝐶𝑡subscript𝛾0subscript𝑓0subscript𝑐𝑡\text{MU}_{t}=C_{t}^{-\gamma_{0}}f_{0}(c_{t})MU start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), where γ0>0subscript𝛾00\gamma_{0}>0italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0 is the discount factor, f0:𝒞:subscript𝑓0𝒞f_{0}:\mathcal{C}\to\mathbb{R}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : caligraphic_C → blackboard_R is the nonparametric structural demand function, which is an unknown positive function of our interest and is defined on 𝒞𝒞\mathcal{C}caligraphic_C, the space of consumption growth. The unknown function f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can be understood as a taste shifter that describes how the marginal utility of consumption changes with the state of the economy in terms of consumption growth.

Now, consider the growth-return tuple (ct,r~t+1,ct+1)subscript𝑐𝑡subscript~𝑟𝑡1subscript𝑐𝑡1(c_{t},\widetilde{r}_{t+1},c_{t+1})( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) for t+𝑡superscriptt\in\mathbb{N}^{+}italic_t ∈ blackboard_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT with joint distribution 𝒟𝒟\mathcal{D}caligraphic_D, where ctsubscript𝑐𝑡c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the consumption growth at the current time t𝑡titalic_t, and ct+1subscript𝑐𝑡1c_{t+1}italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is the consumption growth at the next time t+1𝑡1t+1italic_t + 1. r~t+1subscript~𝑟𝑡1\widetilde{r}_{t+1}over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is a modified return observed in this period, which is a known function of the actual return rt+1subscript𝑟𝑡1r_{t+1}italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT and the consumption growth ct+1subscript𝑐𝑡1c_{t+1}italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT at time t+1𝑡1t+1italic_t + 1. We consider the scenario where the time series of consumption growth {ct}t0subscriptsubscript𝑐𝑡𝑡0\{c_{t}\}_{t\geq 0}{ italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT follows a time-homogenous Markov chain with a smooth transition kernel. That being said, both conditional transition probabilities ct+1|ctconditionalsubscript𝑐𝑡1subscript𝑐𝑡c_{t+1}|c_{t}italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ct|ct+1conditionalsubscript𝑐𝑡subscript𝑐𝑡1c_{t}|c_{t+1}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT admit a smooth density function. The CCAPM model captures the behavior of f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT through the following equation:

𝔼ct+1|ct[r~t+1f0(ct+1)f0(ct)|ct]=0,subscript𝔼conditionalsubscript𝑐𝑡1subscript𝑐𝑡delimited-[]subscript~𝑟𝑡1subscript𝑓0subscript𝑐𝑡1conditionalsubscript𝑓0subscript𝑐𝑡subscript𝑐𝑡0\displaystyle\mathbb{E}_{c_{t+1}|c_{t}}\big{[}\widetilde{r}_{t+1}\cdot f_{0}(c% _{t+1})-f_{0}(c_{t}){\,\big{|}\,}c_{t}\big{]}=0,blackboard_E start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ⋅ italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = 0 , (2.13)

where the modified return can be further expressed as r~t+1=δ0rt+1ct+1γ0subscript~𝑟𝑡1subscript𝛿0subscript𝑟𝑡1superscriptsubscript𝑐𝑡1subscript𝛾0\widetilde{r}_{t+1}=\delta_{0}\cdot r_{t+1}\cdot c_{t+1}^{-\gamma_{0}}over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ⋅ italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, δ0(0,1]subscript𝛿001\delta_{0}\in(0,1]italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ ( 0 , 1 ] is the rate of time preference. We focus on a setting where 𝒞𝒞\mathcal{C}\subseteq\mathbb{R}caligraphic_C ⊆ blackboard_R is a compact set, and the modified return r~t+1subscript~𝑟𝑡1\widetilde{r}_{t+1}over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is bounded for all t0𝑡0t\geq 0italic_t ≥ 0. We notice that (2.13) is a special case of the functional conditional moment equation in (2.4). We can identify the exogenous variable Z𝑍Zitalic_Z with ctsubscript𝑐𝑡c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the consumption growth at the current time t𝑡titalic_t, and the endogenous variable X𝑋Xitalic_X with ct+1subscript𝑐𝑡1c_{t+1}italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, the consumption growth at the next time t+1𝑡1t+1italic_t + 1. In this scenario, we identify the space 𝒲𝒲\mathcal{W}caligraphic_W with 𝒞𝒞\mathcal{C}caligraphic_C, the space of consumption growth and the function to be estimated f:𝒞:𝑓𝒞f:\mathcal{C}\to\mathbb{R}italic_f : caligraphic_C → blackboard_R is defined on 𝒞𝒞\mathcal{C}caligraphic_C. The functional is Φ(X,Z;f)=r~t+1f(X)f(Z)Φ𝑋𝑍𝑓subscript~𝑟𝑡1𝑓𝑋𝑓𝑍\Phi(X,Z;f)=\widetilde{r}_{t+1}\cdot f(X)-f(Z)roman_Φ ( italic_X , italic_Z ; italic_f ) = over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ⋅ italic_f ( italic_X ) - italic_f ( italic_Z ), where r~t+1subscript~𝑟𝑡1\widetilde{r}_{t+1}over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT again denotes the modified return. Similar to the scenario of policy evaluation, the reason function f𝑓fitalic_f can be evaluated simultaneously on X𝑋Xitalic_X and Z𝑍Zitalic_Z is that both X𝑋Xitalic_X and Z𝑍Zitalic_Z are variables defined on 𝒞𝒞\mathcal{C}caligraphic_C. Following the same derivation of (2.8), the problem of asset pricing through CCAPM is equivalent to the following minimax optimization problem,

minfmaxg(f,g)=𝔼𝒟[g(Z)(r~t+1f(X)f(Z))1/2g(Z)2+λΨ(X,Z;f)]subscript𝑓subscript𝑔𝑓𝑔subscript𝔼𝒟delimited-[]𝑔𝑍subscript~𝑟𝑡1𝑓𝑋𝑓𝑍12𝑔superscript𝑍2𝜆Ψ𝑋𝑍𝑓\displaystyle\min_{f}\max_{g}\mathcal{L}(f,g)=\mathbb{E}_{\mathcal{D}}\Big{[}g% (Z)\cdot(\widetilde{r}_{t+1}\cdot f(X)-f(Z))-1/2\cdot g(Z)^{2}+\lambda\Psi(X,Z% ;f)\Big{]}roman_min start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT caligraphic_L ( italic_f , italic_g ) = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_g ( italic_Z ) ⋅ ( over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ⋅ italic_f ( italic_X ) - italic_f ( italic_Z ) ) - 1 / 2 ⋅ italic_g ( italic_Z ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ roman_Ψ ( italic_X , italic_Z ; italic_f ) ]

Adversarial Riesz representer Estimation. Many problems in statistics, causal inference, and finance involve the task of learning a continuous linear functional in the following form,

𝒱(g)=𝔼[m(V;g)].\displaystyle\mathcal{V}(g)=\mathbb{E}\bigr{[}m(V;g)\bigl{]}.caligraphic_V ( italic_g ) = blackboard_E [ italic_m ( italic_V ; italic_g ) ] . (2.14)

where function g𝒢:𝒳:𝑔𝒢𝒳g\in\mathcal{G}:\mathcal{X}\rightarrow\mathbb{R}italic_g ∈ caligraphic_G : caligraphic_X → blackboard_R, \mathcal{F}caligraphic_F is defined on a function space 𝒢𝒢\mathcal{G}caligraphic_G, and V𝑉Vitalic_V is a random vector of which we have access to observations and represents the source of randomness in the functional. Moreover, suppose such continuous linear functional ()\mathcal{F}(\cdot)caligraphic_F ( ⋅ ) is also mean-square continuous with respect to L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT norm. In that case, it can be written in a more benign and useful manner, which is also often the case. Formally speaking, for such linear functional 𝒱𝒱\mathcal{V}caligraphic_V, there exists function f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such that for any g𝒢𝑔𝒢g\in\mathcal{G}italic_g ∈ caligraphic_G,

𝒱(g)=𝔼[f0(X)g(X)].𝒱𝑔𝔼delimited-[]subscript𝑓0𝑋𝑔𝑋\displaystyle\mathcal{V}(g)=\mathbb{E}\bigl{[}f_{0}(X)g(X)\bigr{]}.caligraphic_V ( italic_g ) = blackboard_E [ italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_X ) italic_g ( italic_X ) ] . (2.15)

The function f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT here is called the Riesz representer of the linear functional 𝒱𝒱\mathcal{V}caligraphic_V, and the equation (2.15) is known as the Riesz representation theorem. Information about the Riesz representation of such linear functional is crucial to numerous applications and learning tasks. Therefore, we aim to estimate f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by exploiting the relationship characterized by the equation. We have the following trivial observation that the true Riesz representer f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can be recovered by solving the following equation,

𝔼[f0(X)f(X)|X]=0.𝔼delimited-[]subscript𝑓0𝑋conditional𝑓𝑋𝑋0\displaystyle\mathbb{E}\Bigl{[}f_{0}(X)-f(X){\,\big{|}\,}X\Bigr{]}=0.blackboard_E [ italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_X ) - italic_f ( italic_X ) | italic_X ] = 0 . (2.16)

Of course, f=f0𝑓subscript𝑓0f=f_{0}italic_f = italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT will solve the equation above, and therefore the true Riesz representer is achieved. We remark that this is indeed a special case since the expectation taken in (2.16) is unconditioned. In the equation, we only involve the endogenous variable X𝑋Xitalic_X, which also indicates that the exogenous variable Z𝑍Zitalic_Z coincides with X𝑋Xitalic_X. While special, the problem still fits in the framework discussed here. By setting Φ(X,Z;f)=f(X)f0(X)Φ𝑋𝑍𝑓𝑓𝑋subscript𝑓0𝑋\Phi(X,Z;f)=f(X)-f_{0}(X)roman_Φ ( italic_X , italic_Z ; italic_f ) = italic_f ( italic_X ) - italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_X ), we recovered the intractable formulation of Riesz representer estimation.

However, unlike the previous examples where have access to observations of each term in the equation, here we have no direct access to values of f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, making the problem seemingly intractable. Fortunately, the alternative formulation of the original problem as a minimax optimization problem solves this difficulty. When written in the minimax formulation, we will again see the linear functional 𝒱𝒱\mathcal{V}caligraphic_V show up in the equation in the form of (2.14), which can be approximated using empirical values calculated from accessible observations of the random vector V𝑉Vitalic_V. Following the same derivation of (2.8) and the definition of Riesz representer in (2.15), the problem of adversarial Riesz representer estimation is equivalent to the following minimax optimization problem,

minfmaxg(f,g)=𝔼𝒟[m(V;g)f(X)g(X)1/2g(X)2+λΨ(X,X;f)].subscript𝑓subscript𝑔𝑓𝑔subscript𝔼𝒟delimited-[]𝑚𝑉𝑔𝑓𝑋𝑔𝑋12𝑔superscript𝑋2𝜆Ψ𝑋𝑋𝑓\displaystyle\min_{f}\max_{g}\mathcal{L}(f,g)=\mathbb{E}_{\mathcal{D}}\Bigl{[}% m(V;g)-f(X)\cdot g(X)-1/2\cdot g(X)^{2}+\lambda\Psi(X,X;f)\Bigr{]}.roman_min start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT caligraphic_L ( italic_f , italic_g ) = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_m ( italic_V ; italic_g ) - italic_f ( italic_X ) ⋅ italic_g ( italic_X ) - 1 / 2 ⋅ italic_g ( italic_X ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ roman_Ψ ( italic_X , italic_X ; italic_f ) ] . (2.17)

Again, we stress that in (2.17), the absence of Z𝑍Zitalic_Z is due to the fact both the endogenous and exogenous variables are described by X𝑋Xitalic_X and the objective is computationally tractable since we have access to both observations of X𝑋Xitalic_X and V𝑉Vitalic_V.

2.3 Mean-Field Neural Network and Wasserstein Space

In the sequel, we will consider functions in the neural network function class. Consider a neural function defined on a given state space ΩΩ\Omegaroman_Ω, σ:Ω×D:𝜎Ωsuperscript𝐷\sigma:\Omega\times\mathbb{R}^{D}\rightarrow\mathbb{R}italic_σ : roman_Ω × blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT → blackboard_R that takes an input xΩ𝑥Ωx\in\Omegaitalic_x ∈ roman_Ω and parameter θD𝜃superscript𝐷\theta\in\mathbb{R}^{D}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT and outputs a value in \mathbb{R}blackboard_R. For 𝜽=(θ1,,θN)𝜽subscript𝜃1subscript𝜃𝑁\bm{\theta}=(\theta_{1},\dots,\theta_{N})bold_italic_θ = ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) where θiDsubscript𝜃𝑖superscript𝐷\theta_{i}\in\mathbb{R}^{D}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, we can define an overparameterized two-layered neural network function hhitalic_h using neuron function σ𝜎\sigmaitalic_σ,

h(x,𝜽)=1Ni=1Nσ(x;θi),xΩ.formulae-sequence𝑥𝜽1𝑁superscriptsubscript𝑖1𝑁𝜎𝑥subscript𝜃𝑖for-all𝑥Ω\displaystyle h(x,\bm{\theta})=\frac{1}{N}\sum_{i=1}^{N}\sigma(x;\theta_{i}),% \quad\forall x\in\Omega.italic_h ( italic_x , bold_italic_θ ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ ( italic_x ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , ∀ italic_x ∈ roman_Ω .

For such a form, we can further consider the infinite width limit when N𝑁N\rightarrow\inftyitalic_N → ∞. When taking such a limit, the neural network function hhitalic_h becomes a mean-field neural network and can be parameterized with probability measure over the parameter space, μ𝒫(D)𝜇𝒫superscript𝐷\mu\in\mathscr{P}(\mathbb{R}^{D})italic_μ ∈ script_P ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ).

h(x;μ)=Dσ(x;θ)dμ(θ),xΩ.formulae-sequence𝑥𝜇subscriptsuperscript𝐷𝜎𝑥𝜃differential-d𝜇𝜃for-all𝑥Ω\displaystyle h(x;\mu)=\int_{\mathbb{R}^{D}}\sigma(x;\theta)\mathrm{d}\mu(% \theta),\quad\forall x\in\Omega.italic_h ( italic_x ; italic_μ ) = ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_σ ( italic_x ; italic_θ ) roman_d italic_μ ( italic_θ ) , ∀ italic_x ∈ roman_Ω .

When considering such a limit, the optimization problem over the neural network function class is turned from a finite-dimensional problem over the parameter space into an infinite-dimensional problem over the space of probability measures. Therefore, we will need to track the convergence of probability measures over the Wasserstein space when analyzing the convergence of algorithms.

We now introduce the background knowledge of the Wasserstein space for the reader’s information. Let 𝒫p(D)subscript𝒫𝑝superscript𝐷\mathscr{P}_{p}(\mathbb{R}^{D})script_P start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) be the space of all the probability measures over the D𝐷Ditalic_D-dimensional Euclidean space Dsuperscript𝐷\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT with finite p𝑝pitalic_p-th order moments. The Wasserstein-p𝑝pitalic_p distance between two probability measures μ,ν𝒫p(D)𝜇𝜈subscript𝒫𝑝superscript𝐷\mu,\nu\in\mathscr{P}_{p}(\mathbb{R}^{D})italic_μ , italic_ν ∈ script_P start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) is defined as follows,

𝒲p(μ,ν)=inf{(xypdγ(x,y))1/p|γ𝒫p(D×D),xγ=μ,yγ=ν},subscript𝒲𝑝𝜇𝜈infimumformulae-sequenceconditionalsuperscriptsuperscriptnorm𝑥𝑦𝑝differential-d𝛾𝑥𝑦1𝑝𝛾subscript𝒫𝑝superscript𝐷superscript𝐷formulae-sequencesubscript𝑥𝛾𝜇subscript𝑦𝛾𝜈\displaystyle\mathcal{W}_{p}(\mu,\nu)=\inf\biggl{\{}\Bigl{(}\int\|x-y\|^{p}% \mathrm{d}\gamma(x,y)\Bigr{)}^{1/p}{\,\Big{|}\,}\gamma\in\mathscr{P}_{p}(% \mathbb{R}^{D}\times\mathbb{R}^{D}),x_{\sharp}\gamma=\mu,y_{\sharp}\gamma=\nu% \biggr{\}},caligraphic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_μ , italic_ν ) = roman_inf { ( ∫ ∥ italic_x - italic_y ∥ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT roman_d italic_γ ( italic_x , italic_y ) ) start_POSTSUPERSCRIPT 1 / italic_p end_POSTSUPERSCRIPT | italic_γ ∈ script_P start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) , italic_x start_POSTSUBSCRIPT ♯ end_POSTSUBSCRIPT italic_γ = italic_μ , italic_y start_POSTSUBSCRIPT ♯ end_POSTSUBSCRIPT italic_γ = italic_ν } , (2.18)

where the infimum is taken over all the coupling of μ𝜇\muitalic_μ and ν𝜈\nuitalic_ν. Here we denote by xγsubscript𝑥𝛾x_{\sharp}\gammaitalic_x start_POSTSUBSCRIPT ♯ end_POSTSUBSCRIPT italic_γ and yγsubscript𝑦𝛾y_{\sharp}\gammaitalic_y start_POSTSUBSCRIPT ♯ end_POSTSUBSCRIPT italic_γ the marginal distributions of γ𝛾\gammaitalic_γ with respect to x𝑥xitalic_x and y𝑦yitalic_y, respectively. We call p=(𝒫p(D),𝒲p)subscript𝑝subscript𝒫𝑝superscript𝐷subscript𝒲𝑝\mathcal{M}_{p}=(\mathscr{P}_{p}(\mathbb{R}^{D}),\mathcal{W}_{p})caligraphic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = ( script_P start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) , caligraphic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) the Wasserstein-p𝑝pitalic_p space. For any 1pq1𝑝𝑞1\leq p\leq q1 ≤ italic_p ≤ italic_q, due to the relation that 𝔼[|X|p]1/p𝔼[|X|q]1/q𝔼superscriptdelimited-[]superscript𝑋𝑝1𝑝𝔼superscriptdelimited-[]superscript𝑋𝑞1𝑞\mathbb{E}[|X|^{p}]^{1/p}\leq\mathbb{E}[|X|^{q}]^{1/q}blackboard_E [ | italic_X | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 1 / italic_p end_POSTSUPERSCRIPT ≤ blackboard_E [ | italic_X | start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 1 / italic_q end_POSTSUPERSCRIPT, we have that Wp(μ,ν)Wq(μ,ν)subscript𝑊𝑝𝜇𝜈subscript𝑊𝑞𝜇𝜈W_{p}(\mu,\nu)\leq W_{q}(\mu,\nu)italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_μ , italic_ν ) ≤ italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_μ , italic_ν ) for two measures μ,ν𝜇𝜈\mu,\nuitalic_μ , italic_ν. In this paper, we focus on the cases when p=1,2𝑝12p=1,2italic_p = 1 , 2. Without further clarification, we refer to the distance with p=2𝑝2p=2italic_p = 2 as the Wasserstein distance in the sequel.

The Wasserstein-2 space 2=(𝒫2(D),𝒲2)subscript2subscript𝒫2superscript𝐷subscript𝒲2\mathcal{M}_{2}=(\mathscr{P}_{2}(\mathbb{R}^{D}),\mathcal{W}_{2})caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( script_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) , caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) can be viewed as an infinite-dimensional Riemannian manifold (Villani, 2008). Formally, the tangent space at point ρ𝒫2(D)𝜌subscript𝒫2superscript𝐷\rho\in\mathscr{P}_{2}(\mathbb{R}^{D})italic_ρ ∈ script_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) is defined as

Tanρ(𝒫2(D))={vL2(ρ)|v,u𝑑ρ=0,uL2(ρ) s.t. div(uρ)=0}.subscriptTan𝜌subscript𝒫2superscript𝐷conditional-set𝑣superscript𝐿2𝜌formulae-sequence𝑣𝑢differential-d𝜌0for-all𝑢superscript𝐿2𝜌 s.t. div𝑢𝜌0\displaystyle\mathrm{Tan}_{\rho}\bigl{(}\mathscr{P}_{2}(\mathbb{R}^{D})\bigr{)% }=\Bigl{\{}v\in L^{2}(\rho){\,\Big{|}\,}\int\langle v,u\rangle d\rho=0,\forall u% \in L^{2}(\rho)\text{ s.t. }\mathrm{div}(u\rho)=0\Bigr{\}}.roman_Tan start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( script_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) ) = { italic_v ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) | ∫ ⟨ italic_v , italic_u ⟩ italic_d italic_ρ = 0 , ∀ italic_u ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) s.t. roman_div ( italic_u italic_ρ ) = 0 } .

Then, for any absolutely continuous curve ρ:[0,1]𝒫2(D):𝜌01subscript𝒫2superscript𝐷\rho:[0,1]\rightarrow\mathscr{P}_{2}(\mathbb{R}^{D})italic_ρ : [ 0 , 1 ] → script_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) on the Wasserstein-2 space, there exists a family of vector fields vtTanρt(𝒫2(D))subscript𝑣𝑡subscriptTansubscript𝜌𝑡subscript𝒫2superscript𝐷v_{t}\in\mathrm{Tan}_{\rho_{t}}(\mathscr{P}_{2}(\mathbb{R}^{D}))italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ roman_Tan start_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( script_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) ) such that the continuity equation

tρt+div(vtρt)=0subscript𝑡subscript𝜌𝑡divsubscript𝑣𝑡subscript𝜌𝑡0\displaystyle\partial_{t}\rho_{t}+\mathrm{div}(v_{t}\rho_{t})=0∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + roman_div ( italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 0 (2.19)

holds in the sense of distributions. For any two absolutely continuous curves ρ,ρ~:[0,1]𝒫2(D):𝜌~𝜌01subscript𝒫2superscript𝐷\rho,\widetilde{\rho}:[0,1]\rightarrow\mathscr{P}_{2}(\mathbb{R}^{D})italic_ρ , over~ start_ARG italic_ρ end_ARG : [ 0 , 1 ] → script_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ), we define the inner product between tρt,tρ~tsubscript𝑡subscript𝜌𝑡subscript𝑡subscript~𝜌𝑡\partial_{t}\rho_{t},\partial_{t}\widetilde{\rho}_{t}∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over~ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for any t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ] as follows,

tρt,tρ~tρt=vt,v~tdρt,subscriptsubscript𝑡subscript𝜌𝑡subscript𝑡subscript~𝜌𝑡subscript𝜌𝑡subscript𝑣𝑡subscript~𝑣𝑡differential-dsubscript𝜌𝑡\displaystyle\langle\partial_{t}\rho_{t},\partial_{t}\widetilde{\rho}_{t}% \rangle_{\rho_{t}}=\int\langle v_{t},\widetilde{v}_{t}\rangle\mathrm{d}\rho_{t},⟨ ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over~ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∫ ⟨ italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ roman_d italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (2.20)

where vt,v~tsubscript𝑣𝑡subscript~𝑣𝑡\langle v_{t},\widetilde{v}_{t}\rangle⟨ italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ is the inner product over Dsuperscript𝐷\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, (ρt,vt)subscript𝜌𝑡subscript𝑣𝑡(\rho_{t},v_{t})( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and (ρ~t,v~t)subscript~𝜌𝑡subscript~𝑣𝑡(\widetilde{\rho}_{t},\widetilde{v}_{t})( over~ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) satisfy the continuity equation in (2.19). Note that (2.20) yields a Riemannian metric over 2subscript2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Furthermore, the Riemannian metric induces a norm tρtρt=ρt,tρtρt1/2subscriptnormsubscript𝑡subscript𝜌𝑡subscript𝜌𝑡superscriptsubscriptsubscript𝜌𝑡subscript𝑡subscript𝜌𝑡subscript𝜌𝑡12\|\partial_{t}\rho_{t}\|_{\rho_{t}}=\langle\partial\rho_{t},\partial_{t}\rho_{% t}\rangle_{\rho_{t}}^{1/2}∥ ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ⟨ ∂ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT.

3 Algorithms

In this section, we introduce the stochastic gradient descent-ascent algorithm (SGDA) and its mean-field limit, which is characterized by the continuity equation.

Stochastic Gradient Descent-Ascent Algorithm. We solve the minimax optimization problem in (2.9) via SGDA. Recall that in the minimax objective, we have two functions simultaneously involved, where the primal function f𝑓fitalic_f represents the true model of interest and the dual function g𝑔gitalic_g represents an adversarial player. Specifically, we parameterize both f𝑓fitalic_f and g𝑔gitalic_g with neural networks with width N𝑁Nitalic_N and parameters 𝜽=(θ1,θ2,,θN)D×N𝜽superscript𝜃1superscript𝜃2superscript𝜃𝑁superscript𝐷𝑁\bm{\theta}=(\theta^{1},\theta^{2},\dots,\theta^{N})\in\mathbb{R}^{D\times N}bold_italic_θ = ( italic_θ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_θ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_N end_POSTSUPERSCRIPT and 𝝎=(ω1,ω2,,ωN)D×N𝝎superscript𝜔1superscript𝜔2superscript𝜔𝑁superscript𝐷𝑁\bm{\omega}=(\omega^{1},\omega^{2},\dots,\omega^{N})\in\mathbb{R}^{D\times N}bold_italic_ω = ( italic_ω start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_ω start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_N end_POSTSUPERSCRIPT

f(;𝜽)=αNi=1Nϕ(;θi),g(;𝝎)=αNi=1Nψ(;ωi).formulae-sequence𝑓𝜽𝛼𝑁superscriptsubscript𝑖1𝑁italic-ϕsuperscript𝜃𝑖𝑔𝝎𝛼𝑁superscriptsubscript𝑖1𝑁𝜓superscript𝜔𝑖\displaystyle f(\cdot;\bm{\theta})=\frac{\alpha}{N}\sum_{i=1}^{N}\phi(\cdot;% \theta^{i}),\quad g(\cdot;\bm{\omega})=\frac{\alpha}{N}\sum_{i=1}^{N}\psi(% \cdot;\omega^{i}).italic_f ( ⋅ ; bold_italic_θ ) = divide start_ARG italic_α end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ϕ ( ⋅ ; italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , italic_g ( ⋅ ; bold_italic_ω ) = divide start_ARG italic_α end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ψ ( ⋅ ; italic_ω start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) . (3.1)

where we use bold symbols 𝜽𝜽\bm{\theta}bold_italic_θ and 𝝎𝝎\bm{\omega}bold_italic_ω to denote the whole parameter used by each neural net and unbold symbols θ𝜃\thetaitalic_θ and ω𝜔\omegaitalic_ω to denote the parameter used by each neuron. Here, ϕ(;θ):𝒲×D:italic-ϕ𝜃𝒲superscript𝐷\phi(\cdot;\theta):\mathcal{W}\times\mathbb{R}^{D}\rightarrow\mathbb{R}italic_ϕ ( ⋅ ; italic_θ ) : caligraphic_W × blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT → blackboard_R, ψ(;ω):𝒵×D:𝜓𝜔𝒵superscript𝐷\psi(\cdot;\omega):\mathcal{Z}\times\mathbb{R}^{D}\rightarrow\mathbb{R}italic_ψ ( ⋅ ; italic_ω ) : caligraphic_Z × blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT → blackboard_R are the functions for neurons. In particular, we can recover the general setting of two-layer neural networks parameterization for f𝑓fitalic_f and g𝑔gitalic_g when we choose ϕ,ψitalic-ϕ𝜓\phi,\psiitalic_ϕ , italic_ψ to be the following specific form,

ϕ(w;β,W)=βσf(w;W),ψ(z;β,W)=βσg(z;W),formulae-sequenceitalic-ϕ𝑤𝛽𝑊𝛽subscript𝜎𝑓𝑤𝑊𝜓𝑧𝛽𝑊𝛽subscript𝜎𝑔𝑧𝑊\displaystyle\phi(w;\beta,W)=\beta\cdot\sigma_{f}(w;W),\quad\psi(z;\beta,W)=% \beta\cdot\sigma_{g}(z;W),italic_ϕ ( italic_w ; italic_β , italic_W ) = italic_β ⋅ italic_σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_w ; italic_W ) , italic_ψ ( italic_z ; italic_β , italic_W ) = italic_β ⋅ italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_z ; italic_W ) ,

where σf:𝒲×D:subscript𝜎𝑓𝒲superscript𝐷\sigma_{f}:\mathcal{W}\times\mathbb{R}^{D}\rightarrow\mathbb{R}italic_σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT : caligraphic_W × blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT → blackboard_R, σg:𝒵×D:subscript𝜎𝑔𝒵superscript𝐷\sigma_{g}:\mathcal{Z}\times\mathbb{R}^{D}\rightarrow\mathbb{R}italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT : caligraphic_Z × blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT → blackboard_R are activation functions with input w𝑤witalic_w and z𝑧zitalic_z respectively and parameters W𝑊Witalic_W. We note that it’s not necessary to choose the same width N𝑁Nitalic_N for f𝑓fitalic_f and g𝑔gitalic_g, and activation functions σf,σgsubscript𝜎𝑓subscript𝜎𝑔\sigma_{f},\sigma_{g}italic_σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT need not have the same parameter dimension D𝐷Ditalic_D. Here we use the same width N𝑁Nitalic_N and parameter dimension D𝐷Ditalic_D to keep notations simple as these won’t affect the validity of the results presented in this paper.

Besides, we have also introduced a scaling factor α>0𝛼0\alpha>0italic_α > 0 in (3.1). Setting the scaling parameter α=N𝛼𝑁\alpha=\sqrt{N}italic_α = square-root start_ARG italic_N end_ARG in (3.1) recovers the neural tangent kernel regime (Jacot et al., 2018). Setting the parameter α=1𝛼1\alpha=1italic_α = 1 recovers the mean-field regime (Mei et al., 2018, 2019). In a discrete-time finite-width (DF) scenario, at the k𝑘kitalic_kth iteration, the primal function f𝑓fitalic_f and adversarial player g𝑔gitalic_g are updated as follows,

DF-GD: 𝜽k+1=𝜽kηg(zk;ωk)𝜽Φ(xk,zk;f(;𝜽k))ηλ𝜽Ψ(xk,zk;f(;𝜽k)),subscript𝜽𝑘1subscript𝜽𝑘𝜂𝑔subscript𝑧𝑘subscript𝜔𝑘subscript𝜽Φsubscript𝑥𝑘subscript𝑧𝑘𝑓subscript𝜽𝑘𝜂𝜆subscript𝜽Ψsubscript𝑥𝑘subscript𝑧𝑘𝑓subscript𝜽𝑘\displaystyle\bm{\theta}_{k+1}=\bm{\theta}_{k}-\eta\cdot g(z_{k};\omega_{k})% \cdot\nabla_{\bm{\theta}}\Phi(x_{k},z_{k};f(\cdot;\bm{\theta}_{k}))-\eta% \lambda\cdot\nabla_{\bm{\theta}}\Psi(x_{k},z_{k};f(\cdot;\bm{\theta}_{k})),bold_italic_θ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_η ⋅ italic_g ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⋅ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_Φ ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_f ( ⋅ ; bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) - italic_η italic_λ ⋅ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_f ( ⋅ ; bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ,
DF-GA: 𝝎k+1=𝝎k+ηΦ(xk,zk;f(;𝜽k))𝝎g(zk;𝝎k))ηg(zk;𝝎k)𝝎g(zk;𝝎k),\displaystyle\bm{\omega}_{k+1}=\bm{\omega}_{k}+\eta\cdot\Phi(x_{k},z_{k};f(% \cdot;\bm{\theta}_{k}))\cdot\nabla_{\bm{\omega}}g(z_{k};\bm{\omega}_{k}))-\eta% \cdot g(z_{k};\bm{\omega}_{k})\cdot\nabla_{\bm{\omega}}g(z_{k};\bm{\omega}_{k}),bold_italic_ω start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = bold_italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_η ⋅ roman_Φ ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_f ( ⋅ ; bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ⋅ ∇ start_POSTSUBSCRIPT bold_italic_ω end_POSTSUBSCRIPT italic_g ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; bold_italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) - italic_η ⋅ italic_g ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; bold_italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⋅ ∇ start_POSTSUBSCRIPT bold_italic_ω end_POSTSUBSCRIPT italic_g ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; bold_italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , (3.2)

where 𝜽k,𝝎ksubscript𝜽𝑘subscript𝝎𝑘\bm{\theta}_{k},\bm{\omega}_{k}bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes the state of the parameters at iteration k𝑘kitalic_k, η>0𝜂0\eta>0italic_η > 0 is the step-size, and the data samples {(xk,zk)}k=0superscriptsubscriptsubscript𝑥𝑘subscript𝑧𝑘𝑘0\{(x_{k},z_{k})\}_{k=0}^{\infty}{ ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT are collected by independently sampling from the data distribution 𝒟𝒟\mathcal{D}caligraphic_D. When f,g𝑓𝑔f,gitalic_f , italic_g are two-layered neural networks with width N𝑁Nitalic_N, we can plug in the form for f,g𝑓𝑔f,gitalic_f , italic_g as is described in (3.1). The update for the parameter of i𝑖iitalic_i-th neuron at k𝑘kitalic_k-th iteration can be further specified to the following,

θk+1isuperscriptsubscript𝜃𝑘1𝑖\displaystyle\theta_{k+1}^{i}italic_θ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT =θkiηαϵg(zk;𝝎k)θΦ(xk,zk;ϕ(,θki))ηλϵδΨ(xk,zk;f(,𝜽k))δfθϕ(xk;θki),absentsuperscriptsubscript𝜃𝑘𝑖𝜂𝛼italic-ϵ𝑔subscript𝑧𝑘subscript𝝎𝑘subscript𝜃Φsubscript𝑥𝑘subscript𝑧𝑘italic-ϕsuperscriptsubscript𝜃𝑘𝑖𝜂𝜆italic-ϵ𝛿Ψsubscript𝑥𝑘subscript𝑧𝑘𝑓subscript𝜽𝑘𝛿𝑓subscript𝜃italic-ϕsubscript𝑥𝑘superscriptsubscript𝜃𝑘𝑖\displaystyle=\theta_{k}^{i}-\eta\alpha\epsilon\cdot g(z_{k};\bm{\omega}_{k})% \cdot\nabla_{\theta}\Phi(x_{k},z_{k};\phi(\cdot,\theta_{k}^{i}))-\eta\lambda% \epsilon\cdot\frac{\delta\Psi(x_{k},z_{k};f(\cdot,\bm{\theta}_{k}))}{\delta f}% \cdot\nabla_{\theta}\phi(x_{k};\theta_{k}^{i}),= italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_η italic_α italic_ϵ ⋅ italic_g ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; bold_italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⋅ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_Φ ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_ϕ ( ⋅ , italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) - italic_η italic_λ italic_ϵ ⋅ divide start_ARG italic_δ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_f ( ⋅ , bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG ⋅ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_ϕ ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ,
ωk+1isuperscriptsubscript𝜔𝑘1𝑖\displaystyle\omega_{k+1}^{i}italic_ω start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT =ωki+ηαϵΦ(xk,zk;f(,𝜽k))ωψ(zk;ωki)ηαϵg(zk;𝝎k)ωψ(zk;ωki),absentsuperscriptsubscript𝜔𝑘𝑖𝜂𝛼italic-ϵΦsubscript𝑥𝑘subscript𝑧𝑘𝑓subscript𝜽𝑘subscript𝜔𝜓subscript𝑧𝑘superscriptsubscript𝜔𝑘𝑖𝜂𝛼italic-ϵ𝑔subscript𝑧𝑘subscript𝝎𝑘subscript𝜔𝜓subscript𝑧𝑘superscriptsubscript𝜔𝑘𝑖\displaystyle=\omega_{k}^{i}+\eta\alpha\epsilon\cdot\Phi(x_{k},z_{k};f(\cdot,% \bm{\theta}_{k}))\cdot\nabla_{\omega}\psi(z_{k};\omega_{k}^{i})-\eta\alpha% \epsilon\cdot g(z_{k};\bm{\omega}_{k})\cdot\nabla_{\omega}\psi(z_{k};\omega_{k% }^{i}),= italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_η italic_α italic_ϵ ⋅ roman_Φ ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_f ( ⋅ , bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ⋅ ∇ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT italic_ψ ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - italic_η italic_α italic_ϵ ⋅ italic_g ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; bold_italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⋅ ∇ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT italic_ψ ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , (3.3)

where 𝜽k=(θk1,θk2,,θkN)subscript𝜽𝑘subscriptsuperscript𝜃1𝑘subscriptsuperscript𝜃2𝑘subscriptsuperscript𝜃𝑁𝑘\bm{\theta}_{k}=(\theta^{1}_{k},\theta^{2}_{k},\dots,\theta^{N}_{k})bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ( italic_θ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , … , italic_θ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) and 𝝎k=(ωk1,ωk2,,ωkN)subscript𝝎𝑘subscriptsuperscript𝜔1𝑘subscriptsuperscript𝜔2𝑘subscriptsuperscript𝜔𝑁𝑘\bm{\omega}_{k}=(\omega^{1}_{k},\omega^{2}_{k},\dots,\omega^{N}_{k})bold_italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ( italic_ω start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , … , italic_ω start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), δΨ/δf𝛿Ψ𝛿𝑓\delta\Psi/\delta fitalic_δ roman_Ψ / italic_δ italic_f denotes the variation of ΨΨ\Psiroman_Ψ with respect to f𝑓fitalic_f. Here, α𝛼\alphaitalic_α is the neural network scaling parameter and ϵ=1/Nitalic-ϵ1𝑁\epsilon=1/Nitalic_ϵ = 1 / italic_N is the stepsize scale. Both α𝛼\alphaitalic_α and ϵitalic-ϵ\epsilonitalic_ϵ show up in (3) due to the finite width parameterization of two-layered neural networks described in (3.1).

For a given space 𝒮𝒮{\mathcal{S}}caligraphic_S, let \mathcal{H}caligraphic_H define a set of functions defined on 𝒮𝒮{\mathcal{S}}\rightarrow\mathbb{R}caligraphic_S → blackboard_R. For a functional defined over the function class \mathcal{H}caligraphic_H, F::𝐹F:\mathcal{H}\rightarrow\mathbb{R}italic_F : caligraphic_H → blackboard_R, its variation at f𝑓f\in\mathcal{H}italic_f ∈ caligraphic_H is a function δFδf:𝒮:𝛿𝐹𝛿𝑓𝒮\frac{\delta F}{\delta f}:{\mathcal{S}}\rightarrow\mathbb{R}divide start_ARG italic_δ italic_F end_ARG start_ARG italic_δ italic_f end_ARG : caligraphic_S → blackboard_R, such that for any test function hh\in\mathcal{H}italic_h ∈ caligraphic_H,

[ddεF(f+εh)]ε=0=𝒮δFδf(s)h(s)ds.subscriptdelimited-[]dd𝜀𝐹𝑓𝜀𝜀0subscript𝒮𝛿𝐹𝛿𝑓𝑠𝑠differential-d𝑠\displaystyle\Big{[}\frac{\mathrm{d}}{\mathrm{d}\varepsilon}F(f+\varepsilon h)% \Big{]}_{\varepsilon=0}=\int_{{\mathcal{S}}}\frac{\delta F}{\delta f}(s)\cdot h% (s)~{}\mathrm{d}s.[ divide start_ARG roman_d end_ARG start_ARG roman_d italic_ε end_ARG italic_F ( italic_f + italic_ε italic_h ) ] start_POSTSUBSCRIPT italic_ε = 0 end_POSTSUBSCRIPT = ∫ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT divide start_ARG italic_δ italic_F end_ARG start_ARG italic_δ italic_f end_ARG ( italic_s ) ⋅ italic_h ( italic_s ) roman_d italic_s . (3.4)

We initialize the parameters with θ0iμ0similar-tosuperscriptsubscript𝜃0𝑖subscript𝜇0\theta_{0}^{i}\sim\mu_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∼ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and w0iν0similar-tosuperscriptsubscript𝑤0𝑖subscript𝜈0w_{0}^{i}\sim\nu_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∼ italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, with μ0,ν0=𝒩(0,ID)subscript𝜇0subscript𝜈0𝒩0subscript𝐼𝐷\mu_{0},\nu_{0}=\mathcal{N}(0,I_{D})italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) be standard Gaussian distribution in Dsuperscript𝐷\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. In addition, to keep track of the evolution of the parameter distribution, we denote the empirical distribution of 𝜽𝜽\bm{\theta}bold_italic_θ and 𝝎𝝎\bm{\omega}bold_italic_ω at the k𝑘kitalic_kth iteration by,

μ^k(θ)=1Ni=1Nδθki(θ),ν^k(ω)=1Ni=1Nδ𝝎ki(ω),formulae-sequencesubscript^𝜇𝑘𝜃1𝑁superscriptsubscript𝑖1𝑁subscript𝛿subscriptsuperscript𝜃𝑖𝑘𝜃subscript^𝜈𝑘𝜔1𝑁superscriptsubscript𝑖1𝑁subscript𝛿subscriptsuperscript𝝎𝑖𝑘𝜔\displaystyle\widehat{\mu}_{k}(\theta)=\frac{1}{N}\sum_{i=1}^{N}\delta_{\theta% ^{i}_{k}}(\theta),\quad\widehat{\nu}_{k}(\omega)=\frac{1}{N}\sum_{i=1}^{N}% \delta_{\bm{\omega}^{i}_{k}}(\omega),over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ ) , over^ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ω ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT bold_italic_ω start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_ω ) ,

where δ𝛿\deltaitalic_δ is the Dirac mass function.

Mean-Field (MF) Limit. To analyze the convergence of the Stochastic Gradient Descent-Ascent Algorithm for solving functional conditional moment equations with neural networks, we employ an analysis that studies the mean-field limit regime (Mei et al., 2018, 2019) of the discrete-time dynamics described in (3). Here, by the mean-field limit, we are referring to an infinite-width limit, i.e., when N𝑁N\rightarrow\inftyitalic_N → ∞ for the neural network width and a continuous time, i.e., t=kϵ𝑡𝑘italic-ϵt=k\epsilonitalic_t = italic_k italic_ϵ where the step scale ϵ0italic-ϵ0\epsilon\rightarrow 0italic_ϵ → 0 in (3). In what follows, we introduce the mean-field limit of the SGDA dynamics, which refers to the infinite-width and continuous limit of (3). For 𝜽={θi}i=1N𝜽superscriptsubscriptsuperscript𝜃𝑖𝑖1𝑁\bm{\theta}=\{\theta^{i}\}_{i=1}^{N}bold_italic_θ = { italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and 𝝎={ωi}i=1N𝝎superscriptsubscriptsuperscript𝜔𝑖𝑖1𝑁\bm{\omega}=\{\omega^{i}\}_{i=1}^{N}bold_italic_ω = { italic_ω start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT independently sampled respectively from μ,ν𝒫(D)𝜇𝜈𝒫superscript𝐷\mu,\nu\in\mathscr{P}(\mathbb{R}^{D})italic_μ , italic_ν ∈ script_P ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ), we can write the infinite width limit of neural networks used in (3.1) as

f(;μ)=αϕ(;θ)μ(dθ),g(;ν)=αψ(;ω)ν(dω).formulae-sequence𝑓𝜇𝛼italic-ϕ𝜃𝜇d𝜃𝑔𝜈𝛼𝜓𝜔𝜈d𝜔\displaystyle f(\cdot;\mu)=\alpha\int\phi(\cdot;\theta)\mu(\mathrm{d}\theta),% \quad g(\cdot;\nu)=\alpha\int\psi(\cdot;\omega)\nu(\mathrm{d}\omega).italic_f ( ⋅ ; italic_μ ) = italic_α ∫ italic_ϕ ( ⋅ ; italic_θ ) italic_μ ( roman_d italic_θ ) , italic_g ( ⋅ ; italic_ν ) = italic_α ∫ italic_ψ ( ⋅ ; italic_ω ) italic_ν ( roman_d italic_ω ) . (3.5)

From now on, we denote by μtsubscript𝜇𝑡\mu_{t}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT the distribution of θtisuperscriptsubscript𝜃𝑡𝑖\theta_{t}^{i}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and νtsubscript𝜈𝑡\nu_{t}italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT the distribution of ωtisuperscriptsubscript𝜔𝑡𝑖\omega_{t}^{i}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT for the infinite-width and continuous limit of the neural networks at time t𝑡titalic_t. For notational simplicity, we overload the notation of the objective function in (2.9) via (μ,ν)=(f(;μ),g(;ν))𝜇𝜈𝑓𝜇𝑔𝜈\mathcal{L}(\mu,\nu)=\mathcal{L}(f(\cdot;\mu),g(\cdot;\nu))caligraphic_L ( italic_μ , italic_ν ) = caligraphic_L ( italic_f ( ⋅ ; italic_μ ) , italic_g ( ⋅ ; italic_ν ) ). This is to further emphasize the dependence of objective \mathcal{L}caligraphic_L on (μ,ν)𝜇𝜈(\mu,\nu)( italic_μ , italic_ν ) when we parameterize the function pair (f,g)𝑓𝑔(f,g)( italic_f , italic_g ) using distributions on the parameter space. By Otto’s calculus (Villani, 2008), the mean-field limit of the update direction takes the following form,

vf(θ;μ,ν)superscript𝑣𝑓𝜃𝜇𝜈\displaystyle v^{f}(\theta;\mu,\nu)italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ ; italic_μ , italic_ν ) =θδ(μ,ν)δμ(θ)absentsubscript𝜃𝛿𝜇𝜈𝛿𝜇𝜃\displaystyle=-\nabla_{\theta}\frac{\delta\mathcal{L}(\mu,\nu)}{\delta\mu}(\theta)= - ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT divide start_ARG italic_δ caligraphic_L ( italic_μ , italic_ν ) end_ARG start_ARG italic_δ italic_μ end_ARG ( italic_θ )
=α𝔼𝒟[g(Z;ν)δΦ(X,Z;f(;μ))δf,θϕ(;θ)L2λδΨ(X,Z;f(;μ))δf,θϕ(;θ)L2],absent𝛼subscript𝔼𝒟delimited-[]𝑔𝑍𝜈subscript𝛿Φ𝑋𝑍𝑓𝜇𝛿𝑓subscript𝜃italic-ϕ𝜃superscript𝐿2𝜆subscript𝛿Ψ𝑋𝑍𝑓𝜇𝛿𝑓subscript𝜃italic-ϕ𝜃superscript𝐿2\displaystyle=\alpha\mathbb{E}_{\mathcal{D}}\Bigl{[}-g(Z;\nu)\cdot\big{\langle% }\frac{\delta\Phi(X,Z;f(\cdot;\mu))}{\delta f},\nabla_{\theta}\phi(\cdot;% \theta)\big{\rangle}_{L^{2}}-\lambda\cdot\Big{\langle}\frac{\delta\Psi(X,Z;f(% \cdot;\mu))}{\delta f},\nabla_{\theta}\phi(\cdot;\theta)\Big{\rangle}_{L^{2}}% \biggr{]},= italic_α blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ - italic_g ( italic_Z ; italic_ν ) ⋅ ⟨ divide start_ARG italic_δ roman_Φ ( italic_X , italic_Z ; italic_f ( ⋅ ; italic_μ ) ) end_ARG start_ARG italic_δ italic_f end_ARG , ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_ϕ ( ⋅ ; italic_θ ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - italic_λ ⋅ ⟨ divide start_ARG italic_δ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ ; italic_μ ) ) end_ARG start_ARG italic_δ italic_f end_ARG , ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_ϕ ( ⋅ ; italic_θ ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ] ,
vg(ω;μ,ν)superscript𝑣𝑔𝜔𝜇𝜈\displaystyle v^{g}(\omega;\mu,\nu)italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_ω ; italic_μ , italic_ν ) =ωδ(μ,ν)δν(ω)absentsubscript𝜔𝛿𝜇𝜈𝛿𝜈𝜔\displaystyle=\nabla_{\omega}\frac{\delta\mathcal{L}(\mu,\nu)}{\delta\nu}(\omega)= ∇ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT divide start_ARG italic_δ caligraphic_L ( italic_μ , italic_ν ) end_ARG start_ARG italic_δ italic_ν end_ARG ( italic_ω )
=α𝔼𝒟[Φ(X,Z;f(,μ))ωψ(Z;ω)g(Z;ν)ωψ(Z;ω)].absent𝛼subscript𝔼𝒟delimited-[]Φ𝑋𝑍𝑓𝜇subscript𝜔𝜓𝑍𝜔𝑔𝑍𝜈subscript𝜔𝜓𝑍𝜔\displaystyle=\alpha\mathbb{E}_{\mathcal{D}}\Bigl{[}\Phi(X,Z;f(\cdot,\mu))% \cdot\nabla_{\omega}\psi(Z;\omega)-g(Z;\nu)\cdot\nabla_{\omega}\psi(Z;\omega)% \Bigr{]}.= italic_α blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ roman_Φ ( italic_X , italic_Z ; italic_f ( ⋅ , italic_μ ) ) ⋅ ∇ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT italic_ψ ( italic_Z ; italic_ω ) - italic_g ( italic_Z ; italic_ν ) ⋅ ∇ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT italic_ψ ( italic_Z ; italic_ω ) ] . (3.6)

Here ,L2subscriptsubscript𝐿2\langle\cdot,\cdot\rangle_{L_{2}}⟨ ⋅ , ⋅ ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the inner product on L2(𝒳×𝒵)superscript𝐿2𝒳𝒵L^{2}(\mathcal{X}\times\mathcal{Z})italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_X × caligraphic_Z ) with respect to the Lebesgue measure. Recall that 𝒟𝒟\mathcal{D}caligraphic_D is the data distribution of random variables (X,Z)𝒳×𝒵𝑋𝑍𝒳𝒵(X,Z)\in\mathcal{X}\times\mathcal{Z}( italic_X , italic_Z ) ∈ caligraphic_X × caligraphic_Z, we denote by ρ𝒳,𝒵subscript𝜌𝒳𝒵\rho_{\mathcal{X},\mathcal{Z}}italic_ρ start_POSTSUBSCRIPT caligraphic_X , caligraphic_Z end_POSTSUBSCRIPT the density of 𝒟𝒟\mathcal{D}caligraphic_D with respect to the Lebesgue measure on 𝒳×𝒵𝒳𝒵\mathcal{X}\times\mathcal{Z}caligraphic_X × caligraphic_Z and we use ,𝒟subscript𝒟\langle\cdot,\cdot\rangle_{\mathcal{D}}⟨ ⋅ , ⋅ ⟩ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT to represent the inner product on L2(𝒳×𝒵)superscript𝐿2𝒳𝒵L^{2}(\mathcal{X}\times\mathcal{Z})italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_X × caligraphic_Z ) with respect to the probability distribution 𝒟𝒟\mathcal{D}caligraphic_D. That is to say, for any two function h1,h2L2(𝒳×𝒵)subscript1subscript2superscript𝐿2𝒳𝒵h_{1},h_{2}\in L^{2}(\mathcal{X}\times\mathcal{Z})italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_X × caligraphic_Z ), h1,h2𝒟=𝒳×𝒵h1h2dρ𝒳,𝒵subscriptsubscript1subscript2𝒟subscript𝒳𝒵subscript1subscript2differential-dsubscript𝜌𝒳𝒵\langle h_{1},h_{2}\rangle_{\mathcal{D}}=\int_{\mathcal{X}\times\mathcal{Z}}h_% {1}h_{2}~{}\mathrm{d}\rho_{\mathcal{X},\mathcal{Z}}⟨ italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT = ∫ start_POSTSUBSCRIPT caligraphic_X × caligraphic_Z end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_d italic_ρ start_POSTSUBSCRIPT caligraphic_X , caligraphic_Z end_POSTSUBSCRIPT.

In the sequel, we will also slightly abuse this notation and use ,𝒟subscript𝒟\langle\cdot,\cdot\rangle_{\mathcal{D}}⟨ ⋅ , ⋅ ⟩ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT to denote the inner product on sub-spaces of L2(𝒳×𝒵)superscript𝐿2𝒳𝒵L^{2}(\mathcal{X}\times\mathcal{Z})italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_X × caligraphic_Z ), with the measure being the marginals of 𝒟𝒟\mathcal{D}caligraphic_D on these sub-spaces. In (3), δΦ/δf𝛿Φ𝛿𝑓\delta\Phi/\delta fitalic_δ roman_Φ / italic_δ italic_f and δΨ/δf𝛿Ψ𝛿𝑓\delta\Psi/\delta fitalic_δ roman_Ψ / italic_δ italic_f is the variation of ΦΦ\Phiroman_Φ and ΨΨ\Psiroman_Ψ over f𝑓fitalic_f under ,L2subscriptsuperscript𝐿2\langle\cdot,\cdot\rangle_{L^{2}}⟨ ⋅ , ⋅ ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, where the test functions are chosen over the function class \mathcal{F}caligraphic_F. In the same way, δ/δμ𝛿𝛿𝜇\delta\mathcal{L}/\delta\muitalic_δ caligraphic_L / italic_δ italic_μ and δ/δν𝛿𝛿𝜈\delta\mathcal{L}/\delta\nuitalic_δ caligraphic_L / italic_δ italic_ν respectively denote the variation of the objective \mathcal{L}caligraphic_L with respect to distributions μ𝜇\muitalic_μ and ν𝜈\nuitalic_ν under ,L2subscriptsuperscript𝐿2\langle\cdot,\cdot\rangle_{L^{2}}⟨ ⋅ , ⋅ ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, following definition in (3.4) with the test function chosen over 𝒫(𝒳×𝒵)𝒫𝒳𝒵\mathscr{P}(\mathcal{X}\times\mathcal{Z})script_P ( caligraphic_X × caligraphic_Z ). We also remark that we can also define the variation under ,𝒟subscript𝒟\langle\cdot,\cdot\rangle_{\mathcal{D}}⟨ ⋅ , ⋅ ⟩ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT, which will only differ from the variation under ,L2subscriptsuperscript𝐿2\langle\cdot,\cdot\rangle_{L^{2}}⟨ ⋅ , ⋅ ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT by a constant function factor that corresponds to the density of the marginals of 𝒟𝒟\mathcal{D}caligraphic_D. Then, the mean-field limit of the SGDA update in (3) is characterized by the continuity equation, which is a system of PDEs given by,

tμt(θ)=ηdivθ(μt(θ)vf(θ;μt,νt)),tνt(ω)subscript𝑡subscript𝜇𝑡𝜃𝜂subscriptdiv𝜃subscript𝜇𝑡𝜃superscript𝑣𝑓𝜃subscript𝜇𝑡subscript𝜈𝑡subscript𝑡subscript𝜈𝑡𝜔\displaystyle\partial_{t}\mu_{t}(\theta)=-\eta\cdot\mathrm{div}_{\theta}\bigl{% (}\mu_{t}(\theta)v^{f}(\theta;\mu_{t},\nu_{t})\bigr{)},\;\partial_{t}\nu_{t}(\omega)∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) = - italic_η ⋅ roman_div start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ω ) =ηdivω(νt(ω)vg(ω;μt,νt)),absent𝜂subscriptdiv𝜔subscript𝜈𝑡𝜔superscript𝑣𝑔𝜔subscript𝜇𝑡subscript𝜈𝑡\displaystyle=-\eta\cdot\mathrm{div}_{\omega}\bigl{(}\nu_{t}(\omega)v^{g}(% \omega;\mu_{t},\nu_{t})\bigr{)},= - italic_η ⋅ roman_div start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ω ) italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_ω ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , (3.7)

where divθsubscriptdiv𝜃\mathrm{div}_{\theta}roman_div start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, divωsubscriptdiv𝜔\mathrm{div}_{\omega}roman_div start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT denotes the divergence with respect to θ𝜃\thetaitalic_θ, ω𝜔\omegaitalic_ω respectively. Note that the initialization μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ν0subscript𝜈0\nu_{0}italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are the same as the initialization of the discrete-time dynamics in (3), i.e. μ0=𝒩(0,ID)subscript𝜇0𝒩0subscript𝐼𝐷\mu_{0}=\mathcal{N}(0,I_{D})italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ), ν0=𝒩(0,ID)subscript𝜈0𝒩0subscript𝐼𝐷\nu_{0}=\mathcal{N}(0,I_{D})italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) are taken to be the distribution of standard Gaussian random variables in Dsuperscript𝐷\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT.

4 Main Results

In this section, we introduce the main theoretical results of the stochastic gradient descent-ascent dynamics. We first present the assumptions in §4.1. Then in §4.2 we show that the SGDA dynamics converge to a mean-field limit when the network with N𝑁Nitalic_N goes to infinity and the stepsize scale ϵitalic-ϵ\epsilonitalic_ϵ goes to zero. Finally, in §4.3 we prove that the mean-field limiting dynamics converge to a globally optimal solution of the primal objective J𝐽Jitalic_J under proper assumptions. Moreover, we will show that the mean-field dynamics learns a data-dependent representation that is 𝒪(α1)𝒪superscript𝛼1\mathcal{O}(\alpha^{-1})caligraphic_O ( italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) away from the initial representation.

4.1 Assumptions

We consider two types of assumptions in this work. The first type of assumption is about the function class in which we search for solutions to the minimax optimization problem. In this category, Assumption 4.1 and Assumption 4.2 discuss the richness and regularity of the two-layered neural network function class. The second type of assumption is about the feasible class of problems to apply our framework. In this category, Assumption 4.3 discusses several technical assumptions on the data space and the regularity/smoothness of the functionals.

We start with the discussion of the two-layered neural network function class. Consider the neuron function ϕitalic-ϕ\phiitalic_ϕ and ψ𝜓\psiitalic_ψ with the following form,

ϕ(w;θ)=b(β)σ(θ~(w,1)),ψ(z;ω)=b(β)σ(ω~(z,1)),formulae-sequenceitalic-ϕ𝑤𝜃𝑏𝛽𝜎superscript~𝜃top𝑤1𝜓𝑧𝜔𝑏𝛽𝜎superscript~𝜔top𝑧1\displaystyle\phi(w;\theta)=b(\beta)\cdot\sigma({\widetilde{\theta}}^{\top}(w,% 1)),\quad\psi(z;\omega)=b(\beta)\cdot\sigma(\widetilde{\omega}^{\top}(z,1)),italic_ϕ ( italic_w ; italic_θ ) = italic_b ( italic_β ) ⋅ italic_σ ( over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_w , 1 ) ) , italic_ψ ( italic_z ; italic_ω ) = italic_b ( italic_β ) ⋅ italic_σ ( over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_z , 1 ) ) , (4.1)

where θ=(β,θ~)×1+dim(𝒲)𝜃𝛽~𝜃superscript1dim𝒲\theta=(\beta,\widetilde{\theta})\in\mathbb{R}\times\mathbb{R}^{1+% \operatorname{dim}(\mathcal{W})}italic_θ = ( italic_β , over~ start_ARG italic_θ end_ARG ) ∈ blackboard_R × blackboard_R start_POSTSUPERSCRIPT 1 + roman_dim ( caligraphic_W ) end_POSTSUPERSCRIPT, ω=(β,ω~)×1+dim(𝒵)𝜔𝛽~𝜔superscript1dim𝒵\omega=(\beta,\widetilde{\omega})\in\mathbb{R}\times\mathbb{R}^{1+% \operatorname{dim}(\mathcal{Z})}italic_ω = ( italic_β , over~ start_ARG italic_ω end_ARG ) ∈ blackboard_R × blackboard_R start_POSTSUPERSCRIPT 1 + roman_dim ( caligraphic_Z ) end_POSTSUPERSCRIPT contains the parameters in the output layer and the hidden layer, b::𝑏b:\mathbb{R}\rightarrow\mathbb{R}italic_b : blackboard_R → blackboard_R is an odd re-scaling function and σ::𝜎\sigma:\mathbb{R}\rightarrow\mathbb{R}italic_σ : blackboard_R → blackboard_R is the activation function. Note that such a form of activation function satisfies the condition of universal function approximation theorem (Theorem 3.1 in Pinkus (1999)) if σ𝜎\sigmaitalic_σ is not a polynomial. For notational simplicity, we write σ(w;θ~)=σ(θ~(w,1))𝜎𝑤~𝜃𝜎superscript~𝜃top𝑤1\sigma(w;\widetilde{\theta})=\sigma({\widetilde{\theta}}^{\top}(w,1))italic_σ ( italic_w ; over~ start_ARG italic_θ end_ARG ) = italic_σ ( over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_w , 1 ) ). The re-scaling function b::𝑏b:\mathbb{R}\rightarrow\mathbb{R}italic_b : blackboard_R → blackboard_R is introduced to ensure that the value of the neural network is upper bounded. When b()=(B0,B0)𝑏subscript𝐵0subscript𝐵0b(\mathbb{R})=(-B_{0},B_{0})italic_b ( blackboard_R ) = ( - italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), the function class induced by the neural network in (3.5) is equivalent to the following class,

={f:𝒲|f(w)=βσ(w;θ~)dμ(β,θ~),μ𝒫2((B0,B0)×d+1)},conditional-set𝑓formulae-sequence𝒲conditional𝑓𝑤superscript𝛽𝜎𝑤~𝜃differential-d𝜇superscript𝛽~𝜃𝜇subscript𝒫2subscript𝐵0subscript𝐵0superscript𝑑1\displaystyle\mathcal{F}=\Bigl{\{}f:\mathcal{W}\rightarrow\mathbb{R}{\,\Big{|}% \,}f(w)=\int\beta^{\prime}\cdot\sigma(w;\widetilde{\theta})\;\mathrm{d}\mu(% \beta^{\prime},\widetilde{\theta}),\mu\in\mathscr{P}_{2}\bigl{(}(-B_{0},B_{0})% \times\mathbb{R}^{d+1}\bigr{)}\Bigr{\}},caligraphic_F = { italic_f : caligraphic_W → blackboard_R | italic_f ( italic_w ) = ∫ italic_β start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ italic_σ ( italic_w ; over~ start_ARG italic_θ end_ARG ) roman_d italic_μ ( italic_β start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over~ start_ARG italic_θ end_ARG ) , italic_μ ∈ script_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ( - italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) × blackboard_R start_POSTSUPERSCRIPT italic_d + 1 end_POSTSUPERSCRIPT ) } , (4.2)

where d=dim(𝒲)𝑑dim𝒲d=\operatorname{dim}(\mathcal{W})italic_d = roman_dim ( caligraphic_W ). This captures a rich function class due to the universal function approximation theorem (Barron, 1993; Pinkus, 1999). We remark that we introduce the re-scaling function b(β)𝑏𝛽b(\beta)italic_b ( italic_β ) in (4.1) to avoid the study of the space of probability measures over (B0,B0)×d+1subscript𝐵0subscript𝐵0superscript𝑑1(-B_{0},B_{0})\times\mathbb{R}^{d+1}( - italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) × blackboard_R start_POSTSUPERSCRIPT italic_d + 1 end_POSTSUPERSCRIPT, which has a boundary and thus lacks regularity in the study of optimal transport. Moreover, note that a scaling hyperparameter α>0𝛼0\alpha>0italic_α > 0 is introduced in the definition of the mean-field neural nets in (3.5). When α>1𝛼1\alpha>1italic_α > 1, this causes an effect of overparameterization. In brief, α𝛼\alphaitalic_α controls the error between the (f(;μt),g(;μt))𝑓subscript𝜇𝑡𝑔subscript𝜇𝑡(f(\cdot;\mu_{t}),g(\cdot;\mu_{t}))( italic_f ( ⋅ ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_g ( ⋅ ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) and optimizer (f,g)superscript𝑓superscript𝑔(f^{*},g^{*})( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) according to Theorem 4.7. Furthermore, the overparameterization scale α𝛼\alphaitalic_α has an influence through Lemma 4.6, which shows that the Wasserstein distance between the Gaussian initialization (μ0,ν0)subscript𝜇0subscript𝜈0(\mu_{0},\nu_{0})( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and the optimal distribution (μ,ν)superscript𝜇superscript𝜈(\mu^{*},\nu^{*})( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is upper-bounded by 𝒪(α1)𝒪superscript𝛼1\mathcal{O}(\alpha^{-1})caligraphic_O ( italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). Next, we impose the following regularity assumptions on the neural network functions ϕitalic-ϕ\phiitalic_ϕ and ψ𝜓\psiitalic_ψ.

Assumption 4.1 (Regularity of Neural Networks).

We assume that there exist absolute constants B0>0subscript𝐵00B_{0}>0italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0, B1>0subscript𝐵10B_{1}>0italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 and B2>0subscript𝐵20B_{2}>0italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 such that

|ϕ(w;θ)|B0,θϕ(w;θ)B1,θθ2ϕ(w;θ)FB2,w𝒲,θD,formulae-sequenceitalic-ϕ𝑤𝜃subscript𝐵0formulae-sequencenormsubscript𝜃italic-ϕ𝑤𝜃subscript𝐵1formulae-sequencesubscriptnormsubscriptsuperscript2𝜃𝜃italic-ϕ𝑤𝜃𝐹subscript𝐵2formulae-sequencefor-all𝑤𝒲𝜃superscript𝐷\displaystyle|\phi(w;\theta)|\leq B_{0},\quad\big{\|}\nabla_{\theta}\phi(w;% \theta)\big{\|}\leq B_{1},\quad\big{\|}\nabla^{2}_{\theta\theta}\phi(w;\theta)% \big{\|}_{F}\leq B_{2},\qquad\forall w\in\mathcal{W},\;\theta\in\mathbb{R}^{D},| italic_ϕ ( italic_w ; italic_θ ) | ≤ italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ∥ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_ϕ ( italic_w ; italic_θ ) ∥ ≤ italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ italic_θ end_POSTSUBSCRIPT italic_ϕ ( italic_w ; italic_θ ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≤ italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∀ italic_w ∈ caligraphic_W , italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ,
|ψ(z;ω)|B0,ωψ(z;ω)B1,ωω2ϕ(z;ω)FB2,z𝒵,ωD,formulae-sequence𝜓𝑧𝜔subscript𝐵0formulae-sequencenormsubscript𝜔𝜓𝑧𝜔subscript𝐵1formulae-sequencesubscriptnormsubscriptsuperscript2𝜔𝜔italic-ϕ𝑧𝜔𝐹subscript𝐵2formulae-sequencefor-all𝑧𝒵𝜔superscript𝐷\displaystyle|\psi(z;\omega)|\leq B_{0},\quad\big{\|}\nabla_{\omega}\psi(z;% \omega)\big{\|}\leq B_{1},\quad\big{\|}\nabla^{2}_{\omega\omega}\phi(z;\omega)% \big{\|}_{F}\leq B_{2},\qquad\forall z\in\mathcal{Z},\;\omega\in\mathbb{R}^{D},| italic_ψ ( italic_z ; italic_ω ) | ≤ italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ∥ ∇ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT italic_ψ ( italic_z ; italic_ω ) ∥ ≤ italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ω italic_ω end_POSTSUBSCRIPT italic_ϕ ( italic_z ; italic_ω ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≤ italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∀ italic_z ∈ caligraphic_Z , italic_ω ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ,

where θθ2,ωω2subscriptsuperscript2𝜃𝜃subscriptsuperscript2𝜔𝜔\nabla^{2}_{\theta\theta},\nabla^{2}_{\omega\omega}∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ italic_θ end_POSTSUBSCRIPT , ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ω italic_ω end_POSTSUBSCRIPT denotes the hessian with respect to θ𝜃\thetaitalic_θ and ω𝜔\omegaitalic_ω respectively, \|\cdot\|∥ ⋅ ∥ denotes the vector 2limit-from22-2 -norm, and F\|\cdot\|_{F}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT denotes the matrix Frobenius norm. Moreover, we assume that the rescaling function b::𝑏b:\mathbb{R}\rightarrow\mathbb{R}italic_b : blackboard_R → blackboard_R is odd and its range satisfies that b()=(B0,B0)𝑏subscript𝐵0subscript𝐵0b(\mathbb{R})=(-B_{0},B_{0})italic_b ( blackboard_R ) = ( - italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ).

Assumption 4.1 is satisfied by a broad class of neuron functions. For example, it is satisfied when we set the activation function σ(x)=sigmoid(x)𝜎𝑥sigmoid𝑥\sigma(x)=\operatorname{sigmoid}(x)italic_σ ( italic_x ) = roman_sigmoid ( italic_x ) and rescaling function b(β)=tanh(β)𝑏𝛽𝛽b(\beta)=\tanh(\beta)italic_b ( italic_β ) = roman_tanh ( italic_β ).

We also make the following assumption regarding the realizability of the saddle point solution (f,g)superscript𝑓superscript𝑔(f^{*},g^{*})( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) to (2.9).

Assumption 4.2 (Realizability).

We assume that the saddle point solution (f,g)superscript𝑓superscript𝑔(f^{*},g^{*})( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) of (2.9) belongs to the function class defined in (4.2), i.e., f,gsuperscript𝑓superscript𝑔f^{*},g^{*}\in\mathcal{F}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_F.

In general, problem (2.9) may not admit a saddle point within the given neural network function class. Therefore, Assumption 4.2 is introduced to guarantee that the discussion in this paper is meaningful. By universal function approximation theorem (Barron, 1993; Pinkus, 1999), the function class defined in (4.2) captures a rich class of functions. Therefore, such an assumption is quite general and does not restrict the influence of the applications of our results.

We impose the following assumptions on the integrability of the functional ΦΦ\Phiroman_Φ and ΨΨ\Psiroman_Ψ and their variations, as well as the compactness of the data space 𝒳𝒳\mathcal{X}caligraphic_X and 𝒵𝒵\mathcal{Z}caligraphic_Z.

Assumption 4.3 (Data regularity and Functional Integrability).

(i) For the data space 𝒳𝒳\mathcal{X}caligraphic_X, 𝒵𝒵\mathcal{Z}caligraphic_Z, we assume that 𝒳×𝒵𝒳𝒵\mathcal{X}\times\mathcal{Z}caligraphic_X × caligraphic_Z is compact, in the sense that there exists a positive constant C1>0subscript𝐶10C_{1}>0italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 such that for any data tuple (x,z)𝒳×𝒵𝑥𝑧𝒳𝒵(x,z)\in\mathcal{X}\times\mathcal{Z}( italic_x , italic_z ) ∈ caligraphic_X × caligraphic_Z, it satisfies that (x,z)C1norm𝑥𝑧subscript𝐶1\|(x,z)\|\leq C_{1}∥ ( italic_x , italic_z ) ∥ ≤ italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Moreover, the data distribution 𝒟𝒟\mathcal{D}caligraphic_D admits a positive, smooth density ρ𝒟subscript𝜌𝒟\rho_{\mathcal{D}}italic_ρ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT with respect to the Lebesgue measure on 𝒳×𝒵𝒳𝒵\mathcal{X}\times\mathcal{Z}caligraphic_X × caligraphic_Z.

(ii) For the functionals Φ(x,z;f)Φ𝑥𝑧𝑓\Phi(x,z;f)roman_Φ ( italic_x , italic_z ; italic_f ) and Ψ(x,z;f)Ψ𝑥𝑧𝑓\Psi(x,z;f)roman_Ψ ( italic_x , italic_z ; italic_f ), there exists a positive constant C2>0subscript𝐶20C_{2}>0italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 such that

𝒲|δΦ(x,z;f)δf(w)|dwC2,𝒲|δΨ(x,z;f)δf(w)|dwC2,(x,z)𝒳×𝒵.formulae-sequencesubscript𝒲𝛿Φ𝑥𝑧𝑓𝛿𝑓superscript𝑤differential-dsuperscript𝑤subscript𝐶2formulae-sequencesubscript𝒲𝛿Ψ𝑥𝑧𝑓𝛿𝑓superscript𝑤differential-dsuperscript𝑤subscript𝐶2for-all𝑥𝑧𝒳𝒵\displaystyle\int_{\mathcal{W}}\Big{|}\frac{\delta\Phi(x,z;f)}{\delta f}(w^{% \prime})\Big{|}\mathrm{d}w^{\prime}\leq C_{2},\quad\int_{\mathcal{W}}\Big{|}% \frac{\delta\Psi(x,z;f)}{\delta f}(w^{\prime})\Big{|}\mathrm{d}w^{\prime}\leq C% _{2},\quad\forall(x,z)\in\mathcal{X}\times\mathcal{Z}.∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT | divide start_ARG italic_δ roman_Φ ( italic_x , italic_z ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | roman_d italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT | divide start_ARG italic_δ roman_Ψ ( italic_x , italic_z ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | roman_d italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∀ ( italic_x , italic_z ) ∈ caligraphic_X × caligraphic_Z .

(iii) We assume that 𝒲δΨ(x,z;f)δf(w)dwsubscript𝒲𝛿Ψ𝑥𝑧𝑓𝛿𝑓superscript𝑤differential-dsuperscript𝑤\int_{\mathcal{W}}\frac{\delta\Psi(x,z;f)}{\delta f}(w^{\prime})\mathrm{d}w^{\prime}∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT divide start_ARG italic_δ roman_Ψ ( italic_x , italic_z ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) roman_d italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as a linear functional of f𝑓fitalic_f is upper-bounded by constant times of values of f𝑓fitalic_f. That is to say, there exists w𝒲𝑤𝒲w\in\mathcal{W}italic_w ∈ caligraphic_W as a part of the data tuple (x,z)𝑥𝑧(x,z)( italic_x , italic_z ) and a positive constant CΨ>0subscript𝐶Ψ0C_{\Psi}>0italic_C start_POSTSUBSCRIPT roman_Ψ end_POSTSUBSCRIPT > 0 such that

𝒲|δΨ(x,z;f)δf(w)|dwCΨ|f(w)|.subscript𝒲𝛿Ψ𝑥𝑧𝑓𝛿𝑓superscript𝑤differential-dsuperscript𝑤subscript𝐶Ψ𝑓𝑤\displaystyle\int_{\mathcal{W}}\Big{|}\frac{\delta\Psi(x,z;f)}{\delta f}(w^{% \prime})\Big{|}\mathrm{d}w^{\prime}\leq C_{\Psi}\cdot\big{|}f(w)\big{|}.∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT | divide start_ARG italic_δ roman_Ψ ( italic_x , italic_z ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | roman_d italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_C start_POSTSUBSCRIPT roman_Ψ end_POSTSUBSCRIPT ⋅ | italic_f ( italic_w ) | .

(iv) We assume that the variation of minimax objective (f,g)𝑓𝑔\mathcal{L}(f,g)caligraphic_L ( italic_f , italic_g ) with respective to f𝑓fitalic_f and g𝑔gitalic_g are continuous functions defined on 𝒲𝒲\mathcal{W}caligraphic_W and 𝒵𝒵\mathcal{Z}caligraphic_Z. That is to say,

δ(f,g)δf𝒞(𝒲),δ(f,g)δg𝒞(𝒵).formulae-sequence𝛿𝑓𝑔𝛿𝑓𝒞𝒲𝛿𝑓𝑔𝛿𝑔𝒞𝒵\displaystyle\frac{\delta\mathcal{L}(f,g)}{\delta f}\in\mathscr{C}(\mathcal{W}% ),\quad\frac{\delta\mathcal{L}(f,g)}{\delta g}\in\mathscr{C}(\mathcal{Z}).divide start_ARG italic_δ caligraphic_L ( italic_f , italic_g ) end_ARG start_ARG italic_δ italic_f end_ARG ∈ script_C ( caligraphic_W ) , divide start_ARG italic_δ caligraphic_L ( italic_f , italic_g ) end_ARG start_ARG italic_δ italic_g end_ARG ∈ script_C ( caligraphic_Z ) .

Item (i) of Assumption 4.3 restricts our scenarios to data spaces with bounded values and smooth densities for technical reasons. Item (ii) and (iii) of Assumption 4.3 is an integrability condition that we additionally require to avoid discussion of improper functionals that potentially have singularities with exploding values. Item (iv) is a smoothness condition that requires the variation of the minimax objective averaged over data to be continuous on respective space. We also remark that a sufficient condition for item (iv) to hold is the variation of ΦΦ\Phiroman_Φ and ΨΨ\Psiroman_Ψ with respect to f𝑓fitalic_f averaged under the marginal of 𝒟𝒟\mathcal{D}caligraphic_D on 𝒲𝒲\mathcal{W}caligraphic_W is continuous. We will also use this condition to verify item (iv) in practice. These are general and reasonable assumptions widely satisfied by various applications in machine learning, causal inference, and statistics.

4.2 Convergence of SGDA dynamics to the Mean-Field Limit

In the following proposition, we show that the empirical distribution of the parameters μ^ksubscript^𝜇𝑘\widehat{\mu}_{k}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and ν^ksubscript^𝜈𝑘\widehat{\nu}_{k}over^ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT weakly converges to the mean-field limit in (3.7) as the width N𝑁Nitalic_N goes to infinity and the stepsize scale ϵitalic-ϵ\epsilonitalic_ϵ goes to zero. Let ρt(θ,ω)=μt(θ)νt(ω)subscript𝜌𝑡𝜃𝜔tensor-productsubscript𝜇𝑡𝜃subscript𝜈𝑡𝜔\rho_{t}(\theta,\omega)=\mu_{t}(\theta)\otimes\nu_{t}(\omega)italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ , italic_ω ) = italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) ⊗ italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ω ), where (μt,νt)subscript𝜇𝑡subscript𝜈𝑡(\mu_{t},\nu_{t})( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the PDE solution to the continuous deterministic dynamics in (3.7) and ρ^k=N1i=1Nδθkiδωkisubscript^𝜌𝑘superscript𝑁1superscriptsubscript𝑖1𝑁subscript𝛿superscriptsubscript𝜃𝑘𝑖subscript𝛿superscriptsubscript𝜔𝑘𝑖\widehat{\rho}_{k}=N^{-1}\cdot\sum_{i=1}^{N}\delta_{\theta_{k}^{i}}\cdot\delta% _{\omega_{k}^{i}}over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⋅ italic_δ start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT corresponds to the empirical distribution of (𝜽k,𝝎k)subscript𝜽𝑘subscript𝝎𝑘(\bm{\theta}_{k},\bm{\omega}_{k})( bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), which is k𝑘kitalic_k-th iterate of the discrete time stochastic dynamics in (3) with stepsize scale ϵitalic-ϵ\epsilonitalic_ϵ. The following proposition proves that the PDE solution ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in (3.7) well approximates the discrete time stochastic gradient descent-ascent dynamics in (3).

Proposition 4.4 (Convergence of SGDA to Mean-Field Limit).

Let {ρt}t0subscriptsubscript𝜌𝑡𝑡0\{\rho_{t}\}_{t\geq 0}{ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT be solution to (3.7) with ρ0=𝒩(0,ID)𝒩(0,ID)subscript𝜌0tensor-product𝒩0subscript𝐼𝐷𝒩0subscript𝐼𝐷\rho_{0}=\mathcal{N}(0,I_{D})\otimes\mathcal{N}(0,I_{D})italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) ⊗ caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ), {ρ^k}k0subscriptsubscript^𝜌𝑘𝑘0\{\widehat{\rho}_{k}\}_{k\geq 0}{ over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT be solution to (3) with ρ^0=𝒩(0,ID)𝒩(0,ID)subscript^𝜌0tensor-product𝒩0subscript𝐼𝐷𝒩0subscript𝐼𝐷\widehat{\rho}_{0}=\mathcal{N}(0,I_{D})\otimes\mathcal{N}(0,I_{D})over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) ⊗ caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ). Under Assumption 4.1 and 4.3, ρ^t/ϵsubscript^𝜌𝑡italic-ϵ\widehat{\rho}_{\lfloor t/\epsilon\rfloor}over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT ⌊ italic_t / italic_ϵ ⌋ end_POSTSUBSCRIPT converges weakly to ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as ϵ0+italic-ϵsuperscript0\epsilon\rightarrow 0^{+}italic_ϵ → 0 start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and N𝑁N\rightarrow\inftyitalic_N → ∞. It holds for any Lipschitz continuous, bounded function F:D×D:𝐹superscript𝐷superscript𝐷F:\mathbb{R}^{D}\times\mathbb{R}^{D}\rightarrow\mathbb{R}italic_F : blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT → blackboard_R that

limϵ0+,NF(θ,ω)dρ^t/ϵ(θ,ω)=F(θ,ω)dρt(θ,ω).subscriptformulae-sequenceitalic-ϵsuperscript0𝑁𝐹𝜃𝜔differential-dsubscript^𝜌𝑡italic-ϵ𝜃𝜔𝐹𝜃𝜔differential-dsubscript𝜌𝑡𝜃𝜔\displaystyle\lim_{\epsilon\rightarrow 0^{+},N\rightarrow\infty}\int F(\theta,% \omega)\mathrm{d}\widehat{\rho}_{\lfloor t/\epsilon\rfloor}(\theta,\omega)=% \int F(\theta,\omega)\mathrm{d}\rho_{t}(\theta,\omega).roman_lim start_POSTSUBSCRIPT italic_ϵ → 0 start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_N → ∞ end_POSTSUBSCRIPT ∫ italic_F ( italic_θ , italic_ω ) roman_d over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT ⌊ italic_t / italic_ϵ ⌋ end_POSTSUBSCRIPT ( italic_θ , italic_ω ) = ∫ italic_F ( italic_θ , italic_ω ) roman_d italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ , italic_ω ) .
Proof.

See §B for a detailed proof. ∎

The proof of Proposition 4.4 is based on the propagation of chaos (Mei et al., 2018, 2019; Araújo et al., 2019; Zhang et al., 2020; Sznitman, 1991). We deferred the detailed proof of Proposition 4.4 to Appendix B. Proposition 4.4 allows us to convert the discrete-time SGDA dynamics over finite dimensional parameter space to its continuous-time, infinite-dimensional counter-part in Wasserstein space, in which the training is amenable to analysis since our infinitely wide neural network f(;μ)𝑓𝜇f(\cdot;\mu)italic_f ( ⋅ ; italic_μ ) and g(;ν))g(\cdot;\nu))italic_g ( ⋅ ; italic_ν ) ) in (3.5) is linear in μ𝜇\muitalic_μ and ν𝜈\nuitalic_ν respectively.

4.3 Global Optimality and Convergence of the Mean-Field Limit

In this section, we will introduce our main results that characterize the global optimality and convergence of the mean-field neural networks, parameterized by the parameter distribution ρt=(μt,νt)subscript𝜌𝑡subscript𝜇𝑡subscript𝜈𝑡\rho_{t}=(\mu_{t},\nu_{t})italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The proof contains two steps. We first show that it is sufficient to find a stationary point of the Wasserstein gradient flow defined in (3.7) in order to solve the minimax optimization problem in (2.9), then we characterize the convergence of ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to the stationary point. Before presenting the two stages of the proof, we would need to further clarify the notions of stationarity regarding the Wasserstein gradient flow. We introduce the following definition,

Definition 4.5 (Stationary point of Wasserstein Gradient Flow).

A distribution pair (μ,ν)𝜇𝜈(\mu,\nu)( italic_μ , italic_ν ) is called a stationary point of the Wasserstein gradient flow (3.7) if it satisfies

vf(θ;μ,ν)=vg(ω;μ,ν)=0,θ,ωD.formulae-sequencesuperscript𝑣𝑓𝜃𝜇𝜈superscript𝑣𝑔𝜔𝜇𝜈0for-all𝜃𝜔superscript𝐷\displaystyle v^{f}(\theta;\mu,\nu)=v^{g}(\omega;\mu,\nu)=0,\quad\forall\theta% ,\omega\in\mathbb{R}^{D}.italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ ; italic_μ , italic_ν ) = italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_ω ; italic_μ , italic_ν ) = 0 , ∀ italic_θ , italic_ω ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT .

From Definition 4.5, the stationary point of Wasserstein gradient flow (3.7) is a distribution pair (μ,ν)𝜇𝜈(\mu,\nu)( italic_μ , italic_ν ), at which the associated vector field (vf(;μ,ν),vg(;μ,ν))superscript𝑣𝑓𝜇𝜈superscript𝑣𝑔𝜇𝜈(v^{f}(\cdot;\mu,\nu),v^{g}(\cdot;\mu,\nu))( italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( ⋅ ; italic_μ , italic_ν ) , italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( ⋅ ; italic_μ , italic_ν ) ) is a zero function on the parameter space D×Dsuperscript𝐷superscript𝐷\mathbb{R}^{D}\times\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. Moreover, for the Wasserstein gradient flow following vector field (vf,vg)superscript𝑣𝑓superscript𝑣𝑔(v^{f},v^{g})( italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) and initial condition (μ,ν)𝜇𝜈(\mu,\nu)( italic_μ , italic_ν ), the solution to its associated continuity equation (μt,νt)subscript𝜇𝑡subscript𝜈𝑡(\mu_{t},\nu_{t})( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is a constant flow such that for all t0𝑡0t\geq 0italic_t ≥ 0, μt=μ,νt=νformulae-sequencesubscript𝜇𝑡𝜇subscript𝜈𝑡𝜈\mu_{t}=\mu,\nu_{t}=\nuitalic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_μ , italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_ν. Now, we have the following important supporting lemma that characterizes the relation between stationary points of Wasserstein gradient flow (3.7) and saddle points of (2.9).

Lemma 4.6.

Under Assumptions 4.1 and 4.2, the following two properties hold.

  • (i)

    Suppose that (μ,ν)superscript𝜇superscript𝜈(\mu^{*},\nu^{*})( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is a stationary point of the Wasserstein gradient flow as is defined in Definition 4.5. Then, the corresponding function (f(;μ),g(;ν))𝑓superscript𝜇𝑔superscript𝜈(f(\cdot;\mu^{*}),g(\cdot;\nu^{*}))( italic_f ( ⋅ ; italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_g ( ⋅ ; italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) is the saddle point of the objective function (f,g)𝑓𝑔\mathcal{L}(f,g)caligraphic_L ( italic_f , italic_g ) defined in (2.9).

  • (ii)

    There exists a stationary distribution pair (μ,ν)superscript𝜇superscript𝜈(\mu^{*},\nu^{*})( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) and constant D¯>0¯𝐷0\bar{D}>0over¯ start_ARG italic_D end_ARG > 0 such that

    W2(μ0,μ)α1D¯,W2(ν0,ν)α1D¯.formulae-sequencesubscript𝑊2subscript𝜇0superscript𝜇superscript𝛼1¯𝐷subscript𝑊2subscript𝜈0superscript𝜈superscript𝛼1¯𝐷\displaystyle W_{2}(\mu_{0},\mu^{*})\leq\alpha^{-1}\bar{D},\quad W_{2}(\nu_{0}% ,\nu^{*})\leq\alpha^{-1}\bar{D}.italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over¯ start_ARG italic_D end_ARG , italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over¯ start_ARG italic_D end_ARG .

Lemma 4.6 demonstrates that the stationary point of the Wasserstein gradient flow in (3.7) achieves global optimality as a solution to the minimax objective (2.9). Lemma  4.6 allows us to bypass the hardness of solving the nonconvex-nonconcave optimization problem (2.9) of finding saddle points in the space of neural network parameters (𝜽,𝝎)𝜽𝝎(\bm{\theta},\bm{\omega})( bold_italic_θ , bold_italic_ω ) by searching for a stationary point of the Wasserstein gradient flow instead. Moreover, there exist good pairs of stationary points that are close to the Gaussian initialization (μ0,ν0)subscript𝜇0subscript𝜈0(\mu_{0},\nu_{0})( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), with Wasserstein distance upper bounded by order 𝒪(α1)𝒪superscript𝛼1\mathcal{O}(\alpha^{-1})caligraphic_O ( italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ).

Proof.

See §A.1 for a detailed proof. ∎

We are now ready to show our main results. The following theorem characterizes the global optimality and convergence of the Wasserstein gradient flow ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Theorem 4.7 (Global Convergence to Saddle Point).

Let (μt,νt)subscript𝜇𝑡subscript𝜈𝑡(\mu_{t},\nu_{t})( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) be the solution to the Wasserstein gradient flow (3.7) at time t𝑡titalic_t with η=α2𝜂superscript𝛼2\eta=\alpha^{-2}italic_η = italic_α start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT and initial condition μ0=ν0=𝒩(0,ID)subscript𝜇0subscript𝜈0𝒩0subscript𝐼𝐷\mu_{0}=\nu_{0}=\mathcal{N}(0,I_{D})italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ), (f,g)superscript𝑓superscript𝑔(f^{*},g^{*})( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) the saddle point of the minimax objective (f,g)𝑓𝑔\mathcal{L}(f,g)caligraphic_L ( italic_f , italic_g ) in (2.9). Under Assumptions 4.1, 4.2, and 4.3, it holds that

inft[0,T]𝔼𝒟[λΨ(X,Z;f(;μt)f())+(g(Z;νt)g(Z))2]𝒪(T1+α1).subscriptinfimum𝑡0𝑇subscript𝔼𝒟delimited-[]𝜆Ψ𝑋𝑍𝑓subscript𝜇𝑡superscript𝑓superscript𝑔𝑍subscript𝜈𝑡superscript𝑔𝑍2𝒪superscript𝑇1superscript𝛼1\displaystyle\inf_{t\in[0,T]}\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\Psi\big{(% }X,Z;f(\cdot;\mu_{t})-f^{*}(\cdot)\big{)}+\bigl{(}g(Z;\nu_{t})-g^{*}(Z)\bigr{)% }^{2}\Bigr{]}\leq\mathcal{O}(T^{-1}+\alpha^{-1}).roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_λ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ ) ) + ( italic_g ( italic_Z ; italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_Z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ caligraphic_O ( italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) . (4.3)
Proof.

See §A.2 for a detailed proof. ∎

Theorem 4.7 says that the optimality gap between (f(;μt),g(;νt))𝑓subscript𝜇𝑡𝑔subscript𝜈𝑡(f(\cdot;\mu_{t}),g(\cdot;\nu_{t}))( italic_f ( ⋅ ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_g ( ⋅ ; italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) and (f,g)superscript𝑓superscript𝑔(f^{*},g^{*})( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), quantified by the ΨΨ\Psiroman_Ψ-induced distance and L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT distance respectively, decays to zero at a sublinear rate in terms of time T𝑇Titalic_T up to the error of 𝒪(α1)𝒪superscript𝛼1\mathcal{O}(\alpha^{-1})caligraphic_O ( italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ), where α>0𝛼0\alpha>0italic_α > 0 is the scaling parameter in (3.1) and (3.5). In order to prove the convergence, we construct a potential V(μ,ν)=𝔼𝒟[λΨ(X,Z;f(;μ)f())+(g(Z;ν)g(Z))2]𝑉𝜇𝜈subscript𝔼𝒟delimited-[]𝜆Ψ𝑋𝑍𝑓𝜇superscript𝑓superscript𝑔𝑍𝜈superscript𝑔𝑍2V(\mu,\nu)=\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\Psi\big{(}X,Z;f(\cdot;\mu)-% f^{*}(\cdot)\big{)}+\bigl{(}g(Z;\nu)-g^{*}(Z)\bigr{)}^{2}\Bigr{]}italic_V ( italic_μ , italic_ν ) = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_λ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ ; italic_μ ) - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ ) ) + ( italic_g ( italic_Z ; italic_ν ) - italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_Z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ], with V(μ,ν)=0𝑉𝜇𝜈0V(\mu,\nu)=0italic_V ( italic_μ , italic_ν ) = 0 if and only if (μ,ν)=(μ,ν)𝜇𝜈superscript𝜇superscript𝜈(\mu,\nu)=(\mu^{*},\nu^{*})( italic_μ , italic_ν ) = ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). Such a potential characterizes the saddle point of the minimax objective. We show that the Wasserstein gradient flow decreases the potential at a sublinear rate, thus suggesting the convergence of the gradient flow to the saddle point. Moreover, varying α𝛼\alphaitalic_α allows a trade-off between the error of order 𝒪(α1)𝒪superscript𝛼1\mathcal{O}(\alpha^{-1})caligraphic_O ( italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) in the optimality gap and the maximum deviation between ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the Gaussian initialization ρ0subscript𝜌0\rho_{0}italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for all t𝑡titalic_t. In the proof of item (ii) of Lemma 4.6, we proved that the deviation of ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from ρ0subscript𝜌0\rho_{0}italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT quantified by the Wasserstein distance is of order 𝒪(α1)𝒪superscript𝛼1\mathcal{O}(\alpha^{-1})caligraphic_O ( italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). Regarding representation learning, this suggests that SGDA induces a data-dependent representation that is significantly different from the initialization. Choosing a small α𝛼\alphaitalic_α of order 𝒪(1)𝒪1\mathcal{O}(1)caligraphic_O ( 1 ) will correspond to the mean-field regime (Mei et al., 2018, 2019) that allows ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to move further away from the initialization, with the potential drawback of yielding a large error of order 𝒪(α1)𝒪superscript𝛼1\mathcal{O}(\alpha^{-1})caligraphic_O ( italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). On the other hand, choosing a large α𝛼\alphaitalic_α of order 𝒪(N)𝒪𝑁\mathcal{O}(\sqrt{N})caligraphic_O ( square-root start_ARG italic_N end_ARG ) will correspond to the NTK regime (Jacot et al., 2018), and this causes the Wasserstein flow ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to stay close to the initial distribution ρ0subscript𝜌0\rho_{0}italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT along the trajectory, inducing a data-independent representation.

As we have commented before, an important class of regularizer Ψ(X,Z;f)Ψ𝑋𝑍𝑓\Psi(X,Z;f)roman_Ψ ( italic_X , italic_Z ; italic_f ) is the L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT regularizer. In this scenario, the left-hand side of (4.3) should be understood as a weighted L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT distance between the gradient flow iterate at time t𝑡titalic_t to the optimal solution (f,g)superscript𝑓superscript𝑔(f^{*},g^{*})( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). As T𝑇Titalic_T and α𝛼\alphaitalic_α go to infinity, such a distance will shrink to 00, thus the gradient flow converges globally in the minimal distance sense to the optimal solution. Due to this observation, in the sequel we will discuss several additional results in the case where the regularizer Ψ(X,Z;f)Ψ𝑋𝑍𝑓\Psi(X,Z;f)roman_Ψ ( italic_X , italic_Z ; italic_f ) is strongly convex, in the sense that it’s bounded below by a quadratic function. We formalize the additional constraint in this case with the following assumption,

Assumption 4.8 (Strong Convexity).

We assume that the regularizer Ψ(X,Z;f)Ψ𝑋𝑍𝑓\Psi(X,Z;f)roman_Ψ ( italic_X , italic_Z ; italic_f ) is cΨsubscript𝑐Ψc_{\Psi}italic_c start_POSTSUBSCRIPT roman_Ψ end_POSTSUBSCRIPT-strongly convex, in the sense that there exists a constant cΨ>0subscript𝑐Ψ0c_{\Psi}>0italic_c start_POSTSUBSCRIPT roman_Ψ end_POSTSUBSCRIPT > 0 such that for any f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F,

Ψ(x,z;f)cΨ|f(w)|2,(x,z)𝒳×𝒵,formulae-sequenceΨ𝑥𝑧𝑓subscript𝑐Ψsuperscript𝑓𝑤2for-all𝑥𝑧𝒳𝒵\displaystyle\Psi(x,z;f)\geq c_{\Psi}\cdot|f(w)|^{2},\quad\forall(x,z)\in% \mathcal{X}\times\mathcal{Z},roman_Ψ ( italic_x , italic_z ; italic_f ) ≥ italic_c start_POSTSUBSCRIPT roman_Ψ end_POSTSUBSCRIPT ⋅ | italic_f ( italic_w ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ∀ ( italic_x , italic_z ) ∈ caligraphic_X × caligraphic_Z ,

where w𝒲𝑤𝒲w\in\mathcal{W}italic_w ∈ caligraphic_W is part of the data tuple (x,z)𝑥𝑧(x,z)( italic_x , italic_z ).

Assumption 4.8 implies that regularizer Ψ(X,Z;f)Ψ𝑋𝑍𝑓\Psi(X,Z;f)roman_Ψ ( italic_X , italic_Z ; italic_f ) is equivalent to a quadratic regularizer because ΨΨ\Psiroman_Ψ is simultaneously bounded above and below by quadratic functionals. We have the following strengthened version of Theorem 4.7 in such case,

inft[0,T]𝔼𝒟[λcΨ(f(;μt)f())2+(g(Z;νt)g(Z))2]𝒪(T1+α1).subscriptinfimum𝑡0𝑇subscript𝔼𝒟delimited-[]𝜆subscript𝑐Ψsuperscript𝑓subscript𝜇𝑡superscript𝑓2superscript𝑔𝑍subscript𝜈𝑡superscript𝑔𝑍2𝒪superscript𝑇1superscript𝛼1\displaystyle\inf_{t\in[0,T]}\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda c_{\Psi}% \cdot\big{(}f(\cdot;\mu_{t})-f^{*}(\cdot)\big{)}^{2}+\bigl{(}g(Z;\nu_{t})-g^{*% }(Z)\bigr{)}^{2}\Bigr{]}\leq\mathcal{O}(T^{-1}+\alpha^{-1}).roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_λ italic_c start_POSTSUBSCRIPT roman_Ψ end_POSTSUBSCRIPT ⋅ ( italic_f ( ⋅ ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_g ( italic_Z ; italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_Z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ caligraphic_O ( italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) . (4.4)

Equation (4.4) shows that the iterates (f(;μt),g(;νt))𝑓subscript𝜇𝑡𝑔subscript𝜈𝑡(f(\cdot;\mu_{t}),g(\cdot;\nu_{t}))( italic_f ( ⋅ ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_g ( ⋅ ; italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) converges to the saddle point solution (f,g)superscript𝑓superscript𝑔(f^{*},g^{*})( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) as a weighted L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT distance decays to zero at a sublinear rate up to an error of 𝒪(α1)𝒪superscript𝛼1\mathcal{O}(\alpha^{-1})caligraphic_O ( italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). With Assumption 4.2, the saddle point fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the global optimizer of the primal functional J(f)𝐽𝑓J(f)italic_J ( italic_f ) defined in (2.3). Therefore, as a direct consequence of Theorem 4.7, when the regularizer ΨΨ\Psiroman_Ψ is strongly convex, f(;μt)𝑓subscript𝜇𝑡f(\cdot;\mu_{t})italic_f ( ⋅ ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) converges globally to fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT at a sublinear rate in terms of T𝑇Titalic_T up to an error of 𝒪(α1)𝒪superscript𝛼1\mathcal{O}(\alpha^{-1})caligraphic_O ( italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ).

Under Assumption 4.8, we can also quantify the optimality gap between J(ft)𝐽subscript𝑓𝑡J(f_{t})italic_J ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and J(f)𝐽superscript𝑓J(f^{*})italic_J ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), in terms of the minimal distance inft[0,T]J(ft)J(f)subscriptinfimum𝑡0𝑇𝐽subscript𝑓𝑡𝐽superscript𝑓\inf_{t\in[0,T]}J(f_{t})-J(f^{*})roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT italic_J ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_J ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). The following theorem characterize the global convergence of J(ft)𝐽subscript𝑓𝑡J(f_{t})italic_J ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to J(f)𝐽superscript𝑓J(f^{*})italic_J ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ),

Theorem 4.9 (Global Convergence to Primal Solution).

Let (μt,νt)subscript𝜇𝑡subscript𝜈𝑡(\mu_{t},\nu_{t})( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) be the solution to the Wasserstein gradient flow (3.7) at time t𝑡titalic_t with η=α2𝜂superscript𝛼2\eta=\alpha^{-2}italic_η = italic_α start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT and initial condition μ0=ν0=𝒩(0,ID)subscript𝜇0subscript𝜈0𝒩0subscript𝐼𝐷\mu_{0}=\nu_{0}=\mathcal{N}(0,I_{D})italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ). Under Assumptions 4.1, 4.2, 4.3 and 4.8, let ft=f(;μt)subscript𝑓𝑡𝑓subscript𝜇𝑡f_{t}=f(\cdot;\mu_{t})italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f ( ⋅ ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), it holds that

inft[0,T]J(ft)J(f)𝒪(T1/2+α1/2),subscriptinfimum𝑡0𝑇𝐽subscript𝑓𝑡𝐽superscript𝑓𝒪superscript𝑇12superscript𝛼12\displaystyle\inf_{t\in[0,T]}J(f_{t})-J(f^{*})\leq\mathcal{O}(T^{-1/2}+\alpha^% {-1/2}),roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT italic_J ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_J ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ caligraphic_O ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) ,

where fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the global minimizer of the objective function defined in (2.3).

Proof.

See §A.3 for a detailed proof. ∎

Theorem 4.9 proves that under the additional strong convexity assumption on the regularizer Ψ(X,Z;f)Ψ𝑋𝑍𝑓\Psi(X,Z;f)roman_Ψ ( italic_X , italic_Z ; italic_f ), the primal objective J(ft)𝐽subscript𝑓𝑡J(f_{t})italic_J ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as is defined in (2.3) decays to zero at rate of T1/2superscript𝑇12T^{-1/2}italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT in terms of time horizon T𝑇Titalic_T, up to an error of 𝒪(α1/2)𝒪superscript𝛼12\mathcal{O}(\alpha^{-1/2})caligraphic_O ( italic_α start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ). Here we use fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to denote the global minimizer instead of the saddle point. However, this will not create any confusion since for each fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT global minimizer of the primal objective (2.3), we can find gsuperscript𝑔g^{*}\in\mathcal{F}italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_F such that (f,g)superscript𝑓superscript𝑔(f^{*},g^{*})( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is a saddle point of (2.9).

5 Applications

In this section, we present the applications of Theorem 4.7 and Theorem 4.9 to several special cases of functional conditional moment equation, such as the problem of policy evaluation, instrumental variables regression, asset pricing, and adversarial Riesz representer estimation. In Section 2.2, we already discussed why these problems are special cases of functional conditional moment equations, thus Theorem 4.7 and Theorem 4.9 are potentially feasible to apply. We will recall the problem settings and examine the technical assumptions for these cases.

5.1 Application 1: Policy Evaluation

Let 𝒟𝒟\mathcal{D}caligraphic_D denote the joint distribution of the state-action tuple (S,A,S)𝑆𝐴superscript𝑆(S,A,S^{\prime})( italic_S , italic_A , italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) under policy π𝜋\piitalic_π. In this scenario, the endogenous variable X=S𝑋superscript𝑆X=S^{\prime}italic_X = italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the next state while the exogenous variable Z=S𝑍𝑆Z=Sitalic_Z = italic_S is the current state. Therefore, 𝒳=𝒮𝒳𝒮\mathcal{X}={\mathcal{S}}caligraphic_X = caligraphic_S, 𝒵=𝒮𝒵𝒮\mathcal{Z}={\mathcal{S}}caligraphic_Z = caligraphic_S and 𝒲=𝒮𝒲𝒮\mathcal{W}={\mathcal{S}}caligraphic_W = caligraphic_S. We attempt to estimate the value function V𝑉Vitalic_V, which is defined on 𝒲=𝒮𝒲𝒮\mathcal{W}={\mathcal{S}}\rightarrow\mathbb{R}caligraphic_W = caligraphic_S → blackboard_R. The functional ΦΦ\Phiroman_Φ and regularizer ΨΨ\Psiroman_Ψ adopted in this case are,

Φ(s,s;f)=r+γf(s)f(s),Ψ(s,s;f)=f(s)2.formulae-sequenceΦsuperscript𝑠𝑠𝑓𝑟𝛾𝑓superscript𝑠𝑓𝑠Ψsuperscript𝑠𝑠𝑓𝑓superscriptsuperscript𝑠2\displaystyle\Phi(s^{\prime},s;f)=r+\gamma\cdot f(s^{\prime})-f(s),\quad\Psi(s% ^{\prime},s;f)=f(s^{\prime})^{2}.roman_Φ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ; italic_f ) = italic_r + italic_γ ⋅ italic_f ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_f ( italic_s ) , roman_Ψ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ; italic_f ) = italic_f ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Here, the regularizer we adopt is a L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT regularizer that penalizes the squared value of the estimator evaluated at the next state ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. With these specific choices of functional ΦΦ\Phiroman_Φ and regularizer ΨΨ\Psiroman_Ψ, the SGDA algorithm identifies with the Gradient Temporal Difference Learning (GTD) algorithm (Wai et al., 2020). Therefore, the application of our general framework to the problem of policy evaluation contributes to the reinforcement learning literature by providing an analysis of the neural GTD algorithm in the mean-field regime. Before presenting the theoretical results, we first verify that Assumption 4.3 and Assumption 4.8 hold.

Verify item (i) of Assumption 4.3. For item (i) of Assumption 4.3, it’s reasonable to assume that (x,z)1norm𝑥𝑧1\|(x,z)\|\leq 1∥ ( italic_x , italic_z ) ∥ ≤ 1 since we can always re-scale the state space without changing the nature of the problem, therefore the compactness assumption is inherently satisfied.

Verify item (ii) of Assumption 4.3. For item (ii) of Assumption 4.3, we first compute the variation of the functional ΦΦ\Phiroman_Φ and ΨΨ\Psiroman_Ψ,

δΦ(s,s;f)δf(w)=γδs(w)δs(w),δΨ(s,s;f)δf(w)=2f(s)δs(w).formulae-sequence𝛿Φsuperscript𝑠𝑠𝑓𝛿𝑓superscript𝑤𝛾subscript𝛿superscript𝑠superscript𝑤subscript𝛿𝑠superscript𝑤𝛿Ψsuperscript𝑠𝑠𝑓𝛿𝑓superscript𝑤2𝑓superscript𝑠subscript𝛿superscript𝑠superscript𝑤\displaystyle\frac{\delta\Phi(s^{\prime},s;f)}{\delta f}(w^{\prime})=\gamma% \delta_{s^{\prime}}(w^{\prime})-\delta_{s}(w^{\prime}),\quad\frac{\delta\Psi(s% ^{\prime},s;f)}{\delta f}(w^{\prime})=2f(s^{\prime})\delta_{s^{\prime}}(w^{% \prime}).divide start_ARG italic_δ roman_Φ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_γ italic_δ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , divide start_ARG italic_δ roman_Ψ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 2 italic_f ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_δ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

Therefore, the desired integrability conditions hold since

𝒲|δΦ(s,s;f)δf(w)|dwγ+1,𝒲|δΨ(s,s;f)δf(w)|dw2|f(s)|.formulae-sequencesubscript𝒲𝛿Φsuperscript𝑠𝑠𝑓𝛿𝑓superscript𝑤differential-dsuperscript𝑤𝛾1subscript𝒲𝛿Ψsuperscript𝑠𝑠𝑓𝛿𝑓superscript𝑤differential-dsuperscript𝑤2𝑓superscript𝑠\displaystyle\int_{\mathcal{W}}\Big{|}\frac{\delta\Phi(s^{\prime},s;f)}{\delta f% }(w^{\prime})\Big{|}\mathrm{d}w^{\prime}\leq\gamma+1,\quad\int_{\mathcal{W}}% \Big{|}\frac{\delta\Psi(s^{\prime},s;f)}{\delta f}(w^{\prime})\Big{|}\mathrm{d% }w^{\prime}\leq 2\cdot|f(s^{\prime})|.∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT | divide start_ARG italic_δ roman_Φ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | roman_d italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_γ + 1 , ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT | divide start_ARG italic_δ roman_Ψ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | roman_d italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ 2 ⋅ | italic_f ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | . (5.1)

Verify item (iii) of Assumption 4.3. For item (iii) of Assumption 4.3, we choose w=s𝑤superscript𝑠w=s^{\prime}italic_w = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, CΨ=2subscript𝐶Ψ2C_{\Psi}=2italic_C start_POSTSUBSCRIPT roman_Ψ end_POSTSUBSCRIPT = 2. The desired condition holds due to (5.1).

Verify item (iv) of Assumption 4.3. For item (iv) of Assumption 4.3, we first compute the variations of (f,g)𝑓𝑔\mathcal{L}(f,g)caligraphic_L ( italic_f , italic_g ) in explicit forms,

δ(f,g)δf(w)=𝔼S|S[γg(S)|S=w]g(w)ρs(w)+2λf(w)ρS(w),w𝒮,formulae-sequence𝛿𝑓𝑔𝛿𝑓superscript𝑤subscript𝔼conditional𝑆superscript𝑆delimited-[]conditional𝛾𝑔𝑆superscript𝑆superscript𝑤𝑔superscript𝑤subscript𝜌𝑠superscript𝑤2𝜆𝑓superscript𝑤subscript𝜌superscript𝑆superscript𝑤for-allsuperscript𝑤𝒮\displaystyle\frac{\delta\mathcal{L}(f,g)}{\delta f}(w^{\prime})=\mathbb{E}_{S% |S^{\prime}}\Big{[}\gamma\cdot g(S){\,\big{|}\,}S^{\prime}=w^{\prime}\Big{]}-g% (w^{\prime})\rho_{s}(w^{\prime})+2\lambda\cdot f(w^{\prime})\rho_{S^{\prime}}(% w^{\prime}),\quad\forall w^{\prime}\in{\mathcal{S}},divide start_ARG italic_δ caligraphic_L ( italic_f , italic_g ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_S | italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_γ ⋅ italic_g ( italic_S ) | italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] - italic_g ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + 2 italic_λ ⋅ italic_f ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ρ start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , ∀ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S ,
δ(f,g)δg(z)=𝔼S|S[r+γf(s)|S=z]f(z)g(z)ρS(z),z𝒮,formulae-sequence𝛿𝑓𝑔𝛿𝑔superscript𝑧subscript𝔼conditionalsuperscript𝑆𝑆delimited-[]𝑟conditional𝛾𝑓superscript𝑠𝑆superscript𝑧𝑓superscript𝑧𝑔superscript𝑧subscript𝜌𝑆superscript𝑧for-allsuperscript𝑧𝒮\displaystyle\frac{\delta\mathcal{L}(f,g)}{\delta g}(z^{\prime})=\mathbb{E}_{S% ^{\prime}|S}\Big{[}r+\gamma\cdot f(s^{\prime}){\,\big{|}\,}S=z^{\prime}\Big{]}% -f(z^{\prime})-g(z^{\prime})\rho_{S}(z^{\prime}),\quad\forall z^{\prime}\in{% \mathcal{S}},divide start_ARG italic_δ caligraphic_L ( italic_f , italic_g ) end_ARG start_ARG italic_δ italic_g end_ARG ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_S end_POSTSUBSCRIPT [ italic_r + italic_γ ⋅ italic_f ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | italic_S = italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] - italic_f ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_g ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ρ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , ∀ italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S ,

where ρSsubscript𝜌𝑆\rho_{S}italic_ρ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, ρSsubscript𝜌superscript𝑆\rho_{S^{\prime}}italic_ρ start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT denotes the density of the marginal distribution of 𝒟𝒟\mathcal{D}caligraphic_D with respect to the current state S𝑆Sitalic_S and next state Ssuperscript𝑆S^{\prime}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT respectively. Due to the item (i) of Assumption 4.3, the variations of \mathcal{L}caligraphic_L with respect to f𝑓fitalic_f and g𝑔gitalic_g are both continuous since the density of the conditional transition S|Sconditionalsuperscript𝑆𝑆S^{\prime}{\,\big{|}\,}Sitalic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_S and S|Sconditional𝑆superscript𝑆S{\,\big{|}\,}S^{\prime}italic_S | italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are both smooth and the functions f,g𝑓𝑔f,gitalic_f , italic_g are also continuous by construction. Therefore, item (iv) is satisfied.

Verify Assumption 4.8. For Assumption 4.8, we choose cΨ=1subscript𝑐Ψ1c_{\Psi}=1italic_c start_POSTSUBSCRIPT roman_Ψ end_POSTSUBSCRIPT = 1 and w=s𝑤superscript𝑠w=s^{\prime}italic_w = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. The desired condition holds by definition of our choice of regularizer Ψ(s,s;f)=f(s)2Ψsuperscript𝑠𝑠𝑓𝑓superscriptsuperscript𝑠2\Psi(s^{\prime},s;f)=f(s^{\prime})^{2}roman_Ψ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ; italic_f ) = italic_f ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

We have checked that the technical Assumption 4.3 and Assumption 4.8 hold for the case of policy evaluation. Assumption 4.3 allows us to apply Theorem 4.7. This implies the global convergence of the estimated value function to the minimizer of the primal objective (2.3) applied in this case. The convergence is quantified in a weighted L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT distance. Additionally, Assumption 4.8 enables us to apply Theorem 4.9 and further characterize such convergence using the optimality gap between the value of primal objectives. We summarize the conclusions in the following corollary.

Corollary 5.1 (Global Convergence of Mean-field Neural Nets in Policy Evaluation).

Let fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be the minimizer of primal objective J(f)𝐽𝑓J(f)italic_J ( italic_f ) defined in (2.8) with Φ(S,S;f)=r+γf(S)f(S)Φsuperscript𝑆𝑆𝑓𝑟𝛾𝑓superscript𝑆𝑓𝑆\Phi(S^{\prime},S;f)=r+\gamma\cdot f(S^{\prime})-f(S)roman_Φ ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_S ; italic_f ) = italic_r + italic_γ ⋅ italic_f ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_f ( italic_S ), (f)=𝔼𝒟[f(S)2]𝑓subscript𝔼𝒟delimited-[]𝑓superscriptsuperscript𝑆2\mathcal{R}(f)=\mathbb{E}_{\mathcal{D}}[f(S^{\prime})^{2}]caligraphic_R ( italic_f ) = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_f ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]. Let (μt,νt)subscript𝜇𝑡subscript𝜈𝑡(\mu_{t},\nu_{t})( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) be the solution to the Wasserstein gradient flow (3.7) at time t𝑡titalic_t with η=α2𝜂superscript𝛼2\eta=\alpha^{-2}italic_η = italic_α start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT and initial condition μ0=ν0=𝒩(0,ID)subscript𝜇0subscript𝜈0𝒩0subscript𝐼𝐷\mu_{0}=\nu_{0}=\mathcal{N}(0,I_{D})italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ). Under Assumption  4.1, 4.2, 4.3, and 4.8, it holds that

inft[0,T]𝔼𝒟[(f(S;μt)f(S))2]𝒪(T1+α1),subscriptinfimum𝑡0𝑇subscript𝔼𝒟delimited-[]superscript𝑓superscript𝑆subscript𝜇𝑡superscript𝑓superscript𝑆2𝒪superscript𝑇1superscript𝛼1\displaystyle\inf_{t\in[0,T]}\mathbb{E}_{\mathcal{D}}\Bigl{[}(f(S^{\prime};\mu% _{t})-f^{*}(S^{\prime}))^{2}\Bigr{]}\leq\mathcal{O}(T^{-1}+\alpha^{-1}),roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ ( italic_f ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ caligraphic_O ( italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ,
inft[0,T]J(f(;μt))J(f())𝒪(T1/2+α1/2).subscriptinfimum𝑡0𝑇𝐽𝑓subscript𝜇𝑡𝐽superscript𝑓𝒪superscript𝑇12superscript𝛼12\displaystyle\inf_{t\in[0,T]}J(f(\cdot;\mu_{t}))-J(f^{*}(\cdot))\leq\mathcal{O% }(T^{-1/2}+\alpha^{-1/2}).roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT italic_J ( italic_f ( ⋅ ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_J ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ ) ) ≤ caligraphic_O ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) .
Proof.

We apply Theorem 4.7 and Theorem 4.9 to the setting of policy evaluation. As we have examined above, the Assumption 4.1, 4.2, 4.3, and 4.8 are all satisfied. Thus, by Theorem 4.7 and Theorem 4.9, the desired results hold directly. ∎

Corollary 5.1 proves that in the setting of policy evaluation, the L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT distance between the mean-field neural network f(;μt)𝑓subscript𝜇𝑡f(\cdot;\mu_{t})italic_f ( ⋅ ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) at time t𝑡titalic_t and the global minimizer fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT decays to zero at a sub-linear rate, up to an error of order 𝒪(α1)𝒪superscript𝛼1\mathcal{O}(\alpha^{-1})caligraphic_O ( italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). Moreover, the optimality gap inft[0,T]J(f(;μt))J(f())subscriptinfimum𝑡0𝑇𝐽𝑓subscript𝜇𝑡𝐽superscript𝑓\inf_{t\in[0,T]}J(f(\cdot;\mu_{t}))-J(f^{*}(\cdot))roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT italic_J ( italic_f ( ⋅ ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_J ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ ) ) in terms of primal objective values decays to zero at the rate of 𝒪(T1/2)𝒪superscript𝑇12\mathcal{O}(T^{-1/2})caligraphic_O ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ), up to an error 𝒪(α1/2)𝒪superscript𝛼12\mathcal{O}(\alpha^{-1/2})caligraphic_O ( italic_α start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) caused by overparameterization. Corollary 5.1 allows us to efficiently and globally solve the policy evaluation problem using overparameterized two-layer neural networks. We also remark that in such a scenario, the primal objective J(f)𝐽𝑓J(f)italic_J ( italic_f ) is also known as the regularized mean-squared Bellman error (MSBE) in the literature of reinforcement learning. As we have commented before, in the setting of policy evaluation, applying the SGDA algorithm within neural network function classes is equivalent to applying the neural GTD algorithm. Therefore, Corollary 5.1 states that, in the mean-field regime, the neural GTD algorithm converges globally to the minimizer at a sublinear rate up to an additional overparameterization error 𝒪(α1)𝒪superscript𝛼1\mathcal{O}(\alpha^{-1})caligraphic_O ( italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). The neural GTD algorithm also reduces regularized MSBE at the rate of 𝒪(T1/2)𝒪superscript𝑇12\mathcal{O}(T^{-1/2})caligraphic_O ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) up to an additional overparameterization error 𝒪(α1/2)𝒪superscript𝛼12\mathcal{O}(\alpha^{-1/2})caligraphic_O ( italic_α start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ). Moreover, The global convergence of mean-field neural networks also implies the global convergence of the discrete dynamics in (3) due to the proximity between the discrete dynamics and continuous dynamics, which is proved in Proposition 4.4.

5.2 Application 2: Nonparametric Instrumental Variables Regression

Let 𝒟𝒟\mathcal{D}caligraphic_D denote the joint distribution of the endogenous variable X𝑋Xitalic_X, the exogenous variable Z𝑍Zitalic_Z, and the observed outcome Y𝑌Yitalic_Y. In this scenario, the endogenous variable is defined in space 𝒳𝒳\mathcal{X}caligraphic_X, the exogenous variable is defined in space 𝒵𝒵\mathcal{Z}caligraphic_Z, and 𝒲=𝒳𝒲𝒳\mathcal{W}=\mathcal{X}caligraphic_W = caligraphic_X. We attempt to estimate the model function f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, which is defined on 𝒲=𝒳𝒲𝒳\mathcal{W}=\mathcal{X}\rightarrow\mathbb{R}caligraphic_W = caligraphic_X → blackboard_R. The functional ΦΦ\Phiroman_Φ and regularizer ΨΨ\Psiroman_Ψ adopted in this case are,

Φ(x,z;f)=yf(x),Ψ(x,z;f)=f(x)2.formulae-sequenceΦ𝑥𝑧𝑓𝑦𝑓𝑥Ψ𝑥𝑧𝑓𝑓superscript𝑥2\displaystyle\Phi(x,z;f)=y-f(x),\quad\Psi(x,z;f)=f(x)^{2}.roman_Φ ( italic_x , italic_z ; italic_f ) = italic_y - italic_f ( italic_x ) , roman_Ψ ( italic_x , italic_z ; italic_f ) = italic_f ( italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Here, the regularizer we adopt is a L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT regularizer that penalizes the squared value of the estimator of the model function evaluated at the endogenous variable x𝑥xitalic_x. We examine Assumption 4.3 and Assumption 4.8 in order to apply results from Section 4.3.

Verify item (i) of Assumption 4.3. For item (i) of Assumption 4.3, the NPIV problem with compact data space captures a large class of important applications, therefore the scenarios considered are still general while imposing this assumption.

Verify item (ii) of Assumption 4.3. For item (ii) of Assumption 4.3, we first compute the variation of the functional ΦΦ\Phiroman_Φ and ΨΨ\Psiroman_Ψ,

δΦ(x,z;f)δf(w)=δx(w),δΨ(x,z;f)δf(w)=2f(x)δx(w).formulae-sequence𝛿Φ𝑥𝑧𝑓𝛿𝑓superscript𝑤subscript𝛿𝑥superscript𝑤𝛿Ψ𝑥𝑧𝑓𝛿𝑓superscript𝑤2𝑓𝑥subscript𝛿𝑥superscript𝑤\displaystyle\frac{\delta\Phi(x,z;f)}{\delta f}(w^{\prime})=-\delta_{x}(w^{% \prime}),\quad\frac{\delta\Psi(x,z;f)}{\delta f}(w^{\prime})=2f(x)\delta_{x}(w% ^{\prime}).divide start_ARG italic_δ roman_Φ ( italic_x , italic_z ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = - italic_δ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , divide start_ARG italic_δ roman_Ψ ( italic_x , italic_z ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 2 italic_f ( italic_x ) italic_δ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

Therefore, the desired integrability conditions hold since

𝒲|δΦ(x,z;f)δf(w)|dw1,𝒲|δΨ(x,z;f)δf(w)|dw2|f(x)|.formulae-sequencesubscript𝒲𝛿Φ𝑥𝑧𝑓𝛿𝑓superscript𝑤differential-dsuperscript𝑤1subscript𝒲𝛿Ψ𝑥𝑧𝑓𝛿𝑓superscript𝑤differential-dsuperscript𝑤2𝑓𝑥\displaystyle\int_{\mathcal{W}}\Big{|}\frac{\delta\Phi(x,z;f)}{\delta f}(w^{% \prime})\Big{|}\mathrm{d}w^{\prime}\leq 1,\quad\int_{\mathcal{W}}\Big{|}\frac{% \delta\Psi(x,z;f)}{\delta f}(w^{\prime})\Big{|}\mathrm{d}w^{\prime}\leq 2\cdot% |f(x)|.∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT | divide start_ARG italic_δ roman_Φ ( italic_x , italic_z ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | roman_d italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ 1 , ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT | divide start_ARG italic_δ roman_Ψ ( italic_x , italic_z ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | roman_d italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ 2 ⋅ | italic_f ( italic_x ) | . (5.2)

Verify item (iii) of Assumption 4.3. For item (iii) of Assumption 4.3, we choose w=x𝑤𝑥w=xitalic_w = italic_x, CΨ=2subscript𝐶Ψ2C_{\Psi}=2italic_C start_POSTSUBSCRIPT roman_Ψ end_POSTSUBSCRIPT = 2. The desired condition holds due to (5.2).

Verify item (iv) of Assumption 4.3. For item (iv) of Assumption 4.3, we first compute the variations of (f,g)𝑓𝑔\mathcal{L}(f,g)caligraphic_L ( italic_f , italic_g ) in explicit forms,

δ(f,g)δf(w)=𝔼Z|X[g(Z)|X=w]+2λf(w)ρX(w),w𝒳,formulae-sequence𝛿𝑓𝑔𝛿𝑓superscript𝑤subscript𝔼conditional𝑍𝑋delimited-[]conditional𝑔𝑍𝑋superscript𝑤2𝜆𝑓superscript𝑤subscript𝜌𝑋superscript𝑤for-allsuperscript𝑤𝒳\displaystyle\frac{\delta\mathcal{L}(f,g)}{\delta f}(w^{\prime})=\mathbb{E}_{Z% |X}\Big{[}-g(Z){\,\big{|}\,}X=w^{\prime}\Big{]}+2\lambda\cdot f(w^{\prime})% \rho_{X}(w^{\prime}),\quad\forall w^{\prime}\in\mathcal{X},divide start_ARG italic_δ caligraphic_L ( italic_f , italic_g ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_Z | italic_X end_POSTSUBSCRIPT [ - italic_g ( italic_Z ) | italic_X = italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] + 2 italic_λ ⋅ italic_f ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ρ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , ∀ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X ,
δ(f,g)δg(z)=𝔼X|Z[Yf(X)|Z=z]g(z)ρZ(z),z𝒵,formulae-sequence𝛿𝑓𝑔𝛿𝑔superscript𝑧subscript𝔼conditional𝑋𝑍delimited-[]𝑌conditional𝑓𝑋𝑍superscript𝑧𝑔superscript𝑧subscript𝜌𝑍superscript𝑧for-allsuperscript𝑧𝒵\displaystyle\frac{\delta\mathcal{L}(f,g)}{\delta g}(z^{\prime})=\mathbb{E}_{X% |Z}\Big{[}Y-f(X){\,\big{|}\,}Z=z^{\prime}\Big{]}-g(z^{\prime})\rho_{Z}(z^{% \prime}),\quad\forall z^{\prime}\in\mathcal{Z},divide start_ARG italic_δ caligraphic_L ( italic_f , italic_g ) end_ARG start_ARG italic_δ italic_g end_ARG ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_X | italic_Z end_POSTSUBSCRIPT [ italic_Y - italic_f ( italic_X ) | italic_Z = italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] - italic_g ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ρ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , ∀ italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_Z ,

where ρXsubscript𝜌𝑋\rho_{X}italic_ρ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT, ρZsubscript𝜌𝑍\rho_{Z}italic_ρ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT denotes the density of the marginal distribution of 𝒟𝒟\mathcal{D}caligraphic_D with respect to the endogenous variable X𝑋Xitalic_X and the exogenous variable Z𝑍Zitalic_Z respectively. Due to the item (i) of Assumption 4.3, the variations of \mathcal{L}caligraphic_L with respect to f𝑓fitalic_f and g𝑔gitalic_g are both continuous since the density of the conditional transition Z|Xconditional𝑍𝑋Z{\,\big{|}\,}Xitalic_Z | italic_X and X|Zconditional𝑋𝑍X{\,\big{|}\,}Zitalic_X | italic_Z are both smooth and the functions f,g𝑓𝑔f,gitalic_f , italic_g are also continuous by construction. Therefore, item (iv) is satisfied.

Verify Assumption 4.8. For Assumption 4.8, we choose cΨ=1subscript𝑐Ψ1c_{\Psi}=1italic_c start_POSTSUBSCRIPT roman_Ψ end_POSTSUBSCRIPT = 1 and w=s𝑤superscript𝑠w=s^{\prime}italic_w = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. The desired condition holds by definition of our choice of regularizer Ψ(x,z;f)=f(x)2Ψ𝑥𝑧𝑓𝑓superscript𝑥2\Psi(x,z;f)=f(x)^{2}roman_Ψ ( italic_x , italic_z ; italic_f ) = italic_f ( italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

We have checked that the technical Assumption 4.3 and Assumption 4.8 hold for the case of nonparametric instrumental variables regression. Theorem 4.7 can be applied in this case due to the establishment of Assumption 4.3. This implies the global convergence of the estimated model function to the minimizer of the primal objective. The convergence is quantified in a weighted L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT distance. The choice of quadratic regularizer implies the establishment of Assumption 4.8, which further enables us to apply Theorem 4.9 and characterize the convergence in terms of primal objective value. We summarize the conclusions in the following corollary.

Corollary 5.2 (Global Convergence of Mean-field Neural Nets in NPIV).

Let fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be the minimizer of primal objective J(f)𝐽𝑓J(f)italic_J ( italic_f ) defined in (2.8) with Φ(X,Z;f)=Yf(X)Φ𝑋𝑍𝑓𝑌𝑓𝑋\Phi(X,Z;f)=Y-f(X)roman_Φ ( italic_X , italic_Z ; italic_f ) = italic_Y - italic_f ( italic_X ), (f)=𝔼𝒟[f(X)2]𝑓subscript𝔼𝒟delimited-[]𝑓superscript𝑋2\mathcal{R}(f)=\mathbb{E}_{\mathcal{D}}[f(X)^{2}]caligraphic_R ( italic_f ) = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_f ( italic_X ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]. Let (μt,νt)subscript𝜇𝑡subscript𝜈𝑡(\mu_{t},\nu_{t})( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) be the solution to the Wasserstein gradient flow (3.7) at time t𝑡titalic_t with η=α2𝜂superscript𝛼2\eta=\alpha^{-2}italic_η = italic_α start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT and initial condition μ0=ν0=𝒩(0,ID)subscript𝜇0subscript𝜈0𝒩0subscript𝐼𝐷\mu_{0}=\nu_{0}=\mathcal{N}(0,I_{D})italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ). Under Assumption  4.1, 4.2, 4.3, and 4.8, it holds that

inft[0,T]𝔼𝒟[(f(X;μt)f(X))2]𝒪(T1+α1),subscriptinfimum𝑡0𝑇subscript𝔼𝒟delimited-[]superscript𝑓𝑋subscript𝜇𝑡superscript𝑓𝑋2𝒪superscript𝑇1superscript𝛼1\displaystyle\inf_{t\in[0,T]}\mathbb{E}_{\mathcal{D}}\Bigl{[}(f(X;\mu_{t})-f^{% *}(X))^{2}\Bigr{]}\leq\mathcal{O}(T^{-1}+\alpha^{-1}),roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ ( italic_f ( italic_X ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ caligraphic_O ( italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ,
inft[0,T]J(f(;μt))J(f())𝒪(T1/2+α1/2).subscriptinfimum𝑡0𝑇𝐽𝑓subscript𝜇𝑡𝐽superscript𝑓𝒪superscript𝑇12superscript𝛼12\displaystyle\inf_{t\in[0,T]}J(f(\cdot;\mu_{t}))-J(f^{*}(\cdot))\leq\mathcal{O% }(T^{-1/2}+\alpha^{-1/2}).roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT italic_J ( italic_f ( ⋅ ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_J ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ ) ) ≤ caligraphic_O ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) .
Proof.

We apply Theorem 4.7 and Theorem 4.9 to the setting of NPIV. As we have examined above, the Assumption 4.1, 4.2, 4.3, and 4.8 are all satisfied. Thus, by Theorem 4.7 and Theorem 4.9, the desired results hold directly. ∎

Corollary 5.2 proves that in the setting of NPIV, the L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT distance between the mean-field neural network f(;μt)𝑓subscript𝜇𝑡f(\cdot;\mu_{t})italic_f ( ⋅ ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) at time t𝑡titalic_t and the global minimizer fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT decays to zero at a sub-linear rate, up to an error of order 𝒪(α1)𝒪superscript𝛼1\mathcal{O}(\alpha^{-1})caligraphic_O ( italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). Moreover, the optimality gap inft[0,T]J(f(;μt))J(f())subscriptinfimum𝑡0𝑇𝐽𝑓subscript𝜇𝑡𝐽superscript𝑓\inf_{t\in[0,T]}J(f(\cdot;\mu_{t}))-J(f^{*}(\cdot))roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT italic_J ( italic_f ( ⋅ ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_J ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ ) ) decays to zero at the rate of 𝒪(T1/2)𝒪superscript𝑇12\mathcal{O}(T^{-1/2})caligraphic_O ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ), up to an error 𝒪(α1/2)𝒪superscript𝛼12\mathcal{O}(\alpha^{-1/2})caligraphic_O ( italic_α start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ). Corollary 5.2 allows us to solve the NPIV problem globally using overparameterized two-layer neural networks. We also want to remark that when the true model function is linear in the input, we recover the setting of instrumental variables regression as an important special instance of NPIV. Therefore, Corollary 5.2 also implies IV regression can be globally solved efficiently by using overparameterized two-layer neural networks.

5.3 Application 3: Asset Pricing

Let 𝒟𝒟\mathcal{D}caligraphic_D denote the joint distribution of the growth-return tuple (ct,r~t+1,ct+1)subscript𝑐𝑡subscript~𝑟𝑡1subscript𝑐𝑡1(c_{t},\widetilde{r}_{t+1},c_{t+1})( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ). In this scenario, the exogenous variable Z=ct𝑍subscript𝑐𝑡Z=c_{t}italic_Z = italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the consumption growth at the current time t𝑡titalic_t, and the endogenous variable X=ct+1𝑋subscript𝑐𝑡1X=c_{t+1}italic_X = italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is the consumption growth at the next time t+1𝑡1t+1italic_t + 1. Therefore, 𝒳=𝒵=𝒞𝒳𝒵𝒞\mathcal{X}=\mathcal{Z}=\mathcal{C}caligraphic_X = caligraphic_Z = caligraphic_C, 𝒲=𝒞𝒲𝒞\mathcal{W}=\mathcal{C}caligraphic_W = caligraphic_C where 𝒞𝒞\mathcal{C}caligraphic_C is the space of consumption growth and is also a compact subset of \mathbb{R}blackboard_R. Here, we consider the scenario where the modified return r~t+1subscript~𝑟𝑡1\widetilde{r}_{t+1}over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is also bounded for all t0𝑡0t\geq 0italic_t ≥ 0, i.e., r~t+1Rnormsubscript~𝑟𝑡1𝑅\|\widetilde{r}_{t+1}\|\leq R∥ over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ ≤ italic_R for some R>0𝑅0R>0italic_R > 0. We attempt to estimate the function f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, which is defined on 𝒲=𝒮𝒲𝒮\mathcal{W}={\mathcal{S}}\to\mathbb{R}caligraphic_W = caligraphic_S → blackboard_R. The functional ΦΦ\Phiroman_Φ and regularizer ΨΨ\Psiroman_Ψ adopted in this case are,

Φ(ct+1,ct;f)=r~t+1f(ct+1)f(ct),Ψ(ct+1,ct;f)=f(ct+1)2.formulae-sequenceΦsubscript𝑐𝑡1subscript𝑐𝑡𝑓subscript~𝑟𝑡1𝑓subscript𝑐𝑡1𝑓subscript𝑐𝑡Ψsubscript𝑐𝑡1subscript𝑐𝑡𝑓𝑓superscriptsubscript𝑐𝑡12\displaystyle\Phi(c_{t+1},c_{t};f)=\widetilde{r}_{t+1}\cdot f(c_{t+1})-f(c_{t}% ),\quad\Psi(c_{t+1},c_{t};f)=f(c_{t+1})^{2}.roman_Φ ( italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_f ) = over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ⋅ italic_f ( italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , roman_Ψ ( italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_f ) = italic_f ( italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Here, the regularizer we adopt is a L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT regularizer that penalizes the squared value of the estimator evaluated at the consumption growth of the next time ct+1subscript𝑐𝑡1c_{t+1}italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. Before presenting the theoretical results, we first verify that Assumption 4.3 and Assumption 4.8 hold.

Verify item (i) of Assumption 4.3. For item (i) of Assumption 4.3, since we assume that the space of consumption growth 𝒞𝒞\mathcal{C}caligraphic_C is a compact subset of \mathbb{R}blackboard_R, therefore there exists C1>0subscript𝐶10C_{1}>0italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 such that for all t0𝑡0t\geq 0italic_t ≥ 0, (ct+1,ct)C1normsubscript𝑐𝑡1subscript𝑐𝑡subscript𝐶1\|(c_{t+1},c_{t})\|\leq C_{1}∥ ( italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ ≤ italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Moreover, it is reasonable to assume that the consumption growth is bounded since the data often fluctuates within certain regimes in practice.

Verify item (ii) of Assumption 4.3. For item (ii) of Assumption 4.3, we first compute the variation of the functional ΦΦ\Phiroman_Φ and ΨΨ\Psiroman_Ψ,

δΦ(ct+1,ct;f)δf(w)=r~t+1δct+1(w)δct(w),δΨ(ct+1,ct;f)δf(w)=2f(ct+1)δct+1(w).formulae-sequence𝛿Φsubscript𝑐𝑡1subscript𝑐𝑡𝑓𝛿𝑓superscript𝑤subscript~𝑟𝑡1subscript𝛿subscript𝑐𝑡1superscript𝑤subscript𝛿subscript𝑐𝑡superscript𝑤𝛿Ψsubscript𝑐𝑡1subscript𝑐𝑡𝑓𝛿𝑓superscript𝑤2𝑓subscript𝑐𝑡1subscript𝛿subscript𝑐𝑡1superscript𝑤\displaystyle\frac{\delta\Phi(c_{t+1},c_{t};f)}{\delta f}(w^{\prime})=% \widetilde{r}_{t+1}\cdot\delta_{c_{t+1}}(w^{\prime})-\delta_{c_{t}}(w^{\prime}% ),\quad\frac{\delta\Psi(c_{t+1},c_{t};f)}{\delta f}(w^{\prime})=2f(c_{t+1})% \cdot\delta_{c_{t+1}}(w^{\prime}).divide start_ARG italic_δ roman_Φ ( italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ⋅ italic_δ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_δ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , divide start_ARG italic_δ roman_Ψ ( italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 2 italic_f ( italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ⋅ italic_δ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

Therefore, the desired integrability condition holds since,

𝒲|δΦ(ct+1,ct;f)δf(w)|dwR+1,𝒲|δΨ(ct+1,ct;f)δf(w)|dw2|f(ct+1)|.formulae-sequencesubscript𝒲𝛿Φsubscript𝑐𝑡1subscript𝑐𝑡𝑓𝛿𝑓superscript𝑤differential-dsuperscript𝑤𝑅1subscript𝒲𝛿Ψsubscript𝑐𝑡1subscript𝑐𝑡𝑓𝛿𝑓superscript𝑤differential-dsuperscript𝑤2𝑓subscript𝑐𝑡1\displaystyle\int_{\mathcal{W}}\Big{|}\frac{\delta\Phi(c_{t+1},c_{t};f)}{% \delta f}(w^{\prime})\Big{|}\mathrm{d}w^{\prime}\leq R+1,\quad\int_{\mathcal{W% }}\Big{|}\frac{\delta\Psi(c_{t+1},c_{t};f)}{\delta f}(w^{\prime})\Big{|}% \mathrm{d}w^{\prime}\leq 2\cdot|f(c_{t+1})|.∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT | divide start_ARG italic_δ roman_Φ ( italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | roman_d italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_R + 1 , ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT | divide start_ARG italic_δ roman_Ψ ( italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | roman_d italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ 2 ⋅ | italic_f ( italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) | . (5.3)

Verify item (iii) of Assumption 4.3. For item (iii) of Assumption 4.3, we choose w=c~𝑤~𝑐w=\widetilde{c}italic_w = over~ start_ARG italic_c end_ARG, CΨ=2subscript𝐶Ψ2C_{\Psi}=2italic_C start_POSTSUBSCRIPT roman_Ψ end_POSTSUBSCRIPT = 2. The desired property holds due to (5.3).

Verify item (iv) of Assumption 4.3. For item (iv) of Assumption (4.3), we first compute the variations of (f,g)𝑓𝑔\mathcal{L}(f,g)caligraphic_L ( italic_f , italic_g ) in explicit forms,

δ(f,g)δf(w)=𝔼ct|ct+1[r~t+1g(ct)|c~t=w]g(w)ρct(w)+2λf(w)ρct+1(w),w𝒮,formulae-sequence𝛿𝑓𝑔𝛿𝑓superscript𝑤subscript𝔼conditionalsubscript𝑐𝑡subscript𝑐𝑡1delimited-[]conditionalsubscript~𝑟𝑡1𝑔subscript𝑐𝑡subscript~𝑐𝑡superscript𝑤𝑔superscript𝑤subscript𝜌subscript𝑐𝑡superscript𝑤2𝜆𝑓superscript𝑤subscript𝜌subscript𝑐𝑡1superscript𝑤for-allsuperscript𝑤𝒮\displaystyle\frac{\delta\mathcal{L}(f,g)}{\delta f}(w^{\prime})=\mathbb{E}_{c% _{t}|c_{t+1}}\Big{[}\widetilde{r}_{t+1}\cdot g(c_{t}){\,\big{|}\,}\widetilde{c% }_{t}=w^{\prime}\Big{]}-g(w^{\prime})\rho_{c_{t}}(w^{\prime})+2\lambda\cdot f(% w^{\prime})\rho_{c_{t+1}}(w^{\prime}),\quad\forall w^{\prime}\in{\mathcal{S}},divide start_ARG italic_δ caligraphic_L ( italic_f , italic_g ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ⋅ italic_g ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | over~ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] - italic_g ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ρ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + 2 italic_λ ⋅ italic_f ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ρ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , ∀ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S ,
δ(f,g)δg(z)=𝔼ct+1|ct[r~t+1f(ct+1)|ct=z]f(z)g(z)ρct(z),z𝒮,formulae-sequence𝛿𝑓𝑔𝛿𝑔superscript𝑧subscript𝔼conditionalsubscript𝑐𝑡1subscript𝑐𝑡delimited-[]conditionalsubscript~𝑟𝑡1𝑓subscript𝑐𝑡1subscript𝑐𝑡superscript𝑧𝑓superscript𝑧𝑔superscript𝑧subscript𝜌subscript𝑐𝑡superscript𝑧for-allsuperscript𝑧𝒮\displaystyle\frac{\delta\mathcal{L}(f,g)}{\delta g}(z^{\prime})=\mathbb{E}_{c% _{t+1}|c_{t}}\Big{[}\widetilde{r}_{t+1}\cdot f(c_{t+1}){\,\big{|}\,}c_{t}=z^{% \prime}\Big{]}-f(z^{\prime})-g(z^{\prime})\rho_{c_{t}}(z^{\prime}),\quad% \forall z^{\prime}\in{\mathcal{S}},divide start_ARG italic_δ caligraphic_L ( italic_f , italic_g ) end_ARG start_ARG italic_δ italic_g end_ARG ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ⋅ italic_f ( italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) | italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] - italic_f ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_g ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ρ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , ∀ italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S ,

where ρct,ρct+1subscript𝜌subscript𝑐𝑡subscript𝜌subscript𝑐𝑡1\rho_{c_{t}},\rho_{c_{t+1}}italic_ρ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the density of the marginal distribution of 𝒟𝒟\mathcal{D}caligraphic_D with respect to the current time consumption growth ctsubscript𝑐𝑡c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the next time consumption growth ct+1subscript𝑐𝑡1c_{t+1}italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT respectively. The variations of \mathcal{L}caligraphic_L with respect to f𝑓fitalic_f and g𝑔gitalic_g are both continuous since the density of the conditional transition ct+1|ctconditionalsubscript𝑐𝑡1subscript𝑐𝑡c_{t+1}{\,\big{|}\,}c_{t}italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ct|ct+1conditionalsubscript𝑐𝑡subscript𝑐𝑡1c_{t}{\,\big{|}\,}c_{t+1}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT are both smooth, and the function f,g𝑓𝑔f,gitalic_f , italic_g are also continuous by construction. Therefore, item (iv) is satisfied.

Verify Assumption 4.8. For Assumption 4.8, we choose cΨ=1subscript𝑐Ψ1c_{\Psi}=1italic_c start_POSTSUBSCRIPT roman_Ψ end_POSTSUBSCRIPT = 1 and w=ct+1𝑤subscript𝑐𝑡1w=c_{t+1}italic_w = italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. The desired condition holds by definition of our choice of regularizer Ψ(ct+1,ct;f)=f(ct+1)2Ψsubscript𝑐𝑡1subscript𝑐𝑡𝑓𝑓superscriptsubscript𝑐𝑡12\Psi(c_{t+1},c_{t};f)=f(c_{t+1})^{2}roman_Ψ ( italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_f ) = italic_f ( italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

We have checked that the technical Assumption 4.3 and Assumption 4.8 hold for the case of asset pricing with CCAPM model. Theorem 4.7 can be applied in this case due to the establishment of Assumption 4.3. This implies the global convergence of the estimated function to the minimizer of the primal objective. The convergence is quantified in a weighted L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT distance. Since Assumption 4.8 holds, we can apply Theorem 4.9 and characterize the convergence in terms of primal objective value. We summarize the conclusions in the following corollary.

Corollary 5.3 (Global Convergence of Mean-field Neural Nets in Asset Pricing).

Let fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be the minimizer of primal objective J(f)𝐽𝑓J(f)italic_J ( italic_f ) defined in (2.8) with Φ(ct+1,ct;f)=r~t+1f(ct+1)f(ct)Φsubscript𝑐𝑡1subscript𝑐𝑡𝑓subscript~𝑟𝑡1𝑓subscript𝑐𝑡1𝑓subscript𝑐𝑡\Phi(c_{t+1},c_{t};f)=\widetilde{r}_{t+1}\cdot f(c_{t+1})-f(c_{t})roman_Φ ( italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_f ) = over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ⋅ italic_f ( italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), (f)=𝔼𝒟[f(ct+1)2]𝑓subscript𝔼𝒟delimited-[]𝑓superscriptsubscript𝑐𝑡12\mathcal{R}(f)=\mathbb{E}_{\mathcal{D}}[f(c_{t+1})^{2}]caligraphic_R ( italic_f ) = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_f ( italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]. Let (μt,νt)subscript𝜇𝑡subscript𝜈𝑡(\mu_{t},\nu_{t})( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) be the solution to the Wasserstein gradient flow (3.7) at time t𝑡titalic_t with η=α2𝜂superscript𝛼2\eta=\alpha^{-2}italic_η = italic_α start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT and initial condition μ0=ν0=𝒩(0,ID)subscript𝜇0subscript𝜈0𝒩0subscript𝐼𝐷\mu_{0}=\nu_{0}=\mathcal{N}(0,I_{D})italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ). Under Assumption  4.1, 4.2, 4.3, and 4.8, it holds that

inft[0,T]𝔼𝒟[(f(ct+1;μt)f(ct+1))2]𝒪(T1+α1),subscriptinfimum𝑡0𝑇subscript𝔼𝒟delimited-[]superscript𝑓subscript𝑐𝑡1subscript𝜇𝑡superscript𝑓subscript𝑐𝑡12𝒪superscript𝑇1superscript𝛼1\displaystyle\inf_{t\in[0,T]}\mathbb{E}_{\mathcal{D}}\Big{[}(f(c_{t+1};\mu_{t}% )-f^{*}(c_{t+1}))^{2}\Big{]}\leq\mathcal{O}(T^{-1}+\alpha^{-1}),roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ ( italic_f ( italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ caligraphic_O ( italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ,
inft[0,T]J(f(;μt))J(f())𝒪(T1/2+α1/2).subscriptinfimum𝑡0𝑇𝐽𝑓subscript𝜇𝑡𝐽superscript𝑓𝒪superscript𝑇12superscript𝛼12\displaystyle\inf_{t\in[0,T]}J(f(\cdot;\mu_{t}))-J(f^{*}(\cdot))\leq\mathcal{O% }(T^{-1/2}+\alpha^{-1/2}).roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT italic_J ( italic_f ( ⋅ ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_J ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ ) ) ≤ caligraphic_O ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) .
Proof.

We apply Theorem 4.7 and Theorem 4.9 to the setting of asset pricing. As we have examined above, the Assumption 4.1, 4.2, 4.3, and 4.8 are all satisfied. Thus, by Theorem 4.7 and Theorem 4.9, the desired results hold directly. ∎

Corollary 5.3 proves that in the setting of asset pricing, the L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT distance between the mean-field neural network f(;μt)𝑓subscript𝜇𝑡f(\cdot;\mu_{t})italic_f ( ⋅ ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) at time t𝑡titalic_t and the global minimizer fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT decays to zero at a sub-linear rate, up to an error of order 𝒪(α1)𝒪superscript𝛼1\mathcal{O}(\alpha^{-1})caligraphic_O ( italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). Moreover, the optimality gap inft[0,T]J(f(;μt))J(f())subscriptinfimum𝑡0𝑇𝐽𝑓subscript𝜇𝑡𝐽superscript𝑓\inf_{t\in[0,T]}J(f(\cdot;\mu_{t}))-J(f^{*}(\cdot))roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT italic_J ( italic_f ( ⋅ ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_J ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ ) ) decays to zero at the rate of 𝒪(T1/2)𝒪superscript𝑇12\mathcal{O}(T^{-1/2})caligraphic_O ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ), up to an error 𝒪(α1/2)𝒪superscript𝛼12\mathcal{O}(\alpha^{-1/2})caligraphic_O ( italic_α start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ). Corollary 5.3 allows us to solve the CCAPM model globally by estimating the nonparametric structural demand function with overparameterized two-layer neural networks. Since the return on investment is linked to the marginal utility of consumption through the CCAPM equation, we can price fairly the assets by considering consumption risk and utilizing the marginal utility information.

5.4 Application 4: Adversarial Riesz Representer Estimation

Let 𝒟𝒟\mathcal{D}caligraphic_D denote the joint distribution of the endogenous variable X𝑋Xitalic_X and the random vector V𝑉Vitalic_V. In this scenario, the exogenous variable Z𝑍Zitalic_Z coincides with the endogenous variable X𝑋Xitalic_X, therefore the problem is essentially unconditional. The endogenous variable is defined in space 𝒳𝒳\mathcal{X}caligraphic_X, the exogenous variable is defined on 𝒵=𝒳𝒵𝒳\mathcal{Z}=\mathcal{X}caligraphic_Z = caligraphic_X, and 𝒲=𝒳𝒲𝒳\mathcal{W}=\mathcal{X}caligraphic_W = caligraphic_X. We attempt to estimate the Riesz representer f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, which is defined on 𝒲=𝒳𝒲𝒳\mathcal{W}=\mathcal{X}\rightarrow\mathbb{R}caligraphic_W = caligraphic_X → blackboard_R. The functional ΦΦ\Phiroman_Φ and regularizer ΨΨ\Psiroman_Ψ adopted in this case are,

Φ(x,x;f)=f0(x)f(x),Ψ(x,x;f)=f(x)2.formulae-sequenceΦ𝑥𝑥𝑓subscript𝑓0𝑥𝑓𝑥Ψ𝑥𝑥𝑓𝑓superscript𝑥2\displaystyle\Phi(x,x;f)=f_{0}(x)-f(x),\quad\Psi(x,x;f)=f(x)^{2}.roman_Φ ( italic_x , italic_x ; italic_f ) = italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) - italic_f ( italic_x ) , roman_Ψ ( italic_x , italic_x ; italic_f ) = italic_f ( italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Here, the regularizer we adopt is a L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT regularizer that penalizes the squared value of estimator of the Riez representer evaluated at the variable x𝑥xitalic_x. We examine Assumption 4.3 and Assumption 4.8 in order to apply results from Section 4.3.

Verify item (i) of Assumption 4.3. For item (i) of Assumption 4.3, we restrict our attention to estimating Riesz represented of functionals defined on a compact space. In practice, such an assumption is very general since we often treat data distribution on an unbounded space with exponential decay as a distribution defined on a compact space.

Verify item (ii) of Assumption 4.3. For item (ii) of Assumption 4.3, we first compute the variation of the functional ΦΦ\Phiroman_Φ and ΨΨ\Psiroman_Ψ,

δΦ(x,x;f)δf(w)=δx(w),δΨ(x,x;f)δf(w)=2f(x)δx(w).formulae-sequence𝛿Φ𝑥𝑥𝑓𝛿𝑓superscript𝑤subscript𝛿𝑥superscript𝑤𝛿Ψ𝑥𝑥𝑓𝛿𝑓superscript𝑤2𝑓𝑥subscript𝛿𝑥superscript𝑤\displaystyle\frac{\delta\Phi(x,x;f)}{\delta f}(w^{\prime})=-\delta_{x}(w^{% \prime}),\quad\frac{\delta\Psi(x,x;f)}{\delta f}(w^{\prime})=2f(x)\delta_{x}(w% ^{\prime}).divide start_ARG italic_δ roman_Φ ( italic_x , italic_x ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = - italic_δ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , divide start_ARG italic_δ roman_Ψ ( italic_x , italic_x ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 2 italic_f ( italic_x ) italic_δ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

Therefore, the desired integrability conditions hold since

𝒲|δΦ(x,x;f)δf(w)|dw1,𝒲|δΨ(x,x;f)δf(w)|dw2|f(x)|.formulae-sequencesubscript𝒲𝛿Φ𝑥𝑥𝑓𝛿𝑓superscript𝑤differential-dsuperscript𝑤1subscript𝒲𝛿Ψ𝑥𝑥𝑓𝛿𝑓superscript𝑤differential-dsuperscript𝑤2𝑓𝑥\displaystyle\int_{\mathcal{W}}\Big{|}\frac{\delta\Phi(x,x;f)}{\delta f}(w^{% \prime})\Big{|}\mathrm{d}w^{\prime}\leq 1,\quad\int_{\mathcal{W}}\Big{|}\frac{% \delta\Psi(x,x;f)}{\delta f}(w^{\prime})\Big{|}\mathrm{d}w^{\prime}\leq 2\cdot% |f(x)|.∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT | divide start_ARG italic_δ roman_Φ ( italic_x , italic_x ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | roman_d italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ 1 , ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT | divide start_ARG italic_δ roman_Ψ ( italic_x , italic_x ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | roman_d italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ 2 ⋅ | italic_f ( italic_x ) | . (5.4)

Verify item (iii) of Assumption 4.3. For item (iii) of Assumption 4.3, we choose w=x𝑤𝑥w=xitalic_w = italic_x, CΨ=2subscript𝐶Ψ2C_{\Psi}=2italic_C start_POSTSUBSCRIPT roman_Ψ end_POSTSUBSCRIPT = 2. The desired condition holds due to (5.4).

Verify item (iv) of Assumption 4.3. For item (iv) of Assumption 4.3, we first compute the variations of (f,g)𝑓𝑔\mathcal{L}(f,g)caligraphic_L ( italic_f , italic_g ) in explicit forms,

δ(f,g)δf(w)=𝔼Z|X[g(Z)|X=w]+2λf(w)ρX(w),w𝒳,formulae-sequence𝛿𝑓𝑔𝛿𝑓superscript𝑤subscript𝔼conditional𝑍𝑋delimited-[]conditional𝑔𝑍𝑋superscript𝑤2𝜆𝑓superscript𝑤subscript𝜌𝑋superscript𝑤for-allsuperscript𝑤𝒳\displaystyle\frac{\delta\mathcal{L}(f,g)}{\delta f}(w^{\prime})=\mathbb{E}_{Z% |X}\Big{[}-g(Z){\,\big{|}\,}X=w^{\prime}\Big{]}+2\lambda\cdot f(w^{\prime})% \rho_{X}(w^{\prime}),\quad\forall w^{\prime}\in\mathcal{X},divide start_ARG italic_δ caligraphic_L ( italic_f , italic_g ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_Z | italic_X end_POSTSUBSCRIPT [ - italic_g ( italic_Z ) | italic_X = italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] + 2 italic_λ ⋅ italic_f ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ρ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , ∀ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X ,
δ(f,g)δg(z)=𝔼X|Z[f0(X)f(X)|Z=z]g(z)ρZ(z),z𝒵,formulae-sequence𝛿𝑓𝑔𝛿𝑔superscript𝑧subscript𝔼conditional𝑋𝑍delimited-[]subscript𝑓0𝑋conditional𝑓𝑋𝑍superscript𝑧𝑔superscript𝑧subscript𝜌𝑍superscript𝑧for-allsuperscript𝑧𝒵\displaystyle\frac{\delta\mathcal{L}(f,g)}{\delta g}(z^{\prime})=\mathbb{E}_{X% |Z}\Big{[}f_{0}(X)-f(X){\,\big{|}\,}Z=z^{\prime}\Big{]}-g(z^{\prime})\rho_{Z}(% z^{\prime}),\quad\forall z^{\prime}\in\mathcal{Z},divide start_ARG italic_δ caligraphic_L ( italic_f , italic_g ) end_ARG start_ARG italic_δ italic_g end_ARG ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_X | italic_Z end_POSTSUBSCRIPT [ italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_X ) - italic_f ( italic_X ) | italic_Z = italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] - italic_g ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ρ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , ∀ italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_Z ,

where ρXsubscript𝜌𝑋\rho_{X}italic_ρ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT, ρZsubscript𝜌𝑍\rho_{Z}italic_ρ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT denotes the density of the marginal distribution of 𝒟𝒟\mathcal{D}caligraphic_D with respect to the endogenous variable X𝑋Xitalic_X and the exogenous variable Z𝑍Zitalic_Z respectively. Due to the item (i) of Assumption 4.3, the variations of \mathcal{L}caligraphic_L with respect to f𝑓fitalic_f and g𝑔gitalic_g are both continuous since the density of the conditional transition Z|Xconditional𝑍𝑋Z{\,\big{|}\,}Xitalic_Z | italic_X and X|Zconditional𝑋𝑍X{\,\big{|}\,}Zitalic_X | italic_Z are both smooth and the functions f,g𝑓𝑔f,gitalic_f , italic_g are also continuous by construction. Therefore, item (iv) is satisfied.

Assumption 4.8. For Assumption 4.8, we choose cΨ=1subscript𝑐Ψ1c_{\Psi}=1italic_c start_POSTSUBSCRIPT roman_Ψ end_POSTSUBSCRIPT = 1 and w=s𝑤superscript𝑠w=s^{\prime}italic_w = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. The desired condition holds by definition of our choice of regularizer Ψ(x,x;f)=f(x)2Ψ𝑥𝑥𝑓𝑓superscript𝑥2\Psi(x,x;f)=f(x)^{2}roman_Ψ ( italic_x , italic_x ; italic_f ) = italic_f ( italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

We have checked that the technical Assumption 4.3 and Assumption 4.8 hold for the case of adversarial Riesz representer estimation. Theorem 4.7 can be applied in this case due to the establishment of Assumption 4.3. This implies the global convergence of the estimated Riesz representer to the minimizer of the primal objective. The convergence is quantified in a weighted L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT distance. The choice of quadratic regularizer implies the establishment of Assumption 4.8, which further enables us to apply Theorem 4.9 and characterize the convergence in terms of primal objective value. We summarize the conclusions in the following corollary.

Corollary 5.4 (Global Convergence of Mean-field Neural Nets in Adversarial Riesz Representer Estimation).

Let fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be the minimizer of primal objective J(f)𝐽𝑓J(f)italic_J ( italic_f ) defined in (2.8) with Φ(x,x;f)=f0(x)f(x)Φ𝑥𝑥𝑓subscript𝑓0𝑥𝑓𝑥\Phi(x,x;f)=f_{0}(x)-f(x)roman_Φ ( italic_x , italic_x ; italic_f ) = italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) - italic_f ( italic_x ), (f)=𝔼𝒟[f(x)2]𝑓subscript𝔼𝒟delimited-[]𝑓superscript𝑥2\mathcal{R}(f)=\mathbb{E}_{\mathcal{D}}[f(x)^{2}]caligraphic_R ( italic_f ) = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_f ( italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]. Let (μt,νt)subscript𝜇𝑡subscript𝜈𝑡(\mu_{t},\nu_{t})( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) be the solution to the Wasserstein gradient flow (3.7) at time t𝑡titalic_t with η=α2𝜂superscript𝛼2\eta=\alpha^{-2}italic_η = italic_α start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT and initial condition μ0=ν0=𝒩(0,ID)subscript𝜇0subscript𝜈0𝒩0subscript𝐼𝐷\mu_{0}=\nu_{0}=\mathcal{N}(0,I_{D})italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ). Under Assumption  4.1, 4.2, 4.3, and 4.8, it holds that

inft[0,T]𝔼𝒟[(f(X;μt)f(X))2]𝒪(T1+α1),subscriptinfimum𝑡0𝑇subscript𝔼𝒟delimited-[]superscript𝑓𝑋subscript𝜇𝑡superscript𝑓𝑋2𝒪superscript𝑇1superscript𝛼1\displaystyle\inf_{t\in[0,T]}\mathbb{E}_{\mathcal{D}}\Bigl{[}(f(X;\mu_{t})-f^{% *}(X))^{2}\Bigr{]}\leq\mathcal{O}(T^{-1}+\alpha^{-1}),roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ ( italic_f ( italic_X ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ caligraphic_O ( italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ,
inft[0,T]J(f(;μt))J(f())𝒪(T1/2+α1/2).subscriptinfimum𝑡0𝑇𝐽𝑓subscript𝜇𝑡𝐽superscript𝑓𝒪superscript𝑇12superscript𝛼12\displaystyle\inf_{t\in[0,T]}J(f(\cdot;\mu_{t}))-J(f^{*}(\cdot))\leq\mathcal{O% }(T^{-1/2}+\alpha^{-1/2}).roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT italic_J ( italic_f ( ⋅ ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_J ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ ) ) ≤ caligraphic_O ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) .

Corollary 5.4 proves that in the setting of adversarial Riesz representer estimation, the L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT distance between the mean-field neural network f(;μt)𝑓subscript𝜇𝑡f(\cdot;\mu_{t})italic_f ( ⋅ ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) at time t𝑡titalic_t and the global minimizer fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT decays to zero at a sub-linear rate, up to an error of order 𝒪(α1)𝒪superscript𝛼1\mathcal{O}(\alpha^{-1})caligraphic_O ( italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). Moreover, the optimality gap inft[0,T]J(f(;μt))J(f())subscriptinfimum𝑡0𝑇𝐽𝑓subscript𝜇𝑡𝐽superscript𝑓\inf_{t\in[0,T]}J(f(\cdot;\mu_{t}))-J(f^{*}(\cdot))roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT italic_J ( italic_f ( ⋅ ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_J ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ ) ) decays to zero at the rate of 𝒪(T1/2)𝒪superscript𝑇12\mathcal{O}(T^{-1/2})caligraphic_O ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ), up to an error 𝒪(α1/2)𝒪superscript𝛼12\mathcal{O}(\alpha^{-1/2})caligraphic_O ( italic_α start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ). Corollary 5.4 allows us to estimate the Riesz representer of a given functional using overparameterized two-layer neural networks.

6 Conclusion

In this paper, we focus on the minimax optimization problem derived from solving functional conditional moment equations using overparameterized two-layer neural networks. For such a problem, we first prove that the stochastic gradient descent-ascent algorithm converges to a mean-field limit as the stepsize goes to zero and the network width goes to infinity. In this mean-field limit, the optimization dynamics is characterized by a Wasserstein gradient flow in the space of probability distributions. We further establish the global convergence of the Wasserstein gradient flow, and prove that the feature representation induced by the neural networks is allowed to move by a considerable distance from the initial value. We further apply our general results to policy evaluation with high dimensional state space, nonparametric instrumental variables regression with high dimensional endogenous and exogenous variables, and asset pricing with a nonparametric structural demand function, and general Riesz representer estimation. Our analysis opens avenues for studying functional minimax optimization problems with more complicated objectives, such as nonlinear functional conditional moment equations. We leave the study of the convergence properties of the algorithm in such a general setting to future research. This setting includes nonparametric quantile instrumental variables regression as a leading and important application.

References

  • Ai and Chen (2003) Ai, C. and Chen, X. (2003). Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica, 71 1795–1843.
  • Alkousa et al. (2019) Alkousa, M., Dvinskikh, D., Stonyakin, F., Gasnikov, A. and Kovalev, D. (2019). Accelerated methods for composite non-bilinear saddle point problem. arXiv preprint arXiv:1906.03620.
  • Allen-Zhu et al. (2019a) Allen-Zhu, Z., Li, Y. and Liang, Y. (2019a). Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32.
  • Allen-Zhu et al. (2019b) Allen-Zhu, Z., Li, Y. and Song, Z. (2019b). A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning. PMLR.
  • Ambrosio and Gigli (2013) Ambrosio, L. and Gigli, N. (2013). A user’s guide to optimal transport. In Modelling and Optimisation of Flows on Networks. Springer, 1–155.
  • Ambrosio et al. (2008) Ambrosio, L., Gigli, N. and Savaré, G. (2008). Gradient flows: In metric spaces and in the space of probability measures. Springer.
  • Araújo et al. (2019) Araújo, D., Oliveira, R. I. and Yukimura, D. (2019). A mean-field limit for certain deep neural networks. arXiv preprint arXiv:1906.00193.
  • Arora et al. (2019a) Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. R. and Wang, R. (2019a). On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems.
  • Arora et al. (2019b) Arora, S., Du, S. S., Hu, W., Li, Z. and Wang, R. (2019b). Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584.
  • Barron (1993) Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39 930–945.
  • Ben-Tal et al. (2009) Ben-Tal, A., El Ghaoui, L. and Nemirovski, A. (2009). Robust optimization, vol. 28. Princeton university press.
  • Bennett et al. (2019) Bennett, A., Kallus, N. and Schnabel, T. (2019). Deep generalized method of moments for instrumental variable analysis. Advances in neural information processing systems, 32.
  • Blundell et al. (2007) Blundell, R., Chen, X. and Kristensen, D. (2007). Semi-nonparametric iv estimation of shape-invariant engel curves. Econometrica, 75 1613–1669.
  • Cai et al. (2019) Cai, Q., Yang, Z., Lee, J. D. and Wang, Z. (2019). Neural temporal-difference learning converges to global optima. In Advances in Neural Information Processing Systems.
  • Chen et al. (2024) Chen, L., Pelger, M. and Zhu, J. (2024). Deep learning in asset pricing. Management Science, 70 714–750.
  • Chen et al. (2014) Chen, X., Chernozhukov, V., Lee, S. and Newey, W. K. (2014). Local identification of nonparametric and semiparametric models. Econometrica, 82 785–809.
  • Chen and Christensen (2018) Chen, X. and Christensen, T. M. (2018). Optimal sup-norm rates and uniform inference on nonlinear functionals of nonparametric iv regression. Quantitative Economics, 9 39–84.
  • Chen and Ludvigson (2009) Chen, X. and Ludvigson, S. C. (2009). Land of addicts? an empirical investigation of habit-based asset pricing models. Journal of Applied Econometrics, 24 1057–1093.
  • Chen and Pouzo (2012) Chen, X. and Pouzo, D. (2012). Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals. Econometrica, 80 277–321.
  • Chen and Qi (2022) Chen, X. and Qi, Z. (2022). On well-posedness and minimax optimal rates of nonparametric q-function estimation in off-policy evaluation. In International Conference on Machine Learning. PMLR.
  • Chen et al. (2020a) Chen, Z., Cao, Y., Gu, Q. and Zhang, T. (2020a). A generalized neural tangent kernel analysis for two-layer neural networks. Advances in Neural Information Processing Systems, 33 13363–13373.
  • Chen et al. (2020b) Chen, Z., Cao, Y., Gu, Q. and Zhang, T. (2020b). Mean-field analysis of two-layer neural networks: Non-asymptotic rates and generalization bounds. arXiv preprint arXiv:2002.04026.
  • Chen et al. (2019) Chen, Z., Cao, Y., Zou, D. and Gu, Q. (2019). How much over-parameterization is sufficient to learn deep relu networks? arXiv preprint arXiv:1911.12360.
  • Chernozhukov et al. (2020) Chernozhukov, V., Newey, W., Singh, R. and Syrgkanis, V. (2020). Adversarial estimation of riesz representers. arXiv preprint arXiv:2101.00009.
  • Chizat (2022) Chizat, L. (2022). Mean-field langevin dynamics: Exponential convergence and annealing. arXiv preprint arXiv:2202.01009.
  • Chizat and Bach (2018) Chizat, L. and Bach, F. (2018). On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in Neural Information Processing Systems.
  • Diakonikolas et al. (2021) Diakonikolas, J., Daskalakis, C. and Jordan, M. I. (2021). Efficient methods for structured nonconvex-nonconcave min-max optimization. In International Conference on Artificial Intelligence and Statistics. PMLR.
  • Dikkala et al. (2020) Dikkala, N., Lewis, G., Mackey, L. and Syrgkanis, V. (2020). Minimax estimation of conditional moment models. Advances in Neural Information Processing Systems, 33 12248–12262.
  • Du et al. (2019) Du, S., Lee, J., Li, H., Wang, L. and Zhai, X. (2019). Gradient descent finds global minima of deep neural networks. In International conference on machine learning. PMLR.
  • Du et al. (2018) Du, S. S., Zhai, X., Poczos, B. and Singh, A. (2018). Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054.
  • Duan et al. (2020) Duan, Y., Jia, Z. and Wang, M. (2020). Minimax-optimal off-policy evaluation with linear function approximation. In International Conference on Machine Learning. PMLR.
  • Duan et al. (2021) Duan, Y., Jin, C. and Li, Z. (2021). Risk bounds and rademacher complexity in batch reinforcement learning. In International Conference on Machine Learning. PMLR.
  • Fang et al. (2019) Fang, C., Dong, H. and Zhang, T. (2019). Over parameterized two-level neural networks can learn near optimal feature representations. arXiv preprint arXiv:1910.11508.
  • Fang et al. (2021a) Fang, C., Dong, H. and Zhang, T. (2021a). Mathematical models of overparameterized neural networks. Proceedings of the IEEE, 109 683–703.
  • Fang et al. (2021b) Fang, C., Lee, J., Yang, P. and Zhang, T. (2021b). Modeling from features: a mean-field framework for over-parameterized deep neural networks. In Conference on learning theory. PMLR.
  • Frei and Gu (2021) Frei, S. and Gu, Q. (2021). Proxy convexity: A unified framework for the analysis of neural networks trained by gradient descent. Advances in Neural Information Processing Systems, 34 7937–7949.
  • Ganin et al. (2016) Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M. and Lempitsky, V. (2016). Domain-adversarial training of neural networks. The journal of machine learning research, 17 2096–2030.
  • Goodfellow et al. (2020) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63 139–144.
  • Grimmer et al. (2022) Grimmer, B., Lu, H., Worah, P. and Mirrokni, V. (2022). Limiting behaviors of nonconvex-nonconcave minimax optimization via continuous-time systems. In International Conference on Algorithmic Learning Theory. PMLR.
  • Grimmer et al. (2023) Grimmer, B., Lu, H., Worah, P. and Mirrokni, V. (2023). The landscape of the proximal point method for nonconvex–nonconcave minimax optimization. Mathematical Programming, 201 373–407.
  • Hajizadeh et al. (2024) Hajizadeh, S., Lu, H. and Grimmer, B. (2024). On the linear convergence of extragradient methods for nonconvex–nonconcave minimax problems. INFORMS Journal on Optimization, 6 19–31.
  • Han et al. (2024) Han, Y., Xie, G. and Zhang, Z. (2024). Lower complexity bounds of finite-sum optimization problems: The results and construction. Journal of Machine Learning Research, 25 1–86.
  • Holte (2009) Holte, J. M. (2009). Discrete Gronwall lemma and applications. In MAA-NCS meeting at the University of North Dakota, vol. 24.
  • Hu et al. (2021) Hu, K., Ren, Z., Šiška, D. and Szpruch, Ł. (2021). Mean-field langevin dynamics and energy landscape of neural networks. In Annales de l’Institut Henri Poincare (B) Probabilites et statistiques, vol. 57. Institut Henri Poincaré.
  • Huang and Yau (2020) Huang, J. and Yau, H.-T. (2020). Dynamics of deep neural networks and neural tangent hierarchy. In International conference on machine learning. PMLR.
  • Huang et al. (2022) Huang, M., Chen, X., Ji, K., Ma, S. and Lai, L. (2022). Efficiently escaping saddle points in bilevel optimization. arXiv preprint arXiv:2202.03684.
  • Ibrahim et al. (2019) Ibrahim, A., Azizian, W., Gidel, G. and Mitliagkas, I. (2019). Lower bounds and conditioning of differentiable games. arXiv preprint arXiv:1906.07300 31.
  • Jacot et al. (2018) Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, vol. 31.
  • Jin et al. (2019) Jin, C., Netrapalli, P. and Jordan, M. I. (2019). Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. arXiv preprint arXiv:1902.00618.
  • Jin et al. (2022) Jin, Y., Sidford, A. and Tian, K. (2022). Sharper rates for separable minimax and finite sum optimization via primal-dual extragradient methods. In Conference on Learning Theory. PMLR.
  • Jin et al. (2021) Jin, Y., Yang, Z. and Wang, Z. (2021). Is pessimism provably efficient for offline rl? In International Conference on Machine Learning. PMLR.
  • Levy et al. (2020) Levy, D., Carmon, Y., Duchi, J. C. and Sidford, A. (2020). Large-scale methods for distributionally robust optimization. Advances in Neural Information Processing Systems, 33 8847–8860.
  • Li et al. (2023) Li, C. J., Yuan, H., Gidel, G., Gu, Q. and Jordan, M. (2023). Nesterov meets optimism: rate-optimal separable minimax optimization. In International Conference on Machine Learning. PMLR.
  • Li et al. (2022) Li, J., Zhu, L. and So, A. M.-C. (2022). Nonsmooth nonconvex-nonconcave minimax optimization: Primal-dual balancing and iteration complexity analysis. arXiv preprint arXiv:2209.10825.
  • Liao et al. (2020) Liao, L., Chen, Y.-L., Yang, Z., Dai, B., Kolar, M. and Wang, Z. (2020). Provably efficient neural estimation of structural equation models: An adversarial approach. Advances in Neural Information Processing Systems, 33 8947–8958.
  • Lin et al. (2020a) Lin, T., Jin, C. and Jordan, M. (2020a). On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning. PMLR.
  • Lin et al. (2020b) Lin, T., Jin, C. and Jordan, M. I. (2020b). Near-optimal algorithms for minimax optimization. In Conference on Learning Theory. PMLR.
  • Lu et al. (2020a) Lu, S., Tsaknakis, I., Hong, M. and Chen, Y. (2020a). Hybrid block successive approximation for one-sided non-convex min-max problems: algorithms and applications. IEEE Transactions on Signal Processing, 68 3676–3691.
  • Lu et al. (2020b) Lu, Y., Ma, C., Lu, Y., Lu, J. and Ying, L. (2020b). A mean-field analysis of deep resnet and beyond: Towards provable optimization via overparameterization from depth.
  • Luo et al. (2021) Luo, L., Xie, G., Zhang, T. and Zhang, Z. (2021). Near optimal stochastic algorithms for finite-sum unbalanced convex-concave minimax optimization. arXiv preprint arXiv:2106.01761.
  • Luo et al. (2020) Luo, L., Ye, H., Huang, Z. and Zhang, T. (2020). Stochastic recursive gradient descent ascent for stochastic nonconvex-strongly-concave minimax problems. Advances in Neural Information Processing Systems, 33 20566–20577.
  • Madry et al. (2017) Madry, A., Makelov, A., Schmidt, L., Tsipras, D. and Vladu, A. (2017). Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.
  • Mei et al. (2019) Mei, S., Misiakiewicz, T. and Montanari, A. (2019). Mean-field theory of two-layers neural networks: Dimension-free bounds and kernel limit. arXiv preprint arXiv:1902.06015.
  • Mei et al. (2018) Mei, S., Montanari, A. and Nguyen, P.-M. (2018). A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115 E7665–E7671.
  • Nitanda et al. (2022) Nitanda, A., Wu, D. and Suzuki, T. (2022). Convex analysis of the mean field langevin dynamics. In International Conference on Artificial Intelligence and Statistics. PMLR.
  • Nouiehed et al. (2019) Nouiehed, M., Sanjabi, M., Huang, T., Lee, J. D. and Razaviyayn, M. (2019). Solving a class of non-convex min-max games using iterative first order methods. Advances in Neural Information Processing Systems, 32.
  • Ostrovskii et al. (2021a) Ostrovskii, D. M., Barazandeh, B. and Razaviyayn, M. (2021a). Nonconvex-nonconcave min-max optimization with a small maximization domain. arXiv preprint arXiv:2110.03950.
  • Ostrovskii et al. (2021b) Ostrovskii, D. M., Lowy, A. and Razaviyayn, M. (2021b). Efficient search of first-order nash equilibria in nonconvex-concave smooth min-max problems. SIAM Journal on Optimization, 31 2508–2538.
  • Otto and Villani (2000) Otto, F. and Villani, C. (2000). Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality. Journal of Functional Analysis, 173 361–400.
  • Ouyang and Xu (2021) Ouyang, Y. and Xu, Y. (2021). Lower complexity bounds of first-order methods for convex-concave bilinear saddle-point problems. Mathematical Programming, 185 1–35.
  • Pinkus (1999) Pinkus, A. (1999). Approximation theory of the MLP model in neural networks. Acta Numerica, 8 143–195.
  • Ramprasad et al. (2022) Ramprasad, P., Li, Y., Yang, Z., Wang, Z., Sun, W. W. and Cheng, G. (2022). Online bootstrap inference for policy evaluation in reinforcement learning. Journal of the American Statistical Association 1–14.
  • Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A. and Chen, X. (2016). Improved techniques for training gans. Advances in neural information processing systems, 29.
  • Sirignano and Spiliopoulos (2020a) Sirignano, J. and Spiliopoulos, K. (2020a). Mean field analysis of neural networks: A central limit theorem. Stochastic Processes and their Applications, 130 1820–1852.
  • Sirignano and Spiliopoulos (2020b) Sirignano, J. and Spiliopoulos, K. (2020b). Mean field analysis of neural networks: A law of large numbers. SIAM Journal on Applied Mathematics, 80 725–752.
  • Sirignano and Spiliopoulos (2022) Sirignano, J. and Spiliopoulos, K. (2022). Mean field analysis of deep neural networks. Mathematics of Operations Research, 47 120–152.
  • Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
  • Sznitman (1991) Sznitman, A.-S. (1991). Topics in propagation of chaos. In Ecole d’Été de Probabilités de Saint-Flour XIX—1989. Springer, 165–251.
  • Thekumparampil et al. (2019) Thekumparampil, K. K., Jain, P., Netrapalli, P. and Oh, S. (2019). Efficient algorithms for smooth minimax optimization. Advances in Neural Information Processing Systems, 32.
  • Uehara et al. (2020) Uehara, M., Huang, J. and Jiang, N. (2020). Minimax weight and q-function learning for off-policy evaluation. In International Conference on Machine Learning. PMLR.
  • Villani (2003) Villani, C. (2003). Topics in optimal transportation. American Mathematical Society.
  • Villani (2008) Villani, C. (2008). Optimal transport: Old and new. Springer.
  • Wai et al. (2020) Wai, H.-T., Yang, Z., Wang, Z. and Hong, M. (2020). Provably efficient neural GTD for off-policy learning. Advances in Neural Information Processing Systems, 33.
  • Wainwright (2019) Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press.
  • Wang et al. (2022) Wang, S., Yu, X. and Perdikaris, P. (2022). When and why pinns fail to train: A neural tangent kernel perspective. Journal of Computational Physics, 449 110768.
  • Xie et al. (2020a) Xie, G., Luo, L., Lian, Y. and Zhang, Z. (2020a). Lower complexity bounds for finite-sum convex-concave minimax optimization problems. In International Conference on Machine Learning. PMLR.
  • Xie et al. (2020b) Xie, Q., Chen, Y., Wang, Z. and Yang, Z. (2020b). Learning zero-sum simultaneous-move markov games using function approximation and correlated equilibrium. In Conference on learning theory. PMLR.
  • Xu et al. (2020) Xu, L., Chen, Y., Srinivasan, S., de Freitas, N., Doucet, A. and Gretton, A. (2020). Learning deep features in instrumental variable regression. arXiv preprint arXiv:2010.07154.
  • Xu et al. (2021) Xu, L., Kanagawa, H. and Gretton, A. (2021). Deep proxy causal learning and its application to confounded bandit policy evaluation. Advances in Neural Information Processing Systems, 34 26264–26275.
  • Xu and Gu (2020) Xu, P. and Gu, Q. (2020). A finite-time analysis of q-learning with neural network function approximation. In International Conference on Machine Learning. PMLR.
  • Yang et al. (2020) Yang, J., Kiyavash, N. and He, N. (2020). Global convergence and variance reduction for a class of nonconvex-nonconcave minimax problems. Advances in Neural Information Processing Systems, 33 1153–1165.
  • Yang et al. (2022) Yang, J., Orvieto, A., Lucchi, A. and He, N. (2022). Faster single-loop algorithms for minimax optimization without strong concavity. In International Conference on Artificial Intelligence and Statistics. PMLR.
  • Zhang et al. (2021a) Zhang, S., Yang, J., Guzmán, C., Kiyavash, N. and He, N. (2021a). The complexity of nonconvex-strongly-concave minimax optimization. In Uncertainty in Artificial Intelligence. PMLR.
  • Zhang et al. (2020) Zhang, Y., Cai, Q., Yang, Z., Chen, Y. and Wang, Z. (2020). Can temporal-difference and q-learning learn representation? A mean-field theory. arXiv preprint arXiv:2006.04761.
  • Zhang et al. (2021b) Zhang, Y., Chen, S., Yang, Z., Jordan, M. and Wang, Z. (2021b). Wasserstein flow meets replicator dynamics: A mean-field analysis of representation learning in actor-critic. Advances in Neural Information Processing Systems, 34 15993–16006.
  • Zhao (2023) Zhao, R. (2023). A primal-dual smoothing framework for max-structured non-convex optimization. Mathematics of operations research.
  • Zhao et al. (2022) Zhao, Y., Tian, Y., Lee, J. and Du, S. (2022). Provably efficient policy optimization for two-player zero-sum markov games. In International Conference on Artificial Intelligence and Statistics. PMLR.
  • Zou and Gu (2019) Zou, D. and Gu, Q. (2019). An improved analysis of training over-parameterized deep neural networks. In Advances in Neural Information Processing Systems.

Appendix A Proof of Main Results

In this section, we provide proofs for the main theorems and technical lemmas in our work.

A.1 Proof of Lemma 4.6

Proof of (i). The proof for Claim (i) will be two-stage. First, we will show that if function pair (f,g)superscript𝑓superscript𝑔(f^{*},g^{*})( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is a stationary point for (f,g)𝑓𝑔\mathcal{L}(f,g)caligraphic_L ( italic_f , italic_g ) with respect to (f,g)𝑓𝑔(f,g)( italic_f , italic_g ), then it’s a saddle point for the same objective as well. Then we will show that the distribution pair (μ,ν)superscript𝜇superscript𝜈(\mu^{*},\nu^{*})( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) being a stationary point of (μ,ν)𝜇𝜈\mathcal{L}(\mu,\nu)caligraphic_L ( italic_μ , italic_ν ) implies that the corresponding (f,g)superscript𝑓superscript𝑔(f^{*},g^{*})( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is a stationary point for (f,g)𝑓𝑔\mathcal{L}(f,g)caligraphic_L ( italic_f , italic_g ), which concludes the claim. We will start with the first part. We define the following functional 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT,

1(f,g)=𝔼𝒟[g(Z)Φ(X,Z;f)],2(f,g)=𝔼𝒟[1/2g(Z)2+λΨ(X,Z;f)].formulae-sequencesubscript1𝑓𝑔subscript𝔼𝒟delimited-[]𝑔𝑍Φ𝑋𝑍𝑓subscript2𝑓𝑔subscript𝔼𝒟delimited-[]12𝑔superscript𝑍2𝜆Ψ𝑋𝑍𝑓\displaystyle\mathcal{L}_{1}(f,g)=\mathbb{E}_{\mathcal{D}}\Big{[}g(Z)\cdot\Phi% (X,Z;f)\Big{]},\quad\mathcal{L}_{2}(f,g)=\mathbb{E}_{\mathcal{D}}\Big{[}-1/2% \cdot g(Z)^{2}+\lambda\Psi(X,Z;f)\Big{]}.caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_f , italic_g ) = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_g ( italic_Z ) ⋅ roman_Φ ( italic_X , italic_Z ; italic_f ) ] , caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_f , italic_g ) = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ - 1 / 2 ⋅ italic_g ( italic_Z ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ roman_Ψ ( italic_X , italic_Z ; italic_f ) ] .

We see that the minimax objective in (2.9) is indeed the sum of such two functionals,

(f,g)=1(f,g)+2(f,g).𝑓𝑔subscript1𝑓𝑔subscript2𝑓𝑔\displaystyle\mathcal{L}(f,g)=\mathcal{L}_{1}(f,g)+\mathcal{L}_{2}(f,g).caligraphic_L ( italic_f , italic_g ) = caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_f , italic_g ) + caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_f , italic_g ) .

For any function pair (f,g)𝑓𝑔(f,g)( italic_f , italic_g ), we can verify that the following chain of equalities holds,

maxg(f,g)minf(f,g)subscriptsuperscript𝑔𝑓superscript𝑔subscriptsuperscript𝑓superscript𝑓𝑔\displaystyle\max_{g^{\prime}}~{}\mathcal{L}(f,g^{\prime})-\min_{f^{\prime}}~{% }\mathcal{L}(f^{\prime},g)roman_max start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_f , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - roman_min start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_g ) =maxg((f,g)(f,g))+maxf((f,g)(f,g)).absentsubscriptsuperscript𝑔𝑓superscript𝑔𝑓𝑔subscriptsuperscript𝑓𝑓𝑔superscript𝑓𝑔\displaystyle=\max_{g^{\prime}}~{}\bigl{(}\mathcal{L}(f,g^{\prime})-\mathcal{L% }(f,g)\bigr{)}+\max_{f^{\prime}}~{}\bigl{(}\mathcal{L}(f,g)-\mathcal{L}(f^{% \prime},g)\big{)}.= roman_max start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( caligraphic_L ( italic_f , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - caligraphic_L ( italic_f , italic_g ) ) + roman_max start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( caligraphic_L ( italic_f , italic_g ) - caligraphic_L ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_g ) ) . (A.1)

We considered the function space L2(𝒲)superscript𝐿2𝒲L^{2}(\mathcal{W})italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_W ) and L2(𝒵)superscript𝐿2𝒵L^{2}(\mathcal{Z})italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_Z ) equipped with inner product ,L2subscriptsuperscript𝐿2\langle\cdot,\cdot\rangle_{L^{2}}⟨ ⋅ , ⋅ ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, which are also Hilbert spaces. Since 𝒳×𝒵𝒳𝒵\mathcal{X}\times\mathcal{Z}caligraphic_X × caligraphic_Z are compact, continuous function f𝑓fitalic_f and g𝑔gitalic_g parameterized in the form of (3.5) are square-integrable, thus naturally belong to L2(𝒲)superscript𝐿2𝒲L^{2}(\mathcal{W})italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_W ) and L2(𝒵)superscript𝐿2𝒵L^{2}(\mathcal{Z})italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_Z ).

For a fixed f𝑓fitalic_f, 1(f,g)subscript1𝑓𝑔\mathcal{L}_{1}(f,g)caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_f , italic_g ) is a continuous linear functional in g𝑔gitalic_g defined on L2(𝒵)superscript𝐿2𝒵L^{2}(\mathcal{Z})italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_Z ). Thus, there exists function hfsubscript𝑓h_{f}italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT in L2(𝒵)subscript𝐿2𝒵L_{2}(\mathcal{Z})italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_Z ) such that 1(f,g)=hf,gL2subscript1𝑓𝑔subscriptsubscript𝑓𝑔superscript𝐿2\mathcal{L}_{1}(f,g)=\big{\langle}h_{f},g\big{\rangle}_{L^{2}}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_f , italic_g ) = ⟨ italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_g ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. Similarly, for a fixed g𝑔gitalic_g, 1(f,g)subscript1𝑓𝑔\mathcal{L}_{1}(f,g)caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_f , italic_g ) is a continuous linear functional in f𝑓fitalic_f, thus there exists function hgsubscript𝑔h_{g}italic_h start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT in L2(𝒲)subscript𝐿2𝒲L_{2}(\mathcal{W})italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_W ) such that 1(f,g)=hg,fL2subscript1𝑓𝑔subscriptsubscript𝑔𝑓superscript𝐿2\mathcal{L}_{1}(f,g)=\big{\langle}h_{g},f\big{\rangle}_{L^{2}}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_f , italic_g ) = ⟨ italic_h start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_f ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. In fact, hfsubscript𝑓h_{f}italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and hgsubscript𝑔h_{g}italic_h start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT matches the variation of 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with respect to g𝑔gitalic_g and f𝑓fitalic_f.

hf=δ1(f,g)δg,hf=δ2(f,g)δf.formulae-sequencesubscript𝑓𝛿subscript1𝑓𝑔𝛿𝑔subscript𝑓𝛿subscript2𝑓𝑔𝛿𝑓\displaystyle h_{f}=\frac{\delta\mathcal{L}_{1}(f,g)}{\delta g},\quad h_{f}=% \frac{\delta\mathcal{L}_{2}(f,g)}{\delta f}.italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = divide start_ARG italic_δ caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_f , italic_g ) end_ARG start_ARG italic_δ italic_g end_ARG , italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = divide start_ARG italic_δ caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_f , italic_g ) end_ARG start_ARG italic_δ italic_f end_ARG .

Since 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a concave functional with respect to g𝑔gitalic_g, we apply Jensen’s inequality and it holds that,

(f,g)(f,g)𝑓superscript𝑔𝑓𝑔\displaystyle\mathcal{L}(f,g^{\prime})-\mathcal{L}(f,g)caligraphic_L ( italic_f , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - caligraphic_L ( italic_f , italic_g ) =1(f,g)1(f,g)+2(f,g)2(f,g)absentsubscript1𝑓superscript𝑔subscript1𝑓𝑔subscript2𝑓superscript𝑔subscript2𝑓𝑔\displaystyle=\mathcal{L}_{1}\bigl{(}f,g^{\prime}\bigr{)}-\mathcal{L}_{1}\bigl% {(}f,g\bigr{)}+\mathcal{L}_{2}\bigl{(}f,g^{\prime}\bigr{)}-\mathcal{L}_{2}% \bigl{(}f,g\bigr{)}= caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_f , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_f , italic_g ) + caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_f , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_f , italic_g )
δ1(f,g)δg,ggL2+δ2(f,g)g,ggL2absentsubscript𝛿subscript1𝑓𝑔𝛿𝑔superscript𝑔𝑔superscript𝐿2subscript𝛿subscript2𝑓𝑔𝑔superscript𝑔𝑔superscript𝐿2\displaystyle\leq\Big{\langle}\frac{\delta\mathcal{L}_{1}(f,g)}{\delta g},g^{% \prime}-g\Big{\rangle}_{L^{2}}+\Big{\langle}\frac{\delta\mathcal{L}_{2}(f,g)}{% g},g^{\prime}-g\Big{\rangle}_{L^{2}}≤ ⟨ divide start_ARG italic_δ caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_f , italic_g ) end_ARG start_ARG italic_δ italic_g end_ARG , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_g ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + ⟨ divide start_ARG italic_δ caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_f , italic_g ) end_ARG start_ARG italic_g end_ARG , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_g ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT
=δ(f,g)δg,ggL2.absentsubscript𝛿𝑓𝑔𝛿𝑔superscript𝑔𝑔superscript𝐿2\displaystyle=\Big{\langle}\frac{\delta\mathcal{L}(f,g)}{\delta g},g^{\prime}-% g\Big{\rangle}_{L^{2}}.= ⟨ divide start_ARG italic_δ caligraphic_L ( italic_f , italic_g ) end_ARG start_ARG italic_δ italic_g end_ARG , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_g ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT . (A.2)

Follow a similar reasoning, using the fact that 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a linear functional with respect to f𝑓fitalic_f and 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a convex functional with respect to f𝑓fitalic_f, it holds that

(f,g)(f,g)δ(f,g)δf,ffL2.𝑓𝑔superscript𝑓𝑔subscript𝛿𝑓𝑔𝛿𝑓𝑓superscript𝑓superscript𝐿2\displaystyle\mathcal{L}(f,g)-\mathcal{L}(f^{\prime},g)\leq\Big{\langle}\frac{% \delta\mathcal{L}(f,g)}{\delta f},f-f^{\prime}\Big{\rangle}_{L^{2}}.caligraphic_L ( italic_f , italic_g ) - caligraphic_L ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_g ) ≤ ⟨ divide start_ARG italic_δ caligraphic_L ( italic_f , italic_g ) end_ARG start_ARG italic_δ italic_f end_ARG , italic_f - italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT . (A.3)

Plugging (A.1) and (A.3) into (A.1), we re-write the minimax expression in (A.1) using the variation of (f,g)𝑓𝑔\mathcal{L}(f,g)caligraphic_L ( italic_f , italic_g ), the following inequality holds,

maxg(f,g)minf(f,g)maxf,gδ(f,g)δg,ggL2+δ(f,g)δf,ffL2.\displaystyle\max_{g^{\prime}}~{}\mathcal{L}(f,g^{\prime})-\min_{f^{\prime}}~{% }\mathcal{L}(f^{\prime},g)\leq\max_{f^{\prime},g^{\prime}}~{}\Big{\langle}% \frac{\delta\mathcal{L}(f,g)}{\delta g},g^{\prime}-g\Big{\rangle}_{L^{2}}+\Big% {\langle}\frac{\delta\mathcal{L}(f,g)}{\delta f},f-f^{\prime}\Big{\rangle}_{L^% {2}}.roman_max start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_f , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - roman_min start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_g ) ≤ roman_max start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟨ divide start_ARG italic_δ caligraphic_L ( italic_f , italic_g ) end_ARG start_ARG italic_δ italic_g end_ARG , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_g ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + ⟨ divide start_ARG italic_δ caligraphic_L ( italic_f , italic_g ) end_ARG start_ARG italic_δ italic_f end_ARG , italic_f - italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT . (A.4)

Thus, if (f,g)superscript𝑓superscript𝑔(f^{*},g^{*})( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is the stationary point, i.e.,

δ(f,g)δf=δ(f,g)δg=0,a.s.,formulae-sequence𝛿superscript𝑓superscript𝑔𝛿𝑓𝛿superscript𝑓superscript𝑔𝛿𝑔0a.s.\displaystyle\frac{\delta\mathcal{L}(f^{*},g^{*})}{\delta f}=\frac{\delta% \mathcal{L}(f^{*},g^{*})}{\delta g}=0,\quad\text{a.s.},divide start_ARG italic_δ caligraphic_L ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_δ italic_f end_ARG = divide start_ARG italic_δ caligraphic_L ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_δ italic_g end_ARG = 0 , a.s. , (A.5)

then (A.4) suggests that for such stationary point (f,g)superscript𝑓superscript𝑔(f^{*},g^{*})( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), for any function pair (f,g)superscript𝑓superscript𝑔(f^{\prime},g^{\prime})( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), the following inequality holds,

maxg(f,g)minf(f,g)0.subscriptsuperscript𝑔superscript𝑓superscript𝑔subscriptsuperscript𝑓superscript𝑓superscript𝑔0\displaystyle\max_{g^{\prime}}~{}\mathcal{L}(f^{*},g^{\prime})-\min_{f^{\prime% }}~{}\mathcal{L}(f^{\prime},g^{*})\leq 0.roman_max start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - roman_min start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ 0 . (A.6)

Equation (A.6) proves that (f,g)superscript𝑓superscript𝑔(f^{*},g^{*})( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is a saddle point for the minimx objective (f,g)𝑓𝑔\mathcal{L}(f,g)caligraphic_L ( italic_f , italic_g ). Therefore, the stationarity of (f,g)superscript𝑓superscript𝑔(f^{*},g^{*})( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) implies that it’s a saddle point for objective (f,g)𝑓𝑔\mathcal{L}(f,g)caligraphic_L ( italic_f , italic_g ).

Now, we proceed to show the second stage of the proof. We now show that if (μ,ν)superscript𝜇superscript𝜈(\mu^{*},\nu^{*})( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is the stationary point of \mathcal{L}caligraphic_L, i.e., vf(;μ,ν)=vg(;μ,ν)=0superscript𝑣𝑓superscript𝜇superscript𝜈superscript𝑣𝑔superscript𝜇superscript𝜈0v^{f}(\cdot;\mu^{*},\nu^{*})=v^{g}(\cdot;\mu^{*},\nu^{*})=0italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( ⋅ ; italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( ⋅ ; italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = 0, the corresponding function pair (f(;μ),g(;ν))𝑓superscript𝜇𝑔superscript𝜈(f(\cdot;\mu^{*}),g(\cdot;\nu^{*}))( italic_f ( ⋅ ; italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_g ( ⋅ ; italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) is the stationary point of (f,g)𝑓𝑔\mathcal{L}(f,g)caligraphic_L ( italic_f , italic_g ) with respect to (f,g)𝑓𝑔(f,g)( italic_f , italic_g ). We recall that the correspondence between (μ,ν)superscript𝜇superscript𝜈(\mu^{*},\nu^{*})( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) and (f(;μ),g(;ν))𝑓superscript𝜇𝑔superscript𝜈(f(\cdot;\mu^{*}),g(\cdot;\nu^{*}))( italic_f ( ⋅ ; italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_g ( ⋅ ; italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) is through (3.5). Let (μ,ν)superscript𝜇superscript𝜈(\mu^{*},\nu^{*})( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) be a stationary point of (2.9), that is

θδ(μ,ν)δμ(θ)=ωδ(μ,ν)δν(ω)=0,θ,ωDformulae-sequencesubscript𝜃𝛿superscript𝜇superscript𝜈𝛿𝜇𝜃subscript𝜔𝛿superscript𝜇superscript𝜈𝛿𝜈𝜔0for-all𝜃𝜔superscript𝐷\displaystyle\nabla_{\theta}\frac{\delta\mathcal{L}(\mu^{*},\nu^{*})}{\delta% \mu}(\theta)=\nabla_{\omega}\frac{\delta\mathcal{L}(\mu^{*},\nu^{*})}{\delta% \nu}(\omega)=0,\quad\forall\theta,\omega\in\mathbb{R}^{D}∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT divide start_ARG italic_δ caligraphic_L ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_δ italic_μ end_ARG ( italic_θ ) = ∇ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT divide start_ARG italic_δ caligraphic_L ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_δ italic_ν end_ARG ( italic_ω ) = 0 , ∀ italic_θ , italic_ω ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT (A.7)

We can also compute the variation of (μ,ν)𝜇𝜈\mathcal{L}(\mu,\nu)caligraphic_L ( italic_μ , italic_ν ) explicitly.

δ(μ,ν)δμ(θ)𝛿superscript𝜇superscript𝜈𝛿𝜇𝜃\displaystyle\frac{\delta\mathcal{L}(\mu^{*},\nu^{*})}{\delta\mu}(\theta)divide start_ARG italic_δ caligraphic_L ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_δ italic_μ end_ARG ( italic_θ ) =𝔼𝒟[αg(Z;ν)δΦ(X,Z;f(;μ))δf+λδΨ(X,Z;f(;μ))δf,ϕ(;θ)L2],absentsubscript𝔼𝒟delimited-[]𝛼subscript𝑔𝑍superscript𝜈𝛿Φ𝑋𝑍𝑓superscript𝜇𝛿𝑓𝜆𝛿Ψ𝑋𝑍𝑓superscript𝜇𝛿𝑓italic-ϕ𝜃superscript𝐿2\displaystyle=\mathbb{E}_{\mathcal{D}}\Bigl{[}\alpha\Big{\langle}g(Z;\nu^{*})% \cdot\frac{\delta\Phi(X,Z;f(\cdot;\mu^{*}))}{\delta f}+\lambda\cdot\frac{% \delta\Psi(X,Z;f(\cdot;\mu^{*}))}{\delta f},\phi(\cdot;\theta)\Big{\rangle}_{L% ^{2}}\Bigr{]},= blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_α ⟨ italic_g ( italic_Z ; italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ⋅ divide start_ARG italic_δ roman_Φ ( italic_X , italic_Z ; italic_f ( ⋅ ; italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG + italic_λ ⋅ divide start_ARG italic_δ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ ; italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG , italic_ϕ ( ⋅ ; italic_θ ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ] ,
δ(μ,ν)δν(ω)𝛿superscript𝜇superscript𝜈𝛿𝜈𝜔\displaystyle\frac{\delta\mathcal{L}(\mu^{*},\nu^{*})}{\delta\nu}(\omega)divide start_ARG italic_δ caligraphic_L ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_δ italic_ν end_ARG ( italic_ω ) =𝔼𝒟[α(Φ(X,Z;f(,μ))g(Z;ν))ψ(Z;ω)].absentsubscript𝔼𝒟delimited-[]𝛼Φ𝑋𝑍𝑓superscript𝜇𝑔𝑍superscript𝜈𝜓𝑍𝜔\displaystyle=\mathbb{E}_{\mathcal{D}}\Bigl{[}\alpha\big{(}\Phi(X,Z;f(\cdot,% \mu^{*}))-g(Z;\nu^{*})\big{)}\cdot\psi(Z;\omega)\Bigr{]}.= blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_α ( roman_Φ ( italic_X , italic_Z ; italic_f ( ⋅ , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) - italic_g ( italic_Z ; italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) ⋅ italic_ψ ( italic_Z ; italic_ω ) ] .

By the oddness of b𝑏bitalic_b in Assumption 4.1, we have that ϕ(;𝟎)=0italic-ϕ00\phi(\cdot;\bm{0})=0italic_ϕ ( ⋅ ; bold_0 ) = 0, This implies that the variation of (μ,ν)superscript𝜇superscript𝜈\mathcal{L}(\mu^{*},\nu^{*})caligraphic_L ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) with respect to μ𝜇\muitalic_μ and ν𝜈\nuitalic_ν are 00 when θ=ω=𝟎𝜃𝜔0\theta=\omega=\bm{0}italic_θ = italic_ω = bold_0, i.e.,

δ(μ,ν)δμ(𝟎)=δ(μ,ν)δν(𝟎)=0.𝛿superscript𝜇superscript𝜈𝛿𝜇0𝛿superscript𝜇superscript𝜈𝛿𝜈00\displaystyle\frac{\delta\mathcal{L}(\mu^{*},\nu^{*})}{\delta\mu}(\bm{0})=% \frac{\delta\mathcal{L}(\mu^{*},\nu^{*})}{\delta\nu}(\bm{0})=0.divide start_ARG italic_δ caligraphic_L ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_δ italic_μ end_ARG ( bold_0 ) = divide start_ARG italic_δ caligraphic_L ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_δ italic_ν end_ARG ( bold_0 ) = 0 .

Combined with (A.7), we deduced that

δ(μ,ν)δμ(θ)=δ(μ,ν)δν(w)=0θ,ωD.formulae-sequence𝛿superscript𝜇superscript𝜈𝛿𝜇𝜃𝛿superscript𝜇superscript𝜈𝛿𝜈𝑤0for-all𝜃𝜔superscript𝐷\displaystyle\frac{\delta\mathcal{L}(\mu^{*},\nu^{*})}{\delta\mu}(\theta)=% \frac{\delta\mathcal{L}(\mu^{*},\nu^{*})}{\delta\nu}(w)=0\quad\forall\theta,% \omega\in\mathbb{R}^{D}.divide start_ARG italic_δ caligraphic_L ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_δ italic_μ end_ARG ( italic_θ ) = divide start_ARG italic_δ caligraphic_L ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_δ italic_ν end_ARG ( italic_w ) = 0 ∀ italic_θ , italic_ω ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT .

Note that we can expand the variation of \mathcal{L}caligraphic_L with respect to μ𝜇\muitalic_μ,

αδ(f(;μ),g(;ν))δf,ϕ(;θ)L2=δ(μ,ν)δμ(θ)=0.𝛼subscript𝛿𝑓superscript𝜇𝑔superscript𝜈𝛿𝑓italic-ϕ𝜃superscript𝐿2𝛿superscript𝜇superscript𝜈𝛿𝜇𝜃0\displaystyle\alpha\Big{\langle}\frac{\delta\mathcal{L}(f(\cdot;\mu^{*}),g(% \cdot;\nu^{*}))}{\delta f},\phi(\cdot;\theta)\Big{\rangle}_{L^{2}}=\frac{% \delta\mathcal{L}(\mu^{*},\nu^{*})}{\delta\mu}(\theta)=0.italic_α ⟨ divide start_ARG italic_δ caligraphic_L ( italic_f ( ⋅ ; italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_g ( ⋅ ; italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG , italic_ϕ ( ⋅ ; italic_θ ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = divide start_ARG italic_δ caligraphic_L ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_δ italic_μ end_ARG ( italic_θ ) = 0 . (A.8)

By the universal function approximation theorem (Lemma D.1), since (f(;μ),g(;ν))δf𝑓superscript𝜇𝑔superscript𝜈𝛿𝑓\frac{\mathcal{L}(f(\cdot;\mu^{*}),g(\cdot;\nu^{*}))}{\delta f}divide start_ARG caligraphic_L ( italic_f ( ⋅ ; italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_g ( ⋅ ; italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG is in 𝒞(𝒲)𝒞𝒲\mathscr{C}(\mathcal{W})script_C ( caligraphic_W ) as is assumed in item (iv) of Assumption 4.3, there exists {ϕn}n=1𝒢(ϕ)superscriptsubscriptsubscriptitalic-ϕ𝑛𝑛1𝒢italic-ϕ\{\phi_{n}\}_{n=1}^{\infty}\in\mathcal{G}(\phi){ italic_ϕ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ∈ caligraphic_G ( italic_ϕ ) such that ϕn(f(;μ),g(;ν))δfsubscriptitalic-ϕ𝑛𝑓superscript𝜇𝑔superscript𝜈𝛿𝑓\phi_{n}\rightarrow\frac{\mathcal{L}(f(\cdot;\mu^{*}),g(\cdot;\nu^{*}))}{% \delta f}italic_ϕ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → divide start_ARG caligraphic_L ( italic_f ( ⋅ ; italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_g ( ⋅ ; italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG uniformly. Here, 𝒢(ϕ)𝒢italic-ϕ\mathcal{G}(\phi)caligraphic_G ( italic_ϕ ) denotes the space of functions that are linearly spanned by ϕ(,θ)italic-ϕ𝜃\phi(\cdot,\theta)italic_ϕ ( ⋅ , italic_θ ) By (A.8), it holds that

(f(;μ),g(;ν))δf(),ϕn()L2=0.subscript𝑓superscript𝜇𝑔superscript𝜈𝛿𝑓subscriptitalic-ϕ𝑛superscript𝐿20\displaystyle\Big{\langle}\frac{\mathcal{L}(f(\cdot;\mu^{*}),g(\cdot;\nu^{*}))% }{\delta f}(\cdot),\phi_{n}(\cdot)\Big{\rangle}_{L^{2}}=0.⟨ divide start_ARG caligraphic_L ( italic_f ( ⋅ ; italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_g ( ⋅ ; italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG ( ⋅ ) , italic_ϕ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( ⋅ ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = 0 . (A.9)

Following a similar strategy, we can show that there exists {ψn}n=1𝒢(ψ)superscriptsubscriptsubscript𝜓𝑛𝑛1𝒢𝜓\{\psi_{n}\}_{n=1}^{\infty}\in\mathcal{G}(\psi){ italic_ψ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ∈ caligraphic_G ( italic_ψ ) such that ψn(f(;μ),g(;ν))δgsubscript𝜓𝑛𝑓superscript𝜇𝑔superscript𝜈𝛿𝑔\psi_{n}\rightarrow\frac{\mathcal{L}(f(\cdot;\mu^{*}),g(\cdot;\nu^{*}))}{% \delta g}italic_ψ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → divide start_ARG caligraphic_L ( italic_f ( ⋅ ; italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_g ( ⋅ ; italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) end_ARG start_ARG italic_δ italic_g end_ARG, where for each ψnsubscript𝜓𝑛\psi_{n}italic_ψ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, it holds that

(f(;μ),g(;ν))δg(),ψn()L2=0.subscript𝑓superscript𝜇𝑔superscript𝜈𝛿𝑔subscript𝜓𝑛superscript𝐿20\displaystyle\Big{\langle}\frac{\mathcal{L}(f(\cdot;\mu^{*}),g(\cdot;\nu^{*}))% }{\delta g}(\cdot),\psi_{n}(\cdot)\Big{\rangle}_{L^{2}}=0.⟨ divide start_ARG caligraphic_L ( italic_f ( ⋅ ; italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_g ( ⋅ ; italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) end_ARG start_ARG italic_δ italic_g end_ARG ( ⋅ ) , italic_ψ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( ⋅ ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = 0 . (A.10)

We take the limit of (A.9) and (A.10) by passing n𝑛n\rightarrow\inftyitalic_n → ∞ and conclude,

δ(f(;μ),g(;ν))δf=0,(f(;μ),g(;ν))δg()=0.a.s.\displaystyle\frac{\delta\mathcal{L}(f(\cdot;\mu^{*}),g(\cdot;\nu^{*}))}{% \delta f}=0,\quad\frac{\mathcal{L}(f(\cdot;\mu^{*}),g(\cdot;\nu^{*}))}{\delta g% }(\cdot)=0.\quad\text{a.s.}divide start_ARG italic_δ caligraphic_L ( italic_f ( ⋅ ; italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_g ( ⋅ ; italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG = 0 , divide start_ARG caligraphic_L ( italic_f ( ⋅ ; italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_g ( ⋅ ; italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) end_ARG start_ARG italic_δ italic_g end_ARG ( ⋅ ) = 0 . a.s. (A.11)

Equation (A.11) proves that if (μ,ν)superscript𝜇superscript𝜈(\mu^{*},\nu^{*})( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is a stationary point of the Wasserstein gradient flow, then the associated function pair (f(;μ,g(;ν)))𝑓superscript𝜇𝑔superscript𝜈(f(\cdot;\mu^{*},g(\cdot;\nu^{*})))( italic_f ( ⋅ ; italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g ( ⋅ ; italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) ) is a stationary point of the minimax objective (f,g)𝑓𝑔\mathcal{L}(f,g)caligraphic_L ( italic_f , italic_g ), which matches the conditions we conclude in (A.5). Therefore, we prove that (f(;μ),g(;ν))𝑓superscript𝜇𝑔superscript𝜈(f(\cdot;\mu^{*}),g(\cdot;\nu^{*}))( italic_f ( ⋅ ; italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_g ( ⋅ ; italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) is a saddle point of the minimax objective (f,g)𝑓𝑔\mathcal{L}(f,g)caligraphic_L ( italic_f , italic_g ). We complete the proof of item (i).

Proof of (ii). We now show that there exists good solution pair (μ,ν)superscript𝜇superscript𝜈(\mu^{*},\nu^{*})( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) that is both optimal as well as close to initialization (μ0,ν0)subscript𝜇0subscript𝜈0(\mu_{0},\nu_{0})( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) in Wasserstein distance. By Assumption 4.2, there exists distribution μ,ν𝒫2(D)superscript𝜇superscript𝜈subscript𝒫2superscript𝐷\mu^{\dagger},\nu^{\dagger}\in\mathscr{P}_{2}(\mathbb{R}^{D})italic_μ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ∈ script_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) such that the optimal solution to the optimization problem (2.9)(f,g)superscript𝑓superscript𝑔(f^{*},g^{*})( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) satisfies the following,

f(w)=ϕ(w;θ)dμ(θ),g(z)=ψ(z;ω)dν(ω),w𝒲,z𝒵formulae-sequencesuperscript𝑓𝑤italic-ϕ𝑤𝜃differential-dsuperscript𝜇𝜃formulae-sequencesuperscript𝑔𝑧𝜓𝑧𝜔differential-dsuperscript𝜈𝜔formulae-sequencefor-all𝑤𝒲𝑧𝒵\displaystyle f^{*}(w)=\int\phi(w;\theta)\mathrm{d}\mu^{\dagger}(\theta),g^{*}% (z)=\int\psi(z;\omega)\mathrm{d}\nu^{\dagger}(\omega),\quad\forall w\in% \mathcal{W},z\in\mathcal{Z}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_w ) = ∫ italic_ϕ ( italic_w ; italic_θ ) roman_d italic_μ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_θ ) , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_z ) = ∫ italic_ψ ( italic_z ; italic_ω ) roman_d italic_ν start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ω ) , ∀ italic_w ∈ caligraphic_W , italic_z ∈ caligraphic_Z

Recall that α>0𝛼0\alpha>0italic_α > 0 is the scaling parameter in neural network parameterization. We can construct (μ,ν)superscript𝜇superscript𝜈(\mu^{*},\nu^{*})( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) using a convex combination of (μ,ν)superscript𝜇superscript𝜈(\mu^{\dagger},\nu^{\dagger})( italic_μ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ) and the initialization (μ0,ν0)subscript𝜇0subscript𝜈0(\mu_{0},\nu_{0})( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ),

μ(θ)=α1μ(θ)+(1α1)μ0(θ),ν(w)=α1ν(ω)+(1α1)ν0(ω).formulae-sequencesuperscript𝜇𝜃superscript𝛼1superscript𝜇𝜃1superscript𝛼1subscript𝜇0𝜃superscript𝜈𝑤superscript𝛼1superscript𝜈𝜔1superscript𝛼1subscript𝜈0𝜔\displaystyle\mu^{*}(\theta)=\alpha^{-1}\mu^{\dagger}(\theta)+(1-\alpha^{-1})% \mu_{0}(\theta),\quad\nu^{*}(w)=\alpha^{-1}\nu^{\dagger}(\omega)+(1-\alpha^{-1% })\nu_{0}(\omega).italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ ) = italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_μ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_θ ) + ( 1 - italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ ) , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_w ) = italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ν start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ω ) + ( 1 - italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_ω ) . (A.12)

We claim that (μ,ν)superscript𝜇superscript𝜈(\mu^{*},\nu^{*})( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) constructed in (A.12) satisfies all the desired requirements. Since μ0,ν0subscript𝜇0subscript𝜈0\mu_{0},\nu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are standard Gaussian distribution, the integration of ϕ(;θ)italic-ϕ𝜃\phi(\cdot;\theta)italic_ϕ ( ⋅ ; italic_θ ) with respect to μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ψ(;ω)𝜓𝜔\psi(\cdot;\omega)italic_ψ ( ⋅ ; italic_ω ) with respect to ν0subscript𝜈0\nu_{0}italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are identically 00 due to oddness of neuron functions,

𝒲ϕ(w;θ)dμ0(θ)=0,𝒵ψ(z;ω)dν0(ω)=0.w𝒲,z𝒵\displaystyle\int_{\mathcal{W}}\phi(w;\theta)\mathrm{d}\mu_{0}(\theta)=0,\quad% \int_{\mathcal{Z}}\psi(z;\omega)\mathrm{d}\nu_{0}(\omega)=0.\quad\forall w\in% \mathcal{W},z\in\mathcal{Z}∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_ϕ ( italic_w ; italic_θ ) roman_d italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ ) = 0 , ∫ start_POSTSUBSCRIPT caligraphic_Z end_POSTSUBSCRIPT italic_ψ ( italic_z ; italic_ω ) roman_d italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_ω ) = 0 . ∀ italic_w ∈ caligraphic_W , italic_z ∈ caligraphic_Z

Thus, the expressions for (f,g)superscript𝑓superscript𝑔(f^{*},g^{*})( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) are simplified to

f(w)=αϕ(w;θ)dμ(θ),g(z)=αψ(z;ω)dν(ω).formulae-sequencesuperscript𝑓𝑤𝛼italic-ϕ𝑤𝜃differential-dsuperscript𝜇𝜃superscript𝑔𝑧𝛼𝜓𝑧𝜔differential-dsuperscript𝜈𝜔\displaystyle f^{*}(w)=\alpha\int\phi(w;\theta)\mathrm{d}\mu^{*}(\theta),\quad g% ^{*}(z)=\alpha\int\psi(z;\omega)\mathrm{d}\nu^{*}(\omega).italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_w ) = italic_α ∫ italic_ϕ ( italic_w ; italic_θ ) roman_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ ) , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_z ) = italic_α ∫ italic_ψ ( italic_z ; italic_ω ) roman_d italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_ω ) .

By Talagrand’s inequality (Lemma D.5), the following chain of inequalities holds,

𝒲2(μ0,μ)2subscript𝒲2superscriptsubscript𝜇0superscript𝜇2\displaystyle\mathcal{W}_{2}(\mu_{0},\mu^{*})^{2}caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 2DKL(μμ0)Dχ2(μμ0)absent2subscript𝐷KLconditionalsuperscript𝜇subscript𝜇0subscript𝐷superscript𝜒2conditionalsuperscript𝜇subscript𝜇0\displaystyle\leq 2D_{\mathrm{KL}}(\mu^{*}\,\|\,\mu_{0})\leq D_{\chi^{2}}(\mu^% {*}\,\|\,\mu_{0})≤ 2 italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≤ italic_D start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )
=(μ(θ)μ0(θ)1)2dμ0(θ)=((1α1)μ0(θ)+α1μ(θ)μ0(θ)1)2dμ0(θ)absentsuperscriptsuperscript𝜇𝜃subscript𝜇0𝜃12differential-dsubscript𝜇0𝜃superscript1superscript𝛼1subscript𝜇0𝜃superscript𝛼1superscript𝜇𝜃subscript𝜇0𝜃12differential-dsubscript𝜇0𝜃\displaystyle=\int\biggl{(}\frac{\mu^{*}(\theta)}{\mu_{0}(\theta)}-1\biggr{)}^% {2}\,\mathrm{d}\mu_{0}(\theta)=\int\biggl{(}\frac{(1-\alpha^{-1})\cdot\mu_{0}(% \theta)+\alpha^{-1}\cdot\mu^{\dagger}(\theta)}{\mu_{0}(\theta)}-1\biggr{)}^{2}% \,\mathrm{d}\mu_{0}(\theta)= ∫ ( divide start_ARG italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ ) end_ARG start_ARG italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ ) end_ARG - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ ) = ∫ ( divide start_ARG ( 1 - italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ⋅ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ ) + italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ italic_μ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_θ ) end_ARG start_ARG italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ ) end_ARG - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ )
=α2Dχ2(μμ0).absentsuperscript𝛼2subscript𝐷superscript𝜒2conditionalsuperscript𝜇subscript𝜇0\displaystyle=\alpha^{-2}D_{\chi^{2}}(\mu^{\dagger}\,\|\,\mu_{0}).= italic_α start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ∥ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) . (A.13)

A similar bound on 𝒲2(ν0,ν)subscript𝒲2subscript𝜈0superscript𝜈\mathcal{W}_{2}(\nu_{0},\nu^{*})caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) also applies,

𝒲2(ν0,ν)2α2Dχ2(νν0).subscript𝒲2superscriptsubscript𝜈0superscript𝜈2superscript𝛼2subscript𝐷superscript𝜒2conditionalsuperscript𝜈subscript𝜈0\displaystyle\mathcal{W}_{2}(\nu_{0},\nu^{*})^{2}\leq\alpha^{-2}D_{\chi^{2}}(% \nu^{\dagger}\,\|\,\nu_{0}).caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_α start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ν start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ∥ italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) . (A.14)

Let D¯=max{Dχ2(μμ0)1/2,Dχ2(νν0)1/2}¯𝐷subscript𝐷superscript𝜒2superscriptconditionalsuperscript𝜇subscript𝜇012subscript𝐷superscript𝜒2superscriptconditionalsuperscript𝜈subscript𝜈012\bar{D}=\max\{D_{\chi^{2}}(\mu^{\dagger}\,\|\,\mu_{0})^{1/2},D_{\chi^{2}}(\nu^% {\dagger}\,\|\,\nu_{0})^{1/2}\}over¯ start_ARG italic_D end_ARG = roman_max { italic_D start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ∥ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ν start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ∥ italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT }, we conclude the proof of item (ii).

A.2 Proof of Theorem 4.7

By Lemma 4.6, there exists distribution (μ,ν)superscript𝜇superscript𝜈(\mu^{*},\nu^{*})( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) that is a stationary point of Wasserstein gradient flow (3.7) and simultaneously satisfying the distance bound in item (ii) of Lemma 4.6. For such (μ,ν)superscript𝜇superscript𝜈(\mu^{*},\nu^{*})( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), we denote ρ(θ,ω)=μ(θ)ν(ω)superscript𝜌𝜃𝜔superscript𝜇𝜃superscript𝜈𝜔\rho^{*}(\theta,\omega)=\mu^{*}(\theta)\nu^{*}(\omega)italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ , italic_ω ) = italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ ) italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_ω ) as their product measure. Moreover, for any distribution pair (μ,ν)𝜇𝜈(\mu,\nu)( italic_μ , italic_ν ), we use ρ(θ,ω)=μ(θ)ν(ω)𝜌𝜃𝜔𝜇𝜃𝜈𝜔\rho(\theta,\omega)=\mu(\theta)\nu(\omega)italic_ρ ( italic_θ , italic_ω ) = italic_μ ( italic_θ ) italic_ν ( italic_ω ) as their product measure for simplicity. To rewrite the Wasserstein gradient flow for (μ,ν)𝜇𝜈(\mu,\nu)( italic_μ , italic_ν ) into the flow for ρ𝜌\rhoitalic_ρ, we define vector the stacked vector field v𝑣vitalic_v as,

v(θ,ω;μ,ν)=(vf(θ;μ,ν),vg(ω;μ,ν)).𝑣𝜃𝜔𝜇𝜈superscript𝑣𝑓𝜃𝜇𝜈superscript𝑣𝑔𝜔𝜇𝜈\displaystyle v(\theta,\omega;\mu,\nu)=\bigl{(}v^{f}(\theta;\mu,\nu),v^{g}(% \omega;\mu,\nu)\bigr{)}.italic_v ( italic_θ , italic_ω ; italic_μ , italic_ν ) = ( italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ ; italic_μ , italic_ν ) , italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_ω ; italic_μ , italic_ν ) ) . (A.15)

Following from Lemma D.2, (A.1), and (A.14), it holds that 𝒲2(ρ,ρ0)α1D¯subscript𝒲2superscript𝜌subscript𝜌0superscript𝛼1¯𝐷\mathcal{W}_{2}(\rho^{*},\rho_{0})\leq\alpha^{-1}\bar{D}caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≤ italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over¯ start_ARG italic_D end_ARG, where D¯¯𝐷\bar{D}over¯ start_ARG italic_D end_ARG is defined in Lemma 4.6. Note that

f(w;μ)=αϕ(w;θ)μ(θ)dθ=αϕ(w;θ)ρ(θ,ω)d(θ,ω),w𝒲,formulae-sequence𝑓𝑤𝜇𝛼italic-ϕ𝑤𝜃𝜇𝜃differential-d𝜃𝛼italic-ϕ𝑤𝜃𝜌𝜃𝜔d𝜃𝜔for-all𝑤𝒲\displaystyle f(w;\mu)=\alpha\int\phi(w;\theta)\mu(\theta)\mathrm{d}\theta=% \alpha\int\phi(w;\theta)\rho(\theta,\omega)\mathrm{d}(\theta,\omega),\quad% \forall w\in\mathcal{W},italic_f ( italic_w ; italic_μ ) = italic_α ∫ italic_ϕ ( italic_w ; italic_θ ) italic_μ ( italic_θ ) roman_d italic_θ = italic_α ∫ italic_ϕ ( italic_w ; italic_θ ) italic_ρ ( italic_θ , italic_ω ) roman_d ( italic_θ , italic_ω ) , ∀ italic_w ∈ caligraphic_W ,
g(z;ν)=αψ(z;ω)ν(ω)dω=αψ(z;ω)ρ(θ,ω)d(θ,ω),z𝒵.formulae-sequence𝑔𝑧𝜈𝛼𝜓𝑧𝜔𝜈𝜔differential-d𝜔𝛼𝜓𝑧𝜔𝜌𝜃𝜔d𝜃𝜔for-all𝑧𝒵\displaystyle g(z;\nu)=\alpha\int\psi(z;\omega)\nu(\omega)\mathrm{d}\omega=% \alpha\int\psi(z;\omega)\rho(\theta,\omega)\mathrm{d}(\theta,\omega),\quad% \forall z\in\mathcal{Z}.italic_g ( italic_z ; italic_ν ) = italic_α ∫ italic_ψ ( italic_z ; italic_ω ) italic_ν ( italic_ω ) roman_d italic_ω = italic_α ∫ italic_ψ ( italic_z ; italic_ω ) italic_ρ ( italic_θ , italic_ω ) roman_d ( italic_θ , italic_ω ) , ∀ italic_z ∈ caligraphic_Z .

Thus, we overload the notation to write f(;ρ)=f(;μ)𝑓𝜌𝑓𝜇f(\cdot;\rho)=f(\cdot;\mu)italic_f ( ⋅ ; italic_ρ ) = italic_f ( ⋅ ; italic_μ ) and g(;ρ)=g(;ν)𝑔𝜌𝑔𝜈g(\cdot;\rho)=g(\cdot;\nu)italic_g ( ⋅ ; italic_ρ ) = italic_g ( ⋅ ; italic_ν ) for ρ𝒫2(D×D)𝜌subscript𝒫2superscript𝐷superscript𝐷\rho\in\mathscr{P}_{2}(\mathbb{R}^{D}\times\mathbb{R}^{D})italic_ρ ∈ script_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ). By writing ρt=(μt,νt)subscript𝜌𝑡subscript𝜇𝑡subscript𝜈𝑡\rho_{t}=(\mu_{t},\nu_{t})italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), the update in (3.7) takes the following form

tρt(θ,ω)=div(ρt(θ,ω)v(θ,ω;ρt)),ρ0=(μ0,ν0).formulae-sequencesubscript𝑡subscript𝜌𝑡𝜃𝜔divsubscript𝜌𝑡𝜃𝜔𝑣𝜃𝜔subscript𝜌𝑡subscript𝜌0subscript𝜇0subscript𝜈0\displaystyle\partial_{t}\rho_{t}(\theta,\omega)=-\mathrm{div}\bigl{(}\rho_{t}% (\theta,\omega)v(\theta,\omega;\rho_{t})\bigr{)},\quad\rho_{0}=(\mu_{0},\nu_{0% }).∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ , italic_ω ) = - roman_div ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ , italic_ω ) italic_v ( italic_θ , italic_ω ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) .

Before we prove Theorem 4.7, we first show the following important technical lemma.

Lemma A.1.

We assume 𝒲2(ρt,ρ)2𝒲2(ρ0,ρ)subscript𝒲2subscript𝜌𝑡superscript𝜌2subscript𝒲2subscript𝜌0superscript𝜌\mathcal{W}_{2}(\rho_{t},\rho^{*})\leq 2\mathcal{W}_{2}(\rho_{0},\rho^{*})caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ 2 caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). Under Assumptions 4.1, 4.3, 4.2, it holds that

12d𝒲2(ρt,ρ)2dt12dsubscript𝒲2superscriptsubscript𝜌𝑡superscript𝜌2d𝑡\displaystyle\frac{1}{2}\frac{\mathrm{d}\mathcal{W}_{2}(\rho_{t},\rho^{*})^{2}% }{\mathrm{d}t}divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG roman_d caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG roman_d italic_t end_ARG η𝔼𝒟[λΨ(X,Z;f(;μt)f())+(g(Z;νt)g(Z))2]+Cηα1.absent𝜂subscript𝔼𝒟delimited-[]𝜆Ψ𝑋𝑍𝑓subscript𝜇𝑡superscript𝑓superscript𝑔𝑍subscript𝜈𝑡superscript𝑔𝑍2subscript𝐶𝜂superscript𝛼1\displaystyle\;\leq-\eta\cdot\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\Psi\big{(% }X,Z;f(\cdot;\mu_{t})-f^{*}(\cdot)\big{)}+\bigl{(}g(Z;\nu_{t})-g^{*}(Z)\bigr{)% }^{2}\Bigr{]}+C_{*}\cdot\eta\alpha^{-1}.≤ - italic_η ⋅ blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_λ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ ) ) + ( italic_g ( italic_Z ; italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_Z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_C start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⋅ italic_η italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT . (A.16)

where C>0subscript𝐶0C_{*}>0italic_C start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT > 0 is a constant depending on B0,B1,B2subscript𝐵0subscript𝐵1subscript𝐵2B_{0},B_{1},B_{2}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, λ𝜆\lambdaitalic_λ, and D¯¯𝐷\bar{D}over¯ start_ARG italic_D end_ARG

Proof.

Let {βs}s[0,1]subscriptsubscript𝛽𝑠𝑠01\{\beta_{s}\}_{s\in[0,1]}{ italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s ∈ [ 0 , 1 ] end_POSTSUBSCRIPT be the geodesic connecting ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ρsuperscript𝜌\rho^{*}italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with β0=ρtsubscript𝛽0subscript𝜌𝑡\beta_{0}=\rho_{t}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and β1=ρsubscript𝛽1superscript𝜌\beta_{1}=\rho^{*}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Let u𝑢uitalic_u be the corresponding veclocity field such that sβs=div(βsus)subscript𝑠subscript𝛽𝑠divsubscript𝛽𝑠subscript𝑢𝑠\partial_{s}\beta_{s}=-\mathrm{div}(\beta_{s}u_{s})∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = - roman_div ( italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ). By the first variation formula of Wasserstein distance in Lemma D.3, it holds that

12d𝒲2(ρt,ρ)2dt12dsubscript𝒲2superscriptsubscript𝜌𝑡superscript𝜌2d𝑡\displaystyle\frac{1}{2}\frac{\mathrm{d}\mathcal{W}_{2}(\rho_{t},\rho^{*})^{2}% }{\mathrm{d}t}divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG roman_d caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG roman_d italic_t end_ARG =ηv(;ρt),u0ρt=ηv(;ρ),u1ρ+η01sv(;βs),usβsdsabsent𝜂subscript𝑣subscript𝜌𝑡subscript𝑢0subscript𝜌𝑡𝜂subscript𝑣superscript𝜌subscript𝑢1superscript𝜌𝜂superscriptsubscript01subscript𝑠subscript𝑣subscript𝛽𝑠subscript𝑢𝑠subscript𝛽𝑠d𝑠\displaystyle=-\eta\big{\langle}v(\cdot;\rho_{t}),u_{0}\big{\rangle}_{\rho_{t}% }=-\eta\big{\langle}v(\cdot;\rho^{*}),u_{1}\big{\rangle}_{\rho^{*}}+\eta\int_{% 0}^{1}\partial_{s}\big{\langle}v(\cdot;\beta_{s}),u_{s}\big{\rangle}_{\beta_{s% }}\mathrm{d}s= - italic_η ⟨ italic_v ( ⋅ ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = - italic_η ⟨ italic_v ( ⋅ ; italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_η ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⟨ italic_v ( ⋅ ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_d italic_s (A.17)
=η01sv(;βs),usβsds(i)+η01v(θ,ω;βs),s(us(θ,w)βs(θ,ω))d(θ,ω)ds(ii).absent𝜂subscriptsuperscriptsubscript01subscriptsubscript𝑠𝑣subscript𝛽𝑠subscript𝑢𝑠subscript𝛽𝑠differential-d𝑠i𝜂subscriptsuperscriptsubscript01𝑣𝜃𝜔subscript𝛽𝑠subscript𝑠subscript𝑢𝑠𝜃𝑤subscript𝛽𝑠𝜃𝜔d𝜃𝜔differential-d𝑠ii\displaystyle=\eta\underbrace{\int_{0}^{1}\big{\langle}\partial_{s}v(\cdot;% \beta_{s}),u_{s}\big{\rangle}_{\beta_{s}}\mathrm{d}s}_{\displaystyle\mathrm{(i% )}}+\eta\underbrace{\int_{0}^{1}\int\big{\langle}v(\theta,\omega;\beta_{s}),% \partial_{s}(u_{s}(\theta,w)\beta_{s}(\theta,\omega))\big{\rangle}\mathrm{d}(% \theta,\omega)\mathrm{d}s}_{\displaystyle\mathrm{(ii)}}.= italic_η under⏟ start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ⟨ ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_v ( ⋅ ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_d italic_s end_ARG start_POSTSUBSCRIPT ( roman_i ) end_POSTSUBSCRIPT + italic_η under⏟ start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∫ ⟨ italic_v ( italic_θ , italic_ω ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_θ , italic_w ) italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_θ , italic_ω ) ) ⟩ roman_d ( italic_θ , italic_ω ) roman_d italic_s end_ARG start_POSTSUBSCRIPT ( roman_ii ) end_POSTSUBSCRIPT .

where the notation h1,h2ρ=h1h2dρsubscriptsubscript1subscript2𝜌subscript1subscript2differential-d𝜌\big{\langle}h_{1},h_{2}\big{\rangle}_{\rho}=\int h_{1}\cdot h_{2}\mathrm{d}\rho⟨ italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT = ∫ italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_d italic_ρ for any distribution ρ𝜌\rhoitalic_ρ and functions h1,h2subscript1subscript2h_{1},h_{2}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We will provide bounds for term (i) and (ii) separately in the sequel.

Upper bounding term (i). For term (i) of (A.17), by the definitions of v𝑣vitalic_v, vfsuperscript𝑣𝑓v^{f}italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT, and vgsuperscript𝑣𝑔v^{g}italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT in (A.15) and (3), we have that

svf(θ,ω;βs)subscript𝑠superscript𝑣𝑓𝜃𝜔subscript𝛽𝑠\displaystyle\partial_{s}v^{f}(\theta,\omega;\beta_{s})∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ , italic_ω ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) =αs𝔼𝒟[g(Z;βs)δΦ(X,Z;f(;βs))δf,θϕ(;θ)L2λδΨ(X,Z;f(;βs))δf,θϕ(;θ)L2]absent𝛼subscript𝑠subscript𝔼𝒟delimited-[]𝑔𝑍subscript𝛽𝑠subscript𝛿Φ𝑋𝑍𝑓subscript𝛽𝑠𝛿𝑓subscript𝜃italic-ϕ𝜃superscript𝐿2𝜆subscript𝛿Ψ𝑋𝑍𝑓subscript𝛽𝑠𝛿𝑓subscript𝜃italic-ϕ𝜃superscript𝐿2\displaystyle=\alpha\partial_{s}\mathbb{E}_{\mathcal{D}}\Bigl{[}-g(Z;\beta_{s}% )\cdot\Big{\langle}\frac{\delta\Phi(X,Z;f(\cdot;\beta_{s}))}{\delta f},\nabla_% {\theta}\phi(\cdot;\theta)\Big{\rangle}_{L^{2}}-\lambda\cdot\Big{\langle}\frac% {\delta\Psi(X,Z;f(\cdot;\beta_{s}))}{\delta f},\nabla_{\theta}\phi(\cdot;% \theta)\Big{\rangle}_{L^{2}}\Bigr{]}= italic_α ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ - italic_g ( italic_Z ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⋅ ⟨ divide start_ARG italic_δ roman_Φ ( italic_X , italic_Z ; italic_f ( ⋅ ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG , ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_ϕ ( ⋅ ; italic_θ ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - italic_λ ⋅ ⟨ divide start_ARG italic_δ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG , ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_ϕ ( ⋅ ; italic_θ ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ]
=αθ𝔼𝒟[g(Z;sβs)δΦ(X,Z;f(;βs))δf,ϕ(;θ)L2λδΨ(X,Z;f(;sβs))δf,ϕ(;θ)L2].absent𝛼subscript𝜃subscript𝔼𝒟delimited-[]𝑔𝑍subscript𝑠subscript𝛽𝑠subscript𝛿Φ𝑋𝑍𝑓subscript𝛽𝑠𝛿𝑓italic-ϕ𝜃superscript𝐿2𝜆subscript𝛿Ψ𝑋𝑍𝑓subscript𝑠subscript𝛽𝑠𝛿𝑓italic-ϕ𝜃superscript𝐿2\displaystyle=\alpha\nabla_{\theta}\mathbb{E}_{\mathcal{D}}\Bigl{[}-g(Z;% \partial_{s}\beta_{s})\cdot\Big{\langle}\frac{\delta\Phi(X,Z;f(\cdot;\beta_{s}% ))}{\delta f},\phi(\cdot;\theta)\Big{\rangle}_{L^{2}}-\lambda\cdot\Big{\langle% }\frac{\delta\Psi(X,Z;f(\cdot;\partial_{s}\beta_{s}))}{\delta f},\phi(\cdot;% \theta)\Big{\rangle}_{L^{2}}\Bigr{]}.= italic_α ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ - italic_g ( italic_Z ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⋅ ⟨ divide start_ARG italic_δ roman_Φ ( italic_X , italic_Z ; italic_f ( ⋅ ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG , italic_ϕ ( ⋅ ; italic_θ ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - italic_λ ⋅ ⟨ divide start_ARG italic_δ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG , italic_ϕ ( ⋅ ; italic_θ ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ] .

where the second inequality holds since δΦ(X,Z;f)δf𝛿Φ𝑋𝑍𝑓𝛿𝑓\frac{\delta\Phi(X,Z;f)}{\delta f}divide start_ARG italic_δ roman_Φ ( italic_X , italic_Z ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG a constant, slimit-from𝑠s-italic_s -independent function, δΨ(X,Z;f)δf𝛿Ψ𝑋𝑍𝑓𝛿𝑓\frac{\delta\Psi(X,Z;f)}{\delta f}divide start_ARG italic_δ roman_Ψ ( italic_X , italic_Z ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG is linear in f𝑓fitalic_f, and sf(;βs),sg(;βs)subscript𝑠𝑓subscript𝛽𝑠subscript𝑠𝑔subscript𝛽𝑠\partial_{s}f(\cdot;\beta_{s}),\partial_{s}g(\cdot;\beta_{s})∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_f ( ⋅ ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_g ( ⋅ ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) satisfies

sf(w;βs)=s(ϕ(w;θ)βs(θ,ω))d(θ,ω)=ϕ(w;θ)sβsd(θ,ω)=f(w;sβs),w𝒲formulae-sequencesubscript𝑠𝑓𝑤subscript𝛽𝑠subscript𝑠italic-ϕ𝑤𝜃subscript𝛽𝑠𝜃𝜔d𝜃𝜔italic-ϕ𝑤𝜃subscript𝑠subscript𝛽𝑠d𝜃𝜔𝑓𝑤subscript𝑠subscript𝛽𝑠for-all𝑤𝒲\displaystyle\partial_{s}f(w;\beta_{s})=\int\partial_{s}\big{(}\phi(w;\theta)% \beta_{s}(\theta,\omega)\big{)}\mathrm{d}(\theta,\omega)=\int\phi(w;\theta)% \partial_{s}\beta_{s}\mathrm{d}(\theta,\omega)=f(w;\partial_{s}\beta_{s}),% \quad\forall w\in\mathcal{W}∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_f ( italic_w ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = ∫ ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ϕ ( italic_w ; italic_θ ) italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_θ , italic_ω ) ) roman_d ( italic_θ , italic_ω ) = ∫ italic_ϕ ( italic_w ; italic_θ ) ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT roman_d ( italic_θ , italic_ω ) = italic_f ( italic_w ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , ∀ italic_w ∈ caligraphic_W
sg(z;βs)=s(ψ(z;ω)βs(θ,ω))d(θ,ω)=ψ(z;ω)sβsd(θ,ω)=g(z;sβs),z𝒵formulae-sequencesubscript𝑠𝑔𝑧subscript𝛽𝑠subscript𝑠𝜓𝑧𝜔subscript𝛽𝑠𝜃𝜔d𝜃𝜔𝜓𝑧𝜔subscript𝑠subscript𝛽𝑠d𝜃𝜔𝑔𝑧subscript𝑠subscript𝛽𝑠for-all𝑧𝒵\displaystyle\partial_{s}g(z;\beta_{s})=\int\partial_{s}\big{(}\psi(z;\omega)% \beta_{s}(\theta,\omega)\big{)}\mathrm{d}(\theta,\omega)=\int\psi(z;\omega)% \partial_{s}\beta_{s}\mathrm{d}(\theta,\omega)=g(z;\partial_{s}\beta_{s}),% \quad\forall z\in\mathcal{Z}∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_g ( italic_z ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = ∫ ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ψ ( italic_z ; italic_ω ) italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_θ , italic_ω ) ) roman_d ( italic_θ , italic_ω ) = ∫ italic_ψ ( italic_z ; italic_ω ) ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT roman_d ( italic_θ , italic_ω ) = italic_g ( italic_z ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , ∀ italic_z ∈ caligraphic_Z

A similar computation for svg(θ,ω;βs)subscript𝑠superscript𝑣𝑔𝜃𝜔subscript𝛽𝑠\partial_{s}v^{g}(\theta,\omega;\beta_{s})∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_θ , italic_ω ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) gives

svg(θ,ω;βs)=αω𝔼𝒟[Φ~(X,Z;f(,sβs))ϕ(Z;ω)g(Z;sβs)ϕ(Z;ω)]subscript𝑠superscript𝑣𝑔𝜃𝜔subscript𝛽𝑠𝛼subscript𝜔subscript𝔼𝒟delimited-[]~Φ𝑋𝑍𝑓subscript𝑠subscript𝛽𝑠italic-ϕ𝑍𝜔𝑔𝑍subscript𝑠subscript𝛽𝑠italic-ϕ𝑍𝜔\displaystyle\partial_{s}v^{g}(\theta,\omega;\beta_{s})=\alpha\nabla_{\omega}% \mathbb{E}_{\mathcal{D}}\Bigl{[}\widetilde{\Phi}(X,Z;f(\cdot,\partial_{s}\beta% _{s}))\cdot\phi(Z;\omega)-g(Z;\partial_{s}\beta_{s})\cdot\phi(Z;\omega)\Bigr{]}∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_θ , italic_ω ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = italic_α ∇ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ over~ start_ARG roman_Φ end_ARG ( italic_X , italic_Z ; italic_f ( ⋅ , ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) ⋅ italic_ϕ ( italic_Z ; italic_ω ) - italic_g ( italic_Z ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⋅ italic_ϕ ( italic_Z ; italic_ω ) ]

We recall that Φ~(x,z;f)=Φ(x,z;f)Φ(x,z;𝟎)~Φ𝑥𝑧𝑓Φ𝑥𝑧𝑓Φ𝑥𝑧0\widetilde{\Phi}(x,z;f)=\Phi(x,z;f)-\Phi(x,z;\bm{0})over~ start_ARG roman_Φ end_ARG ( italic_x , italic_z ; italic_f ) = roman_Φ ( italic_x , italic_z ; italic_f ) - roman_Φ ( italic_x , italic_z ; bold_0 ) is the linear component in ΦΦ\Phiroman_Φ. We note that the variation of Φ~~Φ\widetilde{\Phi}over~ start_ARG roman_Φ end_ARG is the same as the variation of ΦΦ\Phiroman_Φ with respect to f𝑓fitalic_f, δΦ(X,Z;f)δf=δΦ~(X,Z;f)δf.𝛿Φ𝑋𝑍𝑓𝛿𝑓𝛿~Φ𝑋𝑍𝑓𝛿𝑓\frac{\delta\Phi(X,Z;f)}{\delta f}=\frac{\delta\widetilde{\Phi}(X,Z;f)}{\delta f}.divide start_ARG italic_δ roman_Φ ( italic_X , italic_Z ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG = divide start_ARG italic_δ over~ start_ARG roman_Φ end_ARG ( italic_X , italic_Z ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG .

We define the potential 𝒱(θ,ω;sβs)𝒱𝜃𝜔subscript𝑠subscript𝛽𝑠\mathcal{V}(\theta,\omega;\partial_{s}\beta_{s})caligraphic_V ( italic_θ , italic_ω ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) as

𝒱(θ,ω;sβs)𝒱𝜃𝜔subscript𝑠subscript𝛽𝑠\displaystyle\mathcal{V}(\theta,\omega;\partial_{s}\beta_{s})caligraphic_V ( italic_θ , italic_ω ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) =𝔼𝒟[g(Z;sβs)δΦ(X,Z;f(;βs))δf,ϕ(;θ)L2+λδΨ(X,Z;f(;sβs))δf,ϕ(;θ)L2]absentsubscript𝔼𝒟delimited-[]𝑔𝑍subscript𝑠subscript𝛽𝑠subscript𝛿Φ𝑋𝑍𝑓subscript𝛽𝑠𝛿𝑓italic-ϕ𝜃superscript𝐿2𝜆subscript𝛿Ψ𝑋𝑍𝑓subscript𝑠subscript𝛽𝑠𝛿𝑓italic-ϕ𝜃superscript𝐿2\displaystyle=\mathbb{E}_{\mathcal{D}}\Bigl{[}g(Z;\partial_{s}\beta_{s})\cdot% \Big{\langle}\frac{\delta\Phi(X,Z;f(\cdot;\beta_{s}))}{\delta f},\phi(\cdot;% \theta)\Big{\rangle}_{L^{2}}+\lambda\cdot\Big{\langle}\frac{\delta\Psi(X,Z;f(% \cdot;\partial_{s}\beta_{s}))}{\delta f},\phi(\cdot;\theta)\Big{\rangle}_{L^{2% }}\Bigr{]}= blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_g ( italic_Z ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⋅ ⟨ divide start_ARG italic_δ roman_Φ ( italic_X , italic_Z ; italic_f ( ⋅ ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG , italic_ϕ ( ⋅ ; italic_θ ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_λ ⋅ ⟨ divide start_ARG italic_δ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG , italic_ϕ ( ⋅ ; italic_θ ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ]
𝔼𝒟[Φ~(X,Z;f(,sβs))ψ(Z;ω)g(Z;sβs)ψ(Z;ω)]subscript𝔼𝒟delimited-[]~Φ𝑋𝑍𝑓subscript𝑠subscript𝛽𝑠𝜓𝑍𝜔𝑔𝑍subscript𝑠subscript𝛽𝑠𝜓𝑍𝜔\displaystyle-\mathbb{E}_{\mathcal{D}}\Bigl{[}\widetilde{\Phi}(X,Z;f(\cdot,% \partial_{s}\beta_{s}))\cdot\psi(Z;\omega)-g(Z;\partial_{s}\beta_{s})\cdot\psi% (Z;\omega)\Bigr{]}- blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ over~ start_ARG roman_Φ end_ARG ( italic_X , italic_Z ; italic_f ( ⋅ , ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) ⋅ italic_ψ ( italic_Z ; italic_ω ) - italic_g ( italic_Z ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⋅ italic_ψ ( italic_Z ; italic_ω ) ]

Then, the vector field sv(θ,ω;βs)subscript𝑠𝑣𝜃𝜔subscript𝛽𝑠\partial_{s}v(\theta,\omega;\beta_{s})∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_v ( italic_θ , italic_ω ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) is the gradient of such potential 𝒱(θ,ω;sβs)𝒱𝜃𝜔subscript𝑠subscript𝛽𝑠\mathcal{V}(\theta,\omega;\partial_{s}\beta_{s})caligraphic_V ( italic_θ , italic_ω ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT )

sv(θ,ω;βs)=(svf(θ;βs)svg(ω;βs))=α𝒱(θ,w;sβs),subscript𝑠𝑣𝜃𝜔subscript𝛽𝑠matrixmissing-subexpressionsubscript𝑠superscript𝑣𝑓𝜃subscript𝛽𝑠missing-subexpressionsubscript𝑠superscript𝑣𝑔𝜔subscript𝛽𝑠𝛼𝒱𝜃𝑤subscript𝑠subscript𝛽𝑠\displaystyle\partial_{s}v(\theta,\omega;\beta_{s})=\begin{pmatrix}&\partial_{% s}v^{f}(\theta;\beta_{s})\\ &\partial_{s}v^{g}(\omega;\beta_{s})\end{pmatrix}=-\alpha\nabla\mathcal{V}(% \theta,w;\partial_{s}\beta_{s}),∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_v ( italic_θ , italic_ω ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = ( start_ARG start_ROW start_CELL end_CELL start_CELL ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_ω ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ) = - italic_α ∇ caligraphic_V ( italic_θ , italic_w ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ,

where the gradient operator =(θ,ω)subscript𝜃subscript𝜔\nabla=(\nabla_{\theta},\nabla_{\omega})∇ = ( ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , ∇ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ). Then, by Stoke’s formula and integration by parts, we have

sv(;βs),usβssubscriptsubscript𝑠𝑣subscript𝛽𝑠subscript𝑢𝑠subscript𝛽𝑠\displaystyle\big{\langle}\partial_{s}v(\cdot;\beta_{s}),u_{s}\big{\rangle}_{% \beta_{s}}⟨ ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_v ( ⋅ ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT =α𝒱(θ,ω;sβs)us(θ,w)βs(θ,w)d(θ,w)absent𝛼𝒱𝜃𝜔subscript𝑠subscript𝛽𝑠subscript𝑢𝑠𝜃𝑤subscript𝛽𝑠𝜃𝑤d𝜃𝑤\displaystyle=-\int\alpha\nabla\mathcal{V}(\theta,\omega;\partial_{s}\beta_{s}% )u_{s}(\theta,w)\beta_{s}(\theta,w)\mathrm{d}(\theta,w)= - ∫ italic_α ∇ caligraphic_V ( italic_θ , italic_ω ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_θ , italic_w ) italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_θ , italic_w ) roman_d ( italic_θ , italic_w )
=α𝒱(θ,w;sβs)div(usβs)d(θ,w)=α𝒱(θ,w;sβs)sβsd(θ,w)absent𝛼𝒱𝜃𝑤subscript𝑠subscript𝛽𝑠divsubscript𝑢𝑠subscript𝛽𝑠d𝜃𝑤𝛼𝒱𝜃𝑤subscript𝑠subscript𝛽𝑠subscript𝑠subscript𝛽𝑠d𝜃𝑤\displaystyle\quad=\int\alpha\mathcal{V}(\theta,w;\partial_{s}\beta_{s})% \mathrm{div}(u_{s}\beta_{s})\mathrm{d}(\theta,w)=-\int\alpha\mathcal{V}(\theta% ,w;\partial_{s}\beta_{s})\partial_{s}\beta_{s}\mathrm{d}(\theta,w)= ∫ italic_α caligraphic_V ( italic_θ , italic_w ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) roman_div ( italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) roman_d ( italic_θ , italic_w ) = - ∫ italic_α caligraphic_V ( italic_θ , italic_w ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT roman_d ( italic_θ , italic_w )

Integrating potential 𝒱𝒱\mathcal{V}caligraphic_V with respect to sβssubscript𝑠subscript𝛽𝑠\partial_{s}\beta_{s}∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT simplied the expression to

α𝒱(θ,ω;sβs)sβsd(θ,ω)𝛼𝒱𝜃𝜔subscript𝑠subscript𝛽𝑠subscript𝑠subscript𝛽𝑠d𝜃𝜔\displaystyle\int\alpha\mathcal{V}(\theta,\omega;\partial_{s}\beta_{s})% \partial_{s}\beta_{s}\mathrm{d}(\theta,\omega)∫ italic_α caligraphic_V ( italic_θ , italic_ω ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT roman_d ( italic_θ , italic_ω )
=𝔼𝒟[g(Z;sβs)δΦ(X,Z;f(;βs))δf,αϕ(;θ)sβs(dθ)L2]absentsubscript𝔼𝒟delimited-[]𝑔𝑍subscript𝑠subscript𝛽𝑠subscript𝛿Φ𝑋𝑍𝑓subscript𝛽𝑠𝛿𝑓𝛼italic-ϕ𝜃subscript𝑠subscript𝛽𝑠d𝜃superscript𝐿2\displaystyle\qquad=\mathbb{E}_{\mathcal{D}}\Big{[}g(Z;\partial_{s}\beta_{s})% \cdot\Big{\langle}\frac{\delta\Phi(X,Z;f(\cdot;\beta_{s}))}{\delta f},\int% \alpha\phi(\cdot;\theta)\partial_{s}\beta_{s}(\mathrm{d}\theta)\Big{\rangle}_{% L^{2}}\Big{]}= blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_g ( italic_Z ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⋅ ⟨ divide start_ARG italic_δ roman_Φ ( italic_X , italic_Z ; italic_f ( ⋅ ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG , ∫ italic_α italic_ϕ ( ⋅ ; italic_θ ) ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( roman_d italic_θ ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ]
+𝔼𝒟[λδΨ(X,Z;f(;sβs))δf,αϕ(;θ)sβs(dθ)L2]subscript𝔼𝒟delimited-[]𝜆subscript𝛿Ψ𝑋𝑍𝑓subscript𝑠subscript𝛽𝑠𝛿𝑓𝛼italic-ϕ𝜃subscript𝑠subscript𝛽𝑠d𝜃superscript𝐿2\displaystyle\qquad\qquad+\mathbb{E}_{\mathcal{D}}\Big{[}\lambda\cdot\big{% \langle}\frac{\delta\Psi(X,Z;f(\cdot;\partial_{s}\beta_{s}))}{\delta f},\int% \alpha\phi(\cdot;\theta)\partial_{s}\beta_{s}(\mathrm{d}\theta)\big{\rangle}_{% L^{2}}\Big{]}+ blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_λ ⋅ ⟨ divide start_ARG italic_δ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG , ∫ italic_α italic_ϕ ( ⋅ ; italic_θ ) ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( roman_d italic_θ ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ]
𝔼𝒟[Φ~(X,Z;f(,sβs))αψ(Z;ω)sβs(dω)g(Z;sβs)αψ(Z;ω)sβs(dω)]subscript𝔼𝒟delimited-[]~Φ𝑋𝑍𝑓subscript𝑠subscript𝛽𝑠𝛼𝜓𝑍𝜔subscript𝑠subscript𝛽𝑠d𝜔𝑔𝑍subscript𝑠subscript𝛽𝑠𝛼𝜓𝑍𝜔subscript𝑠subscript𝛽𝑠d𝜔\displaystyle\qquad\qquad-\mathbb{E}_{\mathcal{D}}\Bigl{[}\widetilde{\Phi}(X,Z% ;f(\cdot,\partial_{s}\beta_{s}))\cdot\int\alpha\psi(Z;\omega)\partial_{s}\beta% _{s}(\mathrm{d}\omega)-g(Z;\partial_{s}\beta_{s})\cdot\int\alpha\psi(Z;\omega)% \partial_{s}\beta_{s}(\mathrm{d}\omega)\Bigr{]}- blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ over~ start_ARG roman_Φ end_ARG ( italic_X , italic_Z ; italic_f ( ⋅ , ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) ⋅ ∫ italic_α italic_ψ ( italic_Z ; italic_ω ) ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( roman_d italic_ω ) - italic_g ( italic_Z ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⋅ ∫ italic_α italic_ψ ( italic_Z ; italic_ω ) ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( roman_d italic_ω ) ]
=𝔼𝒟[λδΨ(X,Z;f(;sβs))δf,f(;sβs)L2+g(Z;sβs)2].absentsubscript𝔼𝒟delimited-[]𝜆subscript𝛿Ψ𝑋𝑍𝑓subscript𝑠subscript𝛽𝑠𝛿𝑓𝑓subscript𝑠subscript𝛽𝑠superscript𝐿2𝑔superscript𝑍subscript𝑠subscript𝛽𝑠2\displaystyle\qquad=\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\cdot\Big{\langle}% \frac{\delta\Psi(X,Z;f(\cdot;\partial_{s}\beta_{s}))}{\delta f},f(\cdot;% \partial_{s}\beta_{s})\Big{\rangle}_{L^{2}}+g(Z;\partial_{s}\beta_{s})^{2}% \Bigr{]}.= blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_λ ⋅ ⟨ divide start_ARG italic_δ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG , italic_f ( ⋅ ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_g ( italic_Z ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (A.18)

By convexity of Ψ(x,z;f)Ψ𝑥𝑧𝑓\Psi(x,z;f)roman_Ψ ( italic_x , italic_z ; italic_f ) and Ψ(x,z;𝟎)=0Ψ𝑥𝑧00\Psi(x,z;\bm{0})=0roman_Ψ ( italic_x , italic_z ; bold_0 ) = 0 for all (x,z)𝒳×𝒵𝑥𝑧𝒳𝒵(x,z)\in\mathcal{X}\times\mathcal{Z}( italic_x , italic_z ) ∈ caligraphic_X × caligraphic_Z, it holds that

Ψ(x,z;f(;sβs))δΨ(x,z;f(;sβs))δf,f(;sβs)L2,(x,z)𝒳×𝒵.formulae-sequenceΨ𝑥𝑧𝑓subscript𝑠subscript𝛽𝑠subscript𝛿Ψ𝑥𝑧𝑓subscript𝑠subscript𝛽𝑠𝛿𝑓𝑓subscript𝑠subscript𝛽𝑠superscript𝐿2for-all𝑥𝑧𝒳𝒵\displaystyle\Psi(x,z;f(\cdot;\partial_{s}\beta_{s}))\leq\Big{\langle}\frac{% \delta\Psi(x,z;f(\cdot;\partial_{s}\beta_{s}))}{\delta f},f(\cdot;\partial_{s}% \beta_{s})\Big{\rangle}_{L^{2}},\quad\forall(x,z)\in\mathcal{X}\times\mathcal{% Z}.roman_Ψ ( italic_x , italic_z ; italic_f ( ⋅ ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) ≤ ⟨ divide start_ARG italic_δ roman_Ψ ( italic_x , italic_z ; italic_f ( ⋅ ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG , italic_f ( ⋅ ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , ∀ ( italic_x , italic_z ) ∈ caligraphic_X × caligraphic_Z . (A.19)

Integrating (A.2) with respect to s[0,1]𝑠01s\in[0,1]italic_s ∈ [ 0 , 1 ], we have that

01sv(;βs),usβsdssuperscriptsubscript01subscriptsubscript𝑠𝑣subscript𝛽𝑠subscript𝑢𝑠subscript𝛽𝑠differential-d𝑠\displaystyle\int_{0}^{1}\big{\langle}\partial_{s}v(\cdot;\beta_{s}),u_{s}\big% {\rangle}_{\beta_{s}}\mathrm{d}s∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ⟨ ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_v ( ⋅ ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_d italic_s =01𝔼𝒟[λδΨ(X,Z;f(;sβs))δf,f(;sβs)L2+g(Z;sβs)2]dsabsentsuperscriptsubscript01subscript𝔼𝒟delimited-[]𝜆subscript𝛿Ψ𝑋𝑍𝑓subscript𝑠subscript𝛽𝑠𝛿𝑓𝑓subscript𝑠subscript𝛽𝑠superscript𝐿2𝑔superscript𝑍subscript𝑠subscript𝛽𝑠2differential-d𝑠\displaystyle=-\int_{0}^{1}\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\cdot\big{% \langle}\frac{\delta\Psi(X,Z;f(\cdot;\partial_{s}\beta_{s}))}{\delta f},f(% \cdot;\partial_{s}\beta_{s})\big{\rangle}_{L^{2}}+g(Z;\partial_{s}\beta_{s})^{% 2}\Bigr{]}\mathrm{d}s= - ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_λ ⋅ ⟨ divide start_ARG italic_δ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG , italic_f ( ⋅ ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_g ( italic_Z ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] roman_d italic_s
𝔼𝒟[λΨ(X,Z;f(;sβs))+g(Z;sβs)2]dsabsentsubscript𝔼𝒟delimited-[]𝜆Ψ𝑋𝑍𝑓subscript𝑠subscript𝛽𝑠𝑔superscript𝑍subscript𝑠subscript𝛽𝑠2d𝑠\displaystyle\leq-\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\cdot\Psi(X,Z;f(\cdot% ;\partial_{s}\beta_{s}))+g(Z;\partial_{s}\beta_{s})^{2}\Bigr{]}\mathrm{d}s≤ - blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_λ ⋅ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) + italic_g ( italic_Z ; ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] roman_d italic_s
𝔼𝒟[λΨ(X,Z;f(;ρt)f(;ρ))+(g(Z;ρt)g(Z;ρ))2]dsabsentsubscript𝔼𝒟delimited-[]𝜆Ψ𝑋𝑍𝑓subscript𝜌𝑡𝑓superscript𝜌superscript𝑔𝑍subscript𝜌𝑡𝑔𝑍superscript𝜌2d𝑠\displaystyle\leq-\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\cdot\Psi(X,Z;f(\cdot% ;\rho_{t})-f(\cdot;\rho^{*}))+\Big{(}g(Z;\rho_{t})-g(Z;\rho^{*})\Big{)}^{2}% \Bigr{]}\mathrm{d}s≤ - blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_λ ⋅ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f ( ⋅ ; italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) + ( italic_g ( italic_Z ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_g ( italic_Z ; italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] roman_d italic_s
=𝔼𝒟[λΨ(X,Z;f(;ρt)f())+(g(Z;ρt)g(Z))2].absentsubscript𝔼𝒟delimited-[]𝜆Ψ𝑋𝑍𝑓subscript𝜌𝑡superscript𝑓superscript𝑔𝑍subscript𝜌𝑡superscript𝑔𝑍2\displaystyle=-\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\cdot\Psi(X,Z;f(\cdot;% \rho_{t})-f^{*}(\cdot))+\bigl{(}g(Z;\rho_{t})-g^{*}(Z)\bigr{)}^{2}\Bigr{]}.= - blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_λ ⋅ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ ) ) + ( italic_g ( italic_Z ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_Z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (A.20)

where the first inequality holds due to (A.19), and the second holds by Jensen’s inequality.

Upper bounding term (ii). By Lemma D.6, for term (ii) in (A.17), it holds that

v(θ,ω;βs),s(us(θ,ω)βs(θ,ω))d(θ,ω)𝑣𝜃𝜔subscript𝛽𝑠subscript𝑠subscript𝑢𝑠𝜃𝜔subscript𝛽𝑠𝜃𝜔d𝜃𝜔\displaystyle\int\big{\langle}v(\theta,\omega;\beta_{s}),\partial_{s}(u_{s}(% \theta,\omega)\beta_{s}(\theta,\omega))\big{\rangle}\mathrm{d}(\theta,\omega)∫ ⟨ italic_v ( italic_θ , italic_ω ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_θ , italic_ω ) italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_θ , italic_ω ) ) ⟩ roman_d ( italic_θ , italic_ω )
=v(θ,ω;βs),us(θ,ω)us(θ,ω)βs(θ,ω)d(θ,ω)absent𝑣𝜃𝜔subscript𝛽𝑠tensor-productsubscript𝑢𝑠𝜃𝜔subscript𝑢𝑠𝜃𝜔subscript𝛽𝑠𝜃𝜔d𝜃𝜔\displaystyle\qquad=\int\big{\langle}\nabla v(\theta,\omega;\beta_{s}),u_{s}(% \theta,\omega)\otimes u_{s}(\theta,\omega)\beta_{s}(\theta,\omega)\big{\rangle% }\mathrm{d}(\theta,\omega)= ∫ ⟨ ∇ italic_v ( italic_θ , italic_ω ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_θ , italic_ω ) ⊗ italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_θ , italic_ω ) italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_θ , italic_ω ) ⟩ roman_d ( italic_θ , italic_ω )
supθ,ωv(θ,ω;βs)Fusβs2.\displaystyle\qquad\leq\sup_{\theta,\omega}~{}\bigl{\|}\nabla v(\theta,\omega;% \beta_{s})\bigl{\|}_{F}\cdot\|u_{s}\|_{\beta_{s}}^{2}.≤ roman_sup start_POSTSUBSCRIPT italic_θ , italic_ω end_POSTSUBSCRIPT ∥ ∇ italic_v ( italic_θ , italic_ω ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ⋅ ∥ italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (A.21)

where F\bigl{\|}\cdot\bigl{\|}_{F}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT denotes the Frobenius norm. Since ussubscript𝑢𝑠u_{s}italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the velocity field corresponding to the geodesic connecting ρsuperscript𝜌\rho^{*}italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, by assumptions, it holds that

usβs2=𝒲2(ρt,ρ)24𝒲2(ρ0,ρ)2=4α2D¯2=𝒪(α2)superscriptsubscriptnormsubscript𝑢𝑠subscript𝛽𝑠2subscript𝒲2superscriptsubscript𝜌𝑡superscript𝜌24subscript𝒲2superscriptsubscript𝜌0superscript𝜌24superscript𝛼2superscript¯𝐷2𝒪superscript𝛼2\displaystyle\|u_{s}\|_{\beta_{s}}^{2}=\mathcal{W}_{2}(\rho_{t},\rho^{*})^{2}% \leq 4\mathcal{W}_{2}(\rho_{0},\rho^{*})^{2}=4\alpha^{-2}\bar{D}^{2}=\mathcal{% O}(\alpha^{-2})∥ italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 4 caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 4 italic_α start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT over¯ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = caligraphic_O ( italic_α start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) (A.22)

On the other hand, by the definition of v𝑣vitalic_v in (A.15), we have that

v(θ,ω;βs)F2=θvf(θ;βs)F2+ωvg(ω;βs)F2\displaystyle\bigl{\|}\nabla v(\theta,\omega;\beta_{s})\bigl{\|}_{F}^{2}=\bigl% {\|}\nabla_{\theta}v^{f}(\theta;\beta_{s})\bigl{\|}_{F}^{2}+\bigl{\|}\nabla_{% \omega}v^{g}(\omega;\beta_{s})\bigl{\|}_{F}^{2}∥ ∇ italic_v ( italic_θ , italic_ω ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ ∇ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_ω ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (A.23)

By the definition of vfsuperscript𝑣𝑓v^{f}italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT in (3), we have that

θvf(θ;βs)Fsubscriptnormsubscript𝜃superscript𝑣𝑓𝜃subscript𝛽𝑠𝐹\displaystyle\big{\|}\nabla_{\theta}v^{f}(\theta;\beta_{s})\big{\|}_{F}∥ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT α𝔼𝒟[|g(Z;βs)𝒲δΦ(X,Z;f(;βs))δf(w)dw|]supw𝒲θ,θ2ϕ(w;θ)F2absent𝛼subscript𝔼𝒟delimited-[]𝑔𝑍subscript𝛽𝑠subscript𝒲𝛿Φ𝑋𝑍𝑓subscript𝛽𝑠𝛿𝑓superscript𝑤differential-dsuperscript𝑤𝑤𝒲supremumsuperscriptsubscriptnormsubscriptsuperscript2𝜃𝜃italic-ϕ𝑤𝜃𝐹2\displaystyle\leq\alpha\cdot\mathbb{E}_{\mathcal{D}}\Big{[}\Big{|}g(Z;\beta_{s% })\cdot\int_{\mathcal{W}}\frac{\delta\Phi(X,Z;f(\cdot;\beta_{s}))}{\delta f}(w% ^{\prime})\mathrm{d}w^{\prime}\Big{|}\Big{]}\cdot\underset{w\in\mathcal{W}}{% \sup}\big{\|}\nabla^{2}_{\theta,\theta}\phi(w;\theta)\big{\|}_{F}^{2}≤ italic_α ⋅ blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ | italic_g ( italic_Z ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⋅ ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT divide start_ARG italic_δ roman_Φ ( italic_X , italic_Z ; italic_f ( ⋅ ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) roman_d italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | ] ⋅ start_UNDERACCENT italic_w ∈ caligraphic_W end_UNDERACCENT start_ARG roman_sup end_ARG ∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ , italic_θ end_POSTSUBSCRIPT italic_ϕ ( italic_w ; italic_θ ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+α𝔼𝒟[λ|𝒲δΨ(X,Z;f(;βs))δf(w)dw|]supw𝒲θ,θ2ϕ(w;θ)F2𝛼subscript𝔼𝒟delimited-[]𝜆subscript𝒲𝛿Ψ𝑋𝑍𝑓subscript𝛽𝑠𝛿𝑓superscript𝑤differential-dsuperscript𝑤𝑤𝒲supremumsuperscriptsubscriptnormsubscriptsuperscript2𝜃𝜃italic-ϕ𝑤𝜃𝐹2\displaystyle\qquad+\alpha\cdot\mathbb{E}_{\mathcal{D}}\Big{[}\lambda\cdot\Big% {|}\int_{\mathcal{W}}\frac{\delta\Psi(X,Z;f(\cdot;\beta_{s}))}{\delta f}(w^{% \prime})\mathrm{d}w^{\prime}\Big{|}\Big{]}\cdot\underset{w\in\mathcal{W}}{\sup% }\big{\|}\nabla^{2}_{\theta,\theta}\phi(w;\theta)\big{\|}_{F}^{2}+ italic_α ⋅ blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_λ ⋅ | ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT divide start_ARG italic_δ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) roman_d italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | ] ⋅ start_UNDERACCENT italic_w ∈ caligraphic_W end_UNDERACCENT start_ARG roman_sup end_ARG ∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ , italic_θ end_POSTSUBSCRIPT italic_ϕ ( italic_w ; italic_θ ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
αB2𝔼𝒟[λCΨ|f(W;βs)|+C2|g(Z;βs)|].\displaystyle\leq\alpha B_{2}\cdot\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda C_{% \Psi}\bigl{|}f(W;\beta_{s})\bigr{|}+C_{2}\big{|}g(Z;\beta_{s})\big{|}\Bigr{]}.≤ italic_α italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_λ italic_C start_POSTSUBSCRIPT roman_Ψ end_POSTSUBSCRIPT | italic_f ( italic_W ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) | + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_g ( italic_Z ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) | ] . (A.24)

where the first inequality follows from Assumption 4.1, and second inequality comes from the integrability conditions in Assumption 4.3. Thus, it suffices to upper bound f(w;βs)𝑓𝑤subscript𝛽𝑠f(w;\beta_{s})italic_f ( italic_w ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) and g(z;βs)𝑔𝑧subscript𝛽𝑠g(z;\beta_{s})italic_g ( italic_z ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) for all (w,z)𝒲×𝒵𝑤𝑧𝒲𝒵(w,z)\in\mathcal{W}\times\mathcal{Z}( italic_w , italic_z ) ∈ caligraphic_W × caligraphic_Z. For f(w;βs)𝑓𝑤subscript𝛽𝑠f(w;\beta_{s})italic_f ( italic_w ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ), we have that

|f(w;βs)|𝑓𝑤subscript𝛽𝑠\displaystyle\bigl{|}f(w;\beta_{s})\bigr{|}| italic_f ( italic_w ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) | =α|ϕ(wθ)dβs(θ,ω)|=α|ϕ(w;θ)d(βsρ0)(θ,ω)|absent𝛼italic-ϕ𝑤𝜃differential-dsubscript𝛽𝑠𝜃𝜔𝛼italic-ϕ𝑤𝜃dsubscript𝛽𝑠subscript𝜌0𝜃𝜔\displaystyle=\alpha\cdot\Bigl{|}\int\phi(w\theta)\mathrm{d}\beta_{s}(\theta,% \omega)\Bigr{|}=\alpha\cdot\Bigl{|}\int\phi(w;\theta)~{}\mathrm{d}(\beta_{s}-% \rho_{0})(\theta,\omega)\Bigr{|}= italic_α ⋅ | ∫ italic_ϕ ( italic_w italic_θ ) roman_d italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_θ , italic_ω ) | = italic_α ⋅ | ∫ italic_ϕ ( italic_w ; italic_θ ) roman_d ( italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( italic_θ , italic_ω ) |
αB1𝒲1(βs,ρ0)αB1𝒲2(βs,ρ0).absent𝛼subscript𝐵1subscript𝒲1subscript𝛽𝑠subscript𝜌0𝛼subscript𝐵1subscript𝒲2subscript𝛽𝑠subscript𝜌0\displaystyle\leq\alpha B_{1}\cdot\mathcal{W}_{1}(\beta_{s},\rho_{0})\leq% \alpha B_{1}\cdot\mathcal{W}_{2}(\beta_{s},\rho_{0}).≤ italic_α italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ caligraphic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≤ italic_α italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) . (A.25)

Moreover, it holds that

𝒲2(βs,ρ0)𝒲2(βs,ρ)+𝒲2(ρ,ρ0)𝒲2(ρt,ρ)+𝒲2(ρ0,ρ)3α1D¯,subscript𝒲2subscript𝛽𝑠subscript𝜌0subscript𝒲2subscript𝛽𝑠superscript𝜌subscript𝒲2superscript𝜌subscript𝜌0subscript𝒲2subscript𝜌𝑡superscript𝜌subscript𝒲2subscript𝜌0superscript𝜌3superscript𝛼1¯𝐷\displaystyle\mathcal{W}_{2}(\beta_{s},\rho_{0})\leq\mathcal{W}_{2}(\beta_{s},% \rho^{*})+\mathcal{W}_{2}(\rho^{*},\rho_{0})\leq\mathcal{W}_{2}(\rho_{t},\rho^% {*})+\mathcal{W}_{2}(\rho_{0},\rho^{*})\leq 3\alpha^{-1}\bar{D},caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≤ caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≤ caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ 3 italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over¯ start_ARG italic_D end_ARG , (A.26)

where the second inequality follows from the fact that βs,s[0,1]subscript𝛽𝑠𝑠01\beta_{s},s\in[0,1]italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ∈ [ 0 , 1 ] is the geodesic connecting ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ρsuperscript𝜌\rho^{*}italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and the last inequality follows from (ii) in Lemma 4.6. Plugging (A.26) into (A.2), we have that

|f(w;βs)|𝒪(1),w𝒲.formulae-sequence𝑓𝑤subscript𝛽𝑠𝒪1for-all𝑤𝒲\displaystyle\bigl{|}f(w;\beta_{s})\bigr{|}\leq\mathcal{O}(1),\quad\forall w% \in\mathcal{W}.| italic_f ( italic_w ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) | ≤ caligraphic_O ( 1 ) , ∀ italic_w ∈ caligraphic_W . (A.27)

Through a similar argument, such an upper bound can also be established for g(z;βs)𝑔𝑧subscript𝛽𝑠g(z;\beta_{s})italic_g ( italic_z ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) for all z𝒵𝑧𝒵z\in\mathcal{Z}italic_z ∈ caligraphic_Z,

|g(z;βs)|𝒪(1),z𝒵.formulae-sequence𝑔𝑧subscript𝛽𝑠𝒪1𝑧𝒵\displaystyle\bigl{|}g(z;\beta_{s})\bigr{|}\leq\mathcal{O}(1),\quad z\in% \mathcal{Z}.| italic_g ( italic_z ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) | ≤ caligraphic_O ( 1 ) , italic_z ∈ caligraphic_Z . (A.28)

Plugging (A.27) and (A.28) into (A.2), we establish an upper bound for θvf(θ;βs)Fsubscriptnormsubscript𝜃superscript𝑣𝑓𝜃subscript𝛽𝑠𝐹\big{\|}\nabla_{\theta}v^{f}(\theta;\beta_{s})\big{\|}_{F}∥ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT,

θvf(θ;βs)F𝒪(α).subscriptnormsubscript𝜃superscript𝑣𝑓𝜃subscript𝛽𝑠𝐹𝒪𝛼\displaystyle\big{\|}\nabla_{\theta}v^{f}(\theta;\beta_{s})\big{\|}_{F}\leq% \mathcal{O}(\alpha).∥ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≤ caligraphic_O ( italic_α ) . (A.29)

Similarly, by the definition of vgsuperscript𝑣𝑔v^{g}italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT in (3) we have that

ωvg(ω;βs)Fsubscriptnormsubscript𝜔superscript𝑣𝑔𝜔subscript𝛽𝑠𝐹\displaystyle\big{\|}\nabla_{\omega}v^{g}(\omega;\beta_{s})\big{\|}_{F}∥ ∇ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_ω ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT α𝔼𝒟[|Φ(X,Z;f(;βs))|+|g(Z;βs)|]supz𝒵ω,ω2ψ(z;ω)F2absent𝛼subscript𝔼𝒟delimited-[]Φ𝑋𝑍𝑓subscript𝛽𝑠𝑔𝑍subscript𝛽𝑠𝑧𝒵supremumsuperscriptsubscriptnormsubscriptsuperscript2𝜔𝜔𝜓𝑧𝜔𝐹2\displaystyle\leq\alpha\cdot\mathbb{E}_{\mathcal{D}}\Bigl{[}\big{|}\Phi(X,Z;f(% \cdot;\beta_{s}))\big{|}+\big{|}g(Z;\beta_{s})\big{|}\Bigr{]}\cdot\underset{z% \in\mathcal{Z}}{\sup}\big{\|}\nabla^{2}_{\omega,\omega}\psi(z;\omega)\big{\|}_% {F}^{2}≤ italic_α ⋅ blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ | roman_Φ ( italic_X , italic_Z ; italic_f ( ⋅ ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) | + | italic_g ( italic_Z ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) | ] ⋅ start_UNDERACCENT italic_z ∈ caligraphic_Z end_UNDERACCENT start_ARG roman_sup end_ARG ∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ω , italic_ω end_POSTSUBSCRIPT italic_ψ ( italic_z ; italic_ω ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
αB2(𝔼𝒟[|Φ(X,Z;𝟎)|+C2|f(W;βs)|+|g(Z;βs)|)=𝒪(α).\displaystyle\leq\alpha B_{2}\cdot\Big{(}\mathbb{E}_{\mathcal{D}}\Bigl{[}\big{% |}\Phi(X,Z;\bm{0})\big{|}+C_{2}\bigl{|}f(W;\beta_{s})\bigr{|}+\big{|}g(Z;\beta% _{s})\big{|}\Big{)}=\mathcal{O}(\alpha).≤ italic_α italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ ( blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ | roman_Φ ( italic_X , italic_Z ; bold_0 ) | + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_f ( italic_W ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) | + | italic_g ( italic_Z ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) | ) = caligraphic_O ( italic_α ) . (A.30)

Combining the bound from (A.29) and (A.2) and plugging into(A.23), it holds that

v(θ,ω;βs)F2=θvf(θ;βs)F2+ωvg(ω;βs)F2𝒪(α2).superscriptsubscriptnorm𝑣𝜃𝜔subscript𝛽𝑠𝐹2superscriptsubscriptnormsubscript𝜃superscript𝑣𝑓𝜃subscript𝛽𝑠𝐹2superscriptsubscriptnormsubscript𝜔superscript𝑣𝑔𝜔subscript𝛽𝑠𝐹2𝒪superscript𝛼2\displaystyle\Big{\|}\nabla v(\theta,\omega;\beta_{s})\Big{\|}_{F}^{2}=\Big{\|% }\nabla_{\theta}v^{f}(\theta;\beta_{s})\Big{\|}_{F}^{2}+\Big{\|}\nabla_{\omega% }v^{g}(\omega;\beta_{s})\Big{\|}_{F}^{2}\leq\mathcal{O}(\alpha^{2}).∥ ∇ italic_v ( italic_θ , italic_ω ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ ∇ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_ω ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ caligraphic_O ( italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (A.31)

Equation (A.22) and (A.31) provide upper bounds on the two terms involved in (A.2). Plugging the upper bounds that we have achieved, it holds that

v(θ,w;βs),s(us(θ,ω)βs(θ,ω))d(θ,ω)𝒪(α1).𝑣𝜃𝑤subscript𝛽𝑠subscript𝑠subscript𝑢𝑠𝜃𝜔subscript𝛽𝑠𝜃𝜔d𝜃𝜔𝒪superscript𝛼1\displaystyle\int\Big{\langle}v(\theta,w;\beta_{s}),\partial_{s}(u_{s}(\theta,% \omega)\beta_{s}(\theta,\omega))\Big{\rangle}\mathrm{d}(\theta,\omega)\leq% \mathcal{O}(\alpha^{-1}).∫ ⟨ italic_v ( italic_θ , italic_w ; italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_θ , italic_ω ) italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_θ , italic_ω ) ) ⟩ roman_d ( italic_θ , italic_ω ) ≤ caligraphic_O ( italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) . (A.32)

Now combining (A.2) and (A.32), we have that

12d𝒲2(ρt,ρ)2dt12dsubscript𝒲2superscriptsubscript𝜌𝑡superscript𝜌2d𝑡\displaystyle\frac{1}{2}\frac{\mathrm{d}\mathcal{W}_{2}(\rho_{t},\rho^{*})^{2}% }{\mathrm{d}t}divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG roman_d caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG roman_d italic_t end_ARG η𝔼𝒟[λΨ(X,Z;f(,ρt)f(;ρ))+(g(Z;ρt)g(Z;ρ))2]+Cηα1.absent𝜂subscript𝔼𝒟delimited-[]𝜆Ψ𝑋𝑍𝑓subscript𝜌𝑡𝑓superscript𝜌superscript𝑔𝑍subscript𝜌𝑡𝑔𝑍superscript𝜌2subscript𝐶𝜂superscript𝛼1\displaystyle\;\leq-\eta\cdot\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\Psi(X,Z;f% (\cdot,\rho_{t})-f(\cdot;\rho^{*}))+\bigl{(}g(Z;\rho_{t})-g(Z;\rho^{*})\bigr{)% }^{2}\Bigr{]}+C_{*}\cdot\eta\cdot\alpha^{-1}.≤ - italic_η ⋅ blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_λ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ , italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f ( ⋅ ; italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) + ( italic_g ( italic_Z ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_g ( italic_Z ; italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_C start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⋅ italic_η ⋅ italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT .

where C=C(B0,B1,B2,C,λ,D¯)>0subscript𝐶subscript𝐶subscript𝐵0subscript𝐵1subscript𝐵2𝐶𝜆¯𝐷0C_{*}=C_{*}\bigl{(}B_{0},B_{1},B_{2},C,\lambda,\bar{D}\bigr{)}>0italic_C start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = italic_C start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_C , italic_λ , over¯ start_ARG italic_D end_ARG ) > 0 is a constant. This completes the proof of Lemma A.1. ∎

We are now ready to present the proof of Theorem 4.7 with the help of Lemma A.1.

Proof.

We define

t=inf{τ+|𝔼𝒟[λΨ(X,Z;f(,ρτ)f(;ρ))+(g(Z;ρτ)g(Z;ρ))2]<Cα1}superscript𝑡infimumconditional-set𝜏subscriptsubscript𝔼𝒟delimited-[]𝜆Ψ𝑋𝑍𝑓subscript𝜌𝜏𝑓superscript𝜌superscript𝑔𝑍subscript𝜌𝜏𝑔𝑍superscript𝜌2subscript𝐶superscript𝛼1\displaystyle t^{*}=\inf\Bigl{\{}\tau\in\mathbb{R}_{+}{\,\bigg{|}\,}\mathbb{E}% _{\mathcal{D}}\bigl{[}\lambda\Psi(X,Z;f(\cdot,\rho_{\tau})-f(\cdot;\rho^{*}))+% \bigl{(}g(Z;\rho_{\tau})-g(Z;\rho^{*})\bigr{)}^{2}\bigr{]}<C_{*}\cdot\alpha^{-% 1}\Bigr{\}}italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_inf { italic_τ ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_λ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ , italic_ρ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) - italic_f ( ⋅ ; italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) + ( italic_g ( italic_Z ; italic_ρ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) - italic_g ( italic_Z ; italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] < italic_C start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⋅ italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT } (A.33)

Also, we define

t=inf{τ+|𝒲2(ρτ,ρ)>2𝒲2(ρ0,ρ)}subscript𝑡infimumconditional-set𝜏subscriptsubscript𝒲2subscript𝜌𝜏superscript𝜌2subscript𝒲2subscript𝜌0superscript𝜌\displaystyle t_{*}=\inf\Bigl{\{}\tau\in\mathbb{R}_{+}{\,\bigg{|}\,}\mathcal{W% }_{2}(\rho_{\tau},\rho^{*})>2\mathcal{W}_{2}(\rho_{0},\rho^{*})\Bigr{\}}italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = roman_inf { italic_τ ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) > 2 caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } (A.34)

In other words, (A.16) of Lemma A.1 holds for tt𝑡subscript𝑡t\leq t_{*}italic_t ≤ italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT, and for 0tmin{t,t}0𝑡subscript𝑡superscript𝑡0\leq t\leq\min\{t_{*},t^{*}\}0 ≤ italic_t ≤ roman_min { italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT }, we have

12d𝒲2(ρt,ρ)2dtη𝔼𝒟[λΨ(X,Z;f(,ρt)f(;ρ))+(g(Z;ρt)g(Z;ρ))2]+Cηα1012dsubscript𝒲2superscriptsubscript𝜌𝑡superscript𝜌2d𝑡𝜂subscript𝔼𝒟delimited-[]𝜆Ψ𝑋𝑍𝑓subscript𝜌𝑡𝑓superscript𝜌superscript𝑔𝑍subscript𝜌𝑡𝑔𝑍superscript𝜌2subscript𝐶𝜂superscript𝛼10\displaystyle\frac{1}{2}\frac{\mathrm{d}\mathcal{W}_{2}(\rho_{t},\rho^{*})^{2}% }{\mathrm{d}t}\leq-\eta\cdot\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\Psi(X,Z;f(% \cdot,\rho_{t})-f(\cdot;\rho^{*}))+\bigl{(}g(Z;\rho_{t})-g(Z;\rho^{*})\bigr{)}% ^{2}\Bigr{]}+C_{*}\cdot\eta\alpha^{-1}\leq 0divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG roman_d caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG roman_d italic_t end_ARG ≤ - italic_η ⋅ blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_λ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ , italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f ( ⋅ ; italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) + ( italic_g ( italic_Z ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_g ( italic_Z ; italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_C start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⋅ italic_η italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ≤ 0

We now show that t>tsubscript𝑡superscript𝑡t_{*}>t^{*}italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT > italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT by contradiction. By the continuity of 𝒲2(ρt,ρ)2subscript𝒲2superscriptsubscript𝜌𝑡superscript𝜌2\mathcal{W}_{2}(\rho_{t},\rho^{*})^{2}caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT with respect to t𝑡titalic_t Ambrosio et al. (2008), since 𝒲2(ρ0,ρ)<2𝒲2(ρ0,ρ)subscript𝒲2subscript𝜌0superscript𝜌2subscript𝒲2subscript𝜌0superscript𝜌\mathcal{W}_{2}(\rho_{0},\rho^{*})<2\mathcal{W}_{2}(\rho_{0},\rho^{*})caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) < 2 caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), it holds that t>0subscript𝑡0t_{*}>0italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT > 0. Let’s assume ttsubscript𝑡superscript𝑡t_{*}\leq t^{*}italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ≤ italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, then t=min{t,t}subscript𝑡subscript𝑡superscript𝑡t_{*}=\min\{t_{*},t^{*}\}italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = roman_min { italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT }. Thus, by (A.16), (A.33), (A.34), it holds that for 0tt0𝑡subscript𝑡0\leq t\leq t_{*}0 ≤ italic_t ≤ italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT that

12d𝒲2(ρt,ρ)2dt012dsubscript𝒲2superscriptsubscript𝜌𝑡superscript𝜌2d𝑡0\displaystyle\frac{1}{2}\frac{\mathrm{d}\mathcal{W}_{2}(\rho_{t},\rho^{*})^{2}% }{\mathrm{d}t}\leq 0divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG roman_d caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG roman_d italic_t end_ARG ≤ 0

which further implies that 𝒲2(ρt,ρ)𝒲2(ρ0,ρ)subscript𝒲2subscript𝜌subscript𝑡superscript𝜌subscript𝒲2subscript𝜌0superscript𝜌\mathcal{W}_{2}(\rho_{t_{*}},\rho^{*})\leq\mathcal{W}_{2}(\rho_{0},\rho^{*})caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). This contradicts the definition of tsubscript𝑡t_{*}italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT in (A.34). Thus, it holds that ttsubscript𝑡superscript𝑡t_{*}\geq t^{*}italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ≥ italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, which implies that (A.16) of Lemma A.1 holds for any 0tt0𝑡superscript𝑡0\leq t\leq t^{*}0 ≤ italic_t ≤ italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. We now discuss two different situations.

Scenario (i) If tTsubscript𝑡𝑇t_{*}\leq Titalic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ≤ italic_T, then it holds that

inft[0,T]𝔼𝒟[λΨ(X,Z;f(,μt)f)+(g(Z;νt)g)2]subscriptinfimum𝑡0𝑇subscript𝔼𝒟delimited-[]𝜆Ψ𝑋𝑍𝑓subscript𝜇𝑡superscript𝑓superscript𝑔𝑍subscript𝜈𝑡superscript𝑔2\displaystyle\inf_{t\in[0,T]}\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\Psi(X,Z;f% (\cdot,\mu_{t})-f^{*})+\bigl{(}g(Z;\nu_{t})-g^{*}\bigr{)}^{2}\Bigr{]}roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_λ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + ( italic_g ( italic_Z ; italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
𝔼𝒟[λΨ(X,Z;f(,μt)f)+(g(Z;νt)g)2]absentsubscript𝔼𝒟delimited-[]𝜆Ψ𝑋𝑍𝑓subscript𝜇subscript𝑡superscript𝑓superscript𝑔𝑍subscript𝜈subscript𝑡superscript𝑔2\displaystyle\quad\leq\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\Psi(X,Z;f(\cdot,% \mu_{t_{*}})-f^{*})+\bigl{(}g(Z;\nu_{t_{*}})-g^{*}\bigr{)}^{2}\Bigr{]}≤ blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_λ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ , italic_μ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + ( italic_g ( italic_Z ; italic_ν start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
<Cα1=𝒪(T1+α1).absentsubscript𝐶superscript𝛼1𝒪superscript𝑇1superscript𝛼1\displaystyle\quad<C_{*}\alpha^{-1}=\mathcal{O}(T^{-1}+\alpha^{-1}).< italic_C start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = caligraphic_O ( italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) . (A.35)

Therefore, (A.2) implies Theorem 4.7 in this scenario.

Scenario (ii) If t>Tsubscript𝑡𝑇t_{*}>Titalic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT > italic_T, then (A.16) in Lemma A.1 holds for 0tT0𝑡𝑇0\leq t\leq T0 ≤ italic_t ≤ italic_T. Re-arranging the terms, we have the following inequality for all 0tT0𝑡𝑇0\leq t\leq T0 ≤ italic_t ≤ italic_T,

𝔼𝒟[λΨ(X,Z;f(,μt)f)+(g(Z;νt)g)2]η112d𝒲2(ρt,ρ)2dt+Cα1subscript𝔼𝒟delimited-[]𝜆Ψ𝑋𝑍𝑓subscript𝜇𝑡superscript𝑓superscript𝑔𝑍subscript𝜈𝑡superscript𝑔2superscript𝜂112dsubscript𝒲2superscriptsubscript𝜌𝑡superscript𝜌2d𝑡subscript𝐶superscript𝛼1\displaystyle\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\Psi(X,Z;f(\cdot,\mu_{t})-% f^{*})+\bigl{(}g(Z;\nu_{t})-g^{*}\bigr{)}^{2}\Bigr{]}\leq-\eta^{-1}\cdot\frac{% 1}{2}\frac{\mathrm{d}\mathcal{W}_{2}(\rho_{t},\rho^{*})^{2}}{\mathrm{d}t}+C_{*% }\cdot\alpha^{-1}blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_λ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + ( italic_g ( italic_Z ; italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ - italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG roman_d caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG roman_d italic_t end_ARG + italic_C start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⋅ italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT (A.36)

This further suggests the following upper bound,

inft[0,T]𝔼𝒟[λΨ(X,Z;f(,μt)f)+(g(Z;νt)g)2]subscriptinfimum𝑡0𝑇subscript𝔼𝒟delimited-[]𝜆Ψ𝑋𝑍𝑓subscript𝜇𝑡superscript𝑓superscript𝑔𝑍subscript𝜈𝑡superscript𝑔2\displaystyle\inf_{t\in[0,T]}\mathbb{E}_{\mathcal{D}}\Bigl{[}\lambda\Psi(X,Z;f% (\cdot,\mu_{t})-f^{*})+\bigl{(}g(Z;\nu_{t})-g^{*}\bigr{)}^{2}\Bigr{]}roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_λ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + ( italic_g ( italic_Z ; italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
T10T𝔼𝒟[λΨ(X,Z;f(,μt)f)+(g(Z;νt)g)2]dtabsentsuperscript𝑇1superscriptsubscript0𝑇subscript𝔼𝒟delimited-[]𝜆Ψ𝑋𝑍𝑓subscript𝜇𝑡superscript𝑓superscript𝑔𝑍subscript𝜈𝑡superscript𝑔2differential-d𝑡\displaystyle\quad\leq T^{-1}\cdot\int_{0}^{T}\mathbb{E}_{\mathcal{D}}\Bigl{[}% \lambda\Psi(X,Z;f(\cdot,\mu_{t})-f^{*})+\bigl{(}g(Z;\nu_{t})-g^{*}\bigr{)}^{2}% \Bigr{]}\mathrm{d}t≤ italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_λ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + ( italic_g ( italic_Z ; italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] roman_d italic_t
1/2η1T1𝒲2(ρ0,ρ)2+Cα1absent12superscript𝜂1superscript𝑇1subscript𝒲2superscriptsubscript𝜌0superscript𝜌2subscript𝐶superscript𝛼1\displaystyle\quad\leq 1/2\cdot\eta^{-1}\cdot T^{-1}\cdot\mathcal{W}_{2}(\rho_% {0},\rho^{*})^{2}+C_{*}\cdot\alpha^{-1}≤ 1 / 2 ⋅ italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⋅ italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
1/2α2D¯2η1T1+Cα1=𝒪(T1+α1),absent12superscript𝛼2superscript¯𝐷2superscript𝜂1superscript𝑇1subscript𝐶superscript𝛼1𝒪superscript𝑇1superscript𝛼1\displaystyle\quad\leq 1/2\cdot\alpha^{-2}\cdot\bar{D}^{2}\cdot\eta^{-1}\cdot T% ^{-1}+C_{*}\cdot\alpha^{-1}=\mathcal{O}(T^{-1}+\alpha^{-1}),≤ 1 / 2 ⋅ italic_α start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ⋅ over¯ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⋅ italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = caligraphic_O ( italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) , (A.37)

where the second inequality comes from integrating (A.36) in for t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ], the third inequality comes from (ii) in Lemma 4.6 and last equality comes from setting η𝜂\etaitalic_η to α2superscript𝛼2\alpha^{-2}italic_α start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT. Therefore, (A.2) implies Theorem 4.7 in this scenario.

Based on the discussion of scenarios (i) and (ii) above, we finish the proof of Theorem 4.7. ∎

A.3 Proof of Theorem 4.9

Proof.

We now prove Theorem 4.9. For notation simplicity, we denote ft=f(;μt)subscript𝑓𝑡𝑓subscript𝜇𝑡f_{t}=f(\cdot;\mu_{t})italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f ( ⋅ ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as the estimator at time t𝑡titalic_t. Recall the definition of J(f)𝐽𝑓J(f)italic_J ( italic_f ) from (2.3) and δ¯(z;f)¯𝛿𝑧𝑓\bar{\delta}(z;f)over¯ start_ARG italic_δ end_ARG ( italic_z ; italic_f ) from (2.2).

J(f)=𝔼𝒟[1/2δ¯(Z;f)2+λΨ(X,Z;f)],δ¯(z;f)=𝔼X|Z[Φ(X,Z;f)|Z=z].formulae-sequence𝐽𝑓subscript𝔼𝒟delimited-[]12¯𝛿superscript𝑍𝑓2𝜆Ψ𝑋𝑍𝑓¯𝛿𝑧𝑓subscript𝔼conditional𝑋𝑍delimited-[]conditionalΦ𝑋𝑍𝑓𝑍𝑧\displaystyle J(f)=\mathbb{E}_{\mathcal{D}}\bigl{[}1/2\cdot\bar{\delta}(Z;f)^{% 2}+\lambda\cdot\Psi(X,Z;f)\bigr{]},\quad\bar{\delta}(z;f)=\mathbb{E}_{X|Z}% \bigl{[}\Phi(X,Z;f){\,\big{|}\,}Z=z\bigr{]}.italic_J ( italic_f ) = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ 1 / 2 ⋅ over¯ start_ARG italic_δ end_ARG ( italic_Z ; italic_f ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ⋅ roman_Ψ ( italic_X , italic_Z ; italic_f ) ] , over¯ start_ARG italic_δ end_ARG ( italic_z ; italic_f ) = blackboard_E start_POSTSUBSCRIPT italic_X | italic_Z end_POSTSUBSCRIPT [ roman_Φ ( italic_X , italic_Z ; italic_f ) | italic_Z = italic_z ] .

Plugging the definition of J(f)𝐽𝑓J(f)italic_J ( italic_f ), it holds that

inft[0,T]J(ft)J(f)subscriptinfimum𝑡0𝑇𝐽subscript𝑓𝑡𝐽superscript𝑓\displaystyle\inf_{t\in[0,T]}J(f_{t})-J(f^{*})roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT italic_J ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_J ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
=inft[0,T]𝔼𝒟[1/2(δ¯(Z,ft)2δ¯(Z,f)2)+λ(Ψ(X,Z;ft)Ψ(X,Z;f))].absentsubscriptinfimum𝑡0𝑇subscript𝔼𝒟delimited-[]12¯𝛿superscript𝑍subscript𝑓𝑡2¯𝛿superscript𝑍superscript𝑓2𝜆Ψ𝑋𝑍subscript𝑓𝑡Ψ𝑋𝑍superscript𝑓\displaystyle\qquad=\inf_{t\in[0,T]}\mathbb{E}_{\mathcal{D}}\Big{[}1/2\cdot% \Big{(}\bar{\delta}(Z,f_{t})^{2}-\bar{\delta}(Z,f^{*})^{2}\Big{)}+\lambda\Big{% (}\Psi(X,Z;f_{t})-\Psi(X,Z;f^{*})\Big{)}\Big{]}.= roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ 1 / 2 ⋅ ( over¯ start_ARG italic_δ end_ARG ( italic_Z , italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - over¯ start_ARG italic_δ end_ARG ( italic_Z , italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_λ ( roman_Ψ ( italic_X , italic_Z ; italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - roman_Ψ ( italic_X , italic_Z ; italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) ] . (A.38)

Similar to the proof of Theorem 4.7, we define tsubscript𝑡t_{*}italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT as,

t=inf{τ+|𝒲2(ρτ,ρ)>2𝒲2(ρ0,ρ)}.subscript𝑡infimumconditional-set𝜏subscriptsubscript𝒲2subscript𝜌𝜏superscript𝜌2subscript𝒲2subscript𝜌0superscript𝜌\displaystyle t_{*}=\inf\Bigl{\{}\tau\in\mathbb{R}_{+}{\,\bigg{|}\,}\mathcal{W% }_{2}(\rho_{\tau},\rho^{*})>2\mathcal{W}_{2}(\rho_{0},\rho^{*})\Bigr{\}}.italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = roman_inf { italic_τ ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) > 2 caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } .

We will upper-bound the term in (A.3) separately in two different scenarios, depending on the value of tsubscript𝑡t_{*}italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT compared with T𝑇Titalic_T.

Scenario (i) If tTsubscript𝑡𝑇t_{*}\leq Titalic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ≤ italic_T, then we have that

inft[0,T]J(ft)J(f)J(ft)J(f).subscriptinfimum𝑡0𝑇𝐽subscript𝑓𝑡𝐽superscript𝑓𝐽subscript𝑓subscript𝑡𝐽superscript𝑓\displaystyle\inf_{t\in[0,T]}J(f_{t})-J(f^{*})\leq J(f_{t_{*}})-J(f^{*}).roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT italic_J ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_J ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_J ( italic_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_J ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) . (A.39)

In order to upper-bound right-hand side of (A.39), we need to uniformly upper-bound ft(w)subscript𝑓subscript𝑡𝑤f_{t_{*}}(w)italic_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_w ) and f(w)superscript𝑓𝑤f^{*}(w)italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_w ) for all w𝒲𝑤𝒲w\in\mathcal{W}italic_w ∈ caligraphic_W. For ft(w)=f(w;μt)subscript𝑓subscript𝑡𝑤𝑓𝑤subscript𝜇subscript𝑡f_{t_{*}}(w)=f(w;\mu_{t_{*}})italic_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_w ) = italic_f ( italic_w ; italic_μ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), we have that

supw𝒲|f(w;μt)|𝑤𝒲supremum𝑓𝑤subscript𝜇subscript𝑡\displaystyle\underset{w\in\mathcal{W}}{\sup}|f(w;\mu_{t_{*}})|start_UNDERACCENT italic_w ∈ caligraphic_W end_UNDERACCENT start_ARG roman_sup end_ARG | italic_f ( italic_w ; italic_μ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) | =αsupw𝒲|ϕ(w;θ)dμt(θ)|=αsupw𝒲|ϕ(w;θ)d(μtμ0)(θ)|absent𝛼𝑤𝒲supremumitalic-ϕ𝑤𝜃differential-dsubscript𝜇subscript𝑡𝜃𝛼𝑤𝒲supremumitalic-ϕ𝑤𝜃dsubscript𝜇subscript𝑡subscript𝜇0𝜃\displaystyle\;=\alpha\cdot\underset{w\in\mathcal{W}}{\sup}\Bigl{|}\int\phi(w;% \theta)\mathrm{d}\mu_{t_{*}}(\theta)\Bigr{|}=\alpha\cdot\underset{w\in\mathcal% {W}}{\sup}\Bigl{|}\int\phi(w;\theta)\mathrm{d}(\mu_{t_{*}}-\mu_{0})(\theta)% \Bigr{|}= italic_α ⋅ start_UNDERACCENT italic_w ∈ caligraphic_W end_UNDERACCENT start_ARG roman_sup end_ARG | ∫ italic_ϕ ( italic_w ; italic_θ ) roman_d italic_μ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ ) | = italic_α ⋅ start_UNDERACCENT italic_w ∈ caligraphic_W end_UNDERACCENT start_ARG roman_sup end_ARG | ∫ italic_ϕ ( italic_w ; italic_θ ) roman_d ( italic_μ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( italic_θ ) |
αB1𝒲1(μt,μ0)αB1𝒲2(μt,μ0)αB1(𝒲2(ρt,ρ)+𝒲2(ρ0,ρ))absent𝛼subscript𝐵1subscript𝒲1subscript𝜇subscript𝑡subscript𝜇0𝛼subscript𝐵1subscript𝒲2subscript𝜇subscript𝑡subscript𝜇0𝛼subscript𝐵1subscript𝒲2subscript𝜌subscript𝑡superscript𝜌subscript𝒲2subscript𝜌0superscript𝜌\displaystyle\;\leq\alpha B_{1}\cdot\mathcal{W}_{1}(\mu_{t_{*}},\mu_{0})\leq% \alpha B_{1}\cdot\mathcal{W}_{2}(\mu_{t_{*}},\mu_{0})\;\leq\alpha B_{1}\cdot% \Bigl{(}\mathcal{W}_{2}(\rho_{t_{*}},\rho^{*})+\mathcal{W}_{2}(\rho_{0},\rho^{% *})\Bigr{)}≤ italic_α italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ caligraphic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≤ italic_α italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≤ italic_α italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ ( caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) )
3B1D¯=𝒪(1).absent3subscript𝐵1¯𝐷𝒪1\displaystyle\;\leq 3B_{1}\cdot\bar{D}=\mathcal{O}(1).≤ 3 italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ over¯ start_ARG italic_D end_ARG = caligraphic_O ( 1 ) . (A.40)

where the first inequality follows from Lemma D.7, the second inequality follows from Lemma D.2. The last inequality follows from (ii) in Lemma (4.6) and definition of tsubscript𝑡t_{*}italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT. For fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, a similar chain of inequalities would apply,

supw𝒲|f(w;μ)|𝑤𝒲supremum𝑓𝑤superscript𝜇\displaystyle\underset{w\in\mathcal{W}}{\sup}|f(w;\mu^{*})|start_UNDERACCENT italic_w ∈ caligraphic_W end_UNDERACCENT start_ARG roman_sup end_ARG | italic_f ( italic_w ; italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | =αsupw𝒲|ϕ(w;θ)dμ(θ)|=αsupw𝒲|ϕ(w;θ)d(μμ0)(θ)|absent𝛼𝑤𝒲supremumitalic-ϕ𝑤𝜃differential-dsuperscript𝜇𝜃𝛼𝑤𝒲supremumitalic-ϕ𝑤𝜃dsuperscript𝜇subscript𝜇0𝜃\displaystyle\;=\alpha\cdot\underset{w\in\mathcal{W}}{\sup}\Bigl{|}\int\phi(w;% \theta)\mathrm{d}\mu^{*}(\theta)\Bigr{|}=\alpha\cdot\underset{w\in\mathcal{W}}% {\sup}\Bigl{|}\int\phi(w;\theta)\mathrm{d}(\mu^{*}-\mu_{0})(\theta)\Bigr{|}= italic_α ⋅ start_UNDERACCENT italic_w ∈ caligraphic_W end_UNDERACCENT start_ARG roman_sup end_ARG | ∫ italic_ϕ ( italic_w ; italic_θ ) roman_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ ) | = italic_α ⋅ start_UNDERACCENT italic_w ∈ caligraphic_W end_UNDERACCENT start_ARG roman_sup end_ARG | ∫ italic_ϕ ( italic_w ; italic_θ ) roman_d ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( italic_θ ) |
αB1𝒲1(μ,μ0)αB1𝒲2(μ,μ0)αB1𝒲2(ρ,ρ0)absent𝛼subscript𝐵1subscript𝒲1superscript𝜇subscript𝜇0𝛼subscript𝐵1subscript𝒲2superscript𝜇subscript𝜇0𝛼subscript𝐵1subscript𝒲2superscript𝜌subscript𝜌0\displaystyle\;\leq\alpha B_{1}\cdot\mathcal{W}_{1}(\mu^{*},\mu_{0})\leq\alpha B% _{1}\cdot\mathcal{W}_{2}(\mu^{*},\mu_{0})\;\leq\alpha B_{1}\cdot\mathcal{W}_{2% }(\rho^{*},\rho_{0})≤ italic_α italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ caligraphic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≤ italic_α italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≤ italic_α italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )
B1D¯=𝒪(1).absentsubscript𝐵1¯𝐷𝒪1\displaystyle\;\leq B_{1}\cdot\bar{D}=\mathcal{O}(1).≤ italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ over¯ start_ARG italic_D end_ARG = caligraphic_O ( 1 ) . (A.41)

With uniform bounds on ftsubscript𝑓subscript𝑡f_{t_{*}}italic_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT and fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we are now ready to upper-bound inft[0,T]J(ft)J(f)subscriptinfimum𝑡0𝑇𝐽subscript𝑓𝑡𝐽superscript𝑓\inf_{t\in[0,T]}J(f_{t})-J(f^{*})roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT italic_J ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_J ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) through upper-bounding J(ft)J(f)𝐽subscript𝑓subscript𝑡𝐽superscript𝑓J(f_{t_{*}})-J(f^{*})italic_J ( italic_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_J ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ),

J(ft)J(f)𝐽subscript𝑓subscript𝑡𝐽subscript𝑓\displaystyle J(f_{t_{*}})-J(f_{*})italic_J ( italic_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_J ( italic_f start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) 𝔼𝒟[δ¯(Z;ft)𝔼X|Z[Φ~(X,Z;ftf)|Z]+λδΨ(X,Z;ft)δf,ftfL2]absentsubscript𝔼𝒟delimited-[]¯𝛿𝑍subscript𝑓subscript𝑡subscript𝔼conditional𝑋𝑍delimited-[]conditional~Φ𝑋𝑍subscript𝑓subscript𝑡superscript𝑓𝑍𝜆subscript𝛿Ψ𝑋𝑍subscript𝑓subscript𝑡𝛿𝑓subscript𝑓subscript𝑡superscript𝑓superscript𝐿2\displaystyle\;\leq\mathbb{E}_{\mathcal{D}}\Big{[}\bar{\delta}(Z;f_{t_{*}})% \cdot\mathbb{E}_{X|Z}\big{[}\widetilde{\Phi}(X,Z;f_{t_{*}}-f^{*})|Z\big{]}+% \lambda\cdot\Big{\langle}\frac{\delta\Psi(X,Z;f_{t_{*}})}{\delta f},f_{t_{*}}-% f^{*}\Big{\rangle}_{L^{2}}\Big{]}≤ blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ over¯ start_ARG italic_δ end_ARG ( italic_Z ; italic_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ⋅ blackboard_E start_POSTSUBSCRIPT italic_X | italic_Z end_POSTSUBSCRIPT [ over~ start_ARG roman_Φ end_ARG ( italic_X , italic_Z ; italic_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | italic_Z ] + italic_λ ⋅ ⟨ divide start_ARG italic_δ roman_Ψ ( italic_X , italic_Z ; italic_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG italic_δ italic_f end_ARG , italic_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ]
(supx,z|Φ(x,z;0)|+CΦsupw𝒲|f(w;μt)|)𝔼𝒟[CΦ|f(W;μt)f(W;μ)|]absent𝑥𝑧supremumΦ𝑥𝑧0subscript𝐶Φ𝑤𝒲supremum𝑓𝑤subscript𝜇subscript𝑡subscript𝔼𝒟delimited-[]subscript𝐶Φ𝑓𝑊subscript𝜇subscript𝑡𝑓𝑊superscript𝜇\displaystyle\;\leq\Big{(}\underset{x,z}{\sup}|\Phi(x,z;0)|+C_{\Phi}\cdot% \underset{w\in\mathcal{W}}{\sup}|f(w;\mu_{t_{*}})|\Big{)}\cdot\mathbb{E}_{% \mathcal{D}}\Big{[}C_{\Phi}\cdot|f(W;\mu_{t_{*}})-f(W;\mu^{*})|\Big{]}≤ ( start_UNDERACCENT italic_x , italic_z end_UNDERACCENT start_ARG roman_sup end_ARG | roman_Φ ( italic_x , italic_z ; 0 ) | + italic_C start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ⋅ start_UNDERACCENT italic_w ∈ caligraphic_W end_UNDERACCENT start_ARG roman_sup end_ARG | italic_f ( italic_w ; italic_μ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) | ) ⋅ blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_C start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ⋅ | italic_f ( italic_W ; italic_μ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_f ( italic_W ; italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | ]
+λCΨ𝔼𝒟[|f(W;μt)f(W;μ)|]𝜆subscript𝐶Ψsubscript𝔼𝒟delimited-[]𝑓𝑊subscript𝜇subscript𝑡𝑓𝑊superscript𝜇\displaystyle\qquad+\lambda C_{\Psi}\cdot\mathbb{E}_{\mathcal{D}}\Big{[}|f(W;% \mu_{t_{*}})-f(W;\mu^{*})|\Big{]}+ italic_λ italic_C start_POSTSUBSCRIPT roman_Ψ end_POSTSUBSCRIPT ⋅ blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ | italic_f ( italic_W ; italic_μ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_f ( italic_W ; italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | ]
B𝔼𝒟[|f(W;μt)f(W;μ)|]B(𝔼𝒟[λ|f(W;μt)f(W;μ)|2])1/2absentsubscript𝐵subscript𝔼𝒟delimited-[]𝑓𝑊subscript𝜇subscript𝑡𝑓𝑊superscript𝜇subscript𝐵superscriptsubscript𝔼𝒟delimited-[]𝜆superscript𝑓𝑊subscript𝜇subscript𝑡𝑓𝑊superscript𝜇212\displaystyle\;\leq B_{*}\cdot\mathbb{E}_{\mathcal{D}}\Big{[}|f(W;\mu_{t_{*}})% -f(W;\mu^{*})|\Big{]}\leq B_{*}\cdot\Big{(}\mathbb{E}_{\mathcal{D}}\Big{[}% \lambda|f(W;\mu_{t_{*}})-f(W;\mu^{*})|^{2}\Big{]}\Big{)}^{1/2}≤ italic_B start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⋅ blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ | italic_f ( italic_W ; italic_μ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_f ( italic_W ; italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | ] ≤ italic_B start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⋅ ( blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_λ | italic_f ( italic_W ; italic_μ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_f ( italic_W ; italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT
B(𝔼𝒟[Ψ(X,Z;ftf)])1/2Bα1/2,absentsubscript𝐵superscriptsubscript𝔼𝒟delimited-[]Ψ𝑋𝑍subscript𝑓subscript𝑡superscript𝑓12subscript𝐵superscript𝛼12\displaystyle\;\leq B_{*}\cdot\Big{(}\mathbb{E}_{\mathcal{D}}\Big{[}\Psi(X,Z;f% _{t_{*}}-f^{*})\Big{]}\Big{)}^{1/2}\leq B_{*}\cdot\alpha^{-1/2},≤ italic_B start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⋅ ( blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ roman_Ψ ( italic_X , italic_Z ; italic_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ≤ italic_B start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⋅ italic_α start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT , (A.42)

where B=B(Φ,cϕ,CΦ,CΨ,λ,C,B1,D¯,C)>0subscript𝐵subscript𝐵Φsubscript𝑐italic-ϕsubscript𝐶Φsubscript𝐶Ψ𝜆𝐶subscript𝐵1¯𝐷subscript𝐶0B_{*}=B_{*}(\Phi,c_{\phi},C_{\Phi},C_{\Psi},\lambda,C,B_{1},\bar{D},C_{*})>0italic_B start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( roman_Φ , italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT roman_Ψ end_POSTSUBSCRIPT , italic_λ , italic_C , italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over¯ start_ARG italic_D end_ARG , italic_C start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) > 0 is a constant and its values changes from line to line. The second inequality follows from (A.3) and (A.3). The last inequality follows from (A.2) in the proof of Theorem (4.7). Therefore, in this scenario, we have that

inft[0,T]J(ft)J(f)J(ft)J(f)𝒪(T1/2+α1/2).subscriptinfimum𝑡0𝑇𝐽subscript𝑓𝑡𝐽subscript𝑓𝐽subscript𝑓subscript𝑡𝐽subscript𝑓𝒪superscript𝑇12superscript𝛼12\displaystyle\inf_{t\in[0,T]}J(f_{t})-J(f_{*})\leq J(f_{t_{*}})-J(f_{*})\leq% \mathcal{O}(T^{-1/2}+\alpha^{-1/2}).roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT italic_J ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_J ( italic_f start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ italic_J ( italic_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_J ( italic_f start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ caligraphic_O ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) . (A.43)

Equation (A.43) concludes the proof of Theorem 4.9 in the scenario of tTsubscript𝑡𝑇t_{*}\leq Titalic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ≤ italic_T.

Scenario (ii) If t>Tsubscript𝑡𝑇t_{*}>Titalic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT > italic_T, by definition of tsubscript𝑡t_{*}italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT, we have that

𝒲2(μt,μ)𝒲2(ρt,ρ)2𝒲2(ρ0,ρ)=2αD¯,t[0,T].formulae-sequencesubscript𝒲2subscript𝜇𝑡superscript𝜇subscript𝒲2subscript𝜌𝑡superscript𝜌2subscript𝒲2subscript𝜌0superscript𝜌2𝛼¯𝐷for-all𝑡0𝑇\displaystyle\mathcal{W}_{2}(\mu_{t},\mu^{*})\leq\mathcal{W}_{2}(\rho_{t},\rho% ^{*})\leq 2\mathcal{W}_{2}(\rho_{0},\rho^{*})=2\alpha\cdot\bar{D},\quad\forall% \;t\in[0,T].caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ 2 caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = 2 italic_α ⋅ over¯ start_ARG italic_D end_ARG , ∀ italic_t ∈ [ 0 , italic_T ] .

Following the same arguments in (A.3) and (A.3), we have a uniform upper-bound for ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for all t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ] and fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that writes,

supw𝒲|f(w;μt)|+|f(w;μ)|4B1D¯=𝒪(1),t[0,T].\displaystyle\underset{w\in\mathcal{W}}{\sup}\bigl{|}f(w;\mu_{t})\bigr{|}+% \bigl{|}f(w;\mu^{*})\bigr{|}\leq 4B_{1}\cdot\bar{D}=\mathcal{O}(1),\quad% \forall\;t\in[0,T].start_UNDERACCENT italic_w ∈ caligraphic_W end_UNDERACCENT start_ARG roman_sup end_ARG | italic_f ( italic_w ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | + | italic_f ( italic_w ; italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | ≤ 4 italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ over¯ start_ARG italic_D end_ARG = caligraphic_O ( 1 ) , ∀ italic_t ∈ [ 0 , italic_T ] .

Following the same derivation of (A.3), we have that

inft[0,T]J(ft)J(f)subscriptinfimum𝑡0𝑇𝐽subscript𝑓𝑡𝐽subscript𝑓\displaystyle\inf_{t\in[0,T]}J(f_{t})-J(f_{*})roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT italic_J ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_J ( italic_f start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) Binft[0,T]B(𝔼𝒟[Ψ(X,Z;ftf)])1/2absentsubscript𝐵subscriptinfimum𝑡0𝑇subscript𝐵superscriptsubscript𝔼𝒟delimited-[]Ψ𝑋𝑍subscript𝑓𝑡superscript𝑓12\displaystyle\;\leq B_{*}\cdot\inf_{t\in[0,T]}B_{*}\cdot\Big{(}\mathbb{E}_{% \mathcal{D}}\Big{[}\Psi(X,Z;f_{t}-f^{*})\Big{]}\Big{)}^{1/2}≤ italic_B start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⋅ roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⋅ ( blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ roman_Ψ ( italic_X , italic_Z ; italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT
B(inft[0,T]𝔼𝒟[Ψ(X,Z;ftf)])1/2absentsubscript𝐵superscriptsubscriptinfimum𝑡0𝑇subscript𝔼𝒟delimited-[]Ψ𝑋𝑍subscript𝑓𝑡superscript𝑓12\displaystyle\;\leq B_{*}\cdot\Big{(}\inf_{t\in[0,T]}\mathbb{E}_{\mathcal{D}}% \Big{[}\Psi(X,Z;f_{t}-f^{*})\Big{]}\Big{)}^{1/2}≤ italic_B start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⋅ ( roman_inf start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ roman_Ψ ( italic_X , italic_Z ; italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT
B(T10T𝔼𝒟[Ψ(X,Z;ftf)]dt)1/2absentsubscript𝐵superscriptsuperscript𝑇1superscriptsubscript0𝑇subscript𝔼𝒟delimited-[]Ψ𝑋𝑍subscript𝑓𝑡superscript𝑓differential-d𝑡12\displaystyle\;\leq B_{*}\cdot\Bigl{(}T^{-1}\cdot\int_{0}^{T}\mathbb{E}_{% \mathcal{D}}\Big{[}\Psi(X,Z;f_{t}-f^{*})\Big{]}\mathrm{d}t\Bigr{)}^{1/2}≤ italic_B start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⋅ ( italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ roman_Ψ ( italic_X , italic_Z ; italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] roman_d italic_t ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT
B1/2D¯2T1+Cα1=𝒪(T1/2+α1/2),absentsubscript𝐵12superscript¯𝐷2superscript𝑇1subscript𝐶superscript𝛼1𝒪superscript𝑇12superscript𝛼12\displaystyle\;\leq B_{*}\cdot\sqrt{1/2\cdot\bar{D}^{2}\cdot T^{-1}+C_{*}\cdot% \alpha^{-1}}=\mathcal{O}(T^{-1/2}+\alpha^{-1/2}),≤ italic_B start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⋅ square-root start_ARG 1 / 2 ⋅ over¯ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⋅ italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG = caligraphic_O ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) , (A.44)

where the last inequality follows from (A.36) and (A.2) in the proof of Theorem 4.7. Equation (A.3) concludes the proof of Theorem (4.9) in the scenario of t>Tsubscript𝑡𝑇t_{*}>Titalic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT > italic_T.

Based on the discussion of scenarios (i) and (ii) above, we finish the proof of Theorem 4.9. ∎

Appendix B Mean Field Limit of Neural Networks

In this section, we prove Proposition 4.4. The formal version is presented as follows. Let ρt(θ,ω)=μt(θ)νt(ω)subscript𝜌𝑡𝜃𝜔tensor-productsubscript𝜇𝑡𝜃subscript𝜈𝑡𝜔\rho_{t}(\theta,\omega)=\mu_{t}(\theta)\otimes\nu_{t}(\omega)italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ , italic_ω ) = italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) ⊗ italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ω ), where (μt,νt)subscript𝜇𝑡subscript𝜈𝑡(\mu_{t},\nu_{t})( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the PDE solution in (3.7) and ρ^k(θ,ω)=N1i=1Nδθki(θ)δωki(ω)subscript^𝜌𝑘𝜃𝜔superscript𝑁1superscriptsubscript𝑖1𝑁subscript𝛿superscriptsubscript𝜃𝑘𝑖𝜃subscript𝛿superscriptsubscript𝜔𝑘𝑖𝜔\widehat{\rho}_{k}(\theta,\omega)=N^{-1}\cdot\sum_{i=1}^{N}\delta_{\theta_{k}^% {i}}(\theta)\cdot\delta_{\omega_{k}^{i}}(\omega)over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ , italic_ω ) = italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_θ ) ⋅ italic_δ start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ω ) is the empirical distribution of (𝜽k,𝝎k)subscript𝜽𝑘subscript𝝎𝑘(\bm{\theta}_{k},\bm{\omega}_{k})( bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). Here we omit the dependence of the empirical distribution ρ^ksubscript^𝜌𝑘\widehat{\rho}_{k}over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT on N𝑁Nitalic_N and stepsize scale ϵitalic-ϵ\epsilonitalic_ϵ for notational simplicity.

Proposition B.1 (Formal Version of Proposition 4.4).

Let h:D×D:superscript𝐷superscript𝐷h:\mathbb{R}^{D}\times\mathbb{R}^{D}\rightarrow\mathbb{R}italic_h : blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT → blackboard_R by any continuous function such that h1subscriptnorm1\|h\|_{\infty}\leq 1∥ italic_h ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ 1 and Lip(h)1Lip1\operatorname{Lip}(h)\leq 1roman_Lip ( italic_h ) ≤ 1. Under Assumption 4.1, 4.3, with probability at least 15δ15𝛿1-5\delta1 - 5 italic_δ, it holds that

supkT/ϵ(k)|h(θ,w)dρkϵ(θ,w)h(θ,w)dρ^k(θ,w)|BeBT(log(N/δ)/N+ϵ(D+log(N/δ))).subscriptsupremum𝑘𝑇italic-ϵ𝑘𝜃𝑤differential-dsubscript𝜌𝑘italic-ϵ𝜃𝑤𝜃𝑤differential-dsubscript^𝜌𝑘𝜃𝑤𝐵superscript𝑒𝐵𝑇𝑁𝛿𝑁italic-ϵ𝐷𝑁𝛿\displaystyle\sup_{\begin{subarray}{c}k\leq T/\epsilon\\ (k\in\mathbb{N})\end{subarray}}\left|\int h(\theta,w)\mathrm{d}\rho_{k\epsilon% }(\theta,w)-\int h(\theta,w)\mathrm{d}\widehat{\rho}_{k}(\theta,w)\right|\leq B% \cdot e^{BT}\cdot\Bigl{(}\sqrt{\log(N/\delta)/N}+\sqrt{\epsilon\cdot(D+\log(N/% \delta))}\Bigr{)}.roman_sup start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_k ≤ italic_T / italic_ϵ end_CELL end_ROW start_ROW start_CELL ( italic_k ∈ blackboard_N ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT | ∫ italic_h ( italic_θ , italic_w ) roman_d italic_ρ start_POSTSUBSCRIPT italic_k italic_ϵ end_POSTSUBSCRIPT ( italic_θ , italic_w ) - ∫ italic_h ( italic_θ , italic_w ) roman_d over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ , italic_w ) | ≤ italic_B ⋅ italic_e start_POSTSUPERSCRIPT italic_B italic_T end_POSTSUPERSCRIPT ⋅ ( square-root start_ARG roman_log ( italic_N / italic_δ ) / italic_N end_ARG + square-root start_ARG italic_ϵ ⋅ ( italic_D + roman_log ( italic_N / italic_δ ) ) end_ARG ) .

Here B𝐵Bitalic_B is a constant that depends on α,η,λ,B0,B1𝛼𝜂𝜆subscript𝐵0subscript𝐵1\alpha,\eta,\lambda,B_{0},B_{1}italic_α , italic_η , italic_λ , italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and B2subscript𝐵2B_{2}italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

The proof of Proposition B.1 based heavily on Mei et al. (2018, 2019); Araújo et al. (2019); Zhang et al. (2020), which make use of the propagation of chaos arguments in Sznitman (1991). Recall that (vf(;ρ),vg(;ρ))superscript𝑣𝑓𝜌superscript𝑣𝑔𝜌(v^{f}(\cdot;\rho),v^{g}(\cdot;\rho))( italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( ⋅ ; italic_ρ ) , italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( ⋅ ; italic_ρ ) ) is the a vector field defined as,

vf(θ;ρ)superscript𝑣𝑓𝜃𝜌\displaystyle v^{f}(\theta;\rho)italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ ; italic_ρ ) =α𝔼𝒟[g(Z;ρ)δΦ(X,Z;f(;ρ))δf,θϕ(;θ)L2λδΨ(X,Z;f(;ρ))δf,θϕ(;θ)L2],absent𝛼subscript𝔼𝒟delimited-[]𝑔𝑍𝜌subscript𝛿Φ𝑋𝑍𝑓𝜌𝛿𝑓subscript𝜃italic-ϕ𝜃superscript𝐿2𝜆subscript𝛿Ψ𝑋𝑍𝑓𝜌𝛿𝑓subscript𝜃italic-ϕ𝜃superscript𝐿2\displaystyle=\alpha\mathbb{E}_{\mathcal{D}}\Bigl{[}-g(Z;\rho)\cdot\Big{% \langle}\frac{\delta\Phi(X,Z;f(\cdot;\rho))}{\delta f},\nabla_{\theta}\phi(% \cdot;\theta)\Big{\rangle}_{L^{2}}-\lambda\cdot\Big{\langle}\frac{\delta\Psi(X% ,Z;f(\cdot;\rho))}{\delta f},\nabla_{\theta}\phi(\cdot;\theta)\Big{\rangle}_{L% ^{2}}\Bigr{]},= italic_α blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ - italic_g ( italic_Z ; italic_ρ ) ⋅ ⟨ divide start_ARG italic_δ roman_Φ ( italic_X , italic_Z ; italic_f ( ⋅ ; italic_ρ ) ) end_ARG start_ARG italic_δ italic_f end_ARG , ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_ϕ ( ⋅ ; italic_θ ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - italic_λ ⋅ ⟨ divide start_ARG italic_δ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ ; italic_ρ ) ) end_ARG start_ARG italic_δ italic_f end_ARG , ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_ϕ ( ⋅ ; italic_θ ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ] ,
vg(w;ρ)superscript𝑣𝑔𝑤𝜌\displaystyle v^{g}(w;\rho)italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_w ; italic_ρ ) =α𝔼𝒟[Φ(X,Z;f(,ρ))ωψ(Z;ω)g(Z;ρ)ωψ(Z;ω)].absent𝛼subscript𝔼𝒟delimited-[]Φ𝑋𝑍𝑓𝜌subscript𝜔𝜓𝑍𝜔𝑔𝑍𝜌subscript𝜔𝜓𝑍𝜔\displaystyle=\alpha\mathbb{E}_{\mathcal{D}}\Bigl{[}\Phi(X,Z;f(\cdot,\rho))% \cdot\nabla_{\omega}\psi(Z;\omega)-g(Z;\rho)\cdot\nabla_{\omega}\psi(Z;\omega)% \Bigr{]}.= italic_α blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ roman_Φ ( italic_X , italic_Z ; italic_f ( ⋅ , italic_ρ ) ) ⋅ ∇ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT italic_ψ ( italic_Z ; italic_ω ) - italic_g ( italic_Z ; italic_ρ ) ⋅ ∇ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT italic_ψ ( italic_Z ; italic_ω ) ] . (B.1)

From now on, we equivalently write θki=θi(k)superscriptsubscript𝜃𝑘𝑖subscript𝜃𝑖𝑘\theta_{k}^{i}=\theta_{i}(k)italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ), ωki=ωi(k)superscriptsubscript𝜔𝑘𝑖subscript𝜔𝑖𝑘\omega_{k}^{i}=\omega_{i}(k)italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) to emphasize the dependence on iterations. For abbreviation, we denote θ(N)(k)={θi(k)}i=1Nsuperscript𝜃𝑁𝑘superscriptsubscriptsubscript𝜃𝑖𝑘𝑖1𝑁\theta^{(N)}(k)=\{\theta_{i}(k)\}_{i=1}^{N}italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) = { italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and ω(N)(k)={ωi(k)}i=1Nsuperscript𝜔𝑁𝑘superscriptsubscriptsubscript𝜔𝑖𝑘𝑖1𝑁\omega^{(N)}(k)=\{\omega_{i}(k)\}_{i=1}^{N}italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) = { italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. We recall the finite-width representation of f(;θ(N))𝑓superscript𝜃𝑁f(\cdot;\theta^{(N)})italic_f ( ⋅ ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) and g(;ω(N))𝑔superscript𝜔𝑁g(\cdot;\omega^{(N)})italic_g ( ⋅ ; italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) are,

f(,θ(N))=αNi=1Nϕ(;θi),g(,ω(N))=αNi=1Nψ(;ωi).formulae-sequence𝑓superscript𝜃𝑁𝛼𝑁superscriptsubscript𝑖1𝑁italic-ϕsubscript𝜃𝑖𝑔superscript𝜔𝑁𝛼𝑁superscriptsubscript𝑖1𝑁𝜓subscript𝜔𝑖\displaystyle f(\cdot,\theta^{(N)})=\frac{\alpha}{N}\cdot\sum_{i=1}^{N}\phi(% \cdot;\theta_{i}),\qquad g(\cdot,\omega^{(N)})=\frac{\alpha}{N}\cdot\sum_{i=1}% ^{N}\psi(\cdot;\omega_{i}).italic_f ( ⋅ , italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) = divide start_ARG italic_α end_ARG start_ARG italic_N end_ARG ⋅ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ϕ ( ⋅ ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_g ( ⋅ , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) = divide start_ARG italic_α end_ARG start_ARG italic_N end_ARG ⋅ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ψ ( ⋅ ; italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

Correspondingly, we defined the finite-width counter-part of vfsuperscript𝑣𝑓v^{f}italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT and vgsuperscript𝑣𝑔v^{g}italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT as following,

v^f(θ;θ(N),ω(N))superscript^𝑣𝑓𝜃superscript𝜃𝑁superscript𝜔𝑁\displaystyle\widehat{v}^{f}(\theta;\theta^{(N)},\omega^{(N)})over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) =α𝔼𝒟[g(Z;ω(N))δΦ(X,Z;f(;θ(N)))δf,θϕ(;θ)L2\displaystyle=\alpha\mathbb{E}_{\mathcal{D}}\bigg{[}-g(Z;\omega^{(N)})\cdot% \big{\langle}\frac{\delta\Phi(X,Z;f(\cdot;\theta^{(N)}))}{\delta f},\nabla_{% \theta}\phi(\cdot;\theta)\big{\rangle}_{L^{2}}= italic_α blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ - italic_g ( italic_Z ; italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ⋅ ⟨ divide start_ARG italic_δ roman_Φ ( italic_X , italic_Z ; italic_f ( ⋅ ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG , ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_ϕ ( ⋅ ; italic_θ ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT
λδΨ(X,Z;f(;θ(N)))δf,θϕ(;θ)L2],\displaystyle\quad\quad-\lambda\cdot\big{\langle}\frac{\delta\Psi(X,Z;f(\cdot;% \theta^{(N)}))}{\delta f},\nabla_{\theta}\phi(\cdot;\theta)\big{\rangle}_{L^{2% }}\bigg{]},- italic_λ ⋅ ⟨ divide start_ARG italic_δ roman_Ψ ( italic_X , italic_Z ; italic_f ( ⋅ ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG , ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_ϕ ( ⋅ ; italic_θ ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ] ,
v^g(w;θ(N),ω(N))superscript^𝑣𝑔𝑤superscript𝜃𝑁superscript𝜔𝑁\displaystyle\widehat{v}^{g}(w;\theta^{(N)},\omega^{(N)})over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_w ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) =α𝔼𝒟[Φ(X,Z;f(,θ(N)))ωψ(Z;ω)g(Z;ω(N))ωψ(Z;ω)].absent𝛼subscript𝔼𝒟delimited-[]Φ𝑋𝑍𝑓superscript𝜃𝑁subscript𝜔𝜓𝑍𝜔𝑔𝑍superscript𝜔𝑁subscript𝜔𝜓𝑍𝜔\displaystyle=\alpha\mathbb{E}_{\mathcal{D}}\bigg{[}\Phi(X,Z;f(\cdot,\theta^{(% N)}))\cdot\nabla_{\omega}\psi(Z;\omega)-g(Z;\omega^{(N)})\cdot\nabla_{\omega}% \psi(Z;\omega)\bigg{]}.= italic_α blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ roman_Φ ( italic_X , italic_Z ; italic_f ( ⋅ , italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ) ⋅ ∇ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT italic_ψ ( italic_Z ; italic_ω ) - italic_g ( italic_Z ; italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ⋅ ∇ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT italic_ψ ( italic_Z ; italic_ω ) ] . (B.2)

And we also defined the stochastic counter-part,

V^kf(θ;θ(N),w(N))superscriptsubscript^𝑉𝑘𝑓𝜃superscript𝜃𝑁superscript𝑤𝑁\displaystyle\widehat{V}_{k}^{f}(\theta;\theta^{(N)},w^{(N)})over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_w start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) =α[g(zk;ω(N))δΦ(xk,zk;f(;θ(N)))δf,θϕ(;θ)L2\displaystyle=\alpha\bigg{[}-g(z_{k};\omega^{(N)})\cdot\Big{\langle}\frac{% \delta\Phi(x_{k},z_{k};f(\cdot;\theta^{(N)}))}{\delta f},\nabla_{\theta}\phi(% \cdot;\theta)\Big{\rangle}_{L^{2}}= italic_α [ - italic_g ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ⋅ ⟨ divide start_ARG italic_δ roman_Φ ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_f ( ⋅ ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG , ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_ϕ ( ⋅ ; italic_θ ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT
λδΨ(xk,zk;f(;θ(N)))δf,θϕ(;θ)L2],\displaystyle\quad\quad-\lambda\cdot\Big{\langle}\frac{\delta\Psi(x_{k},z_{k};% f(\cdot;\theta^{(N)}))}{\delta f},\nabla_{\theta}\phi(\cdot;\theta)\Big{% \rangle}_{L^{2}}\bigg{]},- italic_λ ⋅ ⟨ divide start_ARG italic_δ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_f ( ⋅ ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG , ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_ϕ ( ⋅ ; italic_θ ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ] ,
V^kg(ω;θ(N),w(N))superscriptsubscript^𝑉𝑘𝑔𝜔superscript𝜃𝑁superscript𝑤𝑁\displaystyle\widehat{V}_{k}^{g}(\omega;\theta^{(N)},w^{(N)})over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_ω ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_w start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) =α(Φ(xk,zk;f(;θ(N)))ωψ(zk;ω)g(zk;ω(N))ωψ(zk;ω)).absent𝛼Φsubscript𝑥𝑘subscript𝑧𝑘𝑓superscript𝜃𝑁subscript𝜔𝜓subscript𝑧𝑘𝜔𝑔subscript𝑧𝑘superscript𝜔𝑁subscript𝜔𝜓subscript𝑧𝑘𝜔\displaystyle=\alpha\Big{(}\Phi(x_{k},z_{k};f(\cdot;\theta^{(N)}))\cdot\nabla_% {\omega}\psi(z_{k};\omega)-g(z_{k};\omega^{(N)})\cdot\nabla_{\omega}\psi(z_{k}% ;\omega)\Big{)}.= italic_α ( roman_Φ ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_f ( ⋅ ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ) ⋅ ∇ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT italic_ψ ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_ω ) - italic_g ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ⋅ ∇ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT italic_ψ ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_ω ) ) . (B.3)

where (xk,zk)𝒟similar-tosubscript𝑥𝑘subscript𝑧𝑘𝒟(x_{k},z_{k})\sim\mathcal{D}( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∼ caligraphic_D. Following from Mei et al. (2019); Araújo et al. (2019), we consider the following four dynamics.

  • Stochastic Gradient Descent Ascent (SGDA). We consider the following SGDA dynamics for θ(N)(k)superscript𝜃𝑁𝑘\theta^{(N)}(k)italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) and ω(N)(k)superscript𝜔𝑁𝑘\omega^{(N)}(k)italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ), where k𝑘k\in\mathbb{N}italic_k ∈ blackboard_N, with θi(0)i.i.d.μ0,wi(0)i.i.d.ν0(i[N])\theta_{i}(0)\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mu_{0},w_{i}(0)% \stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\nu_{0}\;(i\in[N])italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG roman_i . roman_i . roman_d . end_ARG end_RELOP italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG roman_i . roman_i . roman_d . end_ARG end_RELOP italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_i ∈ [ italic_N ] ) as its initialization,

    θi(k+1)subscript𝜃𝑖𝑘1\displaystyle\theta_{i}(k+1)italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) =θi(k)+ηϵV^kf(θi(k);θ(N)(k),ω(N)(k)),absentsubscript𝜃𝑖𝑘𝜂italic-ϵsuperscriptsubscript^𝑉𝑘𝑓subscript𝜃𝑖𝑘superscript𝜃𝑁𝑘superscript𝜔𝑁𝑘\displaystyle=\theta_{i}(k)+\eta\epsilon\cdot\widehat{V}_{k}^{f}(\theta_{i}(k)% ;\theta^{(N)}(k),\omega^{(N)}(k)),= italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) + italic_η italic_ϵ ⋅ over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) ) ,
    ωi(k+1)subscript𝜔𝑖𝑘1\displaystyle\omega_{i}(k+1)italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) =ωi(k)+ηϵV^kg(ωi(k);θ(N)(k),ω(N)(k)).absentsubscript𝜔𝑖𝑘𝜂italic-ϵsuperscriptsubscript^𝑉𝑘𝑔subscript𝜔𝑖𝑘superscript𝜃𝑁𝑘superscript𝜔𝑁𝑘\displaystyle=\omega_{i}(k)+\eta\epsilon\cdot\widehat{V}_{k}^{g}(\omega_{i}(k)% ;\theta^{(N)}(k),\omega^{(N)}(k)).= italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) + italic_η italic_ϵ ⋅ over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) ) . (B.4)

    Note that this dynamics is equivalent to (3).

  • Population Gradient Descent Ascent (PGDA). We consider the following population gradient descent ascent dynamics for θ˘(N)(k)superscript˘𝜃𝑁𝑘\breve{\theta}^{(N)}(k)over˘ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) and ω˘(N)(k)superscript˘𝜔𝑁𝑘\breve{\omega}^{(N)}(k)over˘ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ), where k𝑘k\in\mathbb{N}italic_k ∈ blackboard_N, with θ˘i(0)=θi(0)subscript˘𝜃𝑖0subscript𝜃𝑖0\breve{\theta}_{i}(0)=\theta_{i}(0)over˘ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) = italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ), ω˘i(0)=ωi(0)(i[N])subscript˘𝜔𝑖0subscript𝜔𝑖0𝑖delimited-[]𝑁\breve{\omega}_{i}(0)=\omega_{i}(0)\;(i\in[N])over˘ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) = italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) ( italic_i ∈ [ italic_N ] ) as its initialization,

    θ˘i(k+1)subscript˘𝜃𝑖𝑘1\displaystyle\breve{\theta}_{i}(k+1)over˘ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) =θ˘i(k)+ηϵv^f(θ˘i(k);θ˘(N)(k),ω˘(N)(k)),absentsubscript˘𝜃𝑖𝑘𝜂italic-ϵsuperscript^𝑣𝑓subscript˘𝜃𝑖𝑘superscript˘𝜃𝑁𝑘superscript˘𝜔𝑁𝑘\displaystyle=\breve{\theta}_{i}(k)+\eta\epsilon\cdot\widehat{v}^{f}(\breve{% \theta}_{i}(k);\breve{\theta}^{(N)}(k),\breve{\omega}^{(N)}(k)),= over˘ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) + italic_η italic_ϵ ⋅ over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( over˘ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ; over˘ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) , over˘ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) ) ,
    ω˘i(k+1)subscript˘𝜔𝑖𝑘1\displaystyle\breve{\omega}_{i}(k+1)over˘ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) =wi(k)+ηϵv^g(ω˘i(k);θ˘(N)(k),ω˘(N)(k)).absentsubscript𝑤𝑖𝑘𝜂italic-ϵsuperscript^𝑣𝑔subscript˘𝜔𝑖𝑘superscript˘𝜃𝑁𝑘superscript˘𝜔𝑁𝑘\displaystyle=w_{i}(k)+\eta\epsilon\cdot\widehat{v}^{g}(\breve{\omega}_{i}(k);% \breve{\theta}^{(N)}(k),\breve{\omega}^{(N)}(k)).= italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) + italic_η italic_ϵ ⋅ over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( over˘ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ; over˘ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) , over˘ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) ) . (B.5)
  • Continuous-time Population Gradient Descent Ascent (CTPGDA). We consider the following continuous time population gradient descent ascent dynamics for θ~(N)(t)superscript~𝜃𝑁𝑡\widetilde{\theta}^{(N)}(t)over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) and ω~(N)(t)superscript~𝜔𝑁𝑡\widetilde{\omega}^{(N)}(t)over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ), where t+𝑡subscriptt\in\mathbb{R}_{+}italic_t ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, with θ~i(0)=θi(0)subscript~𝜃𝑖0subscript𝜃𝑖0\widetilde{\theta}_{i}(0)=\theta_{i}(0)over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) = italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ), ω~i(0)=ωi(0)(i[N])subscript~𝜔𝑖0subscript𝜔𝑖0𝑖delimited-[]𝑁\widetilde{\omega}_{i}(0)=\omega_{i}(0)\;(i\in[N])over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) = italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) ( italic_i ∈ [ italic_N ] ) as initialization,

    ddtθ~i(t)dd𝑡subscript~𝜃𝑖𝑡\displaystyle\frac{\mathrm{d}}{\mathrm{d}t}\widetilde{\theta}_{i}(t)divide start_ARG roman_d end_ARG start_ARG roman_d italic_t end_ARG over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) =ηv^f(θ~i(t);θ~(N)(t),ω~(N)(t)),ddtω~i(t)absent𝜂superscript^𝑣𝑓subscript~𝜃𝑖𝑡superscript~𝜃𝑁𝑡superscript~𝜔𝑁𝑡dd𝑡subscript~𝜔𝑖𝑡\displaystyle=\eta\cdot\widehat{v}^{f}(\widetilde{\theta}_{i}(t);\widetilde{% \theta}^{(N)}(t),\widetilde{\omega}^{(N)}(t)),\qquad\frac{\mathrm{d}}{\mathrm{% d}t}\widetilde{\omega}_{i}(t)= italic_η ⋅ over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ; over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) , over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) ) , divide start_ARG roman_d end_ARG start_ARG roman_d italic_t end_ARG over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) =ηv^g(ω~i(t);θ~(N)(t),ω~(N)(t)).absent𝜂superscript^𝑣𝑔subscript~𝜔𝑖𝑡superscript~𝜃𝑁𝑡superscript~𝜔𝑁𝑡\displaystyle=\eta\cdot\widehat{v}^{g}(\widetilde{\omega}_{i}(t);\widetilde{% \theta}^{(N)}(t),\widetilde{\omega}^{(N)}(t)).= italic_η ⋅ over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ; over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) , over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) ) . (B.6)
  • Ideal particle (IP). We consider the following ideal particle dynamics for θ¯(N)(t)superscript¯𝜃𝑁𝑡\bar{\theta}^{(N)}(t)over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) and w¯(N)(t)superscript¯𝑤𝑁𝑡\bar{w}^{(N)}(t)over¯ start_ARG italic_w end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ), where t+𝑡subscriptt\in\mathbb{R}_{+}italic_t ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, with θ¯i(0)=θi(0)subscript¯𝜃𝑖0subscript𝜃𝑖0\bar{\theta}_{i}(0)=\theta_{i}(0)over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) = italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ), w¯i(0)=wi(0)(i[N])subscript¯𝑤𝑖0subscript𝑤𝑖0𝑖delimited-[]𝑁\bar{w}_{i}(0)=w_{i}(0)\;(i\in[N])over¯ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) = italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) ( italic_i ∈ [ italic_N ] ) as initialization,

    ddtθ¯i(t)=ηvf(θ¯i(t);ρt),ddtω¯i(t)=ηvg(ω¯i(t);ρt).formulae-sequencedd𝑡subscript¯𝜃𝑖𝑡𝜂superscript𝑣𝑓subscript¯𝜃𝑖𝑡subscript𝜌𝑡dd𝑡subscript¯𝜔𝑖𝑡𝜂superscript𝑣𝑔subscript¯𝜔𝑖𝑡subscript𝜌𝑡\displaystyle\frac{\mathrm{d}}{\mathrm{d}t}\bar{\theta}_{i}(t)=\eta\cdot v^{f}% (\bar{\theta}_{i}(t);\rho_{t}),\qquad\frac{\mathrm{d}}{\mathrm{d}t}\bar{\omega% }_{i}(t)=\eta\cdot v^{g}(\bar{\omega}_{i}(t);\rho_{t}).divide start_ARG roman_d end_ARG start_ARG roman_d italic_t end_ARG over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = italic_η ⋅ italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , divide start_ARG roman_d end_ARG start_ARG roman_d italic_t end_ARG over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = italic_η ⋅ italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (B.7)

We aim to prove that ρ^k=N1i=1Nδθi(k)δwi(k)subscript^𝜌𝑘superscript𝑁1superscriptsubscript𝑖1𝑁subscript𝛿subscript𝜃𝑖𝑘subscript𝛿subscript𝑤𝑖𝑘\widehat{\rho}_{k}=N^{-1}\cdot\sum_{i=1}^{N}\delta_{\theta_{i}(k)}\cdot\delta_% {w_{i}(k)}over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT ⋅ italic_δ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT weakly converges to ρkϵsubscript𝜌𝑘italic-ϵ\rho_{k\epsilon}italic_ρ start_POSTSUBSCRIPT italic_k italic_ϵ end_POSTSUBSCRIPT. For any continuous function hhitalic_h that satisfies the assumptions of Proposition B.1, using the IP, CTPGDA, and PGDA dynamics as interpolating dynamics, we have,

|h(θ,ω)dρkϵ(θ,ω)h(θ,ω)dρ^k(θ,ω)|PDESGDAsuperscript𝜃𝜔differential-dsubscript𝜌𝑘italic-ϵ𝜃𝜔𝜃𝜔differential-dsubscript^𝜌𝑘𝜃𝜔PDESGDA\displaystyle\overbrace{\Bigl{|}\int h(\theta,\omega)\mathrm{d}\rho_{k\epsilon% }(\theta,\omega)-\int h(\theta,\omega)\mathrm{d}\widehat{\rho}_{k}(\theta,% \omega)\Bigr{|}}^{\mathrm{PDE}-\mathrm{SGDA}}over⏞ start_ARG | ∫ italic_h ( italic_θ , italic_ω ) roman_d italic_ρ start_POSTSUBSCRIPT italic_k italic_ϵ end_POSTSUBSCRIPT ( italic_θ , italic_ω ) - ∫ italic_h ( italic_θ , italic_ω ) roman_d over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ , italic_ω ) | end_ARG start_POSTSUPERSCRIPT roman_PDE - roman_SGDA end_POSTSUPERSCRIPT
|h(θ,ω)dρkϵ(θ)N1i=1Nh(θ¯i(kϵ),ω¯i(kϵ))|PDEIP+(θ¯,ω¯)(N)(kϵ)(θ~,ω~)(N)(kϵ)(N)IPCTPGDAabsentsubscript𝜃𝜔differential-dsubscript𝜌𝑘italic-ϵ𝜃superscript𝑁1superscriptsubscript𝑖1𝑁subscript¯𝜃𝑖𝑘italic-ϵsubscript¯𝜔𝑖𝑘italic-ϵPDEIPsubscriptsubscriptnormsuperscript¯𝜃¯𝜔𝑁𝑘italic-ϵsuperscript~𝜃~𝜔𝑁𝑘italic-ϵ𝑁IPCTPGDA\displaystyle\qquad\leq\underbrace{\left|\int h(\theta,\omega)\mathrm{d}\rho_{% k\epsilon}(\theta)-N^{-1}\cdot\sum_{i=1}^{N}h\left(\bar{\theta}_{i}(k\epsilon)% ,\bar{\omega}_{i}(k\epsilon)\right)\right|}_{\mathrm{PDE}-\mathrm{IP}}+% \underbrace{\left\|(\bar{\theta},\bar{\omega})^{(N)}(k\epsilon)-(\widetilde{% \theta},\widetilde{\omega})^{(N)}(k\epsilon)\right\|_{(N)}}_{\mathrm{IP}-% \mathrm{CTPGDA}}≤ under⏟ start_ARG | ∫ italic_h ( italic_θ , italic_ω ) roman_d italic_ρ start_POSTSUBSCRIPT italic_k italic_ϵ end_POSTSUBSCRIPT ( italic_θ ) - italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_h ( over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k italic_ϵ ) , over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k italic_ϵ ) ) | end_ARG start_POSTSUBSCRIPT roman_PDE - roman_IP end_POSTSUBSCRIPT + under⏟ start_ARG ∥ ( over¯ start_ARG italic_θ end_ARG , over¯ start_ARG italic_ω end_ARG ) start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k italic_ϵ ) - ( over~ start_ARG italic_θ end_ARG , over~ start_ARG italic_ω end_ARG ) start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k italic_ϵ ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT roman_IP - roman_CTPGDA end_POSTSUBSCRIPT
+(θ~,ω~)(N)(kϵ)(θ˘,ω˘)(N)(k)(N)CTPGDAPGDA+(θ˘,ω˘)(N)(k)(θ,ω)(N)(k)(N)PGDASGDA.subscriptsubscriptnormsuperscript~𝜃~𝜔𝑁𝑘italic-ϵsuperscript˘𝜃˘𝜔𝑁𝑘𝑁CTPGDAPGDAsubscriptsubscriptnormsuperscript˘𝜃˘𝜔𝑁𝑘superscript𝜃𝜔𝑁𝑘𝑁PGDASGDA\displaystyle\qquad+\underbrace{\left\|(\widetilde{\theta},\widetilde{\omega})% ^{(N)}(k\epsilon)-(\breve{\theta},\breve{\omega})^{(N)}(k)\right\|_{(N)}}_{% \mathrm{CTPGDA}-\mathrm{PGDA}}+\underbrace{\left\|(\breve{\theta},\breve{% \omega})^{(N)}(k)-(\theta,\omega)^{(N)}(k)\right\|_{(N)}}_{\mathrm{PGDA}-% \mathrm{SGDA}}.+ under⏟ start_ARG ∥ ( over~ start_ARG italic_θ end_ARG , over~ start_ARG italic_ω end_ARG ) start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k italic_ϵ ) - ( over˘ start_ARG italic_θ end_ARG , over˘ start_ARG italic_ω end_ARG ) start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT roman_CTPGDA - roman_PGDA end_POSTSUBSCRIPT + under⏟ start_ARG ∥ ( over˘ start_ARG italic_θ end_ARG , over˘ start_ARG italic_ω end_ARG ) start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) - ( italic_θ , italic_ω ) start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT roman_PGDA - roman_SGDA end_POSTSUBSCRIPT . (B.8)

The last inequality follows from the fact that Lip(h)1Lip1\operatorname{Lip}(h)\leq 1roman_Lip ( italic_h ) ≤ 1. Here the norm (N)\|\cdot\|_{(N)}∥ ⋅ ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT denotes the supremum norm over the sequence of vectors (θ,w)(N)={(θi,wi)}i=1Nsuperscript𝜃𝑤𝑁superscriptsubscriptsubscript𝜃𝑖subscript𝑤𝑖𝑖1𝑁(\theta,w)^{(N)}=\{(\theta_{i},w_{i})\}_{i=1}^{N}( italic_θ , italic_w ) start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT = { ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT,

(θ,ω)(N)(N)=supi[N](θi,ωi).subscriptdelimited-∥∥superscript𝜃𝜔𝑁𝑁subscriptsupremum𝑖delimited-[]𝑁delimited-∥∥subscript𝜃𝑖subscript𝜔𝑖\displaystyle\Bigl{\|}(\theta,\omega)^{(N)}\Bigr{\|}_{(N)}=\sup_{i\in[N]}~{}% \Bigl{\|}(\theta_{i},\omega_{i})\Bigr{\|}.∥ ( italic_θ , italic_ω ) start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT = roman_sup start_POSTSUBSCRIPT italic_i ∈ [ italic_N ] end_POSTSUBSCRIPT ∥ ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ . (B.9)

In what follows, we define B>0𝐵0B>0italic_B > 0 as a constant with its value varying from line to line. We establish the following lemmas as upper-bound of the four terms on right-hand side of (B).

Lemma B.2 (Upper Bound of PDEIPPDEIP\mathrm{PDE}-\mathrm{IP}roman_PDE - roman_IP).

Under Assumption 4.1 and 4.3, with probability at least 1δ1𝛿1-\delta1 - italic_δ, it holds that

supt[0,T]|h(θ,ω)dρt(θ,ω)N1i=1Nh(θ¯i(t),ω¯i(t))|Blog(NT/δ)/N.subscriptsupremum𝑡0𝑇𝜃𝜔differential-dsubscript𝜌𝑡𝜃𝜔superscript𝑁1superscriptsubscript𝑖1𝑁subscript¯𝜃𝑖𝑡subscript¯𝜔𝑖𝑡𝐵𝑁𝑇𝛿𝑁\displaystyle\sup_{t\in[0,T]}\Bigl{|}\int h(\theta,\omega)\mathrm{d}\rho_{t}(% \theta,\omega)-N^{-1}\sum_{i=1}^{N}h\bigl{(}\bar{\theta}_{i}(t),\bar{\omega}_{% i}(t)\bigr{)}\Bigr{|}\leq B\cdot\sqrt{\log(NT/\delta)/N}.roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | ∫ italic_h ( italic_θ , italic_ω ) roman_d italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ , italic_ω ) - italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_h ( over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) | ≤ italic_B ⋅ square-root start_ARG roman_log ( italic_N italic_T / italic_δ ) / italic_N end_ARG . (B.10)
Lemma B.3 (Upper Bound of IPCTPGDAIPCTPGDA\mathrm{IP}-\mathrm{CTPGDA}roman_IP - roman_CTPGDA).

Under Assumption 4.1 and 4.3, with probability at least 12δ12𝛿1-2\delta1 - 2 italic_δ, it holds that

supt[0,T](θ¯,ω¯)(N)(t)(θ~,ω~)(N)(t)(N)BeBTlog(N/δ)/N.subscriptsupremum𝑡0𝑇subscriptdelimited-∥∥superscript¯𝜃¯𝜔𝑁𝑡superscript~𝜃~𝜔𝑁𝑡𝑁𝐵superscript𝑒𝐵𝑇𝑁𝛿𝑁\displaystyle\sup_{t\in[0,T]}\Bigl{\|}(\bar{\theta},\bar{\omega})^{(N)}(t)-(% \widetilde{\theta},\widetilde{\omega})^{(N)}(t)\Bigr{\|}_{(N)}\leq B\cdot e^{% BT}\cdot\sqrt{\log(N/\delta)/N}.roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT ∥ ( over¯ start_ARG italic_θ end_ARG , over¯ start_ARG italic_ω end_ARG ) start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) - ( over~ start_ARG italic_θ end_ARG , over~ start_ARG italic_ω end_ARG ) start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT ≤ italic_B ⋅ italic_e start_POSTSUPERSCRIPT italic_B italic_T end_POSTSUPERSCRIPT ⋅ square-root start_ARG roman_log ( italic_N / italic_δ ) / italic_N end_ARG . (B.11)
Lemma B.4 (Upper Bound of CTPGDAPGDACTPGDAPGDA\mathrm{CTPGDA}-\mathrm{PGDA}roman_CTPGDA - roman_PGDA).

Under Assumption 4.1 and 4.3, it holds that

supkT/ϵ(θ~,ω~)(N)(kϵ)(θ˘,ω˘)(N)(k))(N)BeBTϵ.\displaystyle\sup_{k\leq T/\epsilon}\Bigl{\|}(\widetilde{\theta},\widetilde{% \omega})^{(N)}(k\epsilon)-(\breve{\theta},\breve{\omega})^{(N)}(k))\Bigr{\|}_{% (N)}\leq B\cdot e^{BT}\cdot\epsilon.roman_sup start_POSTSUBSCRIPT italic_k ≤ italic_T / italic_ϵ end_POSTSUBSCRIPT ∥ ( over~ start_ARG italic_θ end_ARG , over~ start_ARG italic_ω end_ARG ) start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k italic_ϵ ) - ( over˘ start_ARG italic_θ end_ARG , over˘ start_ARG italic_ω end_ARG ) start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT ≤ italic_B ⋅ italic_e start_POSTSUPERSCRIPT italic_B italic_T end_POSTSUPERSCRIPT ⋅ italic_ϵ . (B.12)
Lemma B.5 (Upper Bound of PGDASGDAPGDASGDA\mathrm{PGDA}-\mathrm{SGDA}roman_PGDA - roman_SGDA).

Under Assumption 4.1 and 4.3, with probability at least 12δ12𝛿1-2\delta1 - 2 italic_δ, it holds that

supkT/ϵ(θ˘,ω˘)(N)(k))(θ,w)(N)(k)(N)BeBTϵ(D+log(N/δ).\displaystyle\sup_{k\leq T/\epsilon}\Bigl{\|}(\breve{\theta},\breve{\omega})^{% (N)}(k))-(\theta,w)^{(N)}(k)\Bigr{\|}_{(N)}\leq B\cdot e^{BT}\cdot\sqrt{% \epsilon\cdot(D+\log(N/\delta)}.roman_sup start_POSTSUBSCRIPT italic_k ≤ italic_T / italic_ϵ end_POSTSUBSCRIPT ∥ ( over˘ start_ARG italic_θ end_ARG , over˘ start_ARG italic_ω end_ARG ) start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) ) - ( italic_θ , italic_w ) start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT ≤ italic_B ⋅ italic_e start_POSTSUPERSCRIPT italic_B italic_T end_POSTSUPERSCRIPT ⋅ square-root start_ARG italic_ϵ ⋅ ( italic_D + roman_log ( italic_N / italic_δ ) end_ARG . (B.13)

With these lemmas, we are now ready to present the proof of Proposition B.1.

Proof.

See §B.1.1, B.1.2, B.1.3, B.1.4 for detailed proofs for Lemma B.2 to Lemma B.5.

Plug in (B.10), (B.12), (B.12) and (B.13) to (B) and condition on the intersection of events in Lemma B.2, B.3, B.4 and B.5, we have that

|h(θ,ω)dρkϵ(θ,ω)h(θ,ω)dρ^k(θ,ω)|BeBT(log(N/δ)/N+ϵ(D+log(N/δ))),𝜃𝜔differential-dsubscript𝜌𝑘italic-ϵ𝜃𝜔𝜃𝜔differential-dsubscript^𝜌𝑘𝜃𝜔𝐵superscript𝑒𝐵𝑇𝑁𝛿𝑁italic-ϵ𝐷𝑁𝛿\displaystyle\Bigl{|}\int h(\theta,\omega)\mathrm{d}\rho_{k\epsilon}(\theta,% \omega)-\int h(\theta,\omega)\mathrm{d}\widehat{\rho}_{k}(\theta,\omega)\Bigr{% |}\leq B\cdot e^{BT}\cdot\Big{(}\sqrt{\log(N/\delta)/N}+\sqrt{\epsilon\cdot(D+% \log(N/\delta))}\Big{)},| ∫ italic_h ( italic_θ , italic_ω ) roman_d italic_ρ start_POSTSUBSCRIPT italic_k italic_ϵ end_POSTSUBSCRIPT ( italic_θ , italic_ω ) - ∫ italic_h ( italic_θ , italic_ω ) roman_d over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ , italic_ω ) | ≤ italic_B ⋅ italic_e start_POSTSUPERSCRIPT italic_B italic_T end_POSTSUPERSCRIPT ⋅ ( square-root start_ARG roman_log ( italic_N / italic_δ ) / italic_N end_ARG + square-root start_ARG italic_ϵ ⋅ ( italic_D + roman_log ( italic_N / italic_δ ) ) end_ARG ) ,

with probability at least 15δ15𝛿1-5\delta1 - 5 italic_δ. Thus, we complete the proof of Proposition B.1. ∎

B.1 Proofs of Lemmas B.2-B.5

In this section, we present the proofs of Lemmas B.2-B.5, which based heavily on Mei et al. (2018, 2019); Araújo et al. (2019); Zhang et al. (2020). The required supporting technical lemmas are in §C. The constant B𝐵Bitalic_B presented in the proof is a positive constant whose values varies from line to line for notational simplicity.

B.1.1 Proof of Lemma B.2

Proof.

We first consider the ideal particle dynamics in (B.7). It holds that θ¯i(t)μt,ω¯i(t)νt,(i[N])formulae-sequencesimilar-tosubscript¯𝜃𝑖𝑡subscript𝜇𝑡similar-tosubscript¯𝜔𝑖𝑡subscript𝜈𝑡𝑖delimited-[]𝑁\bar{\theta}_{i}(t)\sim\mu_{t},\bar{\omega}_{i}(t)\sim\nu_{t},\;(i\in[N])over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ∼ italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ∼ italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ( italic_i ∈ [ italic_N ] ) (Proposition 8.1.8 in Ambrosio et al. (2008)). Since the randomness of θ¯i(t)subscript¯𝜃𝑖𝑡\bar{\theta}_{i}(t)over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) and ω¯i(t)subscript¯𝜔𝑖𝑡\bar{\omega}_{i}(t)over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) comes from θi(0)subscript𝜃𝑖0\theta_{i}(0)italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) and ωi(0)subscript𝜔𝑖0\omega_{i}(0)italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) respectively while θi(0)subscript𝜃𝑖0\theta_{i}(0)italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) and ωi(0)(i[N])subscript𝜔𝑖0𝑖delimited-[]𝑁\omega_{i}(0)\;(i\in[N])italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) ( italic_i ∈ [ italic_N ] ) are independent, θ¯i(t)i.i.d.μt\bar{\theta}_{i}(t)\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mu_{t}over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG roman_i . roman_i . roman_d . end_ARG end_RELOP italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, ω¯i(t)i.i.d.νt(i[N])\bar{\omega}_{i}(t)\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\nu_{t}\;(i% \in[N])over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG roman_i . roman_i . roman_d . end_ARG end_RELOP italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_i ∈ [ italic_N ] ). Due to independence of θ¯i(t)subscript¯𝜃𝑖𝑡\bar{\theta}_{i}(t)over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) and ω¯i(t)subscript¯𝜔𝑖𝑡\bar{\omega}_{i}(t)over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ), we also have (θ¯i(t),ω¯i(t))i.i.d.ρt(i[N])(\bar{\theta}_{i}(t),\bar{\omega}_{i}(t))\stackrel{{\scriptstyle\mathrm{i.i.d.% }}}{{\sim}}\rho_{t}\;(i\in[N])( over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG roman_i . roman_i . roman_d . end_ARG end_RELOP italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_i ∈ [ italic_N ] ). This implies the following,

𝔼ρt[N1i=1Nh(θ¯i(t),w¯i(t))]=h(θ,ω)dρt(θ,ω).subscript𝔼subscript𝜌𝑡delimited-[]superscript𝑁1superscriptsubscript𝑖1𝑁subscript¯𝜃𝑖𝑡subscript¯𝑤𝑖𝑡𝜃𝜔differential-dsubscript𝜌𝑡𝜃𝜔\displaystyle\mathbb{E}_{\rho_{t}}\Bigl{[}N^{-1}\cdot\sum_{i=1}^{N}h(\bar{% \theta}_{i}(t),\bar{w}_{i}(t))\Bigr{]}=\int h(\theta,\omega)\mathrm{d}\rho_{t}% (\theta,\omega).blackboard_E start_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_h ( over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , over¯ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) ] = ∫ italic_h ( italic_θ , italic_ω ) roman_d italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ , italic_ω ) .

For notational simplicity, we denote γi=(θi,ωi)subscript𝛾𝑖subscript𝜃𝑖subscript𝜔𝑖\gamma_{i}=(\theta_{i},\omega_{i})italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), similar notations also generalize to γ¯i,γ~i,γ˘isubscript¯𝛾𝑖subscript~𝛾𝑖subscript˘𝛾𝑖\bar{\gamma}_{i},\widetilde{\gamma}_{i},\breve{\gamma}_{i}over¯ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over˘ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Let γ1,(N)={γ1,,γi1.,γN}\gamma^{1,(N)}=\{\gamma_{1},\dots,\gamma_{i}^{1}.\dots,\gamma_{N}\}italic_γ start_POSTSUPERSCRIPT 1 , ( italic_N ) end_POSTSUPERSCRIPT = { italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT . … , italic_γ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } and γ2,(N)={γ1,,γi2,,γN}superscript𝛾2𝑁subscript𝛾1superscriptsubscript𝛾𝑖2subscript𝛾𝑁\gamma^{2,(N)}=\{\gamma_{1},\dots,\gamma_{i}^{2},\dots,\gamma_{N}\}italic_γ start_POSTSUPERSCRIPT 2 , ( italic_N ) end_POSTSUPERSCRIPT = { italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_γ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } be two sets of variables that only differ in the i𝑖iitalic_i-th element. Then, by the assumption that f1subscriptnorm𝑓1\|f\|_{\infty}\leq 1∥ italic_f ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ 1, we have the following bounded difference property,

|N1j=1Nh(γj1)N1j=1Nh(γj2)|=N1|h(γi1)h(γi2)|2/N.superscript𝑁1superscriptsubscript𝑗1𝑁superscriptsubscript𝛾𝑗1superscript𝑁1superscriptsubscript𝑗1𝑁superscriptsubscript𝛾𝑗2superscript𝑁1superscriptsubscript𝛾𝑖1superscriptsubscript𝛾𝑖22𝑁\displaystyle\Bigl{|}N^{-1}\sum_{j=1}^{N}h(\gamma_{j}^{1})-N^{-1}\sum_{j=1}^{N% }h(\gamma_{j}^{2})\Bigr{|}=N^{-1}\cdot|h(\gamma_{i}^{1})-h(\gamma_{i}^{2})|% \leq 2/N.| italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_h ( italic_γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) - italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_h ( italic_γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) | = italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ | italic_h ( italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) - italic_h ( italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) | ≤ 2 / italic_N .

Applying McDiarmid’s inequality (Wainwright, 2019), we have for a fixed t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ] that

(|N1i=1Nh(γ¯i(t))h(γ)dρt(γ)|p)exp(Np2/4).superscript𝑁1superscriptsubscript𝑖1𝑁subscript¯𝛾𝑖𝑡𝛾differential-dsubscript𝜌𝑡𝛾𝑝𝑁superscript𝑝24\displaystyle\mathbb{P}\left(\left|N^{-1}\sum_{i=1}^{N}h\left(\bar{\gamma}_{i}% (t)\right)-\int h(\gamma)\mathrm{d}\rho_{t}(\gamma)\right|\geq p\right)\leq% \exp\left(-Np^{2}/4\right).blackboard_P ( | italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_h ( over¯ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) - ∫ italic_h ( italic_γ ) roman_d italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_γ ) | ≥ italic_p ) ≤ roman_exp ( - italic_N italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 4 ) . (B.14)

Moreover, we have for any s,t[0,T]𝑠𝑡0𝑇s,t\in[0,T]italic_s , italic_t ∈ [ 0 , italic_T ] that,

||N1i=1Nh(γ¯i(t))h(γ)dρt(γ)||N1i=1Nh(γ¯i(s))h(γ)dρs(γ)||superscript𝑁1superscriptsubscript𝑖1𝑁subscript¯𝛾𝑖𝑡𝛾differential-dsubscript𝜌𝑡𝛾superscript𝑁1superscriptsubscript𝑖1𝑁subscript¯𝛾𝑖𝑠𝛾differential-dsubscript𝜌𝑠𝛾\displaystyle\left|\Bigl{|}N^{-1}\sum_{i=1}^{N}h\left(\bar{\gamma}_{i}(t)% \right)-\int h(\gamma)\mathrm{d}\rho_{t}(\gamma)\Bigr{|}-\Bigl{|}N^{-1}\sum_{i% =1}^{N}h\left(\bar{\gamma}_{i}(s)\right)-\int h(\gamma)\mathrm{d}\rho_{s}(% \gamma)\Bigr{|}\right|| | italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_h ( over¯ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) - ∫ italic_h ( italic_γ ) roman_d italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_γ ) | - | italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_h ( over¯ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ) - ∫ italic_h ( italic_γ ) roman_d italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_γ ) | |
|N1i=1Nh(γ¯i(t))N1i=1Nh(γ¯i(s))|+|h(γ)dρt(γ)h(γ)dρs(γ)|absentsuperscript𝑁1superscriptsubscript𝑖1𝑁subscript¯𝛾𝑖𝑡superscript𝑁1superscriptsubscript𝑖1𝑁subscript¯𝛾𝑖𝑠𝛾differential-dsubscript𝜌𝑡𝛾𝛾differential-dsubscript𝜌𝑠𝛾\displaystyle\qquad\leq\Big{|}N^{-1}\sum_{i=1}^{N}h\left(\bar{\gamma}_{i}(t)% \right)-N^{-1}\sum_{i=1}^{N}h\left(\bar{\gamma}_{i}(s)\right)\Big{|}+\Big{|}% \int h(\gamma)\mathrm{d}\rho_{t}(\gamma)-\int h(\gamma)\mathrm{d}\rho_{s}(% \gamma)\Big{|}≤ | italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_h ( over¯ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) - italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_h ( over¯ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ) | + | ∫ italic_h ( italic_γ ) roman_d italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_γ ) - ∫ italic_h ( italic_γ ) roman_d italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_γ ) |
γ¯(N)(t)γ¯(N)(s)(N)+𝒲1(ρt,ρs)γ¯(N)(t)γ¯(N)(s)(N)+𝒲2(ρt,ρs)absentsubscriptnormsuperscript¯𝛾𝑁𝑡superscript¯𝛾𝑁𝑠𝑁subscript𝒲1subscript𝜌𝑡subscript𝜌𝑠subscriptnormsuperscript¯𝛾𝑁𝑡superscript¯𝛾𝑁𝑠𝑁subscript𝒲2subscript𝜌𝑡subscript𝜌𝑠\displaystyle\qquad\leq\left\|\bar{\gamma}^{(N)}(t)-\bar{\gamma}^{(N)}(s)% \right\|_{(N)}+\mathcal{W}_{1}\left(\rho_{t},\rho_{s}\right)\leq\left\|\bar{% \gamma}^{(N)}(t)-\bar{\gamma}^{(N)}(s)\right\|_{(N)}+\mathcal{W}_{2}\left(\rho% _{t},\rho_{s}\right)≤ ∥ over¯ start_ARG italic_γ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) - over¯ start_ARG italic_γ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + caligraphic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ≤ ∥ over¯ start_ARG italic_γ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) - over¯ start_ARG italic_γ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT )
θ¯(N)(t)θ¯(N)(s)(N)+w¯(N)(t)w¯(N)(s)(N)+𝒲2(μt,μs)+𝒲2(νt,νs).absentsubscriptnormsuperscript¯𝜃𝑁𝑡superscript¯𝜃𝑁𝑠𝑁subscriptnormsuperscript¯𝑤𝑁𝑡superscript¯𝑤𝑁𝑠𝑁subscript𝒲2subscript𝜇𝑡subscript𝜇𝑠subscript𝒲2subscript𝜈𝑡subscript𝜈𝑠\displaystyle\qquad\leq\left\|\bar{\theta}^{(N)}(t)-\bar{\theta}^{(N)}(s)% \right\|_{(N)}+\left\|\bar{w}^{(N)}(t)-\bar{w}^{(N)}(s)\right\|_{(N)}+\mathcal% {W}_{2}\left(\mu_{t},\mu_{s}\right)+\mathcal{W}_{2}\left(\nu_{t},\nu_{s}\right).≤ ∥ over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) - over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + ∥ over¯ start_ARG italic_w end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) - over¯ start_ARG italic_w end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) + caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) .

where the second inequality follows from the fact that Lip(h)1Lip1\operatorname{Lip}(h)\leq 1roman_Lip ( italic_h ) ≤ 1 and Lemma D.7. The last inequality follows from the definition of γ(N)superscript𝛾𝑁\gamma^{(N)}italic_γ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT, (B.9) and Lemma D.2. Applying (C.12), (C.14) of Lemma C.2, we have for any s,t[0,T]𝑠𝑡0𝑇s,t\in[0,T]italic_s , italic_t ∈ [ 0 , italic_T ] that

||N1i=1Nh(γ¯i(t))h(γ)dρt||N1i=1Nh(γ¯i(s))h(γ)dρs||B|ts|.\displaystyle\left|\Bigl{|}N^{-1}\sum_{i=1}^{N}h(\bar{\gamma}_{i}(t))-\int h(% \gamma)\mathrm{d}\rho_{t}\Bigl{|}-\Bigl{|}N^{-1}\cdot\sum_{i=1}^{N}h(\bar{% \gamma}_{i}(s))-\int h(\gamma)\mathrm{d}\rho_{s}\Bigl{|}\right|\leq B\cdot% \Bigl{|}t-s\Bigr{|}.| | italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_h ( over¯ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) - ∫ italic_h ( italic_γ ) roman_d italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | - | italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_h ( over¯ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ) - ∫ italic_h ( italic_γ ) roman_d italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | | ≤ italic_B ⋅ | italic_t - italic_s | .

Apply the union bound to (B.14) for tι{0,1,,T/ι}𝑡𝜄01𝑇𝜄t\in\iota\cdot\{0,1,\ldots,\lfloor T/\iota\rfloor\}italic_t ∈ italic_ι ⋅ { 0 , 1 , … , ⌊ italic_T / italic_ι ⌋ }, we have that

(supt[0,T]|N1i=1Nh(γ¯i(t))h(γ)dρt(γ)|p+Bι)(T/ι+1)exp(Np2/4).subscriptsupremum𝑡0𝑇superscript𝑁1superscriptsubscript𝑖1𝑁subscript¯𝛾𝑖𝑡𝛾differential-dsubscript𝜌𝑡𝛾𝑝𝐵𝜄𝑇𝜄1𝑁superscript𝑝24\displaystyle\mathbb{P}\left(\sup_{t\in[0,T]}\left|N^{-1}\sum_{i=1}^{N}h\left(% \bar{\gamma}_{i}(t)\right)-\int h(\gamma)\mathrm{d}\rho_{t}(\gamma)\right|\geq p% +B\cdot\iota\right)\leq(T/\iota+1)\cdot\exp\left(-Np^{2}/4\right).blackboard_P ( roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_h ( over¯ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) - ∫ italic_h ( italic_γ ) roman_d italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_γ ) | ≥ italic_p + italic_B ⋅ italic_ι ) ≤ ( italic_T / italic_ι + 1 ) ⋅ roman_exp ( - italic_N italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 4 ) .

Setting ι=N1/2𝜄superscript𝑁12\iota=N^{-1/2}italic_ι = italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT and p=Blog(NT/δ)/N𝑝𝐵𝑁𝑇𝛿𝑁p=B\cdot\sqrt{\log(NT/\delta)/N}italic_p = italic_B ⋅ square-root start_ARG roman_log ( italic_N italic_T / italic_δ ) / italic_N end_ARG, we have that

supt[0,T]|N1i=1Nh(θ¯i(t),ω¯i(t))h(θ,ω)dρt|Blog(NT/δ)/N.subscriptsupremum𝑡0𝑇superscript𝑁1superscriptsubscript𝑖1𝑁subscript¯𝜃𝑖𝑡subscript¯𝜔𝑖𝑡𝜃𝜔differential-dsubscript𝜌𝑡𝐵𝑁𝑇𝛿𝑁\displaystyle\sup_{t\in[0,T]}\left|N^{-1}\sum_{i=1}^{N}h\left(\bar{\theta}_{i}% (t),\bar{\omega}_{i}(t)\right)-\int h(\theta,\omega)\mathrm{d}\rho_{t}\right|% \leq B\cdot\sqrt{\log(NT/\delta)/N}.roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_h ( over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) - ∫ italic_h ( italic_θ , italic_ω ) roman_d italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ≤ italic_B ⋅ square-root start_ARG roman_log ( italic_N italic_T / italic_δ ) / italic_N end_ARG .

with probability at least 1δ1𝛿1-\delta1 - italic_δ. Thus, we complete the proof of Lemma B.2. ∎

B.1.2 Proof of Lemma B.3

Following from the definition of θ~i(t)subscript~𝜃𝑖𝑡\widetilde{\theta}_{i}(t)over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ), w~i(t)subscript~𝑤𝑖𝑡\widetilde{w}_{i}(t)over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) and θ¯i(t)subscript¯𝜃𝑖𝑡\bar{\theta}_{i}(t)over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ), w¯i(t)subscript¯𝑤𝑖𝑡\bar{w}_{i}(t)over¯ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) in (B.6) and (B.7). We have for any i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ] and t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ] that

θ¯i(t)θ~i(t)delimited-∥∥subscript¯𝜃𝑖𝑡subscript~𝜃𝑖𝑡\displaystyle\bigl{\|}\bar{\theta}_{i}(t)-\widetilde{\theta}_{i}(t)\bigr{\|}∥ over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ∥ 0tdθ~i(s)dsdθ¯i(s)dsdsabsentsuperscriptsubscript0𝑡delimited-∥∥dsubscript~𝜃𝑖𝑠d𝑠dsubscript¯𝜃𝑖𝑠d𝑠differential-d𝑠\displaystyle\;\leq\int_{0}^{t}\Bigl{\|}\frac{\mathrm{d}\widetilde{\theta}_{i}% (s)}{\mathrm{d}s}-\frac{\mathrm{d}\bar{\theta}_{i}(s)}{\mathrm{d}s}\Bigr{\|}% \mathrm{d}s≤ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ divide start_ARG roman_d over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) end_ARG start_ARG roman_d italic_s end_ARG - divide start_ARG roman_d over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) end_ARG start_ARG roman_d italic_s end_ARG ∥ roman_d italic_s
η0tv^f(θ~i(s);θ~(N)(s),ω~(N)(s))v^f(θ¯i(s);θ¯(N)(s),ω¯(N)(s))dsabsent𝜂superscriptsubscript0𝑡delimited-∥∥superscript^𝑣𝑓subscript~𝜃𝑖𝑠superscript~𝜃𝑁𝑠superscript~𝜔𝑁𝑠superscript^𝑣𝑓subscript¯𝜃𝑖𝑠superscript¯𝜃𝑁𝑠superscript¯𝜔𝑁𝑠differential-d𝑠\displaystyle\;\leq\eta\cdot\int_{0}^{t}\Bigl{\|}\widehat{v}^{f}(\widetilde{% \theta}_{i}(s);\widetilde{\theta}^{(N)}(s),\widetilde{\omega}^{(N)}(s))-% \widehat{v}^{f}(\bar{\theta}_{i}(s);\bar{\theta}^{(N)}(s),\bar{\omega}^{(N)}(s% ))\Bigr{\|}\mathrm{d}s≤ italic_η ⋅ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ; over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) , over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) ) - over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ; over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) , over¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) ) ∥ roman_d italic_s
+η0tv^f(θ¯i(s);θ¯(N)(s),ω¯(N)(s))vf(θ¯i(s);ρs)ds𝜂superscriptsubscript0𝑡delimited-∥∥superscript^𝑣𝑓subscript¯𝜃𝑖𝑠superscript¯𝜃𝑁𝑠superscript¯𝜔𝑁𝑠superscript𝑣𝑓subscript¯𝜃𝑖𝑠subscript𝜌𝑠differential-d𝑠\displaystyle\qquad+\eta\cdot\int_{0}^{t}\Bigl{\|}\widehat{v}^{f}(\bar{\theta}% _{i}(s);\bar{\theta}^{(N)}(s),\bar{\omega}^{(N)}(s))-v^{f}(\bar{\theta}_{i}(s)% ;\rho_{s})\Bigr{\|}\mathrm{d}s+ italic_η ⋅ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ; over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) , over¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) ) - italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ; italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ roman_d italic_s
B0tθ¯(N)(s)θ~(N)(s)(N)+ω¯(N)(s)ω~(N)(s)(N)dsabsent𝐵superscriptsubscript0𝑡subscriptdelimited-∥∥superscript¯𝜃𝑁𝑠superscript~𝜃𝑁𝑠𝑁subscriptdelimited-∥∥superscript¯𝜔𝑁𝑠superscript~𝜔𝑁𝑠𝑁d𝑠\displaystyle\;\leq B\cdot\int_{0}^{t}\Bigl{\|}\bar{\theta}^{(N)}(s)-% \widetilde{\theta}^{(N)}(s)\Bigr{\|}_{(N)}+\Bigl{\|}\bar{\omega}^{(N)}(s)-% \widetilde{\omega}^{(N)}(s)\Bigr{\|}_{(N)}\mathrm{d}s≤ italic_B ⋅ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) - over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + ∥ over¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) - over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT roman_d italic_s
+η0tv^f(θ¯i(s);θ¯(N)(s),ω¯(N)(s))vf(θ¯i(s);ρs),ds\displaystyle\qquad+\eta\cdot\int_{0}^{t}\Bigl{\|}\widehat{v}^{f}(\bar{\theta}% _{i}(s);\bar{\theta}^{(N)}(s),\bar{\omega}^{(N)}(s))-v^{f}(\bar{\theta}_{i}(s)% ;\rho_{s}),\Bigr{\|}\mathrm{d}s+ italic_η ⋅ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ; over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) , over¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) ) - italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ; italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , ∥ roman_d italic_s (B.15)

where the last inequality follows from (C.8) of Lemma C.1. Similarly, we have that

ω¯i(t)ω~i(t)delimited-∥∥subscript¯𝜔𝑖𝑡subscript~𝜔𝑖𝑡\displaystyle\bigl{\|}\bar{\omega}_{i}(t)-\widetilde{\omega}_{i}(t)\bigr{\|}∥ over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ∥ B0tθ¯(N)(s)θ~(N)(s)(N)+ω¯(N)(s)ω~(N)(s)(N)dsabsent𝐵superscriptsubscript0𝑡subscriptdelimited-∥∥superscript¯𝜃𝑁𝑠superscript~𝜃𝑁𝑠𝑁subscriptdelimited-∥∥superscript¯𝜔𝑁𝑠superscript~𝜔𝑁𝑠𝑁d𝑠\displaystyle\;\leq B\cdot\int_{0}^{t}\Bigl{\|}\bar{\theta}^{(N)}(s)-% \widetilde{\theta}^{(N)}(s)\Bigr{\|}_{(N)}+\Bigl{\|}\bar{\omega}^{(N)}(s)-% \widetilde{\omega}^{(N)}(s)\Bigr{\|}_{(N)}\mathrm{d}s≤ italic_B ⋅ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) - over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + ∥ over¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) - over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT roman_d italic_s
+η0tv^g(ω¯i(s);θ¯(N)(s),ω¯(N)(s))vg(ω¯i(s);ρs)ds,𝜂superscriptsubscript0𝑡delimited-∥∥superscript^𝑣𝑔subscript¯𝜔𝑖𝑠superscript¯𝜃𝑁𝑠superscript¯𝜔𝑁𝑠superscript𝑣𝑔subscript¯𝜔𝑖𝑠subscript𝜌𝑠differential-d𝑠\displaystyle\qquad+\eta\cdot\int_{0}^{t}\Bigl{\|}\widehat{v}^{g}(\bar{\omega}% _{i}(s);\bar{\theta}^{(N)}(s),\bar{\omega}^{(N)}(s))-v^{g}(\bar{\omega}_{i}(s)% ;\rho_{s})\Bigr{\|}\mathrm{d}s,+ italic_η ⋅ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ; over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) , over¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) ) - italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ; italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ roman_d italic_s , (B.16)

where the inequality follows from (C.9). We now upper-bound the second term of (B.15) and (B.16). We start with (B.15). Following from the definition of vfsuperscript𝑣𝑓v^{f}italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT and v^fsuperscript^𝑣𝑓\widehat{v}^{f}over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT in (B) and (B), we have for any s[0,T]𝑠0𝑇s\in[0,T]italic_s ∈ [ 0 , italic_T ] and i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ] that

v^f(θ¯i(s);θ¯(N)(s),w¯(N)(s))vf(θ¯i(s);ρs)=α2N1j=1NZij(s),delimited-∥∥superscript^𝑣𝑓subscript¯𝜃𝑖𝑠superscript¯𝜃𝑁𝑠superscript¯𝑤𝑁𝑠superscript𝑣𝑓subscript¯𝜃𝑖𝑠subscript𝜌𝑠superscript𝛼2delimited-∥∥superscript𝑁1superscriptsubscript𝑗1𝑁superscriptsubscript𝑍𝑖𝑗𝑠\displaystyle\Bigl{\|}\widehat{v}^{f}(\bar{\theta}_{i}(s);\bar{\theta}^{(N)}(s% ),\bar{w}^{(N)}(s))-v^{f}(\bar{\theta}_{i}(s);\rho_{s})\Bigr{\|}=\alpha^{2}% \cdot\Bigl{\|}N^{-1}\cdot\sum_{j=1}^{N}Z_{i}^{j}(s)\Bigr{\|},∥ over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ; over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) , over¯ start_ARG italic_w end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) ) - italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ; italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ = italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ∥ italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s ) ∥ , (B.17)

where Zij(s)superscriptsubscript𝑍𝑖𝑗𝑠Z_{i}^{j}(s)italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s ) is given by,

Zij(s)superscriptsubscript𝑍𝑖𝑗𝑠\displaystyle Z_{i}^{j}(s)italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s ) =𝔼𝒟[(ψ(Z;ω)dνs(ω)ψ(Z;ω¯j(s)))δΦ(X,Z;f)δf,θϕ(;θ¯i(s))L2\displaystyle=\mathbb{E}_{\mathcal{D}}\bigg{[}\Big{\langle}\Big{(}\int\psi(Z;% \omega)\mathrm{d}\nu_{s}(\omega)-\psi(Z;\bar{\omega}_{j}(s))\Big{)}\cdot\frac{% \delta\Phi(X,Z;f)}{\delta f},\nabla_{\theta}\phi(\cdot;\bar{\theta}_{i}(s))% \Big{\rangle}_{L^{2}}= blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ ⟨ ( ∫ italic_ψ ( italic_Z ; italic_ω ) roman_d italic_ν start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ω ) - italic_ψ ( italic_Z ; over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s ) ) ) ⋅ divide start_ARG italic_δ roman_Φ ( italic_X , italic_Z ; italic_f ) end_ARG start_ARG italic_δ italic_f end_ARG , ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_ϕ ( ⋅ ; over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT
+λδΨ(X,Z;ϕ(;θ)dμs(θ))δfδΨ(X,Z;ϕ(;θ¯j(s)))δf,θϕ(;θ¯i(s))L2].\displaystyle\qquad\qquad+\lambda\cdot\Big{\langle}\frac{\delta\Psi(X,Z;\int% \phi(\cdot;\theta)\mathrm{d}\mu_{s}(\theta))}{\delta f}-\frac{\delta\Psi(X,Z;% \phi(\cdot;\bar{\theta}_{j}(s)))}{\delta f},\nabla_{\theta}\phi(\cdot;\bar{% \theta}_{i}(s))\Big{\rangle}_{L^{2}}\bigg{]}.+ italic_λ ⋅ ⟨ divide start_ARG italic_δ roman_Ψ ( italic_X , italic_Z ; ∫ italic_ϕ ( ⋅ ; italic_θ ) roman_d italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_θ ) ) end_ARG start_ARG italic_δ italic_f end_ARG - divide start_ARG italic_δ roman_Ψ ( italic_X , italic_Z ; italic_ϕ ( ⋅ ; over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s ) ) ) end_ARG start_ARG italic_δ italic_f end_ARG , ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_ϕ ( ⋅ ; over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ] .

Following from Assumption 4.1 and 4.3, we have that Zij(s)Bnormsuperscriptsubscript𝑍𝑖𝑗𝑠𝐵\|Z_{i}^{j}(s)\|\leq B∥ italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s ) ∥ ≤ italic_B. When ji𝑗𝑖j\neq iitalic_j ≠ italic_i, since θ¯j(s)i.i.d.μs,ω¯j(s)i.i.d.νs(j[N])\bar{\theta}_{j}(s)\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mu_{s},% \bar{\omega}_{j}(s)\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\nu_{s}\;(j% \in[N])over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s ) start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG roman_i . roman_i . roman_d . end_ARG end_RELOP italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s ) start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG roman_i . roman_i . roman_d . end_ARG end_RELOP italic_ν start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_j ∈ [ italic_N ] ), it holds that 𝔼[Zij(s)|θ¯i(s)]=0𝔼delimited-[]conditionalsuperscriptsubscript𝑍𝑖𝑗𝑠subscript¯𝜃𝑖𝑠0\mathbb{E}[Z_{i}^{j}(s){\,\big{|}\,}\bar{\theta}_{i}(s)]=0blackboard_E [ italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s ) | over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ] = 0. Following from Lemma C.3, we have for fixed s[0,T]𝑠0𝑇s\in[0,T]italic_s ∈ [ 0 , italic_T ] and i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ] that

(N1jiZij(s)B(N1/2+p))delimited-∥∥superscript𝑁1subscript𝑗𝑖superscriptsubscript𝑍𝑖𝑗𝑠𝐵superscript𝑁12𝑝\displaystyle\mathbb{P}\bigg{(}\Bigl{\|}N^{-1}\cdot\sum_{j\neq i}Z_{i}^{j}(s)% \Bigr{\|}\geq B\cdot\left(N^{-1/2}+p\right)\bigg{)}blackboard_P ( ∥ italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s ) ∥ ≥ italic_B ⋅ ( italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT + italic_p ) ) =𝔼[(N1jiZij(s)B(N1/2+p)|θ¯i(s))]absent𝔼delimited-[]delimited-∥∥superscript𝑁1subscript𝑗𝑖superscriptsubscript𝑍𝑖𝑗𝑠conditional𝐵superscript𝑁12𝑝subscript¯𝜃𝑖𝑠\displaystyle=\mathbb{E}\Bigl{[}\mathbb{P}\Bigl{(}\Bigl{\|}N^{-1}\cdot\sum_{j% \neq i}Z_{i}^{j}(s)\Bigr{\|}\geq B\cdot\left(N^{-1/2}+p\right)\Big{|}~{}\bar{% \theta}_{i}(s)\Bigr{)}\Bigr{]}= blackboard_E [ blackboard_P ( ∥ italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s ) ∥ ≥ italic_B ⋅ ( italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT + italic_p ) | over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ) ]
exp(Np2).absent𝑁superscript𝑝2\displaystyle\leq\exp\left(-Np^{2}\right).≤ roman_exp ( - italic_N italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (B.18)

From Lemma D.7 and (C.14) of Lemma C.2, we have that

supw𝒲|ϕ(w;θ)dμs(θ)ϕ(w;θ)dμt(θ)|B𝒲1(μs,μt)B𝒲2(μs,μt)B|st|.subscriptsupremum𝑤𝒲italic-ϕ𝑤𝜃differential-dsubscript𝜇𝑠𝜃italic-ϕ𝑤𝜃differential-dsubscript𝜇𝑡𝜃𝐵subscript𝒲1subscript𝜇𝑠subscript𝜇𝑡𝐵subscript𝒲2subscript𝜇𝑠subscript𝜇𝑡𝐵𝑠𝑡\displaystyle\sup_{w\in\mathcal{W}}\Bigl{|}\int\phi(w;\theta)\mathrm{d}\mu_{s}% (\theta)-\int\phi(w;\theta)\mathrm{d}\mu_{t}(\theta)\Bigr{|}\leq B\cdot% \mathcal{W}_{1}(\mu_{s},\mu_{t})\leq B\cdot\mathcal{W}_{2}(\mu_{s},\mu_{t})% \leq B\cdot\bigl{|}s-t\bigr{|}.roman_sup start_POSTSUBSCRIPT italic_w ∈ caligraphic_W end_POSTSUBSCRIPT | ∫ italic_ϕ ( italic_w ; italic_θ ) roman_d italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_θ ) - ∫ italic_ϕ ( italic_w ; italic_θ ) roman_d italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) | ≤ italic_B ⋅ caligraphic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ italic_B ⋅ caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ italic_B ⋅ | italic_s - italic_t | .

Following from Assumption 4.1 and 4.3, Lemma C.2, we have for any s,t[0,T]𝑠𝑡0𝑇s,t\in[0,T]italic_s , italic_t ∈ [ 0 , italic_T ] that,

|N1jiZij(s)N1jiZij(t)|B|ts|.delimited-∥∥superscript𝑁1subscript𝑗𝑖superscriptsubscript𝑍𝑖𝑗𝑠delimited-∥∥superscript𝑁1subscript𝑗𝑖superscriptsubscript𝑍𝑖𝑗𝑡𝐵𝑡𝑠\displaystyle\left|\Bigl{\|}N^{-1}\cdot\sum_{j\neq i}Z_{i}^{j}(s)\Bigr{\|}-% \Bigl{\|}N^{-1}\cdot\sum_{j\neq i}Z_{i}^{j}(t)\Bigr{\|}\right|\leq B\cdot\Bigl% {|}t-s\Bigr{|}.| ∥ italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s ) ∥ - ∥ italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_t ) ∥ | ≤ italic_B ⋅ | italic_t - italic_s | .

Applying the union bound to (B.18) for i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ] and tι{0,1,,T/ι}𝑡𝜄01𝑇𝜄t\in\iota\cdot\{0,1,\ldots,\lfloor T/\iota\rfloor\}italic_t ∈ italic_ι ⋅ { 0 , 1 , … , ⌊ italic_T / italic_ι ⌋ }, we have that

(supi[N]s[0,T]N1jiZij(s)B(N1/2+p)+Bι)N(T/ι+1)exp(Np2).subscriptsupremum𝑖delimited-[]𝑁𝑠0𝑇delimited-∥∥superscript𝑁1subscript𝑗𝑖superscriptsubscript𝑍𝑖𝑗𝑠𝐵superscript𝑁12𝑝𝐵𝜄𝑁𝑇𝜄1𝑁superscript𝑝2\displaystyle\mathbb{P}\Bigl{(}\sup_{\begin{subarray}{c}i\in[N]\\ s\in[0,T]\end{subarray}}\bigl{\|}N^{-1}\cdot\sum_{j\neq i}Z_{i}^{j}(s)\bigr{\|% }\geq B\cdot\left(N^{-1/2}+p\right)+B\iota\Bigr{)}\leq N\cdot(T/\iota+1)\cdot% \exp\left(-Np^{2}\right).blackboard_P ( roman_sup start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_i ∈ [ italic_N ] end_CELL end_ROW start_ROW start_CELL italic_s ∈ [ 0 , italic_T ] end_CELL end_ROW end_ARG end_POSTSUBSCRIPT ∥ italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s ) ∥ ≥ italic_B ⋅ ( italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT + italic_p ) + italic_B italic_ι ) ≤ italic_N ⋅ ( italic_T / italic_ι + 1 ) ⋅ roman_exp ( - italic_N italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

Setting ι=N1/2𝜄superscript𝑁12\iota=N^{-1/2}italic_ι = italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT and p=Blog(NT/δ)/N𝑝𝐵𝑁𝑇𝛿𝑁p=B\cdot\sqrt{\log(NT/\delta)/N}italic_p = italic_B ⋅ square-root start_ARG roman_log ( italic_N italic_T / italic_δ ) / italic_N end_ARG, we have that

supi[N]s[0,T]N1jiZij(s)Blog(NT/δ)/N.subscriptsupremum𝑖delimited-[]𝑁𝑠0𝑇delimited-∥∥superscript𝑁1subscript𝑗𝑖superscriptsubscript𝑍𝑖𝑗𝑠𝐵𝑁𝑇𝛿𝑁\displaystyle\sup_{\begin{subarray}{c}i\in[N]\\ s\in[0,T]\end{subarray}}\Bigl{\|}N^{-1}\cdot\sum_{j\neq i}Z_{i}^{j}(s)\Bigr{\|% }\leq B\cdot\sqrt{\log(NT/\delta)/N}.roman_sup start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_i ∈ [ italic_N ] end_CELL end_ROW start_ROW start_CELL italic_s ∈ [ 0 , italic_T ] end_CELL end_ROW end_ARG end_POSTSUBSCRIPT ∥ italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s ) ∥ ≤ italic_B ⋅ square-root start_ARG roman_log ( italic_N italic_T / italic_δ ) / italic_N end_ARG . (B.19)

with probability at least 1δ1𝛿1-\delta1 - italic_δ. Following from Assumption 4.1, when i=j𝑖𝑗i=jitalic_i = italic_j, N1Zii(s)B/Nnormsuperscript𝑁1superscriptsubscript𝑍𝑖𝑖𝑠𝐵𝑁\|N^{-1}Z_{i}^{i}(s)\|\leq B/N∥ italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_s ) ∥ ≤ italic_B / italic_N in (B.17). Plugging (B.19) into (B.17), with probability at least 1δ1𝛿1-\delta1 - italic_δ, we have that

supi[N]s[0,T]v^f(θ¯i(s);θ¯(N)(s),ω¯(N)(s))vf(θ¯i(s);ρs)subscriptsupremum𝑖delimited-[]𝑁𝑠0𝑇delimited-∥∥superscript^𝑣𝑓subscript¯𝜃𝑖𝑠superscript¯𝜃𝑁𝑠superscript¯𝜔𝑁𝑠superscript𝑣𝑓subscript¯𝜃𝑖𝑠subscript𝜌𝑠\displaystyle\sup_{\begin{subarray}{c}i\in[N]\\ s\in[0,T]\end{subarray}}\Bigl{\|}\widehat{v}^{f}(\bar{\theta}_{i}(s);\bar{% \theta}^{(N)}(s),\bar{\omega}^{(N)}(s))-v^{f}(\bar{\theta}_{i}(s);\rho_{s})% \Bigr{\|}roman_sup start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_i ∈ [ italic_N ] end_CELL end_ROW start_ROW start_CELL italic_s ∈ [ 0 , italic_T ] end_CELL end_ROW end_ARG end_POSTSUBSCRIPT ∥ over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ; over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) , over¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) ) - italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ; italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ supi[N],s[0,T]α2N1j=1NZij(s)absentsubscriptsupremumformulae-sequence𝑖delimited-[]𝑁𝑠0𝑇superscript𝛼2delimited-∥∥superscript𝑁1superscriptsubscript𝑗1𝑁superscriptsubscript𝑍𝑖𝑗𝑠\displaystyle\leq\sup_{i\in[N],s\in[0,T]}\alpha^{2}\cdot\Bigl{\|}N^{-1}\sum_{j% =1}^{N}Z_{i}^{j}(s)\Bigr{\|}≤ roman_sup start_POSTSUBSCRIPT italic_i ∈ [ italic_N ] , italic_s ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ∥ italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s ) ∥
Blog(NT/δ)/N.absent𝐵𝑁𝑇𝛿𝑁\displaystyle\leq B\cdot\sqrt{\log(NT/\delta)/N}.≤ italic_B ⋅ square-root start_ARG roman_log ( italic_N italic_T / italic_δ ) / italic_N end_ARG . (B.20)

Through similar arguments, with probability at least 1δ1𝛿1-\delta1 - italic_δ, the second term of (B.16) holds

supi[N]s[0,T]v^g(w¯i(s);θ¯(N)(s),ω¯(N)(s))vg(ω¯i(s);ρs)Blog(NT/δ)/N.subscriptsupremum𝑖delimited-[]𝑁𝑠0𝑇delimited-∥∥superscript^𝑣𝑔subscript¯𝑤𝑖𝑠superscript¯𝜃𝑁𝑠superscript¯𝜔𝑁𝑠superscript𝑣𝑔subscript¯𝜔𝑖𝑠subscript𝜌𝑠𝐵𝑁𝑇𝛿𝑁\displaystyle\sup_{\begin{subarray}{c}i\in[N]\\ s\in[0,T]\end{subarray}}\Bigl{\|}\widehat{v}^{g}(\bar{w}_{i}(s);\bar{\theta}^{% (N)}(s),\bar{\omega}^{(N)}(s))-v^{g}(\bar{\omega}_{i}(s);\rho_{s})\Bigr{\|}% \leq B\cdot\sqrt{\log(NT/\delta)/N}.roman_sup start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_i ∈ [ italic_N ] end_CELL end_ROW start_ROW start_CELL italic_s ∈ [ 0 , italic_T ] end_CELL end_ROW end_ARG end_POSTSUBSCRIPT ∥ over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( over¯ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ; over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) , over¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) ) - italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ; italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ ≤ italic_B ⋅ square-root start_ARG roman_log ( italic_N italic_T / italic_δ ) / italic_N end_ARG . (B.21)

Now, conditioning on the intersection of event in (B.1.2) and event in (B.21), the following holds simultaneously for any t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ]

θ~(N)(t)θ¯(N)(t)(N)B0tθ~(N)(s)θ¯(m)(s)(N)ds+BTlog(NT/δ)/Nsubscriptnormsuperscript~𝜃𝑁𝑡superscript¯𝜃𝑁𝑡𝑁𝐵superscriptsubscript0𝑡subscriptnormsuperscript~𝜃𝑁𝑠superscript¯𝜃𝑚𝑠𝑁differential-d𝑠𝐵𝑇𝑁𝑇𝛿𝑁\displaystyle\left\|\widetilde{\theta}^{(N)}(t)-\bar{\theta}^{(N)}(t)\right\|_% {(N)}\leq B\cdot\int_{0}^{t}\left\|\widetilde{\theta}^{(N)}(s)-\bar{\theta}^{(% m)}(s)\right\|_{(N)}\mathrm{d}s+BT\cdot\sqrt{\log(NT/\delta)/N}∥ over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) - over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT ≤ italic_B ⋅ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) - over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_s ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT roman_d italic_s + italic_B italic_T ⋅ square-root start_ARG roman_log ( italic_N italic_T / italic_δ ) / italic_N end_ARG (B.22)
ω~(N)(t)ω¯(N)(t)(N)B0tω~(N)(s)ω¯(N)(s)(N)ds+BTlog(NT/δ)/Nsubscriptnormsuperscript~𝜔𝑁𝑡superscript¯𝜔𝑁𝑡𝑁𝐵superscriptsubscript0𝑡subscriptnormsuperscript~𝜔𝑁𝑠superscript¯𝜔𝑁𝑠𝑁differential-d𝑠𝐵𝑇𝑁𝑇𝛿𝑁\displaystyle\left\|\widetilde{\omega}^{(N)}(t)-\bar{\omega}^{(N)}(t)\right\|_% {(N)}\leq B\cdot\int_{0}^{t}\left\|\widetilde{\omega}^{(N)}(s)-\bar{\omega}^{(% N)}(s)\right\|_{(N)}\mathrm{d}s+BT\cdot\sqrt{\log(NT/\delta)/N}∥ over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) - over¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT ≤ italic_B ⋅ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) - over¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT roman_d italic_s + italic_B italic_T ⋅ square-root start_ARG roman_log ( italic_N italic_T / italic_δ ) / italic_N end_ARG (B.23)

Summing (B.22) and (B.23) and applying Gronwall’s Lemma (Holte, 2009), with probability at least 12δ12𝛿1-2\delta1 - 2 italic_δ, for any t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ], it holds that

θ~(N)(t)θ¯(N)(t)(N)+ω~(N)(t)ω¯(N)(t)(N)subscriptnormsuperscript~𝜃𝑁𝑡superscript¯𝜃𝑁𝑡𝑁subscriptnormsuperscript~𝜔𝑁𝑡superscript¯𝜔𝑁𝑡𝑁absent\displaystyle\left\|\widetilde{\theta}^{(N)}(t)-\bar{\theta}^{(N)}(t)\right\|_% {(N)}+\left\|\widetilde{\omega}^{(N)}(t)-\bar{\omega}^{(N)}(t)\right\|_{(N)}\leq∥ over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) - over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + ∥ over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) - over¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT ≤ BeBt2BTlog(NT/δ)/N𝐵superscript𝑒𝐵𝑡2𝐵𝑇𝑁𝑇𝛿𝑁\displaystyle B\cdot e^{Bt}\cdot 2BT\cdot\sqrt{\log(NT/\delta)/N}italic_B ⋅ italic_e start_POSTSUPERSCRIPT italic_B italic_t end_POSTSUPERSCRIPT ⋅ 2 italic_B italic_T ⋅ square-root start_ARG roman_log ( italic_N italic_T / italic_δ ) / italic_N end_ARG
\displaystyle\leq BeBTlog(N/δ)/N.𝐵superscript𝑒𝐵𝑇𝑁𝛿𝑁\displaystyle B\cdot e^{BT}\cdot\sqrt{\log(N/\delta)/N}.italic_B ⋅ italic_e start_POSTSUPERSCRIPT italic_B italic_T end_POSTSUPERSCRIPT ⋅ square-root start_ARG roman_log ( italic_N / italic_δ ) / italic_N end_ARG . (B.24)

The last inequality holds since B𝐵Bitalic_B as a constant represents values changing from line to line. Therefore, equation (B.24) implies (B.11). Thus, we complete the proof of Lemma B.3.

B.1.3 Proof of Lemma B.4

By the definition of v^f,v^gsuperscript^𝑣𝑓superscript^𝑣𝑔\widehat{v}^{f},\widehat{v}^{g}over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT , over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT in (B), θ˘i(t),ω˘i(t)subscript˘𝜃𝑖𝑡subscript˘𝜔𝑖𝑡\breve{\theta}_{i}(t),\breve{\omega}_{i}(t)over˘ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , over˘ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) in (B), θ~i(t),ω~i(t)subscript~𝜃𝑖𝑡subscript~𝜔𝑖𝑡\widetilde{\theta}_{i}(t),\widetilde{\omega}_{i}(t)over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) in (B.6), it holds that the distances θ~i(kϵ)θ˘i(k)normsubscript~𝜃𝑖𝑘italic-ϵsubscript˘𝜃𝑖𝑘\left\|\widetilde{\theta}_{i}(k\epsilon)-\breve{\theta}_{i}(k)\right\|∥ over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k italic_ϵ ) - over˘ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥ and ω~i(kϵ)ω˘i(k)normsubscript~𝜔𝑖𝑘italic-ϵsubscript˘𝜔𝑖𝑘\left\|\widetilde{\omega}_{i}(k\epsilon)-\breve{\omega}_{i}(k)\right\|∥ over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k italic_ϵ ) - over˘ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥ satisfy

θ~i(kϵ)θ˘i(k)normsubscript~𝜃𝑖𝑘italic-ϵsubscript˘𝜃𝑖𝑘\displaystyle\left\|\widetilde{\theta}_{i}(k\epsilon)-\breve{\theta}_{i}(k)\right\|∥ over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k italic_ϵ ) - over˘ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥
η0kϵv^f(θ~i(s);θ~(N)(s),ω~(N)(s))v^f(θ~i(s/ϵϵ);θ~(N)(s/ϵϵ),ω~(N)(s/ϵϵ))dsabsent𝜂superscriptsubscript0𝑘italic-ϵnormsuperscript^𝑣𝑓subscript~𝜃𝑖𝑠superscript~𝜃𝑁𝑠superscript~𝜔𝑁𝑠superscript^𝑣𝑓subscript~𝜃𝑖𝑠italic-ϵitalic-ϵsuperscript~𝜃𝑁𝑠italic-ϵitalic-ϵsuperscript~𝜔𝑁𝑠italic-ϵitalic-ϵdifferential-d𝑠\displaystyle\qquad\leq\eta\cdot\int_{0}^{k\epsilon}\left\|\widehat{v}^{f}% \left(\widetilde{\theta}_{i}(s);\widetilde{\theta}^{(N)}(s),\widetilde{\omega}% ^{(N)}(s)\right)-\widehat{v}^{f}\left(\widetilde{\theta}_{i}(\lfloor s/% \epsilon\rfloor\epsilon);\widetilde{\theta}^{(N)}(\lfloor s/\epsilon\rfloor% \epsilon),\widetilde{\omega}^{(N)}(\lfloor s/\epsilon\rfloor\epsilon)\right)% \right\|\mathrm{d}s≤ italic_η ⋅ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k italic_ϵ end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ; over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) , over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) ) - over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⌊ italic_s / italic_ϵ ⌋ italic_ϵ ) ; over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( ⌊ italic_s / italic_ϵ ⌋ italic_ϵ ) , over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( ⌊ italic_s / italic_ϵ ⌋ italic_ϵ ) ) ∥ roman_d italic_s
+η=0k1v^f(θ~i(ϵ);θ~(N)(ϵ),ω~(N)(ϵ))v^f(θ˘i();θ˘(N)(),ω˘(N)())𝜂superscriptsubscript0𝑘1normsuperscript^𝑣𝑓subscript~𝜃𝑖italic-ϵsuperscript~𝜃𝑁italic-ϵsuperscript~𝜔𝑁italic-ϵsuperscript^𝑣𝑓subscript˘𝜃𝑖superscript˘𝜃𝑁superscript˘𝜔𝑁\displaystyle\qquad\qquad+\eta\cdot\sum_{\ell=0}^{k-1}\left\|\widehat{v}^{f}% \left(\widetilde{\theta}_{i}(\ell\epsilon);\widetilde{\theta}^{(N)}(\ell% \epsilon),\widetilde{\omega}^{(N)}(\ell\epsilon)\right)-\widehat{v}^{f}\left(% \breve{\theta}_{i}(\ell);\breve{\theta}^{(N)}(\ell),\breve{\omega}^{(N)}(\ell)% \right)\right\|+ italic_η ⋅ ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_ℓ italic_ϵ ) ; over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ italic_ϵ ) , over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ italic_ϵ ) ) - over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( over˘ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_ℓ ) ; over˘ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ ) , over˘ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ ) ) ∥
Bkϵ2+B=0k1(θ~(N)(ϵ)θ˘(N)()(N)+ω~(N)(ϵ)ω˘(N)()(N)).absent𝐵𝑘superscriptitalic-ϵ2𝐵superscriptsubscript0𝑘1subscriptnormsuperscript~𝜃𝑁italic-ϵsuperscript˘𝜃𝑁𝑁subscriptnormsuperscript~𝜔𝑁italic-ϵsuperscript˘𝜔𝑁𝑁\displaystyle\qquad\leq B\cdot k\cdot\epsilon^{2}+B\cdot\sum_{\ell=0}^{k-1}% \Big{(}\left\|\widetilde{\theta}^{(N)}(\ell\epsilon)-\breve{\theta}^{(N)}(\ell% )\right\|_{(N)}+\left\|\widetilde{\omega}^{(N)}(\ell\epsilon)-\breve{\omega}^{% (N)}(\ell)\right\|_{(N)}\Big{)}.≤ italic_B ⋅ italic_k ⋅ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_B ⋅ ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( ∥ over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ italic_ϵ ) - over˘ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + ∥ over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ italic_ϵ ) - over˘ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT ) . (B.25)
ω~i(kϵ)ω˘i(k)normsubscript~𝜔𝑖𝑘italic-ϵsubscript˘𝜔𝑖𝑘\displaystyle\left\|\widetilde{\omega}_{i}(k\epsilon)-\breve{\omega}_{i}(k)\right\|∥ over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k italic_ϵ ) - over˘ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥
η0kϵv^g(ω~i(s);θ~(N)(s),ω~(N)(s))v^g(ω~i(s/ϵϵ);θ~(N)(s/ϵϵ),ω~(N)(s/ϵϵ))dsabsent𝜂superscriptsubscript0𝑘italic-ϵnormsuperscript^𝑣𝑔subscript~𝜔𝑖𝑠superscript~𝜃𝑁𝑠superscript~𝜔𝑁𝑠superscript^𝑣𝑔subscript~𝜔𝑖𝑠italic-ϵitalic-ϵsuperscript~𝜃𝑁𝑠italic-ϵitalic-ϵsuperscript~𝜔𝑁𝑠italic-ϵitalic-ϵdifferential-d𝑠\displaystyle\qquad\leq\eta\cdot\int_{0}^{k\epsilon}\left\|\widehat{v}^{g}% \left(\widetilde{\omega}_{i}(s);\widetilde{\theta}^{(N)}(s),\widetilde{\omega}% ^{(N)}(s)\right)-\widehat{v}^{g}\left(\widetilde{\omega}_{i}(\lfloor s/% \epsilon\rfloor\epsilon);\widetilde{\theta}^{(N)}(\lfloor s/\epsilon\rfloor% \epsilon),\widetilde{\omega}^{(N)}(\lfloor s/\epsilon\rfloor\epsilon)\right)% \right\|\mathrm{d}s≤ italic_η ⋅ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k italic_ϵ end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ; over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) , over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) ) - over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⌊ italic_s / italic_ϵ ⌋ italic_ϵ ) ; over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( ⌊ italic_s / italic_ϵ ⌋ italic_ϵ ) , over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( ⌊ italic_s / italic_ϵ ⌋ italic_ϵ ) ) ∥ roman_d italic_s
+η=0k1v^g(w~i(ϵ);θ~(N)(ϵ),ω~(N)(ϵ))v^g(ω˘i();θ˘(N)(),ω˘(N)())𝜂superscriptsubscript0𝑘1normsuperscript^𝑣𝑔subscript~𝑤𝑖italic-ϵsuperscript~𝜃𝑁italic-ϵsuperscript~𝜔𝑁italic-ϵsuperscript^𝑣𝑔subscript˘𝜔𝑖superscript˘𝜃𝑁superscript˘𝜔𝑁\displaystyle\qquad\qquad+\eta\cdot\sum_{\ell=0}^{k-1}\left\|\widehat{v}^{g}% \left(\widetilde{w}_{i}(\ell\epsilon);\widetilde{\theta}^{(N)}(\ell\epsilon),% \widetilde{\omega}^{(N)}(\ell\epsilon)\right)-\widehat{v}^{g}\left(\breve{% \omega}_{i}(\ell);\breve{\theta}^{(N)}(\ell),\breve{\omega}^{(N)}(\ell)\right)\right\|+ italic_η ⋅ ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_ℓ italic_ϵ ) ; over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ italic_ϵ ) , over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ italic_ϵ ) ) - over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( over˘ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_ℓ ) ; over˘ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ ) , over˘ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ ) ) ∥
Bkϵ2+B=0k1(θ~(N)(ϵ)θ˘(N)()(N)+ω~(N)(ϵ)ω˘(N)()(N)).absent𝐵𝑘superscriptitalic-ϵ2𝐵superscriptsubscript0𝑘1subscriptnormsuperscript~𝜃𝑁italic-ϵsuperscript˘𝜃𝑁𝑁subscriptnormsuperscript~𝜔𝑁italic-ϵsuperscript˘𝜔𝑁𝑁\displaystyle\qquad\leq B\cdot k\cdot\epsilon^{2}+B\cdot\sum_{\ell=0}^{k-1}% \Big{(}\left\|\widetilde{\theta}^{(N)}(\ell\epsilon)-\breve{\theta}^{(N)}(\ell% )\right\|_{(N)}+\left\|\widetilde{\omega}^{(N)}(\ell\epsilon)-\breve{\omega}^{% (N)}(\ell)\right\|_{(N)}\Big{)}.≤ italic_B ⋅ italic_k ⋅ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_B ⋅ ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( ∥ over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ italic_ϵ ) - over˘ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + ∥ over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ italic_ϵ ) - over˘ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT ) . (B.26)

where (B.25) follows from (C.8) of Lemma C.1 and (C.13) of Lemma C.2, (B.26) follows from (C.9) of Lemma C.1 and (C.13) of Lemma C.2. Combining the inequalities in (B.25) and (B.26), it holds for any kT/ϵ(k)𝑘𝑇italic-ϵ𝑘k\leq T/\epsilon\;(k\in\mathbb{N})italic_k ≤ italic_T / italic_ϵ ( italic_k ∈ blackboard_N ) that

θ~(N)(kϵ)θ˘(N)(k)(N)+ω~(N)(kϵ)ω˘(N)(k)(N)subscriptnormsuperscript~𝜃𝑁𝑘italic-ϵsuperscript˘𝜃𝑁𝑘𝑁subscriptnormsuperscript~𝜔𝑁𝑘italic-ϵsuperscript˘𝜔𝑁𝑘𝑁\displaystyle\left\|\widetilde{\theta}^{(N)}(k\epsilon)-\breve{\theta}^{(N)}(k% )\right\|_{(N)}+\left\|\widetilde{\omega}^{(N)}(k\epsilon)-\breve{\omega}^{(N)% }(k)\right\|_{(N)}∥ over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k italic_ϵ ) - over˘ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + ∥ over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k italic_ϵ ) - over˘ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT
2BTϵ+B=0k1θ~(m)(ϵ)θ˘(N)()(N)+B=0k1ω~(N)(ϵ)ω˘(N)()(N).absent2𝐵𝑇italic-ϵ𝐵superscriptsubscript0𝑘1subscriptnormsuperscript~𝜃𝑚italic-ϵsuperscript˘𝜃𝑁𝑁𝐵superscriptsubscript0𝑘1subscriptnormsuperscript~𝜔𝑁italic-ϵsuperscript˘𝜔𝑁𝑁\displaystyle\qquad\leq 2BT\epsilon+B\cdot\sum_{\ell=0}^{k-1}\left\|\widetilde% {\theta}^{(m)}(\ell\epsilon)-\breve{\theta}^{(N)}(\ell)\right\|_{(N)}+B\cdot% \sum_{\ell=0}^{k-1}\left\|\widetilde{\omega}^{(N)}(\ell\epsilon)-\breve{\omega% }^{(N)}(\ell)\right\|_{(N)}.≤ 2 italic_B italic_T italic_ϵ + italic_B ⋅ ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( roman_ℓ italic_ϵ ) - over˘ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + italic_B ⋅ ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ italic_ϵ ) - over˘ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT . (B.27)

Applying the discrete Gronwall’s lemma (Holte, 2009) to (B.27) , we have that

supkT/ϵ(k)θ~(N)(kϵ)θ˘(N)(k)(N)+ω~(N)(kϵ)ω˘(N)(k)(N)2B2TϵeBTBeBTϵ,subscriptsupremum𝑘𝑇italic-ϵ𝑘subscriptnormsuperscript~𝜃𝑁𝑘italic-ϵsuperscript˘𝜃𝑁𝑘𝑁subscriptnormsuperscript~𝜔𝑁𝑘italic-ϵsuperscript˘𝜔𝑁𝑘𝑁2superscript𝐵2𝑇italic-ϵsuperscript𝑒𝐵𝑇𝐵superscript𝑒𝐵𝑇italic-ϵ\displaystyle\sup_{\begin{subarray}{c}k\leq T/\epsilon\\ (k\in\mathbb{N})\end{subarray}}\left\|\widetilde{\theta}^{(N)}(k\epsilon)-% \breve{\theta}^{(N)}(k)\right\|_{(N)}+\left\|\widetilde{\omega}^{(N)}(k% \epsilon)-\breve{\omega}^{(N)}(k)\right\|_{(N)}\leq 2B^{2}\cdot T\cdot\epsilon% \cdot e^{BT}\leq B\cdot e^{BT}\cdot\epsilon,roman_sup start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_k ≤ italic_T / italic_ϵ end_CELL end_ROW start_ROW start_CELL ( italic_k ∈ blackboard_N ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT ∥ over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k italic_ϵ ) - over˘ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + ∥ over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k italic_ϵ ) - over˘ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT ≤ 2 italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_T ⋅ italic_ϵ ⋅ italic_e start_POSTSUPERSCRIPT italic_B italic_T end_POSTSUPERSCRIPT ≤ italic_B ⋅ italic_e start_POSTSUPERSCRIPT italic_B italic_T end_POSTSUPERSCRIPT ⋅ italic_ϵ ,

where the inequalities hold since we allow the value of B𝐵Bitalic_B to vary from line to line. Thus, we complete the proof of Lemma B.4.

B.1.4 Proof of Lemma B.5

Proof.

Let 𝒢k=σ(θ(N)(0),w(N)(0),u0,,uk)subscript𝒢𝑘𝜎superscript𝜃𝑁0superscript𝑤𝑁0subscript𝑢0subscript𝑢𝑘\mathcal{G}_{k}=\sigma(\theta^{(N)}(0),w^{(N)}(0),u_{0},\dots,u_{k})caligraphic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_σ ( italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( 0 ) , italic_w start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( 0 ) , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) be the σlimit-from𝜎\sigma-italic_σ -algebra generated by θ(N)(0),w(N)(0)superscript𝜃𝑁0superscript𝑤𝑁0\theta^{(N)}(0),w^{(N)}(0)italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( 0 ) , italic_w start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( 0 ) and u=(x,z)(k)subscript𝑢subscript𝑥subscript𝑧𝑘u_{\ell}=(x_{\ell},z_{\ell})\;(\ell\leq k)italic_u start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) ( roman_ℓ ≤ italic_k ). Following from the definition of V^kf,V^kgsubscriptsuperscript^𝑉𝑓𝑘subscriptsuperscript^𝑉𝑔𝑘\widehat{V}^{f}_{k},\widehat{V}^{g}_{k}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and v^f,v^gsuperscript^𝑣𝑓superscript^𝑣𝑔\widehat{v}^{f},\widehat{v}^{g}over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT , over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT in (B) and (B), we have for any i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ] and k+𝑘subscriptk\in\mathbb{N}_{+}italic_k ∈ blackboard_N start_POSTSUBSCRIPT + end_POSTSUBSCRIPT that

𝔼[V^kf(θi(k);θ(N)(k),ω(N)(k))|𝒢k1]=v^f(θi(k);θ(N)(k),ω(N)(k)),𝔼delimited-[]conditionalsuperscriptsubscript^𝑉𝑘𝑓subscript𝜃𝑖𝑘superscript𝜃𝑁𝑘superscript𝜔𝑁𝑘subscript𝒢𝑘1superscript^𝑣𝑓subscript𝜃𝑖𝑘superscript𝜃𝑁𝑘superscript𝜔𝑁𝑘\displaystyle\mathbb{E}\Bigl{[}\widehat{V}_{k}^{f}(\theta_{i}(k);\theta^{(N)}(% k),\omega^{(N)}(k)){\,\big{|}\,}\mathcal{G}_{k-1}\Bigr{]}=\widehat{v}^{f}(% \theta_{i}(k);\theta^{(N)}(k),\omega^{(N)}(k)),blackboard_E [ over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) ) | caligraphic_G start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ] = over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) ) ,
𝔼[V^kg(ωi(k);θ(N)(k),ω(N)(k))|𝒢k1]=v^g(ωi(k);θ(N)(k),ω(N)(k)).𝔼delimited-[]conditionalsuperscriptsubscript^𝑉𝑘𝑔subscript𝜔𝑖𝑘superscript𝜃𝑁𝑘superscript𝜔𝑁𝑘subscript𝒢𝑘1superscript^𝑣𝑔subscript𝜔𝑖𝑘superscript𝜃𝑁𝑘superscript𝜔𝑁𝑘\displaystyle\mathbb{E}\Bigl{[}\widehat{V}_{k}^{g}(\omega_{i}(k);\theta^{(N)}(% k),\omega^{(N)}(k)){\,\big{|}\,}\mathcal{G}_{k-1}\Bigr{]}=\widehat{v}^{g}(% \omega_{i}(k);\theta^{(N)}(k),\omega^{(N)}(k)).blackboard_E [ over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) ) | caligraphic_G start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ] = over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) ) .

Recall the definition of θ(N),ω(N)superscript𝜃𝑁superscript𝜔𝑁\theta^{(N)},\omega^{(N)}italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT and θ˘(N),ω˘(N)superscript˘𝜃𝑁superscript˘𝜔𝑁\breve{\theta}^{(N)},\breve{\omega}^{(N)}over˘ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , over˘ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT as the SGDA and PGDA dynamics defined in (B) and (B). We have for any i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ], k+𝑘subscriptk\in\mathbb{N}_{+}italic_k ∈ blackboard_N start_POSTSUBSCRIPT + end_POSTSUBSCRIPT that

θ˘i(k)θi(k)normsubscript˘𝜃𝑖𝑘subscript𝜃𝑖𝑘\displaystyle\left\|\breve{\theta}_{i}(k)-\theta_{i}(k)\right\|∥ over˘ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) - italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥
ηϵ=0k1Xi()+ηϵ=0k1v^f(θ˘i();θ˘(N)(),ω˘(N)())v^f(θi();θ(N)(),ω(N)())absent𝜂italic-ϵnormsuperscriptsubscript0𝑘1subscript𝑋𝑖𝜂italic-ϵsuperscriptsubscript0𝑘1normsuperscript^𝑣𝑓subscript˘𝜃𝑖superscript˘𝜃𝑁superscript˘𝜔𝑁superscript^𝑣𝑓subscript𝜃𝑖superscript𝜃𝑁superscript𝜔𝑁\displaystyle\qquad\leq\eta\epsilon\cdot\left\|\sum_{\ell=0}^{k-1}X_{i}(\ell)% \right\|+\eta\epsilon\cdot\sum_{\ell=0}^{k-1}\left\|\widehat{v}^{f}\left(% \breve{\theta}_{i}(\ell);\breve{\theta}^{(N)}(\ell),\breve{\omega}^{(N)}(\ell)% \right)-\widehat{v}^{f}\left(\theta_{i}(\ell);\theta^{(N)}(\ell),\omega^{(N)}(% \ell)\right)\right\|≤ italic_η italic_ϵ ⋅ ∥ ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_ℓ ) ∥ + italic_η italic_ϵ ⋅ ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( over˘ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_ℓ ) ; over˘ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ ) , over˘ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ ) ) - over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_ℓ ) ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ ) , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ ) ) ∥
ηϵAi(k)+Bϵ=0k1θ˘(m)()θ(m)()(N)+ω˘(N)()ω(N)()(N),absent𝜂italic-ϵnormsubscript𝐴𝑖𝑘𝐵italic-ϵsuperscriptsubscript0𝑘1subscriptnormsuperscript˘𝜃𝑚superscript𝜃𝑚𝑁subscriptnormsuperscript˘𝜔𝑁superscript𝜔𝑁𝑁\displaystyle\qquad\leq\eta\epsilon\cdot\left\|A_{i}(k)\right\|+B\epsilon\cdot% \sum_{\ell=0}^{k-1}\left\|\breve{\theta}^{(m)}(\ell)-\theta^{(m)}(\ell)\right% \|_{(N)}+\left\|\breve{\omega}^{(N)}(\ell)-\omega^{(N)}(\ell)\right\|_{(N)},≤ italic_η italic_ϵ ⋅ ∥ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥ + italic_B italic_ϵ ⋅ ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ over˘ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( roman_ℓ ) - italic_θ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( roman_ℓ ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + ∥ over˘ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ ) - italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT , (B.28)

where the last inequality follows from (C.8) of Lemma C.1. Xi()subscript𝑋𝑖X_{i}(\ell)italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_ℓ ) and Ai(k)subscript𝐴𝑖𝑘A_{i}(k)italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) are defined as,

Xi()=V^f(θi();θ(N)(),ω(N)())𝔼[V^f(θi();θ(N)(),ω(N)())|𝒢1]1,formulae-sequencesubscript𝑋𝑖subscriptsuperscript^𝑉𝑓subscript𝜃𝑖superscript𝜃𝑁superscript𝜔𝑁𝔼delimited-[]conditionalsubscriptsuperscript^𝑉𝑓subscript𝜃𝑖superscript𝜃𝑁superscript𝜔𝑁subscript𝒢1for-all1\displaystyle X_{i}(\ell)=\widehat{V}^{f}_{\ell}\left(\theta_{i}(\ell);\theta^% {(N)}(\ell),\omega^{(N)}(\ell)\right)-\mathbb{E}\left[\widehat{V}^{f}_{\ell}% \left(\theta_{i}(\ell);\theta^{(N)}(\ell),\omega^{(N)}(\ell)\right){\,\big{|}% \,}\mathcal{G}_{\ell-1}\right]\quad\forall\ell\geq 1,italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_ℓ ) = over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_ℓ ) ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ ) , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ ) ) - blackboard_E [ over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_ℓ ) ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ ) , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ ) ) | caligraphic_G start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT ] ∀ roman_ℓ ≥ 1 ,
Xi(0)=0,Ai(k)==0k1Xi().formulae-sequencesubscript𝑋𝑖00subscript𝐴𝑖𝑘superscriptsubscript0𝑘1subscript𝑋𝑖\displaystyle X_{i}(0)=0,\quad A_{i}(k)=\sum_{\ell=0}^{k-1}X_{i}(\ell).italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) = 0 , italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) = ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_ℓ ) .

Following from (C.7) of Lemma C.1, it holds that Xi()Bnormsubscript𝑋𝑖𝐵\|X_{i}(\ell)\|\leq B∥ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_ℓ ) ∥ ≤ italic_B, thus the stochastic process {Ai(k)}k+subscriptsubscript𝐴𝑖𝑘𝑘subscript\{A_{i}(k)\}_{k\in\mathbb{N}_{+}}{ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) } start_POSTSUBSCRIPT italic_k ∈ blackboard_N start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT is a martingale with Ai(k)Ai(k1)Bnormsubscript𝐴𝑖𝑘subscript𝐴𝑖𝑘1𝐵\|A_{i}(k)-A_{i}(k-1)\|\leq B∥ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) - italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k - 1 ) ∥ ≤ italic_B. Applying the Azuma-Hoeffding bound in Lemma C.4, we have that

(maxkT/ϵ(k+)Ai(k)BT/ϵ(D+p))exp(p2).𝑘𝑇italic-ϵ𝑘subscriptnormsubscript𝐴𝑖𝑘𝐵𝑇italic-ϵ𝐷𝑝superscript𝑝2\displaystyle\mathbb{P}\Bigl{(}\underset{\begin{subarray}{c}k\leq T/\epsilon\\ \left(k\in\mathbb{N}_{+}\right)\end{subarray}}{\max}\left\|A_{i}(k)\right\|% \geq B\cdot\sqrt{T/\epsilon}\cdot(\sqrt{D}+p)\Bigr{)}\leq\exp\left(-p^{2}% \right).blackboard_P ( start_UNDERACCENT start_ARG start_ROW start_CELL italic_k ≤ italic_T / italic_ϵ end_CELL end_ROW start_ROW start_CELL ( italic_k ∈ blackboard_N start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG end_UNDERACCENT start_ARG roman_max end_ARG ∥ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥ ≥ italic_B ⋅ square-root start_ARG italic_T / italic_ϵ end_ARG ⋅ ( square-root start_ARG italic_D end_ARG + italic_p ) ) ≤ roman_exp ( - italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (B.29)

Apply the union bound to (B.29) for i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ], we have that

(maxi[N]kT/ϵ,(k+)Ai(k)BT/ϵ(D+p))Nexp(p2).𝑖delimited-[]𝑁𝑘𝑇italic-ϵ𝑘subscriptnormsubscript𝐴𝑖𝑘𝐵𝑇italic-ϵ𝐷𝑝𝑁superscript𝑝2\displaystyle\mathbb{P}\Bigl{(}\underset{\begin{subarray}{c}i\in[N]\\ k\leq T/\epsilon,\left(k\in\mathbb{N}_{+}\right)\end{subarray}}{\max}\left\|A_% {i}(k)\right\|\geq B\cdot\sqrt{T/\epsilon}\cdot(\sqrt{D}+p)\Bigr{)}\leq N\cdot% \exp\left(-p^{2}\right).blackboard_P ( start_UNDERACCENT start_ARG start_ROW start_CELL italic_i ∈ [ italic_N ] end_CELL end_ROW start_ROW start_CELL italic_k ≤ italic_T / italic_ϵ , ( italic_k ∈ blackboard_N start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG end_UNDERACCENT start_ARG roman_max end_ARG ∥ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥ ≥ italic_B ⋅ square-root start_ARG italic_T / italic_ϵ end_ARG ⋅ ( square-root start_ARG italic_D end_ARG + italic_p ) ) ≤ italic_N ⋅ roman_exp ( - italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

Setting p=log(N/δ)𝑝𝑁𝛿p=\sqrt{\log(N/\delta)}italic_p = square-root start_ARG roman_log ( italic_N / italic_δ ) end_ARG, with probability at least 1δ1𝛿1-\delta1 - italic_δ, it holds that

Ai(k)BT/ϵ(D+log(N/δ)),i[N],kT/ϵ(k+).formulae-sequencenormsubscript𝐴𝑖𝑘𝐵𝑇italic-ϵ𝐷𝑁𝛿formulae-sequencefor-all𝑖delimited-[]𝑁𝑘𝑇italic-ϵ𝑘subscript\displaystyle\left\|A_{i}(k)\right\|\leq B\cdot\sqrt{T/\epsilon}\cdot(\sqrt{D}% +\sqrt{\log(N/\delta)}),\quad\forall i\in[N],k\leq T/\epsilon\left(k\in\mathbb% {N}_{+}\right).∥ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥ ≤ italic_B ⋅ square-root start_ARG italic_T / italic_ϵ end_ARG ⋅ ( square-root start_ARG italic_D end_ARG + square-root start_ARG roman_log ( italic_N / italic_δ ) end_ARG ) , ∀ italic_i ∈ [ italic_N ] , italic_k ≤ italic_T / italic_ϵ ( italic_k ∈ blackboard_N start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) . (B.30)

Plug (B.30) into (B.28) and taking supremum norm over i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ], we have that

θ˘(N)(k)θ(N)(k)(N)subscriptdelimited-∥∥superscript˘𝜃𝑁𝑘superscript𝜃𝑁𝑘𝑁\displaystyle\Bigl{\|}\breve{\theta}^{(N)}(k)-\theta^{(N)}(k)\Bigr{\|}_{(N)}∥ over˘ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) - italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT Bϵ=0k1(θ˘(m)()θ(m)()(N)+ω˘(N)()ω(N)()(N))absent𝐵italic-ϵsuperscriptsubscript0𝑘1subscriptnormsuperscript˘𝜃𝑚superscript𝜃𝑚𝑁subscriptnormsuperscript˘𝜔𝑁superscript𝜔𝑁𝑁\displaystyle\leq B\epsilon\cdot\sum_{\ell=0}^{k-1}\bigg{(}\left\|\breve{% \theta}^{(m)}(\ell)-\theta^{(m)}(\ell)\right\|_{(N)}+\left\|\breve{\omega}^{(N% )}(\ell)-\omega^{(N)}(\ell)\right\|_{(N)}\bigg{)}≤ italic_B italic_ϵ ⋅ ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( ∥ over˘ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( roman_ℓ ) - italic_θ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( roman_ℓ ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + ∥ over˘ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ ) - italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT )
+BTϵ(D+log(N/δ)).𝐵𝑇italic-ϵ𝐷𝑁𝛿\displaystyle\qquad+B\cdot\sqrt{T\epsilon}\cdot(\sqrt{D}+\sqrt{\log(N/\delta)}).+ italic_B ⋅ square-root start_ARG italic_T italic_ϵ end_ARG ⋅ ( square-root start_ARG italic_D end_ARG + square-root start_ARG roman_log ( italic_N / italic_δ ) end_ARG ) . (B.31)

Through similar arguments, for w˘i(k)subscript˘𝑤𝑖𝑘\breve{w}_{i}(k)over˘ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) and wi(k)subscript𝑤𝑖𝑘w_{i}(k)italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ), with probability at least 1δ1𝛿1-\delta1 - italic_δ,

ω˘(N)(k)ω(N)(k)(N)subscriptdelimited-∥∥superscript˘𝜔𝑁𝑘superscript𝜔𝑁𝑘𝑁\displaystyle\Bigl{\|}\breve{\omega}^{(N)}(k)-\omega^{(N)}(k)\Bigr{\|}_{(N)}∥ over˘ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) - italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT Bϵ=0k1(θ˘(m)()θ(m)()(N)+ω˘(N)()ω(N)()(N))absent𝐵italic-ϵsuperscriptsubscript0𝑘1subscriptnormsuperscript˘𝜃𝑚superscript𝜃𝑚𝑁subscriptnormsuperscript˘𝜔𝑁superscript𝜔𝑁𝑁\displaystyle\leq B\epsilon\cdot\sum_{\ell=0}^{k-1}\bigg{(}\left\|\breve{% \theta}^{(m)}(\ell)-\theta^{(m)}(\ell)\right\|_{(N)}+\left\|\breve{\omega}^{(N% )}(\ell)-\omega^{(N)}(\ell)\right\|_{(N)}\bigg{)}≤ italic_B italic_ϵ ⋅ ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( ∥ over˘ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( roman_ℓ ) - italic_θ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( roman_ℓ ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + ∥ over˘ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ ) - italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( roman_ℓ ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT )
+BTϵ(D+log(N/δ)).𝐵𝑇italic-ϵ𝐷𝑁𝛿\displaystyle\qquad+B\cdot\sqrt{T\epsilon}\cdot(\sqrt{D}+\sqrt{\log(N/\delta)}).+ italic_B ⋅ square-root start_ARG italic_T italic_ϵ end_ARG ⋅ ( square-root start_ARG italic_D end_ARG + square-root start_ARG roman_log ( italic_N / italic_δ ) end_ARG ) . (B.32)

Conditioning on the intersection of event in (B.31) and event in (B.32), summing (B.31), (B.32), and applying the discrete Gronwall’s lemma (Holte, 2009), for any kT/ϵ,k+formulae-sequence𝑘𝑇italic-ϵ𝑘subscriptk\leq T/\epsilon,k\in\mathbb{N}_{+}italic_k ≤ italic_T / italic_ϵ , italic_k ∈ blackboard_N start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, the following inequality holds with probability at least 12δ12𝛿1-2\delta1 - 2 italic_δ,

θ˘(N)(k)θ(N)(k)(N)+ω˘(N)(k)ω(N)(k)(N)subscriptdelimited-∥∥superscript˘𝜃𝑁𝑘superscript𝜃𝑁𝑘𝑁subscriptdelimited-∥∥superscript˘𝜔𝑁𝑘superscript𝜔𝑁𝑘𝑁\displaystyle\Bigl{\|}\breve{\theta}^{(N)}(k)-\theta^{(N)}(k)\Bigr{\|}_{(N)}+% \Bigl{\|}\breve{\omega}^{(N)}(k)-\omega^{(N)}(k)\Bigr{\|}_{(N)}∥ over˘ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) - italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + ∥ over˘ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) - italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT BeBkϵBTϵ(D+log(N/δ))absent𝐵superscript𝑒𝐵𝑘italic-ϵ𝐵𝑇italic-ϵ𝐷𝑁𝛿\displaystyle\leq B\cdot e^{Bk\epsilon}\cdot B\cdot\sqrt{T\epsilon}\cdot(\sqrt% {D}+\sqrt{\log(N/\delta)})≤ italic_B ⋅ italic_e start_POSTSUPERSCRIPT italic_B italic_k italic_ϵ end_POSTSUPERSCRIPT ⋅ italic_B ⋅ square-root start_ARG italic_T italic_ϵ end_ARG ⋅ ( square-root start_ARG italic_D end_ARG + square-root start_ARG roman_log ( italic_N / italic_δ ) end_ARG )
BeBTϵ(D+log(N/δ)).absent𝐵superscript𝑒𝐵𝑇italic-ϵ𝐷𝑁𝛿\displaystyle\leq B\cdot e^{BT}\cdot\sqrt{\epsilon\cdot(D+\log(N/\delta))}.≤ italic_B ⋅ italic_e start_POSTSUPERSCRIPT italic_B italic_T end_POSTSUPERSCRIPT ⋅ square-root start_ARG italic_ϵ ⋅ ( italic_D + roman_log ( italic_N / italic_δ ) ) end_ARG .

Here the last inequality holds since we allow the value of B𝐵Bitalic_B to vary from line to line. Thus, we complete the proof of Lemma B.5. ∎

Appendix C Supporting Lemmas

C.1 Supporting Lemmas for §B

In what follows, we presented the technical lemmas heavily used in §§\S§ B. We recall the definition of vf,vgsuperscript𝑣𝑓superscript𝑣𝑔v^{f},v^{g}italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT, v^f,v^gsuperscript^𝑣𝑓superscript^𝑣𝑔\widehat{v}^{f},\widehat{v}^{g}over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT , over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT and V^kf,V^kgsubscriptsuperscript^𝑉𝑓𝑘subscriptsuperscript^𝑉𝑔𝑘\widehat{V}^{f}_{k},\widehat{V}^{g}_{k}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as in (B), (B), and (B) respectively. Let B>0𝐵0B>0italic_B > 0 be a constant depending on α,η,B0,B1,B2,C𝛼𝜂subscript𝐵0subscript𝐵1subscript𝐵2𝐶\alpha,\eta,B_{0},B_{1},B_{2},Citalic_α , italic_η , italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_C, whose value varies from line to line. Recall that f(;θ(N))𝑓superscript𝜃𝑁f(\cdot;\theta^{(N)})italic_f ( ⋅ ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) and g(;ω(N))𝑔superscript𝜔𝑁g(\cdot;\omega^{(N)})italic_g ( ⋅ ; italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) are the finite width representation with parameters θ(N),ω(N)superscript𝜃𝑁superscript𝜔𝑁\theta^{(N)},\omega^{(N)}italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT, whose definitions are given by

f(;θ(N))=αNi=1Nϕ(;θi),g(;ω(N))=αNi=1Nψ(;ωi).formulae-sequence𝑓superscript𝜃𝑁𝛼𝑁superscriptsubscript𝑖1𝑁italic-ϕsubscript𝜃𝑖𝑔superscript𝜔𝑁𝛼𝑁superscriptsubscript𝑖1𝑁𝜓subscript𝜔𝑖\displaystyle f(\cdot;\theta^{(N)})=\frac{\alpha}{N}\cdot\sum_{i=1}^{N}\phi(% \cdot;\theta_{i}),\quad g(\cdot;\omega^{(N)})=\frac{\alpha}{N}\cdot\sum_{i=1}^% {N}\psi(\cdot;\omega_{i}).italic_f ( ⋅ ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) = divide start_ARG italic_α end_ARG start_ARG italic_N end_ARG ⋅ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ϕ ( ⋅ ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_g ( ⋅ ; italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) = divide start_ARG italic_α end_ARG start_ARG italic_N end_ARG ⋅ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ψ ( ⋅ ; italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .
Lemma C.1.

Under Assumption 4.1 and 4.3, it holds that for any θ(N)={θi}i=1Nsuperscript𝜃𝑁superscriptsubscriptsubscript𝜃𝑖𝑖1𝑁\theta^{(N)}=\{\theta_{i}\}_{i=1}^{N}italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT = { italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, θ¯(N)={θ¯i}i=1Nsuperscript¯𝜃𝑁superscriptsubscriptsubscript¯𝜃𝑖𝑖1𝑁\underline{\theta}^{(N)}=\{\underline{\theta}_{i}\}_{i=1}^{N}under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT = { under¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, w(N)={wi}i=1Nsuperscript𝑤𝑁superscriptsubscriptsubscript𝑤𝑖𝑖1𝑁w^{(N)}=\{w_{i}\}_{i=1}^{N}italic_w start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT = { italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, w¯(N)={w¯i}i=1Nsuperscript¯𝑤𝑁superscriptsubscriptsubscript¯𝑤𝑖𝑖1𝑁\underline{w}^{(N)}=\{\underline{w}_{i}\}_{i=1}^{N}under¯ start_ARG italic_w end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT = { under¯ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, that, f(;θ(N))𝑓superscript𝜃𝑁f(\cdot;\theta^{(N)})italic_f ( ⋅ ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) and g(;ω(N))𝑔superscript𝜔𝑁g(\cdot;\omega^{(N)})italic_g ( ⋅ ; italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) are uniformly bounded and Lipschitz in θ,ω𝜃𝜔\theta,\omegaitalic_θ , italic_ω respectively, which is given by the following,

supw𝒲|f(w;θ(N))|+supz𝒵|g(z;ω(N))|B,\displaystyle\sup_{w\in\mathcal{W}}\bigl{|}f(w;\theta^{(N)})\bigr{|}+\sup_{z% \in\mathcal{Z}}\bigl{|}g(z;\omega^{(N)})\bigr{|}\leq B,roman_sup start_POSTSUBSCRIPT italic_w ∈ caligraphic_W end_POSTSUBSCRIPT | italic_f ( italic_w ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) | + roman_sup start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT | italic_g ( italic_z ; italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) | ≤ italic_B , (C.1)
supw𝒲|f(w;θ(N))f(w;θ¯(N))|Bθ(N)θ¯(N)(N),subscriptsupremum𝑤𝒲𝑓𝑤superscript𝜃𝑁𝑓𝑤superscript¯𝜃𝑁𝐵subscriptdelimited-∥∥superscript𝜃𝑁superscript¯𝜃𝑁𝑁\displaystyle\sup_{w\in\mathcal{W}}\bigl{|}f(w;\theta^{(N)})-f(w;\underline{% \theta}^{(N)})\bigr{|}\leq B\cdot\bigl{\|}\theta^{(N)}-\underline{\theta}^{(N)% }\bigr{\|}_{(N)},roman_sup start_POSTSUBSCRIPT italic_w ∈ caligraphic_W end_POSTSUBSCRIPT | italic_f ( italic_w ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - italic_f ( italic_w ; under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) | ≤ italic_B ⋅ ∥ italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT - under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT , (C.2)
supz𝒵|g(z;ω(N))g(z;ω¯(N))|Bω(N)ω¯(N)(N).subscriptsupremum𝑧𝒵𝑔𝑧superscript𝜔𝑁𝑔𝑧superscript¯𝜔𝑁𝐵subscriptdelimited-∥∥superscript𝜔𝑁superscript¯𝜔𝑁𝑁\displaystyle\sup_{z\in\mathcal{Z}}\bigl{|}g(z;\omega^{(N)})-g(z;\underline{% \omega}^{(N)})\bigr{|}\leq B\cdot\bigl{\|}\omega^{(N)}-\underline{\omega}^{(N)% }\bigr{\|}_{(N)}.roman_sup start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT | italic_g ( italic_z ; italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - italic_g ( italic_z ; under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) | ≤ italic_B ⋅ ∥ italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT - under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT . (C.3)

Recall the definition of v^f,v^gsuperscript^𝑣𝑓superscript^𝑣𝑔\widehat{v}^{f},\widehat{v}^{g}over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT , over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT and V^kf,V^kgsubscriptsuperscript^𝑉𝑓𝑘subscriptsuperscript^𝑉𝑔𝑘\widehat{V}^{f}_{k},\widehat{V}^{g}_{k}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in (B), (B), the finite width representation of the velocity field and its stochastic counter-part, when evaluated at arbitrary θi,ωisubscript𝜃𝑖subscript𝜔𝑖\theta_{i},\omega_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, are also uniformly bounded and lipschitz in θ,ω𝜃𝜔\theta,\omegaitalic_θ , italic_ω respectively. This means for V^kf,V^kgsubscriptsuperscript^𝑉𝑓𝑘subscriptsuperscript^𝑉𝑔𝑘\widehat{V}^{f}_{k},\widehat{V}^{g}_{k}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, the following inequalities hold,

V^kf(θi;θ(N),ω(N))+V^kg(ωi;θ(N),w(N))B,delimited-∥∥subscriptsuperscript^𝑉𝑓𝑘subscript𝜃𝑖superscript𝜃𝑁superscript𝜔𝑁delimited-∥∥subscriptsuperscript^𝑉𝑔𝑘subscript𝜔𝑖superscript𝜃𝑁superscript𝑤𝑁𝐵\displaystyle\bigl{\|}\widehat{V}^{f}_{k}(\theta_{i};\theta^{(N)},\omega^{(N)}% )\bigr{\|}+\bigl{\|}\widehat{V}^{g}_{k}(\omega_{i};\theta^{(N)},w^{(N)})\bigr{% \|}\leq B,∥ over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ∥ + ∥ over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_w start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ∥ ≤ italic_B , (C.4)
V^kf(θi;θ(N),ω(N))V^kf(θ¯i;θ¯(N),ω¯(N))B(θ(N)θ¯(N)(N)+ω(N)ω¯(N)(N)),delimited-∥∥subscriptsuperscript^𝑉𝑓𝑘subscript𝜃𝑖superscript𝜃𝑁superscript𝜔𝑁subscriptsuperscript^𝑉𝑓𝑘subscript¯𝜃𝑖superscript¯𝜃𝑁superscript¯𝜔𝑁𝐵subscriptdelimited-∥∥superscript𝜃𝑁superscript¯𝜃𝑁𝑁subscriptdelimited-∥∥superscript𝜔𝑁superscript¯𝜔𝑁𝑁\displaystyle\bigl{\|}\widehat{V}^{f}_{k}(\theta_{i};\theta^{(N)},\omega^{(N)}% )-\widehat{V}^{f}_{k}(\underline{\theta}_{i};\underline{\theta}^{(N)},% \underline{\omega}^{(N)})\bigr{\|}\leq B\cdot\Bigl{(}\bigl{\|}\theta^{(N)}-% \underline{\theta}^{(N)}\bigr{\|}_{(N)}+\bigl{\|}\omega^{(N)}-\underline{% \omega}^{(N)}\bigr{\|}_{(N)}\Bigr{)},∥ over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( under¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ∥ ≤ italic_B ⋅ ( ∥ italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT - under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + ∥ italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT - under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT ) , (C.5)
V^kg(ωi;θ(N),w(N))V^kg(ω¯i;θ¯(N),ω¯(N))B(θ(N)θ¯(N)(N)+ω(N)ω¯(N)(N)).delimited-∥∥subscriptsuperscript^𝑉𝑔𝑘subscript𝜔𝑖superscript𝜃𝑁superscript𝑤𝑁subscriptsuperscript^𝑉𝑔𝑘subscript¯𝜔𝑖superscript¯𝜃𝑁superscript¯𝜔𝑁𝐵subscriptdelimited-∥∥superscript𝜃𝑁superscript¯𝜃𝑁𝑁subscriptdelimited-∥∥superscript𝜔𝑁superscript¯𝜔𝑁𝑁\displaystyle\bigl{\|}\widehat{V}^{g}_{k}(\omega_{i};\theta^{(N)},w^{(N)})-% \widehat{V}^{g}_{k}(\underline{\omega}_{i};\underline{\theta}^{(N)},\underline% {\omega}^{(N)})\bigr{\|}\leq B\cdot\Bigl{(}\bigl{\|}\theta^{(N)}-\underline{% \theta}^{(N)}\bigr{\|}_{(N)}+\bigl{\|}\omega^{(N)}-\underline{\omega}^{(N)}% \bigr{\|}_{(N)}\Bigr{)}.∥ over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_w start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( under¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ∥ ≤ italic_B ⋅ ( ∥ italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT - under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + ∥ italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT - under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT ) . (C.6)

A similar series of inequalities also hold for v^f,v^gsuperscript^𝑣𝑓superscript^𝑣𝑔\widehat{v}^{f},\widehat{v}^{g}over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT , over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT,

v^f(θi;θ(N),ω(N))+v^g(ωi;θ(N),ω(N))B,delimited-∥∥superscript^𝑣𝑓subscript𝜃𝑖superscript𝜃𝑁superscript𝜔𝑁delimited-∥∥superscript^𝑣𝑔subscript𝜔𝑖superscript𝜃𝑁superscript𝜔𝑁𝐵\displaystyle\bigl{\|}\widehat{v}^{f}(\theta_{i};\theta^{(N)},\omega^{(N)})% \bigr{\|}+\bigl{\|}\widehat{v}^{g}(\omega_{i};\theta^{(N)},\omega^{(N)})\bigr{% \|}\leq B,∥ over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ∥ + ∥ over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ∥ ≤ italic_B , (C.7)
v^f(θi;θ(N),ω(N))v^kf(θ¯i;θ¯(N),ω¯(N))B(θ(N)θ¯(N)(N)+ω(N)ω¯(N)(N)),delimited-∥∥superscript^𝑣𝑓subscript𝜃𝑖superscript𝜃𝑁superscript𝜔𝑁subscriptsuperscript^𝑣𝑓𝑘subscript¯𝜃𝑖superscript¯𝜃𝑁superscript¯𝜔𝑁𝐵subscriptdelimited-∥∥superscript𝜃𝑁superscript¯𝜃𝑁𝑁subscriptdelimited-∥∥superscript𝜔𝑁superscript¯𝜔𝑁𝑁\displaystyle\bigl{\|}\widehat{v}^{f}(\theta_{i};\theta^{(N)},\omega^{(N)})-% \widehat{v}^{f}_{k}(\underline{\theta}_{i};\underline{\theta}^{(N)},\underline% {\omega}^{(N)})\bigr{\|}\leq B\cdot\Bigl{(}\bigl{\|}\theta^{(N)}-\underline{% \theta}^{(N)}\bigr{\|}_{(N)}+\bigl{\|}\omega^{(N)}-\underline{\omega}^{(N)}% \bigr{\|}_{(N)}\Bigr{)},∥ over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( under¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ∥ ≤ italic_B ⋅ ( ∥ italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT - under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + ∥ italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT - under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT ) , (C.8)
v^g(ωi;θ(N),ω(N))v^g(ω¯i;θ¯(N),ω¯(N))B(θ(N)θ¯(N)(N)+ω(N)ω¯(N)(N)).delimited-∥∥superscript^𝑣𝑔subscript𝜔𝑖superscript𝜃𝑁superscript𝜔𝑁superscript^𝑣𝑔subscript¯𝜔𝑖superscript¯𝜃𝑁superscript¯𝜔𝑁𝐵subscriptdelimited-∥∥superscript𝜃𝑁superscript¯𝜃𝑁𝑁subscriptdelimited-∥∥superscript𝜔𝑁superscript¯𝜔𝑁𝑁\displaystyle\bigl{\|}\widehat{v}^{g}(\omega_{i};\theta^{(N)},\omega^{(N)})-% \widehat{v}^{g}(\underline{\omega}_{i};\underline{\theta}^{(N)},\underline{% \omega}^{(N)})\bigr{\|}\leq B\cdot\Bigl{(}\bigl{\|}\theta^{(N)}-\underline{% \theta}^{(N)}\bigr{\|}_{(N)}+\bigl{\|}\omega^{(N)}-\underline{\omega}^{(N)}% \bigr{\|}_{(N)}\Bigr{)}.∥ over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( under¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ∥ ≤ italic_B ⋅ ( ∥ italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT - under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + ∥ italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT - under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT ) . (C.9)

As a corollary of the inequalities stated above, the uniform bounds in fact hold for any f,g𝑓𝑔f,g\in\mathcal{F}italic_f , italic_g ∈ caligraphic_F, which says,

supw𝒲|f(w)|+supz𝒵|g(z)|B.\displaystyle\sup_{w\in\mathcal{W}}\bigl{|}f(w)\bigr{|}+\sup_{z\in\mathcal{Z}}% \bigl{|}g(z)\bigr{|}\leq B.roman_sup start_POSTSUBSCRIPT italic_w ∈ caligraphic_W end_POSTSUBSCRIPT | italic_f ( italic_w ) | + roman_sup start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT | italic_g ( italic_z ) | ≤ italic_B . (C.10)

Similarly, the uniform bounds also hold for the velocity field vf,vgsuperscript𝑣𝑓superscript𝑣𝑔v^{f},v^{g}italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT, such that for any ρ𝒫2(D×D)𝜌subscript𝒫2superscript𝐷superscript𝐷\rho\in\mathscr{P}_{2}(\mathbb{R}^{D}\times\mathbb{R}^{D})italic_ρ ∈ script_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ), it holds that

vf(θ;ρ)+vg(ω;ρ)B.delimited-∥∥superscript𝑣𝑓𝜃𝜌delimited-∥∥superscript𝑣𝑔𝜔𝜌𝐵\displaystyle\bigl{\|}v^{f}(\theta;\rho)\bigr{\|}+\bigl{\|}v^{g}(\omega;\rho)% \bigr{\|}\leq B.∥ italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ ; italic_ρ ) ∥ + ∥ italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_ω ; italic_ρ ) ∥ ≤ italic_B . (C.11)
Proof.

We will prove these results separately.

(i) Proof of (C.1), (C.2), and (C.3)

For (C.1) of Lemma C.1, since ϕitalic-ϕ\phiitalic_ϕ, ψ𝜓\psiitalic_ψ are bounded as is assumed in Assumption 4.1, we have for any w𝒲,z𝒵formulae-sequence𝑤𝒲𝑧𝒵w\in\mathcal{W},z\in\mathcal{Z}italic_w ∈ caligraphic_W , italic_z ∈ caligraphic_Z, any θ(N)superscript𝜃𝑁\theta^{(N)}italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT and ω(N)superscript𝜔𝑁\omega^{(N)}italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT that

|f(w;θ(N))|+|g(z;ω(N))|αN1i=1N|ϕ(w;θi)|+|ψ(z;ωi)|B.\displaystyle\bigl{|}f(w;\theta^{(N)})\bigr{|}+\bigl{|}g(z;\omega^{(N)})\bigr{% |}\leq\alpha\cdot N^{-1}\sum_{i=1}^{N}\bigl{|}\phi(w;\theta_{i})\bigr{|}+\bigl% {|}\psi(z;\omega_{i})\bigr{|}\leq B.| italic_f ( italic_w ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) | + | italic_g ( italic_z ; italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) | ≤ italic_α ⋅ italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | italic_ϕ ( italic_w ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | + | italic_ψ ( italic_z ; italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | ≤ italic_B .

For (C.2), and (C.3) of Lemma C.1, since for any w𝒲𝑤𝒲w\in\mathcal{W}italic_w ∈ caligraphic_W, z𝒵𝑧𝒵z\in\mathcal{Z}italic_z ∈ caligraphic_Z, ϕ(w;θ)italic-ϕ𝑤𝜃\phi(w;\theta)italic_ϕ ( italic_w ; italic_θ ) has a bounded gradient in θ𝜃\thetaitalic_θ, ψ(z;ω)𝜓𝑧𝜔\psi(z;\omega)italic_ψ ( italic_z ; italic_ω ) has a bounded gradient in ω𝜔\omegaitalic_ω. The uniform upper bound of the gradient controls the Lipschitz constant of the function, thus it holds for any w𝒲,z𝒵formulae-sequence𝑤𝒲𝑧𝒵w\in\mathcal{W},z\in\mathcal{Z}italic_w ∈ caligraphic_W , italic_z ∈ caligraphic_Z, any θ(N),θ¯(N)superscript𝜃𝑁superscript¯𝜃𝑁\theta^{(N)},\underline{\theta}^{(N)}italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT and ω(N),ω¯(N)superscript𝜔𝑁superscript¯𝜔𝑁\omega^{(N)},\underline{\omega}^{(N)}italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT that

|f(w;θ(N))f(w;θ¯(N))|αN1B1i=1N|θiθ¯i|Bθ(N)θ¯(N)(N),𝑓𝑤superscript𝜃𝑁𝑓𝑤superscript¯𝜃𝑁𝛼superscript𝑁1subscript𝐵1superscriptsubscript𝑖1𝑁subscript𝜃𝑖subscript¯𝜃𝑖𝐵subscriptdelimited-∥∥superscript𝜃𝑁superscript¯𝜃𝑁𝑁\displaystyle\bigl{|}f(w;\theta^{(N)})-f(w;\underline{\theta}^{(N)})\bigr{|}% \leq\alpha N^{-1}\cdot B_{1}\sum_{i=1}^{N}\bigl{|}\theta_{i}-\underline{\theta% }_{i}\bigr{|}\leq B\cdot\bigl{\|}\theta^{(N)}-\underline{\theta}^{(N)}\bigr{\|% }_{(N)},| italic_f ( italic_w ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - italic_f ( italic_w ; under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) | ≤ italic_α italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - under¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ italic_B ⋅ ∥ italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT - under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT ,
|g(z;ω(N))g(z;ω¯(N))|αN1B1i=1N|ωiω¯i|Bω(N)ω¯(N)(N).𝑔𝑧superscript𝜔𝑁𝑔𝑧superscript¯𝜔𝑁𝛼superscript𝑁1subscript𝐵1superscriptsubscript𝑖1𝑁subscript𝜔𝑖subscript¯𝜔𝑖𝐵subscriptdelimited-∥∥superscript𝜔𝑁superscript¯𝜔𝑁𝑁\displaystyle\bigl{|}g(z;\omega^{(N)})-g(z;\underline{\omega}^{(N)})\bigr{|}% \leq\alpha N^{-1}\cdot B_{1}\sum_{i=1}^{N}\bigl{|}\omega_{i}-\underline{\omega% }_{i}\bigr{|}\leq B\cdot\bigl{\|}\omega^{(N)}-\underline{\omega}^{(N)}\bigr{\|% }_{(N)}.| italic_g ( italic_z ; italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - italic_g ( italic_z ; under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) | ≤ italic_α italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - under¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ italic_B ⋅ ∥ italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT - under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT .

(ii) Proof of (C.4), (C.5) and (C.6)

For (C.4) of Lemma C.1, recall the definition of V^kfsubscriptsuperscript^𝑉𝑓𝑘\widehat{V}^{f}_{k}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, V^kgsubscriptsuperscript^𝑉𝑔𝑘\widehat{V}^{g}_{k}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in (B), for any θ(N)superscript𝜃𝑁\theta^{(N)}italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT and ω(N)superscript𝜔𝑁\omega^{(N)}italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT,

V^kf(θi;θ(N),ω(N))delimited-∥∥subscriptsuperscript^𝑉𝑓𝑘subscript𝜃𝑖superscript𝜃𝑁superscript𝜔𝑁\displaystyle\Bigl{\|}\widehat{V}^{f}_{k}(\theta_{i};\theta^{(N)},\omega^{(N)}% )\Bigr{\|}∥ over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ∥ αsupw𝒲θϕ(w;θi)supz𝒵|g(z;ω(N))|𝒲|δΦ(xk,zk,f(;θ(N)))δf(w)|dwabsent𝛼subscriptsupremum𝑤𝒲normsubscript𝜃italic-ϕ𝑤subscript𝜃𝑖subscriptsupremum𝑧𝒵𝑔𝑧superscript𝜔𝑁subscript𝒲𝛿Φsubscript𝑥𝑘subscript𝑧𝑘𝑓superscript𝜃𝑁𝛿𝑓superscript𝑤differential-dsuperscript𝑤\displaystyle\;\leq\alpha\cdot\sup_{w\in\mathcal{W}}\big{\|}\nabla_{\theta}% \phi(w;\theta_{i})\big{\|}\cdot\sup_{z\in\mathcal{Z}}\big{|}g(z;\omega^{(N)})% \big{|}\cdot\int_{\mathcal{W}}\Big{|}\frac{\delta\Phi(x_{k},z_{k},f(\cdot;% \theta^{(N)}))}{\delta f}(w^{\prime})\Big{|}\mathrm{d}w^{\prime}≤ italic_α ⋅ roman_sup start_POSTSUBSCRIPT italic_w ∈ caligraphic_W end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_ϕ ( italic_w ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ ⋅ roman_sup start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT | italic_g ( italic_z ; italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) | ⋅ ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT | divide start_ARG italic_δ roman_Φ ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_f ( ⋅ ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | roman_d italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
+αsupw𝒲θϕ(w;θi)λ𝒲|δΨ(xk,zk,f(;θ(N)))δf(w)|dwB,𝛼subscriptsupremum𝑤𝒲normsubscript𝜃italic-ϕ𝑤subscript𝜃𝑖𝜆subscript𝒲𝛿Ψsubscript𝑥𝑘subscript𝑧𝑘𝑓superscript𝜃𝑁𝛿𝑓superscript𝑤differential-dsuperscript𝑤𝐵\displaystyle\qquad+\alpha\cdot\sup_{w\in\mathcal{W}}\big{\|}\nabla_{\theta}% \phi(w;\theta_{i})\big{\|}\cdot\lambda\cdot\int_{\mathcal{W}}\Big{|}\frac{% \delta\Psi(x_{k},z_{k},f(\cdot;\theta^{(N)}))}{\delta f}(w^{\prime})\Big{|}% \mathrm{d}w^{\prime}\leq B,+ italic_α ⋅ roman_sup start_POSTSUBSCRIPT italic_w ∈ caligraphic_W end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_ϕ ( italic_w ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ ⋅ italic_λ ⋅ ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT | divide start_ARG italic_δ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_f ( ⋅ ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | roman_d italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_B ,
V^kg(ωi;θ(N),ω(N))delimited-∥∥subscriptsuperscript^𝑉𝑔𝑘subscript𝜔𝑖superscript𝜃𝑁superscript𝜔𝑁\displaystyle\Bigl{\|}\widehat{V}^{g}_{k}(\omega_{i};\theta^{(N)},\omega^{(N)}% )\Bigr{\|}∥ over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ∥ α(|Φ(xk,zk;f(;θ(N)))|+supz𝒵|g(z;ω(N))|)supz𝒵ωψ(z;ωi)B.absent𝛼Φsubscript𝑥𝑘subscript𝑧𝑘𝑓superscript𝜃𝑁subscriptsupremum𝑧𝒵𝑔𝑧superscript𝜔𝑁subscriptsupremum𝑧𝒵normsubscript𝜔𝜓𝑧subscript𝜔𝑖𝐵\displaystyle\;\leq\alpha\cdot\Big{(}\big{|}\Phi(x_{k},z_{k};f(\cdot;\theta^{(% N)}))\big{|}+\sup_{z\in\mathcal{Z}}\big{|}g(z;\omega^{(N)})\big{|}\Big{)}\cdot% \sup_{z\in\mathcal{Z}}\big{\|}\nabla_{\omega}\psi(z;\omega_{i})\big{\|}\leq B.≤ italic_α ⋅ ( | roman_Φ ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_f ( ⋅ ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ) | + roman_sup start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT | italic_g ( italic_z ; italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) | ) ⋅ roman_sup start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT italic_ψ ( italic_z ; italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ ≤ italic_B .

For notational simplicity, we further define

uf(θ(N),w(N))=αg(zk;ω(N))δΦ(xk,zk;f(;θ(N)))δfαλδΨ(xk,zk;f(;θ(N)))δf,superscript𝑢𝑓superscript𝜃𝑁superscript𝑤𝑁𝛼𝑔subscript𝑧𝑘superscript𝜔𝑁𝛿Φsubscript𝑥𝑘subscript𝑧𝑘𝑓superscript𝜃𝑁𝛿𝑓𝛼𝜆𝛿Ψsubscript𝑥𝑘subscript𝑧𝑘𝑓superscript𝜃𝑁𝛿𝑓\displaystyle u^{f}(\theta^{(N)},w^{(N)})=-\alpha g(z_{k};\omega^{(N)})\cdot% \frac{\delta\Phi(x_{k},z_{k};f(\cdot;\theta^{(N)}))}{\delta f}-\alpha\lambda% \cdot\frac{\delta\Psi(x_{k},z_{k};f(\cdot;\theta^{(N)}))}{\delta f},italic_u start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_w start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) = - italic_α italic_g ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ⋅ divide start_ARG italic_δ roman_Φ ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_f ( ⋅ ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG - italic_α italic_λ ⋅ divide start_ARG italic_δ roman_Ψ ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_f ( ⋅ ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ) end_ARG start_ARG italic_δ italic_f end_ARG ,
ug(θ(N),w(N))=αΦ(xk,zk;f(;θ(N)))αg(zk;ω(N)).superscript𝑢𝑔superscript𝜃𝑁superscript𝑤𝑁𝛼Φsubscript𝑥𝑘subscript𝑧𝑘𝑓superscript𝜃𝑁𝛼𝑔subscript𝑧𝑘superscript𝜔𝑁\displaystyle u^{g}(\theta^{(N)},w^{(N)})=\alpha\Phi(x_{k},z_{k};f(\cdot;% \theta^{(N)}))-\alpha g(z_{k};\omega^{(N)}).italic_u start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_w start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) = italic_α roman_Φ ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_f ( ⋅ ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ) - italic_α italic_g ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) .

For (C.5) of Lemma C.1, following from Assumption 4.3 and the definition of V^kfsubscriptsuperscript^𝑉𝑓𝑘\widehat{V}^{f}_{k}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in (B), we have for any θ(N),θ¯(N)superscript𝜃𝑁superscript¯𝜃𝑁\theta^{(N)},\underline{\theta}^{(N)}italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT and ω(N),ω¯(N)superscript𝜔𝑁superscript¯𝜔𝑁\omega^{(N)},\underline{\omega}^{(N)}italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT that

V^kf(θi;θ(N),ω(N))V^kf(θ¯i;θ¯(N),ω¯(N))delimited-∥∥subscriptsuperscript^𝑉𝑓𝑘subscript𝜃𝑖superscript𝜃𝑁superscript𝜔𝑁subscriptsuperscript^𝑉𝑓𝑘subscript¯𝜃𝑖superscript¯𝜃𝑁superscript¯𝜔𝑁\displaystyle\Bigl{\|}\widehat{V}^{f}_{k}(\theta_{i};\theta^{(N)},\omega^{(N)}% )-\widehat{V}^{f}_{k}(\underline{\theta}_{i};\underline{\theta}^{(N)},% \underline{\omega}^{(N)})\Bigr{\|}∥ over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( under¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ∥
V^kf(θi;θ(N),ω(N))V^kf(θi;θ¯(N),ω¯(N))+V^kf(θi;θ¯(N),ω¯(N))V^kf(θ¯i;θ¯(N),ω¯(N))absentdelimited-∥∥subscriptsuperscript^𝑉𝑓𝑘subscript𝜃𝑖superscript𝜃𝑁superscript𝜔𝑁subscriptsuperscript^𝑉𝑓𝑘subscript𝜃𝑖superscript¯𝜃𝑁superscript¯𝜔𝑁delimited-∥∥subscriptsuperscript^𝑉𝑓𝑘subscript𝜃𝑖superscript¯𝜃𝑁superscript¯𝜔𝑁subscriptsuperscript^𝑉𝑓𝑘subscript¯𝜃𝑖superscript¯𝜃𝑁superscript¯𝜔𝑁\displaystyle\qquad\leq\Bigl{\|}\widehat{V}^{f}_{k}(\theta_{i};\theta^{(N)},% \omega^{(N)})-\widehat{V}^{f}_{k}(\theta_{i};\underline{\theta}^{(N)},% \underline{\omega}^{(N)})\Bigr{\|}+\Bigl{\|}\widehat{V}^{f}_{k}(\theta_{i};% \underline{\theta}^{(N)},\underline{\omega}^{(N)})-\widehat{V}^{f}_{k}(% \underline{\theta}_{i};\underline{\theta}^{(N)},\underline{\omega}^{(N)})\Bigr% {\|}≤ ∥ over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ∥ + ∥ over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( under¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ∥
|uf(θ(N),ω(N))uf(θ¯(N),ω¯(N))|supw𝒲θϕ(w;θi)+uf(θ¯(N),ω¯(N)),θϕ(;θi)θϕ(;θ¯i)L2.absentsuperscript𝑢𝑓superscript𝜃𝑁superscript𝜔𝑁superscript𝑢𝑓superscript¯𝜃𝑁superscript¯𝜔𝑁subscriptsupremum𝑤𝒲normsubscript𝜃italic-ϕ𝑤subscript𝜃𝑖delimited-∥∥subscriptsuperscript𝑢𝑓superscript¯𝜃𝑁superscript¯𝜔𝑁subscript𝜃italic-ϕsubscript𝜃𝑖subscript𝜃italic-ϕsubscript¯𝜃𝑖superscript𝐿2\displaystyle\qquad\leq|u^{f}(\theta^{(N)},\omega^{(N)})-u^{f}(\underline{% \theta}^{(N)},\underline{\omega}^{(N)})|\cdot\sup_{w\in\mathcal{W}}\|\nabla_{% \theta}\phi(w;\theta_{i})\|+\Bigl{\|}\Big{\langle}u^{f}(\underline{\theta}^{(N% )},\underline{\omega}^{(N)}),\nabla_{\theta}\phi(\cdot;\theta_{i})-\nabla_{% \theta}\phi(\cdot;\underline{\theta}_{i})\Big{\rangle}_{L^{2}}\Bigr{\|}.≤ | italic_u start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - italic_u start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) | ⋅ roman_sup start_POSTSUBSCRIPT italic_w ∈ caligraphic_W end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_ϕ ( italic_w ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ + ∥ ⟨ italic_u start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) , ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_ϕ ( ⋅ ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_ϕ ( ⋅ ; under¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ .

Moreover, uf(θ(N),ω(N))superscript𝑢𝑓superscript𝜃𝑁superscript𝜔𝑁u^{f}(\theta^{(N)},\omega^{(N)})italic_u start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) is also Lipschitz in (θ(N),ω(N))superscript𝜃𝑁superscript𝜔𝑁(\theta^{(N)},\omega^{(N)})( italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) since

|uf(θ(N),ω(N))uf(θ¯(N),ω¯(N))|superscript𝑢𝑓superscript𝜃𝑁superscript𝜔𝑁superscript𝑢𝑓superscript¯𝜃𝑁superscript¯𝜔𝑁\displaystyle|u^{f}(\theta^{(N)},\omega^{(N)})-u^{f}(\underline{\theta}^{(N)},% \underline{\omega}^{(N)})|| italic_u start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - italic_u start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) | B|f(wk;θ(N))f(wk;θ¯(N))|+B|g(zk;ω(N))g(zk;ω¯(N))|absent𝐵𝑓subscript𝑤𝑘superscript𝜃𝑁𝑓subscript𝑤𝑘superscript¯𝜃𝑁𝐵𝑔subscript𝑧𝑘superscript𝜔𝑁𝑔subscript𝑧𝑘superscript¯𝜔𝑁\displaystyle\;\leq B\cdot|f(w_{k};\theta^{(N)})-f(w_{k};\underline{\theta}^{(% N)})|+B\cdot|g(z_{k};\omega^{(N)})-g(z_{k};\underline{\omega}^{(N)})|≤ italic_B ⋅ | italic_f ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - italic_f ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) | + italic_B ⋅ | italic_g ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - italic_g ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) |
B(θ(N)θ¯(N)(N)+ω(N)ω¯(N)(N)),absent𝐵subscriptdelimited-∥∥superscript𝜃𝑁superscript¯𝜃𝑁𝑁subscriptdelimited-∥∥superscript𝜔𝑁superscript¯𝜔𝑁𝑁\displaystyle\;\leq B\cdot\Bigl{(}\bigl{\|}\theta^{(N)}-\underline{\theta}^{(N% )}\bigr{\|}_{(N)}+\bigl{\|}\omega^{(N)}-\underline{\omega}^{(N)}\bigr{\|}_{(N)% }\Bigr{)},≤ italic_B ⋅ ( ∥ italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT - under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + ∥ italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT - under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT ) ,

where the second inequality is achieved by applying (C.2), (C.3). Therefore, the fact that V^kf(θi;θ(N),ω(N))subscriptsuperscript^𝑉𝑓𝑘subscript𝜃𝑖superscript𝜃𝑁superscript𝜔𝑁\widehat{V}^{f}_{k}(\theta_{i};\theta^{(N)},\omega^{(N)})over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) is Lipschitz in (θ(N),ω(N))superscript𝜃𝑁superscript𝜔𝑁(\theta^{(N)},\omega^{(N)})( italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) is due to θϕ(w;θi)normsubscript𝜃italic-ϕ𝑤subscript𝜃𝑖\|\nabla_{\theta}\phi(w;\theta_{i})\|∥ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_ϕ ( italic_w ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ and |uf(θ(N),ω(N))(w)dw|superscript𝑢𝑓superscript𝜃𝑁superscript𝜔𝑁superscript𝑤differential-dsuperscript𝑤\big{|}\int u^{f}(\theta^{(N)},\omega^{(N)})(w^{\prime})\mathrm{d}w^{\prime}% \big{|}| ∫ italic_u start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) roman_d italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | is uniformly bounded.

For (C.6) of Lemma C.1, following from Assumption 4.3 and the definition of V^kgsubscriptsuperscript^𝑉𝑔𝑘\widehat{V}^{g}_{k}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in (B), through a similar argument as is in the proof of (C.5), we have for any θ(N),θ¯(N)superscript𝜃𝑁superscript¯𝜃𝑁\theta^{(N)},\underline{\theta}^{(N)}italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT and ω(N),ω¯(N)superscript𝜔𝑁superscript¯𝜔𝑁\omega^{(N)},\underline{\omega}^{(N)}italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT that

V^kg(ωi;θ(N),ω(N))V^kg(ω¯i;θ¯(N),ω¯(N))delimited-∥∥subscriptsuperscript^𝑉𝑔𝑘subscript𝜔𝑖superscript𝜃𝑁superscript𝜔𝑁subscriptsuperscript^𝑉𝑔𝑘subscript¯𝜔𝑖superscript¯𝜃𝑁superscript¯𝜔𝑁\displaystyle\Bigl{\|}\widehat{V}^{g}_{k}(\omega_{i};\theta^{(N)},\omega^{(N)}% )-\widehat{V}^{g}_{k}(\underline{\omega}_{i};\underline{\theta}^{(N)},% \underline{\omega}^{(N)})\Bigr{\|}∥ over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( under¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ∥
V^kg(ωi;θ(N),ω(N))V^kg(ωi;θ¯(N),ω¯(N))+V^kg(ωi;θ¯(N),ω¯(N))V^kg(ω¯i;θ¯(N),ω¯(N))absentdelimited-∥∥subscriptsuperscript^𝑉𝑔𝑘subscript𝜔𝑖superscript𝜃𝑁superscript𝜔𝑁subscriptsuperscript^𝑉𝑔𝑘subscript𝜔𝑖superscript¯𝜃𝑁superscript¯𝜔𝑁delimited-∥∥subscriptsuperscript^𝑉𝑔𝑘subscript𝜔𝑖superscript¯𝜃𝑁superscript¯𝜔𝑁subscriptsuperscript^𝑉𝑔𝑘subscript¯𝜔𝑖superscript¯𝜃𝑁superscript¯𝜔𝑁\displaystyle\qquad\leq\Bigl{\|}\widehat{V}^{g}_{k}(\omega_{i};\theta^{(N)},% \omega^{(N)})-\widehat{V}^{g}_{k}(\omega_{i};\underline{\theta}^{(N)},% \underline{\omega}^{(N)})\Bigr{\|}+\Bigl{\|}\widehat{V}^{g}_{k}(\omega_{i};% \underline{\theta}^{(N)},\underline{\omega}^{(N)})-\widehat{V}^{g}_{k}(% \underline{\omega}_{i};\underline{\theta}^{(N)},\underline{\omega}^{(N)})\Bigr% {\|}≤ ∥ over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ∥ + ∥ over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( under¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ∥
|ug(θ(N),ω(N))uf(θ¯(N),ω¯(N))|supz𝒵ωψ(z;ωi)+ug(θ¯(N),ω¯(N)),ωψ(;ωi)ωψ(;ω¯i)L2.absentsuperscript𝑢𝑔superscript𝜃𝑁superscript𝜔𝑁superscript𝑢𝑓superscript¯𝜃𝑁superscript¯𝜔𝑁subscriptsupremum𝑧𝒵normsubscript𝜔𝜓𝑧subscript𝜔𝑖delimited-∥∥subscriptsuperscript𝑢𝑔superscript¯𝜃𝑁superscript¯𝜔𝑁subscript𝜔𝜓subscript𝜔𝑖subscript𝜔𝜓subscript¯𝜔𝑖superscript𝐿2\displaystyle\qquad\leq|u^{g}(\theta^{(N)},\omega^{(N)})-u^{f}(\underline{% \theta}^{(N)},\underline{\omega}^{(N)})|\cdot\sup_{z\in\mathcal{Z}}\|\nabla_{% \omega}\psi(z;\omega_{i})\|+\Bigl{\|}\Big{\langle}u^{g}(\underline{\theta}^{(N% )},\underline{\omega}^{(N)}),\nabla_{\omega}\psi(\cdot;\omega_{i})-\nabla_{% \omega}\psi(\cdot;\underline{\omega}_{i})\Big{\rangle}_{L^{2}}\Bigr{\|}.≤ | italic_u start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - italic_u start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) | ⋅ roman_sup start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT italic_ψ ( italic_z ; italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ + ∥ ⟨ italic_u start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) , ∇ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT italic_ψ ( ⋅ ; italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT italic_ψ ( ⋅ ; under¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ .

Again, ug(θ(N),ω(N))superscript𝑢𝑔superscript𝜃𝑁superscript𝜔𝑁u^{g}(\theta^{(N)},\omega^{(N)})italic_u start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) is Lipschitz in (θ(N),ω(N))superscript𝜃𝑁superscript𝜔𝑁(\theta^{(N)},\omega^{(N)})( italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) since

|ug(θ(N),ω(N))ug(θ¯(N),ω¯(N))|superscript𝑢𝑔superscript𝜃𝑁superscript𝜔𝑁superscript𝑢𝑔superscript¯𝜃𝑁superscript¯𝜔𝑁\displaystyle|u^{g}(\theta^{(N)},\omega^{(N)})-u^{g}(\underline{\theta}^{(N)},% \underline{\omega}^{(N)})|| italic_u start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - italic_u start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) | B|f(wk;θ(N))f(wk;θ¯(N))|+B|g(zk;ω(N))g(zk;ω¯(N))|absent𝐵𝑓subscript𝑤𝑘superscript𝜃𝑁𝑓subscript𝑤𝑘superscript¯𝜃𝑁𝐵𝑔subscript𝑧𝑘superscript𝜔𝑁𝑔subscript𝑧𝑘superscript¯𝜔𝑁\displaystyle\;\leq B\cdot|f(w_{k};\theta^{(N)})-f(w_{k};\underline{\theta}^{(% N)})|+B\cdot|g(z_{k};\omega^{(N)})-g(z_{k};\underline{\omega}^{(N)})|≤ italic_B ⋅ | italic_f ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - italic_f ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) | + italic_B ⋅ | italic_g ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - italic_g ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) |
B(θ(N)θ¯(N)(N)+ω(N)ω¯(N)(N)).absent𝐵subscriptdelimited-∥∥superscript𝜃𝑁superscript¯𝜃𝑁𝑁subscriptdelimited-∥∥superscript𝜔𝑁superscript¯𝜔𝑁𝑁\displaystyle\;\leq B\cdot\Bigl{(}\bigl{\|}\theta^{(N)}-\underline{\theta}^{(N% )}\bigr{\|}_{(N)}+\bigl{\|}\omega^{(N)}-\underline{\omega}^{(N)}\bigr{\|}_{(N)% }\Bigr{)}.≤ italic_B ⋅ ( ∥ italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT - under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + ∥ italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT - under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT ) .

Therefore, the Lipschtizness of V^kg(ωi;θ(N),ω(N))subscriptsuperscript^𝑉𝑔𝑘subscript𝜔𝑖superscript𝜃𝑁superscript𝜔𝑁\widehat{V}^{g}_{k}(\omega_{i};\theta^{(N)},\omega^{(N)})over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) in (θ(N),ω(N))superscript𝜃𝑁superscript𝜔𝑁(\theta^{(N)},\omega^{(N)})( italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) comes from ωψ(z;ωi)normsubscript𝜔𝜓𝑧subscript𝜔𝑖\|\nabla_{\omega}\psi(z;\omega_{i})\|∥ ∇ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT italic_ψ ( italic_z ; italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ and |ug(θ(N),ω(N))(z)dz|superscript𝑢𝑔superscript𝜃𝑁superscript𝜔𝑁superscript𝑧differential-dsuperscript𝑧\big{|}\int u^{g}(\theta^{(N)},\omega^{(N)})(z^{\prime})\mathrm{d}z^{\prime}% \big{|}| ∫ italic_u start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) roman_d italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | is uniformly bounded.

(iii) Proof of (C.7), (C.8), and (C.9)

Equations (C.7), (C.8), (C.9) of Lemma C.1 for v^fsuperscript^𝑣𝑓\widehat{v}^{f}over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT and v^gsuperscript^𝑣𝑔\widehat{v}^{g}over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT follow from the fact that

v^f(θi;θ(N),ω(N))=𝔼𝒟[V^kf(θi;θ(N),ω(N))],v^g(ωi;θ(N),w(N))=𝔼𝒟[V^kg(ωi;θ(N),ω(N))].formulae-sequencesuperscript^𝑣𝑓subscript𝜃𝑖superscript𝜃𝑁superscript𝜔𝑁subscript𝔼𝒟delimited-[]superscriptsubscript^𝑉𝑘𝑓subscript𝜃𝑖superscript𝜃𝑁superscript𝜔𝑁superscript^𝑣𝑔subscript𝜔𝑖superscript𝜃𝑁superscript𝑤𝑁subscript𝔼𝒟delimited-[]superscriptsubscript^𝑉𝑘𝑔subscript𝜔𝑖superscript𝜃𝑁superscript𝜔𝑁\displaystyle\widehat{v}^{f}(\theta_{i};\theta^{(N)},\omega^{(N)})=\mathbb{E}_% {\mathcal{D}}\Big{[}\widehat{V}_{k}^{f}(\theta_{i};\theta^{(N)},\omega^{(N)})% \Big{]},\quad\widehat{v}^{g}(\omega_{i};\theta^{(N)},w^{(N)})=\mathbb{E}_{% \mathcal{D}}\Big{[}\widehat{V}_{k}^{g}(\omega_{i};\theta^{(N)},\omega^{(N)})% \Big{]}.over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ] , over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_w start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ] .

Therefore, (C.7) follows from (C.4) and triangle inequality,

v^f(θi;θ(N),ω(N))+v^g(ωi;θ(N),ω(N))delimited-∥∥superscript^𝑣𝑓subscript𝜃𝑖superscript𝜃𝑁superscript𝜔𝑁delimited-∥∥superscript^𝑣𝑔subscript𝜔𝑖superscript𝜃𝑁superscript𝜔𝑁\displaystyle\bigl{\|}\widehat{v}^{f}(\theta_{i};\theta^{(N)},\omega^{(N)})% \bigr{\|}+\bigl{\|}\widehat{v}^{g}(\omega_{i};\theta^{(N)},\omega^{(N)})\bigr{\|}∥ over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ∥ + ∥ over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ∥ 𝔼𝒟[V^kf(θi;θ(N),ω(N))]+𝔼𝒟[V^kg(ωi;θ(N),ω(N))]B.\displaystyle\;\leq\mathbb{E}_{\mathcal{D}}\Bigl{[}\bigl{\|}\widehat{V}_{k}^{f% }(\theta_{i};\theta^{(N}),\omega^{(N)})\bigl{\|}\Bigr{]}+\mathbb{E}_{\mathcal{% D}}\Bigl{[}\bigl{\|}\widehat{V}_{k}^{g}(\omega_{i};\theta^{(N)},\omega^{(N)})% \bigl{\|}\Bigr{]}\leq B.≤ blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ ∥ over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N end_POSTSUPERSCRIPT ) , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ∥ ] + blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ ∥ over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ∥ ] ≤ italic_B .

Equations (C.8) and (C.9) follows from (C.5), (C.6) and triangle inequality,

v^f(θi;θ(N),ω(N))v^kf(θ¯i;θ¯(N),ω¯(N))delimited-∥∥superscript^𝑣𝑓subscript𝜃𝑖superscript𝜃𝑁superscript𝜔𝑁subscriptsuperscript^𝑣𝑓𝑘subscript¯𝜃𝑖superscript¯𝜃𝑁superscript¯𝜔𝑁\displaystyle\bigl{\|}\widehat{v}^{f}(\theta_{i};\theta^{(N)},\omega^{(N)})-% \widehat{v}^{f}_{k}(\underline{\theta}_{i};\underline{\theta}^{(N)},\underline% {\omega}^{(N)})\bigr{\|}∥ over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( under¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ∥ 𝔼𝒟[V^kf(θi;θ(N),ω(N))V^kf(θ¯i;θ¯(N),ω¯(N))]absentsubscript𝔼𝒟delimited-[]delimited-∥∥subscriptsuperscript^𝑉𝑓𝑘subscript𝜃𝑖superscript𝜃𝑁superscript𝜔𝑁subscriptsuperscript^𝑉𝑓𝑘subscript¯𝜃𝑖superscript¯𝜃𝑁superscript¯𝜔𝑁\displaystyle\;\leq\mathbb{E}_{\mathcal{D}}\Big{[}\bigl{\|}\widehat{V}^{f}_{k}% (\theta_{i};\theta^{(N)},\omega^{(N)})-\widehat{V}^{f}_{k}(\underline{\theta}_% {i};\underline{\theta}^{(N)},\underline{\omega}^{(N)})\bigr{\|}\Big{]}≤ blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ ∥ over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( under¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ∥ ]
B(θ(N)θ¯(N)(N)+ω(N)ω¯(N)(N)),absent𝐵subscriptdelimited-∥∥superscript𝜃𝑁superscript¯𝜃𝑁𝑁subscriptdelimited-∥∥superscript𝜔𝑁superscript¯𝜔𝑁𝑁\displaystyle\;\leq B\cdot\Bigl{(}\bigl{\|}\theta^{(N)}-\underline{\theta}^{(N% )}\bigr{\|}_{(N)}+\bigl{\|}\omega^{(N)}-\underline{\omega}^{(N)}\bigr{\|}_{(N)% }\Bigr{)},≤ italic_B ⋅ ( ∥ italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT - under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + ∥ italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT - under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT ) ,
v^g(ωi;θ(N),ω(N))v^kg(ω¯i;θ¯(N),ω¯(N))delimited-∥∥superscript^𝑣𝑔subscript𝜔𝑖superscript𝜃𝑁superscript𝜔𝑁subscriptsuperscript^𝑣𝑔𝑘subscript¯𝜔𝑖superscript¯𝜃𝑁superscript¯𝜔𝑁\displaystyle\bigl{\|}\widehat{v}^{g}(\omega_{i};\theta^{(N)},\omega^{(N)})-% \widehat{v}^{g}_{k}(\underline{\omega}_{i};\underline{\theta}^{(N)},\underline% {\omega}^{(N)})\bigr{\|}∥ over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( under¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ∥ 𝔼𝒟[V^kg(ωi;θ(N),ω(N))V^kg(ω¯i;θ¯(N),ω¯(N))]absentsubscript𝔼𝒟delimited-[]delimited-∥∥subscriptsuperscript^𝑉𝑔𝑘subscript𝜔𝑖superscript𝜃𝑁superscript𝜔𝑁subscriptsuperscript^𝑉𝑔𝑘subscript¯𝜔𝑖superscript¯𝜃𝑁superscript¯𝜔𝑁\displaystyle\;\leq\mathbb{E}_{\mathcal{D}}\Big{[}\bigl{\|}\widehat{V}^{g}_{k}% (\omega_{i};\theta^{(N)},\omega^{(N)})-\widehat{V}^{g}_{k}(\underline{\omega}_% {i};\underline{\theta}^{(N)},\underline{\omega}^{(N)})\bigr{\|}\Big{]}≤ blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ ∥ over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) - over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( under¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ∥ ]
B(θ(N)θ¯(N)(N)+ω(N)ω¯(N)(N)).absent𝐵subscriptdelimited-∥∥superscript𝜃𝑁superscript¯𝜃𝑁𝑁subscriptdelimited-∥∥superscript𝜔𝑁superscript¯𝜔𝑁𝑁\displaystyle\;\leq B\cdot\Bigl{(}\bigl{\|}\theta^{(N)}-\underline{\theta}^{(N% )}\bigr{\|}_{(N)}+\bigl{\|}\omega^{(N)}-\underline{\omega}^{(N)}\bigr{\|}_{(N)% }\Bigr{)}.≤ italic_B ⋅ ( ∥ italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT - under¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + ∥ italic_ω start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT - under¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT ) .

(iv) Proof of (C.10), and (C.11)

Equation (C.10) follows from the definition of \mathcal{F}caligraphic_F in (4.2) and the uniform bounds of neuron functions ϕitalic-ϕ\phiitalic_ϕ and ψ𝜓\psiitalic_ψ. For any f,g𝑓𝑔f,g\in\mathcal{F}italic_f , italic_g ∈ caligraphic_F, there exists probability measures μ^,ν^^𝜇^𝜈\widehat{\mu},\widehat{\nu}over^ start_ARG italic_μ end_ARG , over^ start_ARG italic_ν end_ARG over the parameter space such that

f(w)=ϕ(w;θ)μ^(dθ),g(z)=ψ(z;ω)ν^(dω),w𝒲,z𝒵.formulae-sequence𝑓𝑤italic-ϕ𝑤𝜃^𝜇d𝜃formulae-sequence𝑔𝑧𝜓𝑧𝜔^𝜈d𝜔formulae-sequencefor-all𝑤𝒲𝑧𝒵\displaystyle f(w)=\int\phi(w;\theta)\widehat{\mu}(\mathrm{d}\theta),\quad g(z% )=\int\psi(z;\omega)\widehat{\nu}(\mathrm{d}\omega),\quad\forall w\in\mathcal{% W},z\in\mathcal{Z}.italic_f ( italic_w ) = ∫ italic_ϕ ( italic_w ; italic_θ ) over^ start_ARG italic_μ end_ARG ( roman_d italic_θ ) , italic_g ( italic_z ) = ∫ italic_ψ ( italic_z ; italic_ω ) over^ start_ARG italic_ν end_ARG ( roman_d italic_ω ) , ∀ italic_w ∈ caligraphic_W , italic_z ∈ caligraphic_Z .

We apply the triangle inequality and achieve,

supw𝒲|f(w)|+supz𝒵|g(z)|supw𝒲|ϕ(w;θ)|μ^(dθ)+supz𝒵|g(z)||ψ(z;ω)|ν^(dω)B.subscriptsupremum𝑤𝒲𝑓𝑤subscriptsupremum𝑧𝒵𝑔𝑧subscriptsupremum𝑤𝒲italic-ϕ𝑤𝜃^𝜇d𝜃subscriptsupremum𝑧𝒵𝑔𝑧𝜓𝑧𝜔^𝜈d𝜔𝐵\displaystyle\sup_{w\in\mathcal{W}}|f(w)|+\sup_{z\in\mathcal{Z}}|g(z)|\leq\int% \sup_{w\in\mathcal{W}}|\phi(w;\theta)|\widehat{\mu}(\mathrm{d}\theta)+\int\sup% _{z\in\mathcal{Z}}|g(z)||\psi(z;\omega)|\widehat{\nu}(\mathrm{d}\omega)\leq B.roman_sup start_POSTSUBSCRIPT italic_w ∈ caligraphic_W end_POSTSUBSCRIPT | italic_f ( italic_w ) | + roman_sup start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT | italic_g ( italic_z ) | ≤ ∫ roman_sup start_POSTSUBSCRIPT italic_w ∈ caligraphic_W end_POSTSUBSCRIPT | italic_ϕ ( italic_w ; italic_θ ) | over^ start_ARG italic_μ end_ARG ( roman_d italic_θ ) + ∫ roman_sup start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT | italic_g ( italic_z ) | | italic_ψ ( italic_z ; italic_ω ) | over^ start_ARG italic_ν end_ARG ( roman_d italic_ω ) ≤ italic_B .

Equation (C.11) follows from the definition of vf,vgsuperscript𝑣𝑓superscript𝑣𝑔v^{f},v^{g}italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT in (B) and the proof of (C.4) and (C.7). Proof of (C.11) is the same as the proof for (C.4) and (C.7), except for the fact that a uniform bound is needed for the infinite width representation of f𝑓fitalic_f and g𝑔gitalic_g, which is proved in (C.10).

Based on proofs for items (i), (ii), (iii), and (iv) above, we finish the proof of Lemma (C.1). ∎

Now, recall ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the PDE solution to (3.7), θ¯(N)(t),w¯(N)(t)superscript¯𝜃𝑁𝑡superscript¯𝑤𝑁𝑡\bar{\theta}^{(N)}(t),\bar{w}^{(N)}(t)over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) , over¯ start_ARG italic_w end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) is the IP dynamics defined in (B.7), θ~(N)(t),w~(N)(t)superscript~𝜃𝑁𝑡superscript~𝑤𝑁𝑡\widetilde{\theta}^{(N)}(t),\widetilde{w}^{(N)}(t)over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) , over~ start_ARG italic_w end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) is the CTPGDA dynamics defined in (B.6). We have the following lemma that also bound the difference of iterates for IP, CTPGDA dynamics between time s𝑠sitalic_s and t𝑡titalic_t.

Lemma C.2.

Under Assumption 4.1 and 4.3, it holds for any s,t[0,T]𝑠𝑡0𝑇s,t\in[0,T]italic_s , italic_t ∈ [ 0 , italic_T ] that,

θ¯(N)(t)θ¯(N)(s)(N)+ω¯(N)(t)ω¯(N)(s)(N)B|ts|,subscriptdelimited-∥∥superscript¯𝜃𝑁𝑡superscript¯𝜃𝑁𝑠𝑁subscriptdelimited-∥∥superscript¯𝜔𝑁𝑡superscript¯𝜔𝑁𝑠𝑁𝐵𝑡𝑠\displaystyle\bigl{\|}\bar{\theta}^{(N)}(t)-\bar{\theta}^{(N)}(s)\bigr{\|}_{(N% )}+\bigl{\|}\bar{\omega}^{(N)}(t)-\bar{\omega}^{(N)}(s)\bigr{\|}_{(N)}\leq B% \cdot\bigl{|}t-s\bigr{|},∥ over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) - over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + ∥ over¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) - over¯ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT ≤ italic_B ⋅ | italic_t - italic_s | , (C.12)
θ~(N)(t)θ~(N)(s)(N)+ω~(N)(t)ω~(N)(s)(N)B|ts|,subscriptdelimited-∥∥superscript~𝜃𝑁𝑡superscript~𝜃𝑁𝑠𝑁subscriptdelimited-∥∥superscript~𝜔𝑁𝑡superscript~𝜔𝑁𝑠𝑁𝐵𝑡𝑠\displaystyle\bigl{\|}\widetilde{\theta}^{(N)}(t)-\widetilde{\theta}^{(N)}(s)% \bigr{\|}_{(N)}+\bigl{\|}\widetilde{\omega}^{(N)}(t)-\widetilde{\omega}^{(N)}(% s)\bigr{\|}_{(N)}\leq B\cdot\bigl{|}t-s\bigr{|},∥ over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) - over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT + ∥ over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_t ) - over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_s ) ∥ start_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT ≤ italic_B ⋅ | italic_t - italic_s | , (C.13)
𝒲2(μt,μs))+𝒲2(νt,νs))B|ts|.\displaystyle\mathcal{W}_{2}(\mu_{t},\mu_{s}))+\mathcal{W}_{2}(\nu_{t},\nu_{s}% ))\leq B\cdot\bigl{|}t-s\bigr{|}.caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) + caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) ≤ italic_B ⋅ | italic_t - italic_s | . (C.14)
Proof.

For (C.12) of Lemma C.2, by the definition of θ¯i(t)subscript¯𝜃𝑖𝑡\bar{\theta}_{i}(t)over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) and ω¯i(t)subscript¯𝜔𝑖𝑡\bar{\omega}_{i}(t)over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) in (B.7) and (C.11) of Lemma C.1, we have for any s,t[0,T]𝑠𝑡0𝑇s,t\in[0,T]italic_s , italic_t ∈ [ 0 , italic_T ] and i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ] that

θ¯i(t)θ¯i(s)ηstvf(θ¯i(τ);ρτ)dτB|ts|delimited-∥∥subscript¯𝜃𝑖𝑡subscript¯𝜃𝑖𝑠𝜂superscriptsubscript𝑠𝑡delimited-∥∥superscript𝑣𝑓subscript¯𝜃𝑖𝜏subscript𝜌𝜏differential-d𝜏𝐵𝑡𝑠\displaystyle\bigl{\|}\bar{\theta}_{i}(t)-\bar{\theta}_{i}(s)\bigr{\|}\leq\eta% \cdot\int_{s}^{t}\bigl{\|}v^{f}(\bar{\theta}_{i}(\tau);\rho_{\tau})\bigr{\|}% \mathrm{d}\tau\leq B\cdot\bigl{|}t-s\bigr{|}∥ over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ∥ ≤ italic_η ⋅ ∫ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ ) ; italic_ρ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ∥ roman_d italic_τ ≤ italic_B ⋅ | italic_t - italic_s |
ω¯i(t)ω¯i(s)ηstvg(ω¯i(τ);ρτ)dτB|ts|delimited-∥∥subscript¯𝜔𝑖𝑡subscript¯𝜔𝑖𝑠𝜂superscriptsubscript𝑠𝑡delimited-∥∥superscript𝑣𝑔subscript¯𝜔𝑖𝜏subscript𝜌𝜏differential-d𝜏𝐵𝑡𝑠\displaystyle\bigl{\|}\bar{\omega}_{i}(t)-\bar{\omega}_{i}(s)\bigr{\|}\leq\eta% \cdot\int_{s}^{t}\bigl{\|}v^{g}(\bar{\omega}_{i}(\tau);\rho_{\tau})\bigr{\|}% \mathrm{d}\tau\leq B\cdot\bigl{|}t-s\bigr{|}∥ over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ∥ ≤ italic_η ⋅ ∫ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ italic_v start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ ) ; italic_ρ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ∥ roman_d italic_τ ≤ italic_B ⋅ | italic_t - italic_s |

Similarly, for (C.13) of Lemma C.2, by the definition of θ~i(t)subscript~𝜃𝑖𝑡\widetilde{\theta}_{i}(t)over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) and ω~i(t)subscript~𝜔𝑖𝑡\widetilde{\omega}_{i}(t)over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) in (B.6), and (C.7) of Lemma C.1, we have for any s,t[0,T]𝑠𝑡0𝑇s,t\in[0,T]italic_s , italic_t ∈ [ 0 , italic_T ] and i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ],

θ~i(t)θ~i(s)B|ts|,ω~i(t)ω~i(s)B|ts|.formulae-sequencenormsubscript~𝜃𝑖𝑡subscript~𝜃𝑖𝑠𝐵𝑡𝑠normsubscript~𝜔𝑖𝑡subscript~𝜔𝑖𝑠𝐵𝑡𝑠\displaystyle\big{\|}\widetilde{\theta}_{i}(t)-\widetilde{\theta}_{i}(s)\big{% \|}\leq B\cdot\big{|}t-s\big{|},\quad\big{\|}\widetilde{\omega}_{i}(t)-% \widetilde{\omega}_{i}(s)\big{\|}\leq B\cdot\big{|}t-s\big{|}.∥ over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ∥ ≤ italic_B ⋅ | italic_t - italic_s | , ∥ over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ∥ ≤ italic_B ⋅ | italic_t - italic_s | .

For (C.14) of Lemma C.2, following from the fact that θ¯i(t)i.i.d.μt\bar{\theta}_{i}(t)\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mu_{t}over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG roman_i . roman_i . roman_d . end_ARG end_RELOP italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, ω¯i(t)i.i.d.νt\bar{\omega}_{i}(t)\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\nu_{t}over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG roman_i . roman_i . roman_d . end_ARG end_RELOP italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the definition of 𝒲2subscript𝒲2\mathcal{W}_{2}caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in (2.18), it holds that for any s,t[0,T]𝑠𝑡0𝑇s,t\in[0,T]italic_s , italic_t ∈ [ 0 , italic_T ] that

𝒲2(μt,μs)𝔼[θ¯i(t)θ¯i(s)2]1/2B|ts|subscript𝒲2subscript𝜇𝑡subscript𝜇𝑠𝔼superscriptdelimited-[]superscriptdelimited-∥∥subscript¯𝜃𝑖𝑡subscript¯𝜃𝑖𝑠212𝐵𝑡𝑠\displaystyle\mathcal{W}_{2}(\mu_{t},\mu_{s})\leq\mathbb{E}\Bigl{[}\bigl{\|}% \bar{\theta}_{i}(t)-\bar{\theta}_{i}(s)\bigr{\|}^{2}\Bigr{]}^{1/2}\leq B\cdot|% t-s|caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ≤ blackboard_E [ ∥ over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ≤ italic_B ⋅ | italic_t - italic_s |
𝒲2(νt,νs)𝔼[ω¯i(t)ω¯i(s)2]1/2B|ts|subscript𝒲2subscript𝜈𝑡subscript𝜈𝑠𝔼superscriptdelimited-[]superscriptdelimited-∥∥subscript¯𝜔𝑖𝑡subscript¯𝜔𝑖𝑠212𝐵𝑡𝑠\displaystyle\mathcal{W}_{2}(\nu_{t},\nu_{s})\leq\mathbb{E}\Bigl{[}\bigl{\|}% \bar{\omega}_{i}(t)-\bar{\omega}_{i}(s)\bigr{\|}^{2}\Bigr{]}^{1/2}\leq B\cdot|% t-s|caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ≤ blackboard_E [ ∥ over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ≤ italic_B ⋅ | italic_t - italic_s |

Therefore, we complete the proof of Lemma C.2. ∎

Lemma C.3.

Let {Xi}i=1Nsuperscriptsubscriptsubscript𝑋𝑖𝑖1𝑁\{X_{i}\}_{i=1}^{N}{ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT be i.i.d. random variables with Xiξnormsubscript𝑋𝑖𝜉\|X_{i}\|\leq\xi∥ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ≤ italic_ξ and 𝔼[Xi]=0.𝔼delimited-[]subscript𝑋𝑖0\mathbb{E}[X_{i}]=0.blackboard_E [ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] = 0 . Then it holds for any p>0𝑝0p>0italic_p > 0, there exists C>0𝐶0C>0italic_C > 0 being an absolute constant that

(N1i=1NXiCξ(N1/2+p))exp(Np2),normsuperscript𝑁1superscriptsubscript𝑖1𝑁subscript𝑋𝑖𝐶𝜉superscript𝑁12𝑝𝑁superscript𝑝2\displaystyle\mathbb{P}\left(\left\|N^{-1}\cdot\sum_{i=1}^{N}X_{i}\right\|\geq C% \xi\cdot\left(N^{-1/2}+p\right)\right)\leq\exp\left(-Np^{2}\right),blackboard_P ( ∥ italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ≥ italic_C italic_ξ ⋅ ( italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT + italic_p ) ) ≤ roman_exp ( - italic_N italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,
Proof.

See Lemma 30 in Mei et al. (2019)

Lemma C.4 (Azuma-Hoeffding bound).

Let XkDsubscript𝑋𝑘superscript𝐷X_{k}\in\mathbb{R}^{D}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT be a martingale with respect to the filtration 𝒢k(k0)subscript𝒢𝑘𝑘0\mathcal{G}_{k}\;(k\geq 0)caligraphic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_k ≥ 0 ) with X0=0subscript𝑋00X_{0}=0italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0. We assume for ξ>0𝜉0\xi>0italic_ξ > 0 and any λD𝜆superscript𝐷\lambda\in\mathbb{R}^{D}italic_λ ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT that,

𝔼[exp(λ,XkXk1)𝒢k1]exp(ξ2λ2/2)𝔼delimited-[]conditional𝜆subscript𝑋𝑘subscript𝑋𝑘1subscript𝒢𝑘1superscript𝜉2superscriptnorm𝜆22\displaystyle\mathbb{E}\left[\exp\left(\left\langle\lambda,X_{k}-X_{k-1}\right% \rangle\right)\mid\mathcal{G}_{k-1}\right]\leq\exp\left(\xi^{2}\cdot\|\lambda% \|^{2}/2\right)blackboard_E [ roman_exp ( ⟨ italic_λ , italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⟩ ) ∣ caligraphic_G start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ] ≤ roman_exp ( italic_ξ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ∥ italic_λ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 )

Then it holds that, with C>0𝐶0C>0italic_C > 0 being an absolute constant.

(maxkn(k)XkCξn(D+p))exp(p2)subscript𝑘𝑛𝑘normsubscript𝑋𝑘𝐶𝜉𝑛𝐷𝑝superscript𝑝2\displaystyle\mathbb{P}\left(\max_{\begin{subarray}{c}k\leq n\\ (k\in\mathbb{N})\end{subarray}}\left\|X_{k}\right\|\geq C\xi\cdot\sqrt{n}\cdot% (\sqrt{D}+p)\right)\leq\exp\left(-p^{2}\right)blackboard_P ( roman_max start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_k ≤ italic_n end_CELL end_ROW start_ROW start_CELL ( italic_k ∈ blackboard_N ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ≥ italic_C italic_ξ ⋅ square-root start_ARG italic_n end_ARG ⋅ ( square-root start_ARG italic_D end_ARG + italic_p ) ) ≤ roman_exp ( - italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
Proof.

See Lemma 31 in Mei et al. (2019) and Lemma A.3 in Araújo et al. (2019). ∎

Appendix D Technical Results

D.1 Universal Function Approximation Theorem

In what follows, we introduce the universal function approximation theorem (Pinkus, 1999). For any given activation function σ::𝜎\sigma:\mathbb{R}\rightarrow\mathbb{R}italic_σ : blackboard_R → blackboard_R, we consider the following function class,

𝒢(σ)={i=1rciσ(xwi+θi)|ci,θi,wid}.𝒢𝜎conditional-setsuperscriptsubscript𝑖1𝑟subscript𝑐𝑖𝜎superscript𝑥topsuperscript𝑤𝑖subscript𝜃𝑖formulae-sequencesubscript𝑐𝑖subscript𝜃𝑖superscript𝑤𝑖superscript𝑑\displaystyle\mathcal{G}(\sigma)=\Bigl{\{}\sum_{i=1}^{r}c_{i}\sigma(x^{\top}w^% {i}+\theta_{i}){\,\Big{|}\,}c_{i},\theta_{i}\in\mathbb{R},w^{i}\in\mathbb{R}^{% d}\Bigr{\}}.caligraphic_G ( italic_σ ) = { ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ ( italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R , italic_w start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT } .

We denote by 𝒞(d)𝒞superscript𝑑\mathscr{C}(\mathbb{R}^{d})script_C ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) the class of continuous functions over dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Then, the following theorem holds.

Lemma D.1 (Universal Function Approximation Theorem, Theorem 3.1 in Pinkus (1999)).

If the activation function σ𝒞()𝜎𝒞\sigma\in\mathscr{C}(\mathbb{R})italic_σ ∈ script_C ( blackboard_R ) is not a polynomial, the function class 𝒢(σ)𝒢𝜎\mathcal{G}(\sigma)caligraphic_G ( italic_σ ) is dense in 𝒞(d)𝒞superscript𝑑\mathscr{C}(\mathbb{R}^{d})script_C ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) in the topology of uniform convergence on a compact set.

D.2 Wasserstein Space

We use the definition of absolutely continuous curves in 𝒫2(D)subscript𝒫2superscript𝐷\mathscr{P}_{2}(\mathbb{R}^{D})script_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) in Ambrosio et al. (2008) and introduce the following lemmas.

Lemma D.2.

For any probability measures μ,ν,μ,ν𝒫2(D)𝜇𝜈superscript𝜇superscript𝜈subscript𝒫2superscript𝐷\mu,\nu,\mu^{\prime},\nu^{\prime}\in\mathscr{P}_{2}(\mathbb{R}^{D})italic_μ , italic_ν , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ script_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ), it holds that

𝒲2(μν,μν)2𝒲2(μ,μ)2+𝒲2(ν,ν)2.subscript𝒲2superscripttensor-product𝜇𝜈tensor-productsuperscript𝜇superscript𝜈2subscript𝒲2superscript𝜇superscript𝜇2subscript𝒲2superscript𝜈superscript𝜈2\displaystyle\mathcal{W}_{2}(\mu\otimes\nu,\mu^{\prime}\otimes\nu^{\prime})^{2% }\leq\mathcal{W}_{2}(\mu,\mu^{\prime})^{2}+\mathcal{W}_{2}(\nu,\nu^{\prime})^{% 2}.caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ ⊗ italic_ν , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊗ italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ν , italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .
Lemma D.3 (First Variation Formula, Theorem 8.4.7 in Ambrosio et al. (2008)).

Given ν𝒫2(D)𝜈subscript𝒫2superscript𝐷\nu\in\mathscr{P}_{2}(\mathbb{R}^{D})italic_ν ∈ script_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) and an absolutely continuous curve μ:[0,T]𝒫2(D):𝜇0𝑇subscript𝒫2superscript𝐷\mu:[0,T]\rightarrow\mathscr{P}_{2}(\mathbb{R}^{D})italic_μ : [ 0 , italic_T ] → script_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ), let β:[0,1]𝒫2(D):𝛽01subscript𝒫2superscript𝐷\beta:[0,1]\rightarrow\mathscr{P}_{2}(\mathbb{R}^{D})italic_β : [ 0 , 1 ] → script_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) be the geodesic connecting μtsubscript𝜇𝑡\mu_{t}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ν𝜈\nuitalic_ν. It holds that

ddt𝒲2(μt,ν)22=μ˙t,β˙0μt.dd𝑡subscript𝒲2superscriptsubscript𝜇𝑡𝜈22subscriptsubscript˙𝜇𝑡subscript˙𝛽0subscript𝜇𝑡\displaystyle\frac{\mathrm{d}}{\mathrm{d}t}\frac{\mathcal{W}_{2}(\mu_{t},\nu)^% {2}}{2}=-\langle\dot{\mu}_{t},\dot{\beta}_{0}\rangle_{\mu_{t}}.divide start_ARG roman_d end_ARG start_ARG roman_d italic_t end_ARG divide start_ARG caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ν ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG = - ⟨ over˙ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over˙ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT .

where μt˙=tμt˙subscript𝜇𝑡subscript𝑡subscript𝜇𝑡\dot{\mu_{t}}=\partial_{t}\mu_{t}over˙ start_ARG italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG = ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, β˙0=sβs|s=0subscript˙𝛽0evaluated-atsubscript𝑠subscript𝛽𝑠𝑠0\dot{\beta}_{0}=\partial_{s}\beta_{s}|_{s=0}over˙ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT.

Lemma D.4 (Benamou-Brenier formula, Proposition 2.30 in Ambrosio and Gigli (2013)).

Let μ0,μ1𝒫2(D)superscript𝜇0superscript𝜇1subscript𝒫2superscript𝐷\mu^{0},\mu^{1}\in\mathscr{P}_{2}(\mathbb{R}^{D})italic_μ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∈ script_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ). Then, it holds that

𝒲2(μ0,μ1)=inf{01μ˙tμtdt|μ:[0,1]𝒫2(D),μ0=μ0,μ1=μ1}.subscript𝒲2superscript𝜇0superscript𝜇1infimumconditional-setsuperscriptsubscript01:conditionalevaluated-atsubscript˙𝜇𝑡subscript𝜇𝑡d𝑡𝜇formulae-sequence01subscript𝒫2superscript𝐷formulae-sequencesubscript𝜇0superscript𝜇0subscript𝜇1superscript𝜇1\displaystyle\mathcal{W}_{2}(\mu^{0},\mu^{1})=\inf\biggl{\{}\int_{0}^{1}\|\dot% {\mu}_{t}\|_{\mu_{t}}\,\mathrm{d}t{\,\bigg{|}\,}\mu:[0,1]\rightarrow\mathscr{P% }_{2}(\mathbb{R}^{D}),\mu_{0}=\mu^{0},\mu_{1}=\mu^{1}\biggr{\}}.caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) = roman_inf { ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∥ over˙ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_d italic_t | italic_μ : [ 0 , 1 ] → script_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) , italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_μ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT } .
Lemma D.5 (Talagrand’s Inequality, Corollary 2.1 in Otto and Villani (2000)).

Let ν𝜈\nuitalic_ν be N(0,κID)𝑁0𝜅subscript𝐼𝐷N(0,\kappa\cdot I_{D})italic_N ( 0 , italic_κ ⋅ italic_I start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ). It holds for any μ𝒫2(D)𝜇subscript𝒫2superscript𝐷\mu\in\mathscr{P}_{2}(\mathbb{R}^{D})italic_μ ∈ script_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) that

𝒲2(μ,ν)22DKL(μν)/κ.subscript𝒲2superscript𝜇𝜈22subscript𝐷KLconditional𝜇𝜈𝜅\displaystyle\mathcal{W}_{2}(\mu,\nu)^{2}\leq 2D_{\rm KL}(\mu\,\|\,\nu)/\kappa.caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ , italic_ν ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2 italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_μ ∥ italic_ν ) / italic_κ .
Lemma D.6 (Eulerian Representation of Geodesics, Proposition 5.38 in Villani (2003)).

Let β:[0,1]𝒫2(D):𝛽01subscript𝒫2superscript𝐷\beta:[0,1]\rightarrow\mathscr{P}_{2}(\mathbb{R}^{D})italic_β : [ 0 , 1 ] → script_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) be a geodesic and u𝑢uitalic_u be the corresponding vector field such that tβt=div(βtut)subscript𝑡subscript𝛽𝑡divsubscript𝛽𝑡subscript𝑢𝑡\partial_{t}\beta_{t}=-\mathrm{div}(\beta_{t}\cdot u_{t})∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = - roman_div ( italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). It holds that

t(βtut)=div(βtutut).subscript𝑡subscript𝛽𝑡subscript𝑢𝑡divtensor-productsubscript𝛽𝑡subscript𝑢𝑡subscript𝑢𝑡\displaystyle\partial_{t}(\beta_{t}\cdot u_{t})=-\mathrm{div}(\beta_{t}\cdot u% _{t}\otimes u_{t}).∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = - roman_div ( italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

where tensor-product\otimes is the outer product of two vectors.

Lemma D.7 (Dual Representation of the first order Wasserstein Distance, Villani (2008)).

The first order Wasserstein distance has the following dual representation form

𝒲1(μ,ν)=sup{f(x)d(μν)(x)|f:D that is 1-Lipschitz continuous}subscript𝒲1𝜇𝜈supremumconditional-set𝑓𝑥d𝜇𝜈𝑥:𝑓superscript𝐷 that is 1-Lipschitz continuous\displaystyle\mathcal{W}_{1}(\mu,\nu)=\sup\biggl{\{}\int f(x)\mathrm{d}(\mu-% \nu)(x){\,\bigg{|}\,}f:\mathbb{R}^{D}\rightarrow\mathbb{R}\text{ that is 1-% Lipschitz continuous}\biggr{\}}caligraphic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_μ , italic_ν ) = roman_sup { ∫ italic_f ( italic_x ) roman_d ( italic_μ - italic_ν ) ( italic_x ) | italic_f : blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT → blackboard_R that is 1-Lipschitz continuous }

for any two probability measures μ,ν𝒫1(D)𝜇𝜈subscript𝒫1superscript𝐷\mu,\nu\in\mathscr{P}_{1}(\mathbb{R}^{D})italic_μ , italic_ν ∈ script_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ).