Tuning-free coreset Markov chain Monte Carlo


 


Naitong Chen                        Jonathan H. Huggins                        Trevor Campbell

Department of Statistics University of British Columbia [email protected]                        Department of Mathematics & Statistics Boston University [email protected]                        Department of Statistics University of British Columbia [email protected]

Abstract

A Bayesian coreset is a small, weighted subset of a data set that replaces the full data during inference to reduce computational cost. The state-of-the-art coreset construction algorithm, Coreset Markov chain Monte Carlo (Coreset MCMC), uses draws from an adaptive Markov chain targeting the coreset posterior to train the coreset weights via stochastic gradient optimization. However, the quality of the constructed coreset, and thus the quality of its posterior approximation, is sensitive to the stochastic optimization learning rate. In this work, we propose a learning-rate-free stochastic gradient optimization procedure, Hot-start Distance over Gradient (Hot DoG), for training coreset weights in Coreset MCMC without user tuning effort. Empirical results demonstrate that Hot DoG provides higher quality posterior approximations than other learning-rate-free stochastic gradient methods, and performs competitively to optimally-tuned ADAM.

1 INTRODUCTION

Refer to caption
Refer to caption
Figure 1: Relative Coreset MCMC posterior approximation error (average squared coordinate-wise z-score) using ADAM with different learning rates versus the proposed Hot DoG method (with fixed r=0.001𝑟0.001r=0.001italic_r = 0.001). Median values after 200,000 optimization iterations across 10 trials are used for the relative comparison for a variety of datasets, models, and coreset sizes. Above the horizontal black line (100superscript10010^{0}10 start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT) indicates that the proposed Hot DoG method outperformed ADAM.

Bayesian inference provides a flexible framework for parameter estimation and uncertainty quantification in statistical models. Markov chain Monte Carlo (Robert and Casella, 2004; Robert and Casella, 2011; Gelman et al., 2013, Chs. 11 and 12), the standard methodology for performing Bayesian inference, involves simulating carefully constructed Markov chains whose stationary distribution is the target Bayesian posterior. In the large-scale data setting, this procedure can become prohibitively expensive, as it requires iterating over the entire data set to simulate the next state.

Bayesian coresets (Huggins et al., 2016) are a popular approach for speeding up Bayesian inference in the large-scale data setting. A Bayesian coreset is a weighted subset of data that replaces the full data set during inference, leveraging the insight that large datasets often exhibit a significant degree of redundancy. With a carefully constructed coreset, one can significantly reduce the computational cost of inference while still obtaining samples from a high quality approximation of the full Bayesian posterior. In fact, given a data set of N𝑁Nitalic_N points, a coreset of size 𝒪(logN)𝒪𝑁{\mathcal{O}}\left(\log N\right)caligraphic_O ( roman_log italic_N ) is sufficient for providing a near-exact posterior approximation in exponential family and other sufficiently simple models (Naik et al., 2022, Thms. 4.1 and 4.2; Chen et al., 2022, Prop. 3.1) and 𝒪(polylogN)𝒪polylog𝑁{\mathcal{O}}\mathopen{}\mathclose{{}\left(\operatorname{polylog}N}\right)caligraphic_O ( roman_polylog italic_N ) is sufficient for more general cases (Campbell, 2024, Cor. 6.1).

Constructing a coreset involves picking the data points to include in the coreset and assigning each data point its corresponding weight. The state-of-the-art method, Coreset MCMC (Chen and Campbell, 2024), selects coreset points by sampling them uniformly from the full data set, and learns the weights using stochastic gradient optimization techniques, e.g., ADAM (Kingma and Ba, 2014), where the gradients are estimated using MCMC draws targeting the current coreset posterior. However, as we demonstrate in this paper, there are two issues with this approach. First, the quality of the constructed coreset is sensitive to the learning rate of the stochastic optimization algorithm. And second, gradient estimates using MCMC draws are affected strongly in early iterations by initialization bias, leading to poor optimization performance.

To address these challenges, we first propose Hot-start Distance over Gradient (Hot DoG), a tuning-free stochastic gradient optimization procedure that can be used for learning coreset weights in Coreset MCMC. Hot DoG is a stochastic gradient method combining techniques from Do(W)G (Ivgi et al., 2023; Khaled et al., 2023), ADAM (Kingma and Ba, 2014), and RMSProp (Hinton et al., 2012) to set learning rates automatically. Hot DoG also includes an automated warm-up phase prior to weight optimization, which guards against usage of low quality MCMC draws when estimating the objective function gradients. Fig. 1 demonstrates that Hot DoG performs competitively to optimally-tuned ADAM across a wide range of models, datasets, and coreset sizes, and can be multiple orders of magnitude more accurate than ADAM using other learning rates. Beyond the result in Fig. 1, we provide an extensive empirical investigation of the reliability of Hot DoG in comparison to other methods across many different synthetic and real experiments.

2 BACKGROUND

2.1 Bayesian Coresets

We are given a data set (Xn)n=1Nsuperscriptsubscriptsubscript𝑋𝑛𝑛1𝑁(X_{n})_{n=1}^{N}( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT of N𝑁Nitalic_N observations, a log-likelihood nlogp(xnθ)subscript𝑛𝑝conditionalsubscript𝑥𝑛𝜃\ell_{n}\coloneqq\log p(x_{n}\mid\theta)roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≔ roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∣ italic_θ ) for observation n𝑛nitalic_n given θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ, and a prior density π0(θ)subscript𝜋0𝜃\pi_{0}(\theta)italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ ). We would like to sample from the Bayesian posterior with density

π(θ)1Zexp(n=1Nn(θ))π0(θ),𝜋𝜃1𝑍superscriptsubscript𝑛1𝑁subscript𝑛𝜃subscript𝜋0𝜃\displaystyle\pi(\theta)\coloneqq\frac{1}{Z}\exp\left(\sum_{n=1}^{N}\ell_{n}(% \theta)\right)\pi_{0}(\theta),italic_π ( italic_θ ) ≔ divide start_ARG 1 end_ARG start_ARG italic_Z end_ARG roman_exp ( ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ ) ) italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ ) , (2)

where Z𝑍Zitalic_Z is the unknown normalizing constant. A Bayesian coreset replaces the sum over N𝑁Nitalic_N log-likelihood terms with a weighted sum over a subset of size M𝑀Mitalic_M, where MNmuch-less-than𝑀𝑁M\ll Nitalic_M ≪ italic_N. Without loss of generality, we assume that these are the first M𝑀Mitalic_M points. The coreset posterior can then be written as

πw(θ)1Z(w)exp(m=1Mwmm(θ))π0(θ),subscript𝜋𝑤𝜃1𝑍𝑤superscriptsubscript𝑚1𝑀subscript𝑤𝑚subscript𝑚𝜃subscript𝜋0𝜃\displaystyle\pi_{w}(\theta)\coloneqq\frac{1}{Z(w)}\exp\left(\sum_{m=1}^{M}w_{% m}\ell_{m}(\theta)\right)\pi_{0}(\theta),italic_π start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_θ ) ≔ divide start_ARG 1 end_ARG start_ARG italic_Z ( italic_w ) end_ARG roman_exp ( ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_θ ) ) italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ ) , (3)

where w+M𝑤subscriptsuperscript𝑀w\in\mathbb{R}^{M}_{+}italic_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is a vector of coreset weights. Recent coreset construction methods uniformly select M𝑀Mitalic_M points to include in the coreset (Naik et al., 2022; Chen et al., 2022; Chen and Campbell, 2024), and then optimize the weights of those M𝑀Mitalic_M points as a variational inference problem (Campbell and Beronov, 2019),

w=argminwMDKL(πw||π)s.t.w0,\displaystyle w^{\star}=\operatornamewithlimits{arg\,min}_{w\in\mathbb{R}^{M}}% \mathrm{D_{KL}}(\pi_{w}||\pi)\quad\text{s.t.}\quad w\geq 0,italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | | italic_π ) s.t. italic_w ≥ 0 , (4)

with objective function gradient

wDKL(πw||π)\displaystyle\nabla_{w}\mathrm{D_{KL}}(\pi_{w}||\pi)∇ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT roman_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | | italic_π ) (5)
=\displaystyle== Covπw([1(θ)M(θ)],mwmm(θ)nn(θ)).subscriptCovsubscript𝜋𝑤matrixsubscript1𝜃subscript𝑀𝜃subscript𝑚subscript𝑤𝑚subscript𝑚𝜃subscript𝑛subscript𝑛𝜃\displaystyle\operatorname{Cov}_{\pi_{w}}\mathopen{}\mathclose{{}\left(\begin{% bmatrix}\ell_{1}(\theta)\\ \vdots\\ \ell_{M}(\theta)\end{bmatrix},\sum_{m}w_{m}\ell_{m}(\theta)-\sum_{n}\ell_{n}(% \theta)}\right).roman_Cov start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( [ start_ARG start_ROW start_CELL roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL roman_ℓ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_θ ) end_CELL end_ROW end_ARG ] , ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_θ ) - ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ ) ) . (6)
Algorithm 1 CoresetMCMC
θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, κwsubscript𝜅𝑤\kappa_{w}italic_κ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, S𝑆Sitalic_S, M𝑀Mitalic_M
\triangleright Initialize coreset weights
w0m=NM,m=1,,Mformulae-sequencesubscript𝑤0𝑚𝑁𝑀𝑚1𝑀w_{0m}=\frac{N}{M},\quad m=1,\cdots,Mitalic_w start_POSTSUBSCRIPT 0 italic_m end_POSTSUBSCRIPT = divide start_ARG italic_N end_ARG start_ARG italic_M end_ARG , italic_m = 1 , ⋯ , italic_M
for t=0,,T𝑡0𝑇t=0,\dots,Titalic_t = 0 , … , italic_T do
\triangleright Subsample the data
     𝒮tUnif(S,[N])subscript𝒮𝑡Unif𝑆delimited-[]𝑁{\mathcal{S}}_{t}\leftarrow{\mathrm{Unif}}\mathopen{}\mathclose{{}\left(S,[N]}\right)caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← roman_Unif ( italic_S , [ italic_N ] ) (without replacement)
     \triangleright Compute gradient estimate
     g^tg(wt,θt,𝒮t)subscript^𝑔𝑡𝑔subscript𝑤𝑡subscript𝜃𝑡subscript𝒮𝑡\hat{g}_{t}\leftarrow g(w_{t},\theta_{t},{\mathcal{S}}_{t})over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_g ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (Eq. 7)
     wt+1subscript𝑤𝑡1absentw_{t+1}\leftarrowitalic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← stochastic_gradient_step(wt,g^tsubscript𝑤𝑡subscript^𝑔𝑡w_{t},\hat{g}_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT)
     \triangleright Step each Markov chain
     for k=1,,K𝑘1𝐾k=1,\dots,Kitalic_k = 1 , … , italic_K do
         θ(t+1)kκwt+1(θtk)\theta_{(t+1)k}\sim\kappa_{w_{t+1}}(\cdot\mid\theta_{tk})italic_θ start_POSTSUBSCRIPT ( italic_t + 1 ) italic_k end_POSTSUBSCRIPT ∼ italic_κ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ∣ italic_θ start_POSTSUBSCRIPT italic_t italic_k end_POSTSUBSCRIPT )
     end for
end for

2.2 Coreset MCMC

The key challenge in solving Eq. 4 is that πwsubscript𝜋𝑤\pi_{w}italic_π start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT does not admit tractable i.i.d. draws, and so unbiased estimates of the gradient in Eq. 5 are not readily available. Coreset MCMC (Chen and Campbell, 2024) is an adaptive algorithm that addresses this issue. The method first initializes weights w0Msubscript𝑤0superscript𝑀w_{0}\in\mathbb{R}^{M}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT and K2𝐾2K\geq 2italic_K ≥ 2 samples θ0=(θ01,,θ0K)ΘKsubscript𝜃0subscript𝜃01subscript𝜃0𝐾superscriptΘ𝐾\theta_{0}=\mathopen{}\mathclose{{}\left(\theta_{01},\dots,\theta_{0K}}\right)% \in\Theta^{K}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( italic_θ start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT 0 italic_K end_POSTSUBSCRIPT ) ∈ roman_Θ start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. At iteration t𝑡t\in\mathbb{N}italic_t ∈ blackboard_N, given coreset weights wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and samples θtΘKsubscript𝜃𝑡superscriptΘ𝐾\theta_{t}\in\Theta^{K}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ roman_Θ start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, it then updates the weights wtwt+1subscript𝑤𝑡subscript𝑤𝑡1w_{t}\to w_{t+1}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT using the stochastic gradient estimate based on the draws θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT,

g(wt,θt,𝒮t)=𝑔subscript𝑤𝑡subscript𝜃𝑡subscript𝒮𝑡absent\displaystyle g(w_{t},\theta_{t},{\mathcal{S}}_{t})=italic_g ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = (7)
1K1k=1K[¯1(θtk)¯M(θtk)](mwtm¯m(θtk)NSs𝒮t¯s(θtk)),1𝐾1superscriptsubscript𝑘1𝐾matrixsubscript¯1subscript𝜃𝑡𝑘subscript¯𝑀subscript𝜃𝑡𝑘subscript𝑚subscript𝑤𝑡𝑚subscript¯𝑚subscript𝜃𝑡𝑘𝑁𝑆subscript𝑠subscript𝒮𝑡subscript¯𝑠subscript𝜃𝑡𝑘\displaystyle\frac{1}{K-1}\sum_{k=1}^{K}\!\!\begin{bmatrix}\bar{\ell}_{1}(% \theta_{tk})\\ \vdots\\ \bar{\ell}_{M}(\theta_{tk})\end{bmatrix}\!\!\mathopen{}\mathclose{{}\left(\sum% _{m}w_{tm}\bar{\ell}_{m}(\theta_{tk})\!-\!\frac{N}{S}\sum_{s\in{\mathcal{S}}_{% t}}\bar{\ell}_{s}(\theta_{tk})}\right),divide start_ARG 1 end_ARG start_ARG italic_K - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL over¯ start_ARG roman_ℓ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t italic_k end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL over¯ start_ARG roman_ℓ end_ARG start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t italic_k end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ] ( ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t italic_m end_POSTSUBSCRIPT over¯ start_ARG roman_ℓ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t italic_k end_POSTSUBSCRIPT ) - divide start_ARG italic_N end_ARG start_ARG italic_S end_ARG ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¯ start_ARG roman_ℓ end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t italic_k end_POSTSUBSCRIPT ) ) , (8)

where 𝒮t[N]subscript𝒮𝑡delimited-[]𝑁{\mathcal{S}}_{t}\subseteq[N]caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊆ [ italic_N ] is a uniform subsample of indices of size S𝑆Sitalic_S, and ¯n(θtk)=n(θtk)1Kj=1Kn(θtj)subscript¯𝑛subscript𝜃𝑡𝑘subscript𝑛subscript𝜃𝑡𝑘1𝐾superscriptsubscript𝑗1𝐾subscript𝑛subscript𝜃𝑡𝑗\bar{\ell}_{n}(\theta_{tk})=\ell_{n}(\theta_{tk})-\frac{1}{K}\sum_{j=1}^{K}% \ell_{n}(\theta_{tj})over¯ start_ARG roman_ℓ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t italic_k end_POSTSUBSCRIPT ) = roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t italic_k end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t italic_j end_POSTSUBSCRIPT ). To complete the iteration, the method updates the samples by independently drawing θ(t+1)kκwt+1(θtk,)similar-tosubscript𝜃𝑡1𝑘subscript𝜅subscript𝑤𝑡1subscript𝜃𝑡𝑘\theta_{(t+1)k}\sim\kappa_{w_{t+1}}(\theta_{tk},\cdot)italic_θ start_POSTSUBSCRIPT ( italic_t + 1 ) italic_k end_POSTSUBSCRIPT ∼ italic_κ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t italic_k end_POSTSUBSCRIPT , ⋅ ) for each k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ], where κwsubscript𝜅𝑤\kappa_{w}italic_κ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT is a family of Markov kernels that have invariant distribution πwsubscript𝜋𝑤\pi_{w}italic_π start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT. The pseudocode for Coreset MCMC is outlined in Algorithm 1.

Refer to caption
Refer to caption
Figure 2: Coreset MCMC posterior approximation error (average squared coordinate-wise z-score) using ADAM with different learning rates for a variety of datasets, models, and coreset sizes. The lines indicate median values after 200,000 optimization iterations across 10 trials.
Refer to caption
(a) DoG
Refer to caption
(b) DoWG
Refer to caption
(c) D-Adaptation SGD
Refer to caption
(d) prodigy ADAM
Figure 3: Traces of average squared coordinate-wise z-scores between the true and approximated posterior for a Bayesian linear regression example with M=1,000𝑀1000M=1{,}000italic_M = 1 , 000 coreset points. We evaluate four learning-rate-free SGD methods: DoG and DoWG (with varying initial learning rate parameter), and D-Adaptation SGD and prodigy ADAM (with default initial lower bound 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT). The optimally-tuned ADAM baseline is shown in green. Results display the median after 200,000 optimization iterations across 10 trials.

3 TUNING-FREE CORESET MCMC

A key design choice when using Coreset MCMC is to specify how gradient estimates are used to optimize the weights. One can use ADAM (Kingma and Ba, 2014), which is used as the default optimizer for Coreset MCMC (Chen and Campbell, 2024): at iteration t𝑡titalic_t, with γt>0subscript𝛾𝑡0\gamma_{t}>0italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > 0 being the user-specified learning rate, we set

wt+1subscript𝑤𝑡1\displaystyle w_{t+1}italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT proj0(wtγtm^tv^t +ϵ),absentsubscriptprojabsent0subscript𝑤𝑡subscript𝛾𝑡subscript^𝑚𝑡v^t italic-ϵ\displaystyle\leftarrow\operatorname{proj}_{\geq 0}\mathopen{}\mathclose{{}% \left(w_{t}-\gamma_{t}\frac{\hat{m}_{t}}{\mathchoice{{\hbox{$\displaystyle% \sqrt{\hat{v}_{t}\,}$}\lower 0.4pt\hbox{\vrule height=7.22223pt,depth=-5.77782% pt}}}{{\hbox{$\textstyle\sqrt{\hat{v}_{t}\,}$}\lower 0.4pt\hbox{\vrule height=% 7.22223pt,depth=-5.77782pt}}}{{\hbox{$\scriptstyle\sqrt{\hat{v}_{t}\,}$}\lower 0% .4pt\hbox{\vrule height=7.22223pt,depth=-5.77782pt}}}{{\hbox{$% \scriptscriptstyle\sqrt{\hat{v}_{t}\,}$}\lower 0.4pt\hbox{\vrule height=7.2222% 3pt,depth=-5.77782pt}}}+\epsilon}}\right),← roman_proj start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT divide start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ^vt + italic_ϵ end_ARG ) , (9)

where m^tsubscript^𝑚𝑡\hat{m}_{t}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and v^tsubscript^𝑣𝑡\hat{v}_{t}over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are exponential averages of past gradients (g^i)i=0tsuperscriptsubscriptsubscript^𝑔𝑖𝑖0𝑡(\hat{g}_{i})_{i=0}^{t}( over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and their element-wise squares, and ϵitalic-ϵ\epsilonitalic_ϵ is a small constant. There are a wide range of other first-order stochastic methods available that could be used (e.g., vanilla stochastic gradient descent, AdaGrad (Duchi et al., 2011), etc.). However, like ADAM, most of these algorithms require setting a learning rate γtsubscript𝛾𝑡\gamma_{t}italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. And as we show in Fig. 2, the quality of samples obtained from Coreset MCMC can be highly sensitive to the selected learning rate. In particular, Fig. 2 shows that when using ADAM, no single learning rate applies well across all problems and coreset sizes; and for a given problem, the performance can vary by orders of magnitude as one varies the learning rate. Furthermore, the default ADAM learning rate of 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT (Kingma and Ba, 2014) provides poor results in most of the problems tested. As a result, careful tuning of the learning rate is required to obtain high quality posterior approximations. This usually involves a search on a log-scaled grid, which is computationally wasteful as the results for all but one of the parameter values are thrown out. Moreover, in practice determining which learning rate provides the best posterior approximation may not be straightforward.

A number of recent works in the literature propose learning-rate-free stochastic optimization methods to address this issue (Carmon and Hinder, 2022; Ivgi et al., 2023; Khaled et al., 2023; Defazio and Mishchenko, 2023; Mishchenko and Defazio, 2024). Many of these methods are shown empirically to work competitively compared to optimally-tuned SGD on a wide range of large-scale, non-convex deep learning problems. Although different at first glance, all of these methods arise from the same insight. Suppose one would like to solve the stochastic optimization problem

minxd𝔼[f(x,ξ)],subscript𝑥superscript𝑑𝔼delimited-[]𝑓𝑥𝜉\displaystyle\min_{x\in\mathbb{R}^{d}}\mathbb{E}\left[f(x,\xi)\right],roman_min start_POSTSUBSCRIPT italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_E [ italic_f ( italic_x , italic_ξ ) ] , (10)

where for all ξ𝜉\xiitalic_ξ, f(,ξ)𝑓𝜉f(\cdot,\xi)italic_f ( ⋅ , italic_ξ ) is convex and we only have access to unbiased stochastic gradient gt=f(xt,ξt)subscript𝑔𝑡𝑓subscript𝑥𝑡subscript𝜉𝑡g_{t}=\partial f(x_{t},\xi_{t})italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∂ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Define the initial optimality gap d0=x0xsubscript𝑑0normsubscript𝑥0superscript𝑥d_{0}=\|x_{0}-x^{\star}\|italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ∥ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∥ and the sum of all gradient norms GT=tTgt2subscript𝐺𝑇subscript𝑡𝑇superscriptnormsubscript𝑔𝑡2G_{T}=\sum_{t\leq T}\|g_{t}\|^{2}italic_G start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t ≤ italic_T end_POSTSUBSCRIPT ∥ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. By setting the SGD learning rate

γ=d0GT ,superscript𝛾subscript𝑑0GT \displaystyle\gamma^{\star}=\frac{d_{0}}{\mathchoice{{\hbox{$\displaystyle% \sqrt{G_{T}\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{% {\hbox{$\textstyle\sqrt{G_{T}\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,de% pth=-5.46667pt}}}{{\hbox{$\scriptstyle\sqrt{G_{T}\,}$}\lower 0.4pt\hbox{\vrule h% eight=4.78333pt,depth=-3.82668pt}}}{{\hbox{$\scriptscriptstyle\sqrt{G_{T}\,}$}% \lower 0.4pt\hbox{\vrule height=3.41666pt,depth=-2.73334pt}}}},italic_γ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = divide start_ARG italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG roman_GT end_ARG , (11)

the average iterate x¯=1TtTxt¯𝑥1𝑇subscript𝑡𝑇subscript𝑥𝑡\bar{x}=\frac{1}{T}\sum_{t\leq T}x_{t}over¯ start_ARG italic_x end_ARG = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t ≤ italic_T end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT satisfies the optimal error bound

𝔼[f(x¯,ξ)]𝔼[f(x,ξ)]d0GT T𝔼delimited-[]𝑓¯𝑥𝜉𝔼delimited-[]𝑓superscript𝑥𝜉subscript𝑑0GT 𝑇\displaystyle\mathbb{E}\left[f(\bar{x},\xi)\right]-\mathbb{E}\left[f(x^{\star}% ,\xi)\right]\leq\frac{d_{0}\mathchoice{{\hbox{$\displaystyle\sqrt{G_{T}\,}$}% \lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$% \textstyle\sqrt{G_{T}\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.4% 6667pt}}}{{\hbox{$\scriptstyle\sqrt{G_{T}\,}$}\lower 0.4pt\hbox{\vrule height=% 4.78333pt,depth=-3.82668pt}}}{{\hbox{$\scriptscriptstyle\sqrt{G_{T}\,}$}\lower 0% .4pt\hbox{\vrule height=3.41666pt,depth=-2.73334pt}}}}{T}blackboard_E [ italic_f ( over¯ start_ARG italic_x end_ARG , italic_ξ ) ] - blackboard_E [ italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , italic_ξ ) ] ≤ divide start_ARG italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_GT end_ARG start_ARG italic_T end_ARG (12)

after T𝑇Titalic_T iterations (Carmon and Hinder, 2022; Orabona and Cutkosky, 2020). Learning-rate-free methods therefore essentially try to estimate or bound the initial optimality gap d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, which is unknown in practice. To the best of our knowledge, there are four state-of-the-art methods that do this in a manner that does not require multiple optimization runs, knowledge of unknown constants, or the ability to query the objective function: DoG (Ivgi et al., 2023), DoWG (Khaled et al., 2023), D-Adaptation (Defazio and Mishchenko, 2023) and prodigy (Mishchenko and Defazio, 2024). DoG and DoWG run vanilla stochastic gradient descent (SGD),

wt+1subscript𝑤𝑡1\displaystyle w_{t+1}italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT proj0(wtγtgt),absentsubscriptprojabsent0subscript𝑤𝑡subscript𝛾𝑡subscript𝑔𝑡\displaystyle\leftarrow\operatorname{proj}_{\geq 0}\mathopen{}\mathclose{{}% \left(w_{t}-\gamma_{t}g_{t}}\right),← roman_proj start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (13)

with learning rate schedules

γt=rtGt (DoG),γt=rt2tTrt2gt2 (DoWG),formulae-sequencesubscript𝛾𝑡subscript𝑟𝑡Gt (DoG)subscript𝛾𝑡subscriptsuperscript𝑟2𝑡tTrt2gt2 (DoWG)\displaystyle\gamma_{t}=\frac{r_{t}}{\mathchoice{{\hbox{$\displaystyle\sqrt{G_% {t}\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$% \textstyle\sqrt{G_{t}\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.4% 6667pt}}}{{\hbox{$\scriptstyle\sqrt{G_{t}\,}$}\lower 0.4pt\hbox{\vrule height=% 4.78333pt,depth=-3.82668pt}}}{{\hbox{$\scriptscriptstyle\sqrt{G_{t}\,}$}\lower 0% .4pt\hbox{\vrule height=3.41666pt,depth=-2.73334pt}}}}\text{(DoG)},\,\gamma_{t% }=\frac{r^{2}_{t}}{\mathchoice{{\hbox{$\displaystyle\sqrt{\sum_{t\leq T}r_{t}^% {2}\|g_{t}\|^{2}\,}$}\lower 0.4pt\hbox{\vrule height=9.30444pt,depth=-7.44359% pt}}}{{\hbox{$\textstyle\sqrt{\sum_{t\leq T}r_{t}^{2}\|g_{t}\|^{2}\,}$}\lower 0% .4pt\hbox{\vrule height=9.30444pt,depth=-7.44359pt}}}{{\hbox{$\scriptstyle% \sqrt{\sum_{t\leq T}r_{t}^{2}\|g_{t}\|^{2}\,}$}\lower 0.4pt\hbox{\vrule height% =6.53888pt,depth=-5.23112pt}}}{{\hbox{$\scriptscriptstyle\sqrt{\sum_{t\leq T}r% _{t}^{2}\|g_{t}\|^{2}\,}$}\lower 0.4pt\hbox{\vrule height=5.03888pt,depth=-4.0% 3113pt}}}}\text{(DoWG)},italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG roman_Gt end_ARG (DoG) , italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∑t≤Trt2∥gt∥2 end_ARG (DoWG) , (14)

where r0subscript𝑟0r_{0}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is set to some small constant and, for t1𝑡1t\geq 1italic_t ≥ 1,

rt=maxitwtw0.subscript𝑟𝑡subscript𝑖𝑡normsubscript𝑤𝑡subscript𝑤0\displaystyle r_{t}=\max_{i\leq t}\|w_{t}-w_{0}\|.italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_i ≤ italic_t end_POSTSUBSCRIPT ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ . (15)

For D-Adaptation and prodigy, rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in Eq. 14 is replaced with a lower bound dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT on d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, which is updated using estimated correlations between the gradient gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and step direction w0wtsubscript𝑤0subscript𝑤𝑡w_{0}-w_{t}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

dt+1=max{i=0tdigi,w0wii=0tdigi,dt}.subscript𝑑𝑡1superscriptsubscript𝑖0𝑡subscript𝑑𝑖subscript𝑔𝑖subscript𝑤0subscript𝑤𝑖normsuperscriptsubscript𝑖0𝑡subscript𝑑𝑖subscript𝑔𝑖subscript𝑑𝑡\displaystyle d_{t+1}=\max\left\{\frac{\sum_{i=0}^{t}d_{i}\left\langle g_{i},w% _{0}-w_{i}\right\rangle}{\left\|\sum_{i=0}^{t}d_{i}g_{i}\right\|},d_{t}\right\}.italic_d start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = roman_max { divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ end_ARG start_ARG ∥ ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ end_ARG , italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } . (16)

D-Adaptation replaces rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in Eq. 14 (DoG) with dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, while prodigy replaces rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in Eq. 14 (DoWG) with dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Both D-Adaptation have SGD and ADAM-based variants. All four methods have been shown empirically to match the performance of optimally-tuned SGD.

Fig. 3 shows the results from direct applications of DoG, DoWG, D-Adaptation (SGD), and prodigy (ADAM) to Coreset MCMC. We see that the quality of posterior approximation from all of four methods are orders of magnitude worse than optimally-tuned ADAM. With θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT initialized far away from high density regions of πw0subscript𝜋subscript𝑤0\pi_{w_{0}}italic_π start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the initial gradient estimates are large in magnitude, which leads to small learning rates. The accumulation of these large gradient norms in the learning rate denominator eventually causes the learning rate to vanish, halting the progress of coreset weight optimization. We address these problems in the next section.

Before concluding this section, we note that there are other approaches for making SGD free of learning rate tuning: some methods involve using stochastic versions of line search (Vaswani et al., 2019; Paquette and Scheinberg, 2020), and others do the same for the Polyak step size (Loizou et al., 2021). These methods are not applicable in our setting as they require evaluating the objective function. Recall that due to the unknown Z(w)𝑍𝑤Z(w)italic_Z ( italic_w ) term in Eq. 3, we do not have access to estimates of the objective function.

Algorithm 2 HotDoG
β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999, ϵ=108italic-ϵsuperscript108\epsilon=10^{-8}italic_ϵ = 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT, r=103𝑟superscript103r=10^{-3}italic_r = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
T𝑇\quad\quad\quad\quad Titalic_T, θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
v0𝟎subscript𝑣00v_{0}\leftarrow\bm{0}italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← bold_0, m0𝟎subscript𝑚00m_{0}\leftarrow\bm{0}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← bold_0, d0𝟎subscript𝑑00d_{0}\leftarrow\bm{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← bold_0, c0𝑐0c\leftarrow 0italic_c ← 0, hfalsefalseh\leftarrow\texttt{false}italic_h ← false
for t=1,,T𝑡1𝑇t=1,\dots,Titalic_t = 1 , … , italic_T do
     if h then
         cc+1𝑐𝑐1c\leftarrow c+1italic_c ← italic_c + 1
         𝒮tUnif(S,[N])subscript𝒮𝑡Unif𝑆delimited-[]𝑁{\mathcal{S}}_{t}\leftarrow{\mathrm{Unif}}\mathopen{}\mathclose{{}\left(S,[N]}\right)caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← roman_Unif ( italic_S , [ italic_N ] ) (without replacement)
         g^t=g(wt1,θt1,𝒮t)subscript^𝑔𝑡𝑔subscript𝑤𝑡1subscript𝜃𝑡1subscript𝒮𝑡\hat{g}_{t}=g(w_{t-1},\theta_{t-1},{\mathcal{S}}_{t})over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_g ( italic_w start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (Eq. 7)
         vtβ2vt1+(1β2)g^t2subscript𝑣𝑡subscript𝛽2subscript𝑣𝑡11subscript𝛽2superscriptsubscript^𝑔𝑡2v_{t}\leftarrow\beta_{2}v_{t-1}+(1-\beta_{2})\hat{g}_{t}^{2}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + ( 1 - italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
         mtβ1mt1+(1β1)g^tsubscript𝑚𝑡subscript𝛽1subscript𝑚𝑡11subscript𝛽1subscript^𝑔𝑡m_{t}\leftarrow\beta_{1}m_{t-1}+(1-\beta_{1})\hat{g}_{t}italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + ( 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
         dtβ1dt1+(1β1)max{|wt1w0|,dt1}subscript𝑑𝑡subscript𝛽1subscript𝑑𝑡11subscript𝛽1subscript𝑤𝑡1subscript𝑤0subscript𝑑𝑡1d_{t}\!\leftarrow\!\beta_{1}d_{t-1}\!+\!(1\!-\!\beta_{1})\max\left\{\left|w_{t% -1}\!-\!w_{0}\right|,d_{t-1}\right\}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + ( 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) roman_max { | italic_w start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | , italic_d start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT }
         v^tvt/(1β2c)subscript^𝑣𝑡subscript𝑣𝑡1superscriptsubscript𝛽2𝑐\hat{v}_{t}\leftarrow v_{t}/(1-\beta_{2}^{c})over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / ( 1 - italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT )
         m^tmt/(1β1c)subscript^𝑚𝑡subscript𝑚𝑡1superscriptsubscript𝛽1𝑐\hat{m}_{t}\leftarrow m_{t}/(1-\beta_{1}^{c})over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / ( 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT )
         dt^^subscript𝑑𝑡absent\hat{d_{t}}\leftarrowover^ start_ARG italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ← ( r𝟏𝑟1r\mathbf{1}italic_r bold_1 if t==1 else dt/(1β1c1)subscript𝑑𝑡1superscriptsubscript𝛽1𝑐1d_{t}/(1-\beta_{1}^{c-1})italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / ( 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c - 1 end_POSTSUPERSCRIPT ) )
         wtwt1d^t(diag((cv^t)12)+ϵI)1m^tsubscript𝑤𝑡subscript𝑤𝑡1direct-productsubscript^𝑑𝑡superscriptdiagsuperscript𝑐subscript^𝑣𝑡12italic-ϵ𝐼1subscript^𝑚𝑡w_{t}\leftarrow w_{t-1}\!-\!\hat{d}_{t}\left(\operatorname{diag}\left(\left(c% \hat{v}_{t}\right)^{\frac{1}{2}}\right)+\epsilon I\right)^{-1}\!\odot\hat{m}_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_w start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_diag ( ( italic_c over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) + italic_ϵ italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⊙ over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
     else
         wtwt1subscript𝑤𝑡subscript𝑤𝑡1w_{t}\!\leftarrow\!w_{t-1}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_w start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, vtvt1subscript𝑣𝑡subscript𝑣𝑡1v_{t}\!\leftarrow\!v_{t-1}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, mtmt1subscript𝑚𝑡subscript𝑚𝑡1m_{t}\!\leftarrow\!m_{t-1}italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_m start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, dtdt1subscript𝑑𝑡subscript𝑑𝑡1d_{t}\!\leftarrow\!d_{t-1}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_d start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT
     end if
     for k=1,,K𝑘1𝐾k=1,\dots,Kitalic_k = 1 , … , italic_K do
         θtkκwt(θ(t1)k)\theta_{tk}\sim\kappa_{w_{t}}(\cdot\mid\theta_{(t-1)k})italic_θ start_POSTSUBSCRIPT italic_t italic_k end_POSTSUBSCRIPT ∼ italic_κ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ∣ italic_θ start_POSTSUBSCRIPT ( italic_t - 1 ) italic_k end_POSTSUBSCRIPT ) \triangleright record tksubscript𝑡𝑘\ell_{tk}roman_ℓ start_POSTSUBSCRIPT italic_t italic_k end_POSTSUBSCRIPT
     end for
     \triangleright Hot-start test
     habsenth\!\leftarrow\!italic_h ← (true if hhitalic_h else HotStartTest((ik)i=1,k=1t,K,t)HotStartTestsuperscriptsubscriptsubscript𝑖𝑘formulae-sequence𝑖1𝑘1𝑡𝐾𝑡\texttt{HotStartTest}\!\left(\!\left(\!\ell_{ik}\!\right)_{i\!=\!1\!,\!k\!=\!1% }^{t\!,\!K}\!,\!t\!\right)HotStartTest ( ( roman_ℓ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 , italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_K end_POSTSUPERSCRIPT , italic_t ))
end for
return wTsubscript𝑤𝑇w_{T}italic_w start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT

4 HOT DOG

In this section, we develop our novel Markovian optimization method, Hot-start DoG (Hot DoG), presented in Algorithm 2. Our method extends the original DoG optimizer in two ways: (1) we add a tuning-free hot-start test that automatically detects when the Markov chains have properly mixed and stochastic gradient estimates are stable, at which point we start coreset weight optimization; and (2) we apply acceleration techniques to DoG.

4.1 Hot-start test

Poorly initialized Markov chain states θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can be detrimental to the performance of learning-rate-free methods in Coreset MCMC. Fig. 5, and especially Figs. 5(c), 5(d) and 5(e) show that this is likely due to the bias of initial gradient estimates. In particular, the initial gradient estimates often have a norm orders of magnitude larger than they would if they had been computed using i.i.d. draws, resulting in a quickly vanishing learning rate in Eq. 14. Therefore, it is crucial to hot-start the Markov chains to ensure they are properly mixed before training the coreset weights. There are MCMC convergence diagnostics that could be used for this purpose (e.g, R^^𝑅{\widehat{R}}over^ start_ARG italic_R end_ARG (Vehtari et al., 2021)); many work only with real-valued variables, and are overly stringent for our application. We require a test that works for general coreset posteriors of the form Eq. 3 and checks only that gradient estimates have stabilized reasonably.

To address this challenge, we propose keeping the weights fixed at their initialization (i.e., wt+1wtsubscript𝑤𝑡1subscript𝑤𝑡w_{t+1}\leftarrow w_{t}italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) until a hot-start test passes. For the test, for each Markov chain k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ], we split the iterates i=1,,t𝑖1𝑡i=1,\dots,titalic_i = 1 , … , italic_t into 3 segments, each of equal length n=t/3𝑛𝑡3n=\lceil t/3\rceilitalic_n = ⌈ italic_t / 3 ⌉. We compute the average log-potentials for the two latter segments m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, m2subscript𝑚2m_{2}italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and the standard deviations of residual errors s1,s2subscript𝑠1subscript𝑠2s_{1},s_{2}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT from a linear fit:

m1subscript𝑚1\displaystyle m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =1ni=n+12nikm2=1ni=2n+1tikformulae-sequenceabsent1𝑛superscriptsubscript𝑖𝑛12𝑛subscript𝑖𝑘subscript𝑚21𝑛superscriptsubscript𝑖2𝑛1𝑡subscript𝑖𝑘\displaystyle=\frac{1}{n}\sum_{i=n+1}^{2n}\ell_{ik}\quad m_{2}=\frac{1}{n}\sum% _{i=2n+1}^{t}\ell_{ik}= divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 2 italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT (17)
s12superscriptsubscript𝑠12\displaystyle s_{1}^{2}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =1n2mina,bi=n+12n(a+biik)2absent1𝑛2subscript𝑎𝑏superscriptsubscript𝑖𝑛12𝑛superscript𝑎𝑏𝑖subscript𝑖𝑘2\displaystyle=\frac{1}{n-2}\min_{a,b\in\mathbb{R}}\sum_{i=n+1}^{2n}(a+bi-\ell_% {ik})^{2}= divide start_ARG 1 end_ARG start_ARG italic_n - 2 end_ARG roman_min start_POSTSUBSCRIPT italic_a , italic_b ∈ blackboard_R end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT ( italic_a + italic_b italic_i - roman_ℓ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (18)
s22superscriptsubscript𝑠22\displaystyle s_{2}^{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =1n2mina,bi=2n+1t(a+biik)2,absent1𝑛2subscript𝑎𝑏superscriptsubscript𝑖2𝑛1𝑡superscript𝑎𝑏𝑖subscript𝑖𝑘2\displaystyle=\frac{1}{n-2}\min_{a,b\in\mathbb{R}}\sum_{i=2n+1}^{t}(a+bi-\ell_% {ik})^{2},= divide start_ARG 1 end_ARG start_ARG italic_n - 2 end_ARG roman_min start_POSTSUBSCRIPT italic_a , italic_b ∈ blackboard_R end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 2 italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_a + italic_b italic_i - roman_ℓ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (19)

where iksubscript𝑖𝑘\ell_{ik}roman_ℓ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT is the log-potential for chain k𝑘kitalic_k at iteration i𝑖iitalic_i,

ik=m=1Mw0mm(θik).subscript𝑖𝑘superscriptsubscript𝑚1𝑀subscript𝑤0𝑚subscript𝑚subscript𝜃𝑖𝑘\displaystyle\ell_{ik}=\sum_{m=1}^{M}w_{0m}\ell_{m}(\theta_{ik}).roman_ℓ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 0 italic_m end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) . (20)

Our test monitors the difference between m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and m2subscript𝑚2m_{2}italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT relative to s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. A small difference between m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and m2subscript𝑚2m_{2}italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT indicates that the Markov chain states have stabilized. The residual standard errors from the linear fits allows us to remove trends from the noise computation. We define, for each k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ],

uk=|m1m2|max{s1,s2},subscript𝑢𝑘subscript𝑚1subscript𝑚2subscript𝑠1subscript𝑠2\displaystyle u_{k}=\frac{\mathopen{}\mathclose{{}\left|m_{1}-m_{2}}\right|}{% \max\{s_{1},s_{2}\}},italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG | italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | end_ARG start_ARG roman_max { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } end_ARG , (21)

and use the median of (uk)k=1Ksuperscriptsubscriptsubscript𝑢𝑘𝑘1𝐾\left(u_{\ell k}\right)_{k=1}^{K}( italic_u start_POSTSUBSCRIPT roman_ℓ italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT as our test statistic. This test statistic is checked against a threshold c𝑐citalic_c; the test passes when the median test statistic becomes smaller than c𝑐citalic_c. The pseudocode for the hot-start test is given in Algorithm 3. From our experiments, we find that setting c=0.5𝑐0.5c=0.5italic_c = 0.5 works well in general.

Algorithm 3 HotStartTest
(ik)i=1,k=1t,Ksuperscriptsubscriptsubscript𝑖𝑘formulae-sequence𝑖1𝑘1𝑡𝐾\left(\ell_{ik}\right)_{i=1,k=1}^{t,K}( roman_ℓ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 , italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_K end_POSTSUPERSCRIPT, t𝑡titalic_t, c=0.5𝑐0.5c=0.5italic_c = 0.5
n=ceil(t/3)𝑛ceil𝑡3n=\texttt{ceil}(t/3)italic_n = ceil ( italic_t / 3 )
for k=1,,K𝑘1𝐾k=1,\dots,Kitalic_k = 1 , … , italic_K do
     s121n2mina,bi=n+12n(a+biik)2subscriptsuperscript𝑠211𝑛2subscript𝑎𝑏superscriptsubscript𝑖𝑛12𝑛superscript𝑎𝑏𝑖subscript𝑖𝑘2s^{2}_{1}\leftarrow\frac{1}{n-2}\min_{a,b\in\mathbb{R}}\sum_{i=n+1}^{2n}\left(% a+bi-\ell_{ik}\right)^{2}italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← divide start_ARG 1 end_ARG start_ARG italic_n - 2 end_ARG roman_min start_POSTSUBSCRIPT italic_a , italic_b ∈ blackboard_R end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT ( italic_a + italic_b italic_i - roman_ℓ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
     s221n2mina,bi=2n+1t(a+biik)2subscriptsuperscript𝑠221𝑛2subscript𝑎𝑏superscriptsubscript𝑖2𝑛1𝑡superscript𝑎𝑏𝑖subscript𝑖𝑘2s^{2}_{2}\leftarrow\frac{1}{n-2}\min_{a,b\in\mathbb{R}}\sum_{i=2n+1}^{t}\left(% a+bi-\ell_{ik}\right)^{2}italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← divide start_ARG 1 end_ARG start_ARG italic_n - 2 end_ARG roman_min start_POSTSUBSCRIPT italic_a , italic_b ∈ blackboard_R end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 2 italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_a + italic_b italic_i - roman_ℓ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
     uk|(1ni=n+12nik)(1ni=2n+1tik)|max{s1,s2}subscript𝑢𝑘1𝑛superscriptsubscript𝑖𝑛12𝑛subscript𝑖𝑘1𝑛superscriptsubscript𝑖2𝑛1𝑡subscript𝑖𝑘subscript𝑠1subscript𝑠2u_{k}\leftarrow\frac{\left|\left(\frac{1}{n}\sum_{i=n+1}^{2n}\ell_{ik}\right)-% \left(\frac{1}{n}\sum_{i=2n+1}^{t}\ell_{ik}\right)\right|}{\max\{s_{1},s_{2}\}}italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← divide start_ARG | ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) - ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 2 italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ) | end_ARG start_ARG roman_max { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } end_ARG
end for
return (true if median(u1,,uK)<csubscript𝑢1subscript𝑢𝐾𝑐\left(u_{1},\dots,u_{K}\right)<c( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) < italic_c else  false)

4.2 Acceleration

To accelerate DoG, we begin by noting that the denominator of the DoG learning rate in Eq. 14 is similar to that of AdaGrad (Duchi et al., 2011) in that it is a cumulative sum of some function of the gradient. Therefore, we can leverage the idea used in RMSProp (Hinton et al., 2012) for accelerating AdaGrad to accelerate DoG. In particular, at iteration t𝑡titalic_t, we can replace itg^i2subscript𝑖𝑡superscriptnormsubscript^𝑔𝑖2\sum_{i\leq t}\|\hat{g}_{i}\|^{2}∑ start_POSTSUBSCRIPT italic_i ≤ italic_t end_POSTSUBSCRIPT ∥ over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT with tv^t𝑡subscript^𝑣𝑡t\hat{v}_{t}italic_t over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the bias-corrected exponential moving average of the squared gradient. This allows us to exponentially decrease the weights of past gradient norms. As a result, the effect of the early g^t2superscriptnormsubscript^𝑔𝑡2\|\hat{g}_{t}\|^{2}∥ over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT terms on the learning rate diminishes over time, resulting in less conservative learning rates. To account for situations where the gradient estimates differ in scale across dimensions, we apply the above acceleration technique in a coordinate-wise fashion and obtain the following update rule for v^tsubscript^𝑣𝑡\hat{v}_{t}over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

vt=βvt1+(1β)g^t2,v^t=vt1βt,formulae-sequencesubscript𝑣𝑡𝛽subscript𝑣𝑡11𝛽superscriptsubscript^𝑔𝑡2subscript^𝑣𝑡subscript𝑣𝑡1superscript𝛽𝑡\displaystyle v_{t}=\beta v_{t-1}+(1-\beta)\hat{g}_{t}^{2},\quad\hat{v}_{t}=% \frac{v_{t}}{1-\beta^{t}},italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_β italic_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + ( 1 - italic_β ) over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_β start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG , (22)

where β(0,1)𝛽01\beta\in(0,1)italic_β ∈ ( 0 , 1 ), v0=0subscript𝑣00v_{0}=0italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0, and g^t2superscriptsubscript^𝑔𝑡2\hat{g}_{t}^{2}over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT denotes the vector with each entry of g^tsubscript^𝑔𝑡\hat{g}_{t}over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT squared. We further apply the same idea to rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the maximum distance traveled from w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and g^tsubscript^𝑔𝑡\hat{g}_{t}over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the gradient estimate itself, thereby arriving at our proposed optimization procedure. Note that in Algorithm 2, the max\maxroman_max operator denotes coordinate-wise maximum and the direct-product\odot operater denotes coordinate-wise multiplication.

In Hot DoG, we set the exponential decay rates, β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, to be the same as those in Kingma and Ba (2014), and we set the initial learning rate r𝑟ritalic_r to a small constant (default 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT) following the recommendation of Ivgi et al. (2023).

5 EXPERIMENTS

In this section, we demonstrate the effectiveness of Hot DoG and compare our method against other learning-rate-free stochastic gradient methods: optimally-tuned ADAM from a log scale grid search, as well as prodigy ADAM (Mishchenko and Defazio, 2024), DoG (Ivgi et al., 2023), and DoWG (Khaled et al., 2023) over different initial parameters. We compare the quality of posterior the approximations over different coreset sizes M𝑀Mitalic_M and weight optimization procedures for each experiment. Following Chen and Campbell (2024), for all optimization methods, we set the number of Markov chains to K=2𝐾2K=2italic_K = 2 and subsample size to S=M𝑆𝑀S=Mitalic_S = italic_M in Eq. 7. We set the Markov kernel κwsubscript𝜅𝑤\kappa_{w}italic_κ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT to the hit-and-run slice sampler with doubling (Bélisle et al., 1993; Neal, 2003) for all real data experiments. For the Gaussian location model, we use a kernel that directly samples from πwsubscript𝜋𝑤\pi_{w}italic_π start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT (Chen and Campbell, 2024, Sec. 3.4); for the sparse regression example, we use Gibbs sampling (George and McCulloch, 1993).

We compare these algorithms using six different Bayesian models (two synthetic, four real): a synthetic Gaussian location, a synthetic sparse regression, a (non-conjugate) linear regression, a logistic regression, a Poisson regression, and a Bradley-Terry model. See Section A.1 for details. We use Stan (Carpenter et al., 2017) to obtain full data inference results for real data experiments, and Gibbs sampling (George and McCulloch, 1993) for the sparse regression model with discrete variables. For all experiments, we measure the posterior approximation quality using the average squared z-score, defined as 1Di=1D(μiμ^iσi)21𝐷superscriptsubscript𝑖1𝐷superscriptsubscript𝜇𝑖subscript^𝜇𝑖subscript𝜎𝑖2\frac{1}{D}\sum_{i=1}^{D}(\frac{\mu_{i}-\hat{\mu}_{i}}{\sigma_{i}})^{2}divide start_ARG 1 end_ARG start_ARG italic_D end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( divide start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are, respectively, the coordinate-wise mean and standard deviation estimated using the full data posterior, and μ^isubscript^𝜇𝑖\hat{\mu}_{i}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the coordinate-wise mean estimated using draws from Coreset MCMC. This estimate is computed in a streaming fashion using the second half of all draws simulated at the time; note that this includes draws from πw0subscript𝜋subscript𝑤0\pi_{w_{0}}italic_π start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT before the hot-start test passes.

Each algorithm was run on 8 single-threaded cores of a 2.1GHz Intel Xeon Gold 6130 processor with 32GB memory. Code for these experiments is available at https://github.com/NaitongChen/automated-coreset-mcmc-experiments. More experimental details and additional plots are in Sections A.1 and A.2.

Effect of hot-start test. Fig. 4 compares Hot DoG with and without the hot-start test for M=1000𝑀1000M=1000italic_M = 1000 across all experiments; the same plots for other coreset sizes can be found in Section A.2. Without the hot-start test, the traces often hit a long plateau, before the effect of exponentially-weighted averaging is able to decay early large gradient norms. On the other hand, with burn-in, we begin by simulating from Markov chains targeting πw0subscript𝜋subscript𝑤0\pi_{w_{0}}italic_π start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and start optimizing the coreset weights only after the hot-start test has passed. In terms of the number of log potential evaluations, Hot DoG with burn-in leaves the plateau sooner than without burn-in.

Fig. 5 examines the behaviour of the hot-start test in more detail, showing the traces of the gradient estimate norms g^tnormsubscript^𝑔𝑡\|\hat{g}_{t}\|∥ over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ and test statistics median(u1,,uK)subscript𝑢1subscript𝑢𝐾(u_{1},\dots,u_{K})( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) across optimization iterations when using Hot DoG. Note that here we only show plots for M=1000𝑀1000M=1000italic_M = 1000; the same plots for other coreset sizes can be found in Section A.2. In some experiments, the Markov chains are initialized reasonably well where the gradient norms are already stabilized, and the test passes almost immediately. In others, the Markov chains are initialized poorly and the gradient norms are large, but nevertheless, the hot-start test passes shortly after they stabilize. Across all of these experiments, a test statistic threshold of 0.5 worked well.

Refer to caption
(a) Gaussian location
Refer to caption
(b) Sparse regression
Refer to caption
(c) Linear regression
Refer to caption
(d) Logistic regression
Refer to caption
(e) Poisson regression
Refer to caption
(f) Bradley-Terry
Figure 4: Traces of average squared coordinate-wise z-scores between the true and approximated posterior across all experiments, obtained using Hot DoG with and without hot-start test. All figures share the legend in Fig. 4(c). The coreset size M𝑀Mitalic_M is 1000100010001000 and each line represents a different initial learning rate parameter. The lines indicate the median from 10101010 runs. Orange lines indicate runs with hot-start test and blue lines without.
Refer to caption
(a) Gaussian location
Refer to caption
(b) Sparse regression
Refer to caption
(c) Linear regression
Refer to caption
(d) Logistic regression
Refer to caption
(e) Poisson regression
Refer to caption
(f) Bradley-Terry
Figure 5: Trace of gradient estimate norms (blue) and hot-start test statistics (green) before weight optimization across all experiments with M=1000𝑀1000M=1000italic_M = 1000. The orange horizontal line is the test statistic threshold c=0.5𝑐0.5c=0.5italic_c = 0.5.
Refer to caption
(a) Gaussian location
Refer to caption
(b) Sparse regression
Refer to caption
(c) Linear regression
Refer to caption
(d) Logistic regression
Refer to caption
(e) Poisson regression
Refer to caption
(f) Bradley-Terry
Figure 6: Traces of average squared coordinate-wise z-scores between the true and approximated posterior across all experiments, obtained from Hot DoG and optimally-tuned ADAM. All figures share the legend in Fig. 6(c). The coreset size M=1000𝑀1000M=1000italic_M = 1000 and each line represents a different initial learning rate parameter. The lines indicate the median from 10101010 runs.

Robustness to fixed parameter r𝑟ritalic_r. Figure 6 provides an examination of the robustness of the proposed method to the fixed initial learning rate parameter r𝑟ritalic_r. Across all experiments, different values of r𝑟ritalic_r spanning multiple orders of magnitude result in similar posterior approximations across optimization iterations. Note that M𝑀Mitalic_M is 1000100010001000 for all plots in Fig. 6. The same trends can be observed over different coreset sizes (see Section A.2).

Comparison with other related methods. Figure 7 shows a comparison between our method and DoG, DoWG, ADAM, as well as prodigy ADAM. We fix r=0.001𝑟0.001r=0.001italic_r = 0.001 and c=0.5𝑐0.5c=0.5italic_c = 0.5 for Hot DoG. Since the hot-start test itself can be applied to all methods, Hot DoG is compared against others both with and without burn-in. The posterior approximation quality of Hot DoG is orders of magnitude better than all other methods in many settings tested, and remain competitive otherwise. In particular, Hot DoG is capable of matching the performance of optimally-tuned ADAM without tuning.

Refer to caption
Refer to caption
(a) DoG
Refer to caption
Refer to caption
(b) DoWG
Refer to caption
Refer to caption
(c) ADAM
Refer to caption
Refer to caption
(d) prodigy ADAM
Refer to caption
Refer to caption
(e) DoG with hot-start
Refer to caption
Refer to caption
(f) DoWG with hot-start
Refer to caption
Refer to caption
(g) ADAM with hot-start
Refer to caption
Refer to caption
(h) prodigy ADAM with hot-start
Figure 7: Relative Coreset MCMC posterior approximation error (average squared coordinate-wise z-scores) using different optimization algorithms (labeled in the subfigure captions) versus the proposed Hot DoG method (with fixed r=0.001𝑟0.001r=0.001italic_r = 0.001 and c=0.5𝑐0.5c=0.5italic_c = 0.5). Median values after 200,000200000200,000200 , 000 optimization iterations across 10101010 trials are used for the relative comparison for a variety of datasets, models, and coreset sizes. Above the horizontal black line (100superscript10010^{0}10 start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT) indicates that the proposed Hot DoG method outperformed the method it compared to.

6 CONCLUSION

This paper introduced Hot DoG, a learning-rate-free stochastic gradient method designed for learning coreset weights within the framework of Coreset MCMC. Our method extends DoG, but includes adjustments tailored to the Markovian setting of Coreset MCMC. In particular, Hot DoG includes a hot-start test for detecting when to start training coreset weights, as well as acceleration techniques. The quality of constructed coresets by Hot DoG and their corresponding posterior approximation is robust to all of its input parameters. Empirically, Hot DoG produces better posterior approximations than other learning-rate-free stochastic gradient methods, and is competitive to those of optimally-tuned ADAM.

Acknowledgements

T.C. was supported by NSERC Discovery Grant RGPIN-2019-03962, and J.H.H. was partially supported by National Science Foundation CAREER award IIS-2340586. We acknowledge the use of the ARC Sockeye computing platform from the University of British Columbia.

References

  • Robert and Casella (2004) Christian Robert and George Casella. Monte Carlo Statistical Methods. Springer, 2ndsuperscript2nd2^{\text{nd}}2 start_POSTSUPERSCRIPT nd end_POSTSUPERSCRIPT edition, 2004.
  • Robert and Casella (2011) Christian Robert and George Casella. A short history of Markov chain Monte Carlo: subjective recollections from incomplete data. Statistical Science, 26(1):102–115, 2011.
  • Gelman et al. (2013) Andrew Gelman, John Carlin, Hal Stern, David Dunson, Aki Vehtari, and Donald Rubin. Bayesian Data Analysis. CRC Press, 3rdsuperscript3rd3^{\text{rd}}3 start_POSTSUPERSCRIPT rd end_POSTSUPERSCRIPT edition, 2013.
  • Huggins et al. (2016) Jonathan Huggins, Trevor Campbell, and Tamara Broderick. Coresets for scalable Bayesian logistic regression. In Advances in Neural Information Processing Systems, 2016.
  • Naik et al. (2022) Cian Naik, Judith Rousseau, and Trevor Campbell. Fast Bayesian coresets via subsampling and quasi-Newton refinement. In Advances in Neural Information Processing Systems, 2022.
  • Chen et al. (2022) Naitong Chen, Zuheng Xu, and Trevor Campbell. Bayesian inference via sparse Hamiltonian flows. In Advances in Neural Information Processing Systems, 2022.
  • Campbell (2024) Trevor Campbell. General bounds on the quality of Bayesian coresets. In Advances in Neural Information Processing Systems, 2024.
  • Chen and Campbell (2024) Naitong Chen and Trevor Campbell. Coreset Markov chain Monte Carlo. In International Conference on Artificial Intelligence and Statistics, 2024.
  • Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. Adam: a method for stochastic optimization. In International Conference on Learning Representations, 2014.
  • Ivgi et al. (2023) Maor Ivgi, Oliver Hinder, and Yair Carmon. Dog is SGD’s best friend: a parameter-free dynamic step size schedule. In International Conference on Machine Learning, 2023.
  • Khaled et al. (2023) Ahmed Khaled, Konstantin Mishchenko, and Chi Jin. DoWG unleashed: an efficient universal parameter-free gradient descent method. In Advances in Neural Information Processing Systems, 2023.
  • Hinton et al. (2012) Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture 6a: overview of mini-batch gradient descent, 2012.
  • Campbell and Beronov (2019) Trevor Campbell and Boyan Beronov. Sparse variational inference: Bayesian coresets from scratch. In Advances in Neural Information Processing Systems, 2019.
  • Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61):2121–2159, 2011.
  • Carmon and Hinder (2022) Yair Carmon and Oliver Hinder. Making SGD parameter-free. In Conference on Learning Theory, 2022.
  • Defazio and Mishchenko (2023) Aaron Defazio and Konstantin Mishchenko. Learning-rate-free learning by D-adaptation. In International Conference on Machine Learning, 2023.
  • Mishchenko and Defazio (2024) Konstantin Mishchenko and Aaron Defazio. Prodigy: an expeditiously adaptive parameter-free learner. In International Conference on Machine Learning, 2024.
  • Orabona and Cutkosky (2020) Francesco Orabona and Ashok Cutkosky. International Conference on Machine Learning tutorial on parameter-free stochastic optimization, 2020.
  • Vaswani et al. (2019) Sharan Vaswani, Aaron Mishkin, Issam Laradji, Mark Schmidt, Gauthier Gidel, and Simon Lacoste-Julien. Painless stochastic gradient: interpolation, line-search, and convergence rates. In Advances in Neural Information Processing Systems, 2019.
  • Paquette and Scheinberg (2020) Courtney Paquette and Katya Scheinberg. A stochastic line search method with expected complexity analysis. Society for Industrial and Applied Mathematics Journal on Optimization, 30(1):349–376, 2020.
  • Loizou et al. (2021) Nicolas Loizou, Sharan Vaswani, Issam Laradji, and Simon Lacoste-Julien. Stochastic polyak step-size for SGD: an adaptive learning rate for fast convergence. In International Conference on Artificial Intelligence and Statistics, 2021.
  • Vehtari et al. (2021) Aki Vehtari, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner. Rank-normalization, folding, and localization: an improved R^^𝑅\widehat{R}over^ start_ARG italic_R end_ARG for assessing convergence of MCMC (with discussion). Bayesian Analysis, 16(2):667–718, 2021.
  • Bélisle et al. (1993) Claude Bélisle, Edwin Romeijn, and Robert Smith. Hit-and-run algorithms for generating multivariate distributions. Mathematics of Operations Research, 18(2):255–266, 1993.
  • Neal (2003) Radford Neal. Slice sampling. The Annals of Statistics, 31(3):705–767, 2003.
  • George and McCulloch (1993) Edward George and Robert McCulloch. Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88(423):881–889, 1993.
  • Carpenter et al. (2017) Bob Carpenter, Andrew Gelman, Matthew Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: a probabilistic programming language. Journal of Statistical Software, 76(1):1––32, 2017.
  • Elo (1978) Aprad Elo. The Rating of Chessplayers, Past and Present. Arco, 1stsuperscript1st1^{\text{st}}1 start_POSTSUPERSCRIPT st end_POSTSUPERSCRIPT edition, 1978.

Appendix A APPENDIX

A.1 Details of Experiments

A.1.1 Model Specification

In this subsection, we describe the six examples (two synthetic and four real data) that we used for our experiments. Processed versions of all datasets used for the experiments are available at https://github.com/NaitongChen/automated-coreset-mcmc-experiments. For each of the regression models, we are given a set of points (xn,yn)n=1Nsubscriptsuperscriptsubscript𝑥𝑛subscript𝑦𝑛𝑁𝑛1(x_{n},y_{n})^{N}_{n=1}( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT, each consisting of features xnpsubscript𝑥𝑛superscript𝑝x_{n}\in\mathbb{R}^{p}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and response ynsubscript𝑦𝑛y_{n}italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

Bayesian sparse linear regression: This is based on Example 4.1 from George and McCulloch (1993). We use the model

σ2superscript𝜎2\displaystyle\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT InvGam(ν/2,νλ/2),similar-toabsentInvGam𝜈2𝜈𝜆2\displaystyle\sim{\mathrm{InvGam}}\left(\nu/2,\nu\lambda/2\right),∼ roman_InvGam ( italic_ν / 2 , italic_ν italic_λ / 2 ) , (23)
i[p],γifor-all𝑖delimited-[]𝑝subscript𝛾𝑖\displaystyle\forall i\in[p],\quad\gamma_{i}∀ italic_i ∈ [ italic_p ] , italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT iidBern(q),iidsimilar-toBern𝑞\displaystyle\overset{\text{iid}}{\sim}{\mathrm{Bern}}(q),overiid start_ARG ∼ end_ARG roman_Bern ( italic_q ) , (24)
βiγiconditionalsubscript𝛽𝑖subscript𝛾𝑖\displaystyle\beta_{i}\mid\gamma_{i}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ind𝒩(0,(𝟙(γi=0)τ+𝟙(γi=1)cτ)2),indsimilar-to𝒩0superscript1subscript𝛾𝑖0𝜏1subscript𝛾𝑖1𝑐𝜏2\displaystyle\overset{\text{ind}}{\sim}\mathcal{N}\left(0,\left(\mathds{1}(% \gamma_{i}=0)\tau+\mathds{1}(\gamma_{i}=1)c\tau\right)^{2}\right),overind start_ARG ∼ end_ARG caligraphic_N ( 0 , ( blackboard_1 ( italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 ) italic_τ + blackboard_1 ( italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) italic_c italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (25)
n[N],ynxn,β,σ2for-all𝑛delimited-[]𝑁conditionalsubscript𝑦𝑛subscript𝑥𝑛𝛽superscript𝜎2\displaystyle\forall n\in[N],\quad y_{n}\mid x_{n},\beta,\sigma^{2}∀ italic_n ∈ [ italic_N ] , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_β , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ind𝒩(xnβ,σ2),indsimilar-to𝒩superscriptsubscript𝑥𝑛top𝛽superscript𝜎2\displaystyle\overset{\text{ind}}{\sim}\mathcal{N}\left(x_{n}^{\top}\beta,% \sigma^{2}\right),overind start_ARG ∼ end_ARG caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_β , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (26)

where we set ν=0.1,λ=1,q=0.1,τ=0.1formulae-sequence𝜈0.1formulae-sequence𝜆1formulae-sequence𝑞0.1𝜏0.1\nu=0.1,\lambda=1,q=0.1,\tau=0.1italic_ν = 0.1 , italic_λ = 1 , italic_q = 0.1 , italic_τ = 0.1, and c=10𝑐10c=10italic_c = 10. Here we model the variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, the vector of regression coefficients β=[β1βp]p𝛽superscriptmatrixsubscript𝛽1subscript𝛽𝑝topsuperscript𝑝\beta=\begin{bmatrix}\beta_{1}&\dots&\beta_{p}\end{bmatrix}^{\top}\in\mathbb{R% }^{p}italic_β = [ start_ARG start_ROW start_CELL italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and the vector of binary variables γ=[γ1γp]{0,1}p𝛾superscriptmatrixsubscript𝛾1subscript𝛾𝑝topsuperscript01𝑝\gamma=\begin{bmatrix}\gamma_{1}&\dots&\gamma_{p}\end{bmatrix}^{\top}\in\left% \{0,1\right\}^{p}italic_γ = [ start_ARG start_ROW start_CELL italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT indicating the inclusion of the pthsuperscript𝑝thp^{\text{th}}italic_p start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT feature in the model. We set N=50,000𝑁50000N=50{,}000italic_N = 50 , 000, p=10𝑝10p=10italic_p = 10, β=[0000055555]superscript𝛽superscriptmatrix0000055555top\beta^{\star}=\begin{bmatrix}0&0&0&0&0&5&5&5&5&5\end{bmatrix}^{\top}italic_β start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 5 end_CELL start_CELL 5 end_CELL start_CELL 5 end_CELL start_CELL 5 end_CELL start_CELL 5 end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, and generate a synthetic dataset by

n[N],xnfor-all𝑛delimited-[]𝑁subscript𝑥𝑛\displaystyle\forall n\in[N],\quad x_{n}∀ italic_n ∈ [ italic_N ] , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT iid𝒩(0,I),iidsimilar-to𝒩0𝐼\displaystyle\overset{\text{iid}}{\sim}\mathcal{N}\left(0,I\right),overiid start_ARG ∼ end_ARG caligraphic_N ( 0 , italic_I ) , (27)
ϵnsubscriptitalic-ϵ𝑛\displaystyle\epsilon_{n}italic_ϵ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT iid𝒩(0,252),iidsimilar-to𝒩0superscript252\displaystyle\overset{\text{iid}}{\sim}\mathcal{N}\left(0,25^{2}\right),overiid start_ARG ∼ end_ARG caligraphic_N ( 0 , 25 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (28)
ynsubscript𝑦𝑛\displaystyle y_{n}italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =xnβ+ϵn.absentsuperscriptsubscript𝑥𝑛topsuperscript𝛽subscriptitalic-ϵ𝑛\displaystyle=x_{n}^{\top}\beta^{\star}+\epsilon_{n}.= italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT . (29)

Bayesian linear regression: We use the model

[βlogσ2]superscriptmatrix𝛽superscript𝜎2top\displaystyle\begin{bmatrix}\beta&\log\sigma^{2}\end{bmatrix}^{\top}[ start_ARG start_ROW start_CELL italic_β end_CELL start_CELL roman_log italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT 𝒩(0,I),similar-toabsent𝒩0𝐼\displaystyle\sim\mathcal{N}(0,I),∼ caligraphic_N ( 0 , italic_I ) , (30)
n[N],ynxn,β,σ2for-all𝑛delimited-[]𝑁conditionalsubscript𝑦𝑛subscript𝑥𝑛𝛽superscript𝜎2\displaystyle\forall n\in[N],y_{n}\mid x_{n},\beta,\sigma^{2}∀ italic_n ∈ [ italic_N ] , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_β , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ind𝒩([1xn]β,σ2),indsimilar-to𝒩matrix1superscriptsubscript𝑥𝑛top𝛽superscript𝜎2\displaystyle\overset{\text{ind}}{\sim}\mathcal{N}\left(\begin{bmatrix}1&x_{n}% ^{\top}\end{bmatrix}\beta,\sigma^{2}\right),overind start_ARG ∼ end_ARG caligraphic_N ( [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] italic_β , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (31)

where βp+1𝛽superscript𝑝1\beta\in\mathbb{R}^{p+1}italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_p + 1 end_POSTSUPERSCRIPT is a vector of regression coefficients and σ2+superscript𝜎2subscript\sigma^{2}\in\mathbb{R}_{+}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is the noise variance. Note that the prior here is not conjugate for the likelihood. The dataset consists of flight delay information from N=98,673𝑁98673N=98{,}673italic_N = 98 , 673 observations and was constructed using flight delay data from https://www.transtats.bts.gov/Homepage.asp and historical weather information from https://www.wunderground.com/. We study the difference, in minutes, between the scheduled and actual departure times against p=10𝑝10p=10italic_p = 10 features including flight-specific and meteorological information.

Bayesian logistic regression: We use the model

i[p+1],βifor-all𝑖delimited-[]𝑝1subscript𝛽𝑖\displaystyle\forall i\in[p+1],\quad\beta_{i}∀ italic_i ∈ [ italic_p + 1 ] , italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT iidCauchy(0,1),iidsimilar-toCauchy01\displaystyle\overset{\text{iid}}{\sim}{\mathrm{Cauchy}}(0,1),overiid start_ARG ∼ end_ARG roman_Cauchy ( 0 , 1 ) , (32)
n[N],ynfor-all𝑛delimited-[]𝑁subscript𝑦𝑛\displaystyle\forall n\in[N],\quad y_{n}∀ italic_n ∈ [ italic_N ] , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT indBern((1+exp([1xn]β))1),indsimilar-toBernsuperscript1matrix1superscriptsubscript𝑥𝑛top𝛽1\displaystyle\overset{\text{ind}}{\sim}{\mathrm{Bern}}\left(\left(1+\exp\left(% -\begin{bmatrix}1&x_{n}^{\top}\end{bmatrix}\beta\right)\right)^{-1}\right),overind start_ARG ∼ end_ARG roman_Bern ( ( 1 + roman_exp ( - [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] italic_β ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) , (33)

where β=[β1βp+1]p+1𝛽superscriptmatrixsubscript𝛽1subscript𝛽𝑝1topsuperscript𝑝1\beta=\begin{bmatrix}\beta_{1}&\dots&\beta_{p+1}\end{bmatrix}^{\top}\in\mathbb% {R}^{p+1}italic_β = [ start_ARG start_ROW start_CELL italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL italic_β start_POSTSUBSCRIPT italic_p + 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p + 1 end_POSTSUPERSCRIPT is a vector of regression coefficients. Here we use the same dataset as in linear regression, but instead model the relationship between whether a flight is cancelled using the same set of features. Note that of all flights included, only 0.058%percent0.0580.058\%0.058 % were cancelled.

Bayesian Poisson regression: We use the model

β𝛽\displaystyle\betaitalic_β 𝒩(0,I),similar-toabsent𝒩0𝐼\displaystyle\sim\mathcal{N}(0,I),∼ caligraphic_N ( 0 , italic_I ) , (34)
n[N],ynxn,βfor-all𝑛delimited-[]𝑁conditionalsubscript𝑦𝑛subscript𝑥𝑛𝛽\displaystyle\forall n\in[N],y_{n}\mid x_{n},\beta∀ italic_n ∈ [ italic_N ] , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_β indPoiss(log(1+e[1xn]β)),indsimilar-toPoiss1superscript𝑒matrix1superscriptsubscript𝑥𝑛top𝛽\displaystyle\overset{\text{ind}}{\sim}{\mathrm{Poiss}}\left(\log\left(1+e^{% \begin{bmatrix}1&x_{n}^{\top}\end{bmatrix}\beta}\right)\right),overind start_ARG ∼ end_ARG roman_Poiss ( roman_log ( 1 + italic_e start_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] italic_β end_POSTSUPERSCRIPT ) ) , (35)

where βp+1𝛽superscript𝑝1\beta\in\mathbb{R}^{p+1}italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_p + 1 end_POSTSUPERSCRIPT is a vector of regression coefficients. The dataset consists of N=15,641𝑁15641N=15{,}641italic_N = 15 , 641 observations, and we model the hourly count of rental bikes against p=8𝑝8p=8italic_p = 8 features (e.g., temperature, humidity at the time, and whether or not the day is a workday). The original bike share dataset is available at https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset.

The remaining two non-regression models are specified as follows.

Gaussian location: We use the model

θ𝜃\displaystyle\thetaitalic_θ 𝒩(0,I),similar-toabsent𝒩0𝐼\displaystyle\sim\mathcal{N}(0,I),∼ caligraphic_N ( 0 , italic_I ) , (36)
n[N],Xnfor-all𝑛delimited-[]𝑁subscript𝑋𝑛\displaystyle\forall n\in[N],X_{n}∀ italic_n ∈ [ italic_N ] , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT iid𝒩(θ,I),iidsimilar-to𝒩𝜃𝐼\displaystyle\overset{\text{iid}}{\sim}\mathcal{N}(\theta,I),overiid start_ARG ∼ end_ARG caligraphic_N ( italic_θ , italic_I ) , (37)

where θ,Xnd𝜃subscript𝑋𝑛superscript𝑑\theta,X_{n}\in\mathbb{R}^{d}italic_θ , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Here we model the mean θ𝜃\thetaitalic_θ. We set N=10,000,d=20formulae-sequence𝑁10000𝑑20N=10{,}000,d=20italic_N = 10 , 000 , italic_d = 20 and generate a synthetic dataset by

n[N],xniid𝒩(0,I).for-all𝑛delimited-[]𝑁subscript𝑥𝑛iidsimilar-to𝒩0𝐼\displaystyle\forall n\in[N],x_{n}\overset{\text{iid}}{\sim}\mathcal{N}(0,I).∀ italic_n ∈ [ italic_N ] , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT overiid start_ARG ∼ end_ARG caligraphic_N ( 0 , italic_I ) . (38)

Bradley-Terry model: We use the model

θ𝜃\displaystyle\thetaitalic_θ iid𝒩(0,I),iidsimilar-to𝒩0𝐼\displaystyle\overset{\text{iid}}{\sim}\mathcal{N}(0,I),overiid start_ARG ∼ end_ARG caligraphic_N ( 0 , italic_I ) , (39)
n[N],ynhn,vn,θfor-all𝑛delimited-[]𝑁conditionalsubscript𝑦𝑛subscript𝑛subscript𝑣𝑛𝜃\displaystyle\forall n\in[N],y_{n}\mid h_{n},v_{n},\theta∀ italic_n ∈ [ italic_N ] , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∣ italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_θ indBern((1+exp((θvnθhn)/400))1),indsimilar-toBernsuperscript1subscript𝜃subscript𝑣𝑛subscript𝜃subscript𝑛4001\displaystyle\overset{\text{ind}}{\sim}{\mathrm{Bern}}\left(\left(1+\exp\left(% (\theta_{v_{n}}-\theta_{h_{n}})/400\right)\right)^{-1}\right),overind start_ARG ∼ end_ARG roman_Bern ( ( 1 + roman_exp ( ( italic_θ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) / 400 ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) , (40)

where θd𝜃superscript𝑑\theta\in\mathbb{R}^{d}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. The dataset was constructed using games statistics from https://www.nba.com/stats and consists of data of N=26,651𝑁26651N=26,651italic_N = 26 , 651 NBA games between the 2004 and 2022 seasons. hnsubscript𝑛h_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and vnsubscript𝑣𝑛v_{n}italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are the home team and visitor team IDs for the nthsuperscript𝑛thn^{\text{th}}italic_n start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT game in the dataset, and ynsubscript𝑦𝑛y_{n}italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denotes the outcome of the game (yn=1subscript𝑦𝑛1y_{n}=1italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 if the home team won and yn=0subscript𝑦𝑛0y_{n}=0italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 0 if the visitor team won). θd𝜃superscript𝑑\theta\in\mathbb{R}^{d}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT represents the Elo ratings or relative skill levels (Elo, 1978, Ch. 1) for each of the d=30𝑑30d=30italic_d = 30 teams. We model the Elo ratings using outcomes of pairwise comparisons between teams using game outcomes.

A.1.2 Parameter Settings

For full-data inference results of all examples except for the sparse linear regression model, we ran Stan (Carpenter et al., 2017) with 10 parallel chains, each taking 100,000100000100{,}000100 , 000 steps with the first 50,0005000050{,}00050 , 000 discarded, for a combined 500,000500000500{,}000500 , 000 draws. For full-data inference result of the sparse linear regression example, we use the Gibbs sampler developed by George and McCulloch (1993) to generate 200,000200000200{,}000200 , 000 draws, with the first half discarded as burn-in.

To account for changes in w𝑤witalic_w, for all real data experiments, we use the hit-and-run slice sampler with doubling (Bélisle et al., 1993; Neal, 2003); for the Gaussian location model, we use a kernel that directly samples from πwsubscript𝜋𝑤\pi_{w}italic_π start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT (Chen and Campbell, 2024, Sec. 3.4). for the sparse regression, we use the Gibbs sampler developed by George and McCulloch (1993).

We use Stan (Carpenter et al., 2017) to obtain full data inference results for real data experiments, and Gibbs sampling (George and McCulloch, 1993) for the sparse regression model with discrete variables. The true posterior distribution for the Gaussian location model is available in closed form.

For ADAM, we test multiple learning rates over a log scale grid (10k)superscript10𝑘\left(10^{k}\right)( 10 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) for k=3,2,,1𝑘321k=-3,-2,\dots,1italic_k = - 3 , - 2 , … , 1. For each experiment under each coreset size, the optimally-tuned ADAM is the one that obtained the lowest average squared z-score after 200,000200000200,000200 , 000 iterations of weight optimization. For all learning-rate-free methods, we test different initial parameters (initial lower bound for prodigy ADAM and r0subscript𝑟0r_{0}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for Hot DoG, DoG, and DoWG) over a log scaled grid (10k)superscript10𝑘\left(10^{k}\right)( 10 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) for k=3,2,,1𝑘321k=-3,-2,\dots,1italic_k = - 3 , - 2 , … , 1.

For the logistic regression example, to account for the class imbalance problem, we include all observations from the rare positive class if the coreset size is more than twice as big as the total number of observations with positive labels. Otherwise we sample our coreset to have 50%percent5050\%50 % positive labels and 50%percent5050\%50 % negative labels. Coreset points are uniformly subsampled for all other models.

A.2 Additional Results

Figs. 4 to 6 in the main text show the traces of average squared coordinate-wise z-scores, as well as the gradient estimate norms and hot-start test statistics for Hot DoG when M=1000𝑀1000M=1000italic_M = 1000. In this subsection, we show the same sets of plots for M=100𝑀100M=100italic_M = 100 and M=500𝑀500M=500italic_M = 500. Similarly to Fig. 4, Figs. 8 and 9 compare Hot DoG with and without hot-start test. Similarly to Fig. 5, Figs. 10 and 11 show the gradient estimate norms and hot-start test statistics during burn-in. Similarly to Fig. 6, Figs. 12 and 13 compare Hot DoG (with hot-start test) and optimally-tuned ADAM. We see that all plots show the same trends as the ones in Section 5, where M=1000𝑀1000M=1000italic_M = 1000. As a result, we arrive at similar observations as in Section 5. In particular, Hot DoG with burn-in leaves the plateau sooner than without burn-in; the hot-start test passes and thus burn-in terminates shortly after gradient norms are stabilized; Hot DoG is robust to the fixed parameter r𝑟ritalic_r.

Refer to caption
(a) Gaussian location
Refer to caption
(b) Sparse regression
Refer to caption
(c) Linear regression
Refer to caption
(d) Logistic regression
Refer to caption
(e) Poisson regression
Refer to caption
(f) Bradley-Terry
Figure 8: Traces of average squared coordinate-wise z-scores between the true and approximated posterior across all experiments, obtained using Hot DoG with and without hot-start test. All figures share the legend in Fig. 8(c). The coreset size M𝑀Mitalic_M is 100100100100 and each line represents a different initial learning rate parameter. The lines indicate the median from 10101010 runs. Orange lines indicate runs with hot-start test and blue lines without.
Refer to caption
(a) Gaussian location
Refer to caption
(b) Sparse regression
Refer to caption
(c) Linear regression
Refer to caption
(d) Logistic regression
Refer to caption
(e) Poisson regression
Refer to caption
(f) Bradley-Terry
Figure 9: Traces of average squared coordinate-wise z-scores between the true and approximated posterior across all experiments, obtained using Hot DoG with and without hot-start test. All figures share the legend in Fig. 9(c). The coreset size M𝑀Mitalic_M is 500500500500 and each line represents a different initial learning rate parameter. The lines indicate the median from 10101010 runs. Orange lines indicate runs with hot-start test and blue lines without.
Refer to caption
(a) Gaussian location
Refer to caption
(b) Sparse regression
Refer to caption
(c) Linear regression
Refer to caption
(d) Logistic regression
Refer to caption
(e) Poisson regression
Refer to caption
(f) Bradley-Terry
Figure 10: Trace of gradient estimate norms (blue) and hot-start test statistics (green) before weight optimization across all experiments with M=100𝑀100M=100italic_M = 100. The orange horizontal line is the test statistic threshold c=0.5𝑐0.5c=0.5italic_c = 0.5.
Refer to caption
(a) Gaussian location
Refer to caption
(b) Sparse regression
Refer to caption
(c) Linear regression
Refer to caption
(d) Logistic regression
Refer to caption
(e) Poisson regression
Refer to caption
(f) Bradley-Terry
Figure 11: Trace of gradient estimate norms (blue) and hot-start test statistics (green) before weight optimization across all experiments with M=500𝑀500M=500italic_M = 500. The orange horizontal line is the test statistic threshold c=0.5𝑐0.5c=0.5italic_c = 0.5.
Refer to caption
(a) Gaussian location
Refer to caption
(b) Sparse regression
Refer to caption
(c) Linear regression
Refer to caption
(d) Logistic regression
Refer to caption
(e) Poisson regression
Refer to caption
(f) Bradley-Terry
Figure 12: Traces of average squared coordinate-wise z-scores between the true and approximated posterior across all experiments, obtained from Hot DoG and optimally-tuned ADAM. All figures share the legend in Fig. 12(c). The coreset size M=100𝑀100M=100italic_M = 100 and each line represents a different initial learning rate parameter. The lines indicate the median from 10101010 runs.
Refer to caption
(a) Gaussian location
Refer to caption
(b) Sparse regression
Refer to caption
(c) Linear regression
Refer to caption
(d) Logistic regression
Refer to caption
(e) Poisson regression
Refer to caption
(f) Bradley-Terry
Figure 13: Traces of average squared coordinate-wise z-scores between the true and approximated posterior across all experiments, obtained from Hot DoG and optimally-tuned ADAM. All figures share the legend in Fig. 13(c). The coreset size M=500𝑀500M=500italic_M = 500 and each line represents a different initial learning rate parameter. The lines indicate the median from 10101010 runs.