1 INTRODUCTION

Tuning-free coreset Markov chain Monte Carlo

Naitong Chen Jonathan H. Huggins Trevor Campbell

Department of Statistics University of British Columbia [email protected] Department of Mathematics & Statistics Boston University [email protected] Department of Statistics University of British Columbia [email protected]

Abstract

A Bayesian coreset is a small, weighted subset of a data set that replaces the full data during inference to reduce computational cost. The state-of-the-art coreset construction algorithm, Coreset Markov chain Monte Carlo (Coreset MCMC), uses draws from an adaptive Markov chain targeting the coreset posterior to train the coreset weights via stochastic gradient optimization. However, the quality of the constructed coreset, and thus the quality of its posterior approximation, is sensitive to the stochastic optimization learning rate. In this work, we propose a learning-rate-free stochastic gradient optimization procedure, Hot-start Distance over Gradient (Hot DoG), for training coreset weights in Coreset MCMC without user tuning effort. Empirical results demonstrate that Hot DoG provides higher quality posterior approximations than other learning-rate-free stochastic gradient methods, and performs competitively to optimally-tuned ADAM.

1 INTRODUCTION

Refer to caption — Figure 1: Relative Coreset MCMC posterior approximation error (average squared coordinate-wise z-score) using ADAM with different learning rates versus the proposed Hot DoG method (with fixed $r=0.001$ ). Median values after 200,000 optimization iterations across 10 trials are used for the relative comparison for a variety of datasets, models, and coreset sizes. Above the horizontal black line ( $10^{0}$ ) indicates that the proposed Hot DoG method outperformed ADAM.

Bayesian inference provides a flexible framework for parameter estimation and uncertainty quantification in statistical models. Markov chain Monte Carlo (Robert and Casella, 2004; Robert and Casella, 2011; Gelman et al., 2013, Chs. 11 and 12), the standard methodology for performing Bayesian inference, involves simulating carefully constructed Markov chains whose stationary distribution is the target Bayesian posterior. In the large-scale data setting, this procedure can become prohibitively expensive, as it requires iterating over the entire data set to simulate the next state.

Bayesian coresets (Huggins et al., 2016) are a popular approach for speeding up Bayesian inference in the large-scale data setting. A Bayesian coreset is a weighted subset of data that replaces the full data set during inference, leveraging the insight that large datasets often exhibit a significant degree of redundancy. With a carefully constructed coreset, one can significantly reduce the computational cost of inference while still obtaining samples from a high quality approximation of the full Bayesian posterior. In fact, given a data set of $N$ points, a coreset of size ${\mathcal{O}}\left(\log N\right)$ is sufficient for providing a near-exact posterior approximation in exponential family and other sufficiently simple models (Naik et al., 2022, Thms. 4.1 and 4.2; Chen et al., 2022, Prop. 3.1) and ${\mathcal{O}}\mathopen{}\mathclose{{}\left(\operatorname{polylog}N}\right)$ is sufficient for more general cases (Campbell, 2024, Cor. 6.1).

Constructing a coreset involves picking the data points to include in the coreset and assigning each data point its corresponding weight. The state-of-the-art method, Coreset MCMC (Chen and Campbell, 2024), selects coreset points by sampling them uniformly from the full data set, and learns the weights using stochastic gradient optimization techniques, e.g., ADAM (Kingma and Ba, 2014), where the gradients are estimated using MCMC draws targeting the current coreset posterior. However, as we demonstrate in this paper, there are two issues with this approach. First, the quality of the constructed coreset is sensitive to the learning rate of the stochastic optimization algorithm. And second, gradient estimates using MCMC draws are affected strongly in early iterations by initialization bias, leading to poor optimization performance.

To address these challenges, we first propose Hot-start Distance over Gradient (Hot DoG), a tuning-free stochastic gradient optimization procedure that can be used for learning coreset weights in Coreset MCMC. Hot DoG is a stochastic gradient method combining techniques from Do(W)G (Ivgi et al., 2023; Khaled et al., 2023), ADAM (Kingma and Ba, 2014), and RMSProp (Hinton et al., 2012) to set learning rates automatically. Hot DoG also includes an automated warm-up phase prior to weight optimization, which guards against usage of low quality MCMC draws when estimating the objective function gradients. Fig. 1 demonstrates that Hot DoG performs competitively to optimally-tuned ADAM across a wide range of models, datasets, and coreset sizes, and can be multiple orders of magnitude more accurate than ADAM using other learning rates. Beyond the result in Fig. 1, we provide an extensive empirical investigation of the reliability of Hot DoG in comparison to other methods across many different synthetic and real experiments.

2 BACKGROUND

2.1 Bayesian Coresets

We are given a data set $(X_{n})_{n=1}^{N}$ of $N$ observations, a log-likelihood $\ell_{n}\coloneqq\log p(x_{n}\mid\theta)$ for observation $n$ given $\theta\in\Theta$ , and a prior density $\pi_{0}(\theta)$ . We would like to sample from the Bayesian posterior with density

\displaystyle\pi(\theta)\coloneqq\frac{1}{Z}\exp\left(\sum_{n=1}^{N}\ell_{n}(% \theta)\right)\pi_{0}(\theta),

(2)

where $Z$ is the unknown normalizing constant. A Bayesian coreset replaces the sum over $N$ log-likelihood terms with a weighted sum over a subset of size $M$ , where $M\ll N$ . Without loss of generality, we assume that these are the first $M$ points. The coreset posterior can then be written as

\displaystyle\pi_{w}(\theta)\coloneqq\frac{1}{Z(w)}\exp\left(\sum_{m=1}^{M}w_{% m}\ell_{m}(\theta)\right)\pi_{0}(\theta),

(3)

where $w\in\mathbb{R}^{M}_{+}$ is a vector of coreset weights. Recent coreset construction methods uniformly select $M$ points to include in the coreset (Naik et al., 2022; Chen et al., 2022; Chen and Campbell, 2024), and then optimize the weights of those $M$ points as a variational inference problem (Campbell and Beronov, 2019),

\displaystyle w^{\star}=\operatornamewithlimits{arg\,min}_{w\in\mathbb{R}^{M}}% \mathrm{D_{KL}}(\pi_{w}||\pi)\quad\text{s.t.}\quad w\geq 0,

(4)

with objective function gradient

		$\displaystyle\nabla_{w}\mathrm{D_{KL}}(\pi_{w}\|\|\pi)$		(5)
	$\displaystyle=$	$\displaystyle\operatorname{Cov}_{\pi_{w}}\mathopen{}\mathclose{{}\left(\begin{% bmatrix}\ell_{1}(\theta)\\ \vdots\\ \ell_{M}(\theta)\end{bmatrix},\sum_{m}w_{m}\ell_{m}(\theta)-\sum_{n}\ell_{n}(% \theta)}\right).$		(6)

Algorithm 1 CoresetMCMC

\theta_{0}

\kappa_{w}

S

M

\triangleright

Initialize coreset weights

w_{0m}=\frac{N}{M},\quad m=1,\cdots,M

for

t=0,\dots,T

\triangleright

Subsample the data

{\mathcal{S}}_{t}\leftarrow{\mathrm{Unif}}\mathopen{}\mathclose{{}\left(S,[N]}\right)

(without replacement)

\triangleright

Compute gradient estimate

\hat{g}_{t}\leftarrow g(w_{t},\theta_{t},{\mathcal{S}}_{t})

(Eq. 7)

w_{t+1}\leftarrow

stochastic_gradient_step(

w_{t},\hat{g}_{t}

)

\triangleright

Step each Markov chain

for

k=1,\dots,K

\theta_{(t+1)k}\sim\kappa_{w_{t+1}}(\cdot\mid\theta_{tk})

end for

2.2 Coreset MCMC

The key challenge in solving Eq. 4 is that $\pi_{w}$ does not admit tractable i.i.d. draws, and so unbiased estimates of the gradient in Eq. 5 are not readily available. Coreset MCMC (Chen and Campbell, 2024) is an adaptive algorithm that addresses this issue. The method first initializes weights $w_{0}\in\mathbb{R}^{M}$ and $K\geq 2$ samples $\theta_{0}=\mathopen{}\mathclose{{}\left(\theta_{01},\dots,\theta_{0K}}\right)% \in\Theta^{K}$ . At iteration $t\in\mathbb{N}$ , given coreset weights $w_{t}$ and samples $\theta_{t}\in\Theta^{K}$ , it then updates the weights $w_{t}\to w_{t+1}$ using the stochastic gradient estimate based on the draws $\theta_{t}$ ,

		$\displaystyle g(w_{t},\theta_{t},{\mathcal{S}}_{t})=$		(7)
		$\displaystyle\frac{1}{K-1}\sum_{k=1}^{K}\!\!\begin{bmatrix}\bar{\ell}_{1}(% \theta_{tk})\\ \vdots\\ \bar{\ell}_{M}(\theta_{tk})\end{bmatrix}\!\!\mathopen{}\mathclose{{}\left(\sum% _{m}w_{tm}\bar{\ell}_{m}(\theta_{tk})\!-\!\frac{N}{S}\sum_{s\in{\mathcal{S}}_{% t}}\bar{\ell}_{s}(\theta_{tk})}\right),$		(8)

where ${\mathcal{S}}_{t}\subseteq[N]$ is a uniform subsample of indices of size $S$ , and $\bar{\ell}_{n}(\theta_{tk})=\ell_{n}(\theta_{tk})-\frac{1}{K}\sum_{j=1}^{K}% \ell_{n}(\theta_{tj})$ . To complete the iteration, the method updates the samples by independently drawing $\theta_{(t+1)k}\sim\kappa_{w_{t+1}}(\theta_{tk},\cdot)$ for each $k\in[K]$ , where $\kappa_{w}$ is a family of Markov kernels that have invariant distribution $\pi_{w}$ . The pseudocode for Coreset MCMC is outlined in Algorithm 1.

3 TUNING-FREE CORESET MCMC

A key design choice when using Coreset MCMC is to specify how gradient estimates are used to optimize the weights. One can use ADAM (Kingma and Ba, 2014), which is used as the default optimizer for Coreset MCMC (Chen and Campbell, 2024): at iteration $t$ , with $\gamma_{t}>0$ being the user-specified learning rate, we set

\displaystyle w_{t+1}

\displaystyle\leftarrow\operatorname{proj}_{\geq 0}\mathopen{}\mathclose{{}% \left(w_{t}-\gamma_{t}\frac{\hat{m}_{t}}{\mathchoice{{\hbox{$\displaystyle% \sqrt{\hat{v}_{t}\,}$}\lower 0.4pt\hbox{\vrule height=7.22223pt,depth=-5.77782% pt}}}{{\hbox{$\textstyle\sqrt{\hat{v}_{t}\,}$}\lower 0.4pt\hbox{\vrule height=% 7.22223pt,depth=-5.77782pt}}}{{\hbox{$\scriptstyle\sqrt{\hat{v}_{t}\,}$}\lower 0% .4pt\hbox{\vrule height=7.22223pt,depth=-5.77782pt}}}{{\hbox{$% \scriptscriptstyle\sqrt{\hat{v}_{t}\,}$}\lower 0.4pt\hbox{\vrule height=7.2222% 3pt,depth=-5.77782pt}}}+\epsilon}}\right),

(9)

where $\hat{m}_{t}$ and $\hat{v}_{t}$ are exponential averages of past gradients $(\hat{g}_{i})_{i=0}^{t}$ and their element-wise squares, and $\epsilon$ is a small constant. There are a wide range of other first-order stochastic methods available that could be used (e.g., vanilla stochastic gradient descent, AdaGrad (Duchi et al., 2011), etc.). However, like ADAM, most of these algorithms require setting a learning rate $\gamma_{t}$ . And as we show in Fig. 2, the quality of samples obtained from Coreset MCMC can be highly sensitive to the selected learning rate. In particular, Fig. 2 shows that when using ADAM, no single learning rate applies well across all problems and coreset sizes; and for a given problem, the performance can vary by orders of magnitude as one varies the learning rate. Furthermore, the default ADAM learning rate of $10^{-3}$ (Kingma and Ba, 2014) provides poor results in most of the problems tested. As a result, careful tuning of the learning rate is required to obtain high quality posterior approximations. This usually involves a search on a log-scaled grid, which is computationally wasteful as the results for all but one of the parameter values are thrown out. Moreover, in practice determining which learning rate provides the best posterior approximation may not be straightforward.

A number of recent works in the literature propose learning-rate-free stochastic optimization methods to address this issue (Carmon and Hinder, 2022; Ivgi et al., 2023; Khaled et al., 2023; Defazio and Mishchenko, 2023; Mishchenko and Defazio, 2024). Many of these methods are shown empirically to work competitively compared to optimally-tuned SGD on a wide range of large-scale, non-convex deep learning problems. Although different at first glance, all of these methods arise from the same insight. Suppose one would like to solve the stochastic optimization problem

\displaystyle\min_{x\in\mathbb{R}^{d}}\mathbb{E}\left[f(x,\xi)\right],

(10)

where for all $\xi$ , $f(\cdot,\xi)$ is convex and we only have access to unbiased stochastic gradient $g_{t}=\partial f(x_{t},\xi_{t})$ . Define the initial optimality gap $d_{0}=\|x_{0}-x^{\star}\|$ and the sum of all gradient norms $G_{T}=\sum_{t\leq T}\|g_{t}\|^{2}$ . By setting the SGD learning rate

\displaystyle\gamma^{\star}=\frac{d_{0}}{\mathchoice{{\hbox{$\displaystyle% \sqrt{G_{T}\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{% {\hbox{$\textstyle\sqrt{G_{T}\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,de% pth=-5.46667pt}}}{{\hbox{$\scriptstyle\sqrt{G_{T}\,}$}\lower 0.4pt\hbox{\vrule h% eight=4.78333pt,depth=-3.82668pt}}}{{\hbox{$\scriptscriptstyle\sqrt{G_{T}\,}$}% \lower 0.4pt\hbox{\vrule height=3.41666pt,depth=-2.73334pt}}}},

(11)

the average iterate $\bar{x}=\frac{1}{T}\sum_{t\leq T}x_{t}$ satisfies the optimal error bound

\displaystyle\mathbb{E}\left[f(\bar{x},\xi)\right]-\mathbb{E}\left[f(x^{\star}% ,\xi)\right]\leq\frac{d_{0}\mathchoice{{\hbox{$\displaystyle\sqrt{G_{T}\,}$}% \lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$% \textstyle\sqrt{G_{T}\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.4% 6667pt}}}{{\hbox{$\scriptstyle\sqrt{G_{T}\,}$}\lower 0.4pt\hbox{\vrule height=% 4.78333pt,depth=-3.82668pt}}}{{\hbox{$\scriptscriptstyle\sqrt{G_{T}\,}$}\lower 0% .4pt\hbox{\vrule height=3.41666pt,depth=-2.73334pt}}}}{T}

(12)

after $T$ iterations (Carmon and Hinder, 2022; Orabona and Cutkosky, 2020). Learning-rate-free methods therefore essentially try to estimate or bound the initial optimality gap $d_{0}$ , which is unknown in practice. To the best of our knowledge, there are four state-of-the-art methods that do this in a manner that does not require multiple optimization runs, knowledge of unknown constants, or the ability to query the objective function: DoG (Ivgi et al., 2023), DoWG (Khaled et al., 2023), D-Adaptation (Defazio and Mishchenko, 2023) and prodigy (Mishchenko and Defazio, 2024). DoG and DoWG run vanilla stochastic gradient descent (SGD),

\displaystyle w_{t+1}

\displaystyle\leftarrow\operatorname{proj}_{\geq 0}\mathopen{}\mathclose{{}% \left(w_{t}-\gamma_{t}g_{t}}\right),

(13)

with learning rate schedules

\displaystyle\gamma_{t}=\frac{r_{t}}{\mathchoice{{\hbox{$\displaystyle\sqrt{G_% {t}\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$% \textstyle\sqrt{G_{t}\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.4% 6667pt}}}{{\hbox{$\scriptstyle\sqrt{G_{t}\,}$}\lower 0.4pt\hbox{\vrule height=% 4.78333pt,depth=-3.82668pt}}}{{\hbox{$\scriptscriptstyle\sqrt{G_{t}\,}$}\lower 0% .4pt\hbox{\vrule height=3.41666pt,depth=-2.73334pt}}}}\text{(DoG)},\,\gamma_{t% }=\frac{r^{2}_{t}}{\mathchoice{{\hbox{$\displaystyle\sqrt{\sum_{t\leq T}r_{t}^% {2}\|g_{t}\|^{2}\,}$}\lower 0.4pt\hbox{\vrule height=9.30444pt,depth=-7.44359% pt}}}{{\hbox{$\textstyle\sqrt{\sum_{t\leq T}r_{t}^{2}\|g_{t}\|^{2}\,}$}\lower 0% .4pt\hbox{\vrule height=9.30444pt,depth=-7.44359pt}}}{{\hbox{$\scriptstyle% \sqrt{\sum_{t\leq T}r_{t}^{2}\|g_{t}\|^{2}\,}$}\lower 0.4pt\hbox{\vrule height% =6.53888pt,depth=-5.23112pt}}}{{\hbox{$\scriptscriptstyle\sqrt{\sum_{t\leq T}r% _{t}^{2}\|g_{t}\|^{2}\,}$}\lower 0.4pt\hbox{\vrule height=5.03888pt,depth=-4.0% 3113pt}}}}\text{(DoWG)},

(14)

where $r_{0}$ is set to some small constant and, for $t\geq 1$ ,

\displaystyle r_{t}=\max_{i\leq t}\|w_{t}-w_{0}\|.

(15)

For D-Adaptation and prodigy, $r_{t}$ in Eq. 14 is replaced with a lower bound $d_{t}$ on $d_{0}$ , which is updated using estimated correlations between the gradient $g_{t}$ and step direction $w_{0}-w_{t}$ :

\displaystyle d_{t+1}=\max\left\{\frac{\sum_{i=0}^{t}d_{i}\left\langle g_{i},w% _{0}-w_{i}\right\rangle}{\left\|\sum_{i=0}^{t}d_{i}g_{i}\right\|},d_{t}\right\}.

(16)

D-Adaptation replaces $r_{t}$ in Eq. 14 (DoG) with $d_{t}$ , while prodigy replaces $r_{t}$ in Eq. 14 (DoWG) with $d_{t}$ . Both D-Adaptation have SGD and ADAM-based variants. All four methods have been shown empirically to match the performance of optimally-tuned SGD.

Fig. 3 shows the results from direct applications of DoG, DoWG, D-Adaptation (SGD), and prodigy (ADAM) to Coreset MCMC. We see that the quality of posterior approximation from all of four methods are orders of magnitude worse than optimally-tuned ADAM. With $\theta_{0}$ initialized far away from high density regions of $\pi_{w_{0}}$ , the initial gradient estimates are large in magnitude, which leads to small learning rates. The accumulation of these large gradient norms in the learning rate denominator eventually causes the learning rate to vanish, halting the progress of coreset weight optimization. We address these problems in the next section.

Before concluding this section, we note that there are other approaches for making SGD free of learning rate tuning: some methods involve using stochastic versions of line search (Vaswani et al., 2019; Paquette and Scheinberg, 2020), and others do the same for the Polyak step size (Loizou et al., 2021). These methods are not applicable in our setting as they require evaluating the objective function. Recall that due to the unknown $Z(w)$ term in Eq. 3, we do not have access to estimates of the objective function.

Algorithm 2 HotDoG

\beta_{1}=0.9

\beta_{2}=0.999

\epsilon=10^{-8}

r=10^{-3}

\quad\quad\quad\quad T

\theta_{0}

w_{0}

v_{0}\leftarrow\bm{0}

m_{0}\leftarrow\bm{0}

d_{0}\leftarrow\bm{0}

c\leftarrow 0

h\leftarrow\texttt{false}

for

t=1,\dots,T

if h then

c\leftarrow c+1

{\mathcal{S}}_{t}\leftarrow{\mathrm{Unif}}\mathopen{}\mathclose{{}\left(S,[N]}\right)

(without replacement)

\hat{g}_{t}=g(w_{t-1},\theta_{t-1},{\mathcal{S}}_{t})

(Eq. 7)

v_{t}\leftarrow\beta_{2}v_{t-1}+(1-\beta_{2})\hat{g}_{t}^{2}

m_{t}\leftarrow\beta_{1}m_{t-1}+(1-\beta_{1})\hat{g}_{t}

d_{t}\!\leftarrow\!\beta_{1}d_{t-1}\!+\!(1\!-\!\beta_{1})\max\left\{\left|w_{t% -1}\!-\!w_{0}\right|,d_{t-1}\right\}

\hat{v}_{t}\leftarrow v_{t}/(1-\beta_{2}^{c})

\hat{m}_{t}\leftarrow m_{t}/(1-\beta_{1}^{c})

\hat{d_{t}}\leftarrow

(

r\mathbf{1}

if t==1 else

d_{t}/(1-\beta_{1}^{c-1})

)

w_{t}\leftarrow w_{t-1}\!-\!\hat{d}_{t}\left(\operatorname{diag}\left(\left(c% \hat{v}_{t}\right)^{\frac{1}{2}}\right)+\epsilon I\right)^{-1}\!\odot\hat{m}_{t}

else

w_{t}\!\leftarrow\!w_{t-1}

v_{t}\!\leftarrow\!v_{t-1}

m_{t}\!\leftarrow\!m_{t-1}

d_{t}\!\leftarrow\!d_{t-1}

end if

for

k=1,\dots,K

\theta_{tk}\sim\kappa_{w_{t}}(\cdot\mid\theta_{(t-1)k})

\triangleright

record

\ell_{tk}

end for

\triangleright

Hot-start test

h\!\leftarrow\!

(true if

h

else

\texttt{HotStartTest}\!\left(\!\left(\!\ell_{ik}\!\right)_{i\!=\!1\!,\!k\!=\!1% }^{t\!,\!K}\!,\!t\!\right)

)

end for

return

w_{T}

4 HOT DOG

In this section, we develop our novel Markovian optimization method, Hot-start DoG (Hot DoG), presented in Algorithm 2. Our method extends the original DoG optimizer in two ways: (1) we add a tuning-free hot-start test that automatically detects when the Markov chains have properly mixed and stochastic gradient estimates are stable, at which point we start coreset weight optimization; and (2) we apply acceleration techniques to DoG.

4.1 Hot-start test

Poorly initialized Markov chain states $\theta_{0}$ can be detrimental to the performance of learning-rate-free methods in Coreset MCMC. Fig. 5, and especially Figs. 5(c), 5(d) and 5(e) show that this is likely due to the bias of initial gradient estimates. In particular, the initial gradient estimates often have a norm orders of magnitude larger than they would if they had been computed using i.i.d. draws, resulting in a quickly vanishing learning rate in Eq. 14. Therefore, it is crucial to hot-start the Markov chains to ensure they are properly mixed before training the coreset weights. There are MCMC convergence diagnostics that could be used for this purpose (e.g, ${\widehat{R}}$ (Vehtari et al., 2021)); many work only with real-valued variables, and are overly stringent for our application. We require a test that works for general coreset posteriors of the form Eq. 3 and checks only that gradient estimates have stabilized reasonably.

To address this challenge, we propose keeping the weights fixed at their initialization (i.e., $w_{t+1}\leftarrow w_{t}$ ) until a hot-start test passes. For the test, for each Markov chain $k\in[K]$ , we split the iterates $i=1,\dots,t$ into 3 segments, each of equal length $n=\lceil t/3\rceil$ . We compute the average log-potentials for the two latter segments $m_{1}$ , $m_{2}$ , and the standard deviations of residual errors $s_{1},s_{2}$ from a linear fit:

$\displaystyle m_{1}$	$\displaystyle=\frac{1}{n}\sum_{i=n+1}^{2n}\ell_{ik}\quad m_{2}=\frac{1}{n}\sum% _{i=2n+1}^{t}\ell_{ik}$	(17)
$\displaystyle s_{1}^{2}$	$\displaystyle=\frac{1}{n-2}\min_{a,b\in\mathbb{R}}\sum_{i=n+1}^{2n}(a+bi-\ell_% {ik})^{2}$	(18)
$\displaystyle s_{2}^{2}$	$\displaystyle=\frac{1}{n-2}\min_{a,b\in\mathbb{R}}\sum_{i=2n+1}^{t}(a+bi-\ell_% {ik})^{2},$	(19)

where $\ell_{ik}$ is the log-potential for chain $k$ at iteration $i$ ,

\displaystyle\ell_{ik}=\sum_{m=1}^{M}w_{0m}\ell_{m}(\theta_{ik}).

(20)

Our test monitors the difference between $m_{1}$ and $m_{2}$ relative to $s_{1}$ and $s_{2}$ . A small difference between $m_{1}$ and $m_{2}$ indicates that the Markov chain states have stabilized. The residual standard errors from the linear fits allows us to remove trends from the noise computation. We define, for each $k\in[K]$ ,

\displaystyle u_{k}=\frac{\mathopen{}\mathclose{{}\left|m_{1}-m_{2}}\right|}{% \max\{s_{1},s_{2}\}},

(21)

and use the median of $\left(u_{\ell k}\right)_{k=1}^{K}$ as our test statistic. This test statistic is checked against a threshold $c$ ; the test passes when the median test statistic becomes smaller than $c$ . The pseudocode for the hot-start test is given in Algorithm 3. From our experiments, we find that setting $c=0.5$ works well in general.

Algorithm 3 HotStartTest

\left(\ell_{ik}\right)_{i=1,k=1}^{t,K}

t

c=0.5

n=\texttt{ceil}(t/3)

for

k=1,\dots,K

s^{2}_{1}\leftarrow\frac{1}{n-2}\min_{a,b\in\mathbb{R}}\sum_{i=n+1}^{2n}\left(% a+bi-\ell_{ik}\right)^{2}

s^{2}_{2}\leftarrow\frac{1}{n-2}\min_{a,b\in\mathbb{R}}\sum_{i=2n+1}^{t}\left(% a+bi-\ell_{ik}\right)^{2}

u_{k}\leftarrow\frac{\left|\left(\frac{1}{n}\sum_{i=n+1}^{2n}\ell_{ik}\right)-% \left(\frac{1}{n}\sum_{i=2n+1}^{t}\ell_{ik}\right)\right|}{\max\{s_{1},s_{2}\}}

end for

return (true if median

\left(u_{1},\dots,u_{K}\right)<c

else false)

4.2 Acceleration

To accelerate DoG, we begin by noting that the denominator of the DoG learning rate in Eq. 14 is similar to that of AdaGrad (Duchi et al., 2011) in that it is a cumulative sum of some function of the gradient. Therefore, we can leverage the idea used in RMSProp (Hinton et al., 2012) for accelerating AdaGrad to accelerate DoG. In particular, at iteration $t$ , we can replace $\sum_{i\leq t}\|\hat{g}_{i}\|^{2}$ with $t\hat{v}_{t}$ , the bias-corrected exponential moving average of the squared gradient. This allows us to exponentially decrease the weights of past gradient norms. As a result, the effect of the early $\|\hat{g}_{t}\|^{2}$ terms on the learning rate diminishes over time, resulting in less conservative learning rates. To account for situations where the gradient estimates differ in scale across dimensions, we apply the above acceleration technique in a coordinate-wise fashion and obtain the following update rule for $\hat{v}_{t}$ :

\displaystyle v_{t}=\beta v_{t-1}+(1-\beta)\hat{g}_{t}^{2},\quad\hat{v}_{t}=% \frac{v_{t}}{1-\beta^{t}},

(22)

where $\beta\in(0,1)$ , $v_{0}=0$ , and $\hat{g}_{t}^{2}$ denotes the vector with each entry of $\hat{g}_{t}$ squared. We further apply the same idea to $r_{t}$ , the maximum distance traveled from $w_{0}$ , and $\hat{g}_{t}$ , the gradient estimate itself, thereby arriving at our proposed optimization procedure. Note that in Algorithm 2, the $\max$ operator denotes coordinate-wise maximum and the $\odot$ operater denotes coordinate-wise multiplication.

In Hot DoG, we set the exponential decay rates, $\beta_{1}$ and $\beta_{2}$ , to be the same as those in Kingma and Ba (2014), and we set the initial learning rate $r$ to a small constant (default $10^{-3}$ ) following the recommendation of Ivgi et al. (2023).

5 EXPERIMENTS

In this section, we demonstrate the effectiveness of Hot DoG and compare our method against other learning-rate-free stochastic gradient methods: optimally-tuned ADAM from a log scale grid search, as well as prodigy ADAM (Mishchenko and Defazio, 2024), DoG (Ivgi et al., 2023), and DoWG (Khaled et al., 2023) over different initial parameters. We compare the quality of posterior the approximations over different coreset sizes $M$ and weight optimization procedures for each experiment. Following Chen and Campbell (2024), for all optimization methods, we set the number of Markov chains to $K=2$ and subsample size to $S=M$ in Eq. 7. We set the Markov kernel $\kappa_{w}$ to the hit-and-run slice sampler with doubling (Bélisle et al., 1993; Neal, 2003) for all real data experiments. For the Gaussian location model, we use a kernel that directly samples from $\pi_{w}$ (Chen and Campbell, 2024, Sec. 3.4); for the sparse regression example, we use Gibbs sampling (George and McCulloch, 1993).

We compare these algorithms using six different Bayesian models (two synthetic, four real): a synthetic Gaussian location, a synthetic sparse regression, a (non-conjugate) linear regression, a logistic regression, a Poisson regression, and a Bradley-Terry model. See Section A.1 for details. We use Stan (Carpenter et al., 2017) to obtain full data inference results for real data experiments, and Gibbs sampling (George and McCulloch, 1993) for the sparse regression model with discrete variables. For all experiments, we measure the posterior approximation quality using the average squared z-score, defined as $\frac{1}{D}\sum_{i=1}^{D}(\frac{\mu_{i}-\hat{\mu}_{i}}{\sigma_{i}})^{2}$ , where $\mu_{i}$ and $\sigma_{i}$ are, respectively, the coordinate-wise mean and standard deviation estimated using the full data posterior, and $\hat{\mu}_{i}$ is the coordinate-wise mean estimated using draws from Coreset MCMC. This estimate is computed in a streaming fashion using the second half of all draws simulated at the time; note that this includes draws from $\pi_{w_{0}}$ before the hot-start test passes.

Each algorithm was run on 8 single-threaded cores of a 2.1GHz Intel Xeon Gold 6130 processor with 32GB memory. Code for these experiments is available at https://github.com/NaitongChen/automated-coreset-mcmc-experiments. More experimental details and additional plots are in Sections A.1 and A.2.

Effect of hot-start test. Fig. 4 compares Hot DoG with and without the hot-start test for $M=1000$ across all experiments; the same plots for other coreset sizes can be found in Section A.2. Without the hot-start test, the traces often hit a long plateau, before the effect of exponentially-weighted averaging is able to decay early large gradient norms. On the other hand, with burn-in, we begin by simulating from Markov chains targeting $\pi_{w_{0}}$ , and start optimizing the coreset weights only after the hot-start test has passed. In terms of the number of log potential evaluations, Hot DoG with burn-in leaves the plateau sooner than without burn-in.

Fig. 5 examines the behaviour of the hot-start test in more detail, showing the traces of the gradient estimate norms $\|\hat{g}_{t}\|$ and test statistics median $(u_{1},\dots,u_{K})$ across optimization iterations when using Hot DoG. Note that here we only show plots for $M=1000$ ; the same plots for other coreset sizes can be found in Section A.2. In some experiments, the Markov chains are initialized reasonably well where the gradient norms are already stabilized, and the test passes almost immediately. In others, the Markov chains are initialized poorly and the gradient norms are large, but nevertheless, the hot-start test passes shortly after they stabilize. Across all of these experiments, a test statistic threshold of 0.5 worked well.

Robustness to fixed parameter $r$ . Figure 6 provides an examination of the robustness of the proposed method to the fixed initial learning rate parameter $r$ . Across all experiments, different values of $r$ spanning multiple orders of magnitude result in similar posterior approximations across optimization iterations. Note that $M$ is $1000$ for all plots in Fig. 6. The same trends can be observed over different coreset sizes (see Section A.2).

Comparison with other related methods. Figure 7 shows a comparison between our method and DoG, DoWG, ADAM, as well as prodigy ADAM. We fix $r=0.001$ and $c=0.5$ for Hot DoG. Since the hot-start test itself can be applied to all methods, Hot DoG is compared against others both with and without burn-in. The posterior approximation quality of Hot DoG is orders of magnitude better than all other methods in many settings tested, and remain competitive otherwise. In particular, Hot DoG is capable of matching the performance of optimally-tuned ADAM without tuning.

6 CONCLUSION

This paper introduced Hot DoG, a learning-rate-free stochastic gradient method designed for learning coreset weights within the framework of Coreset MCMC. Our method extends DoG, but includes adjustments tailored to the Markovian setting of Coreset MCMC. In particular, Hot DoG includes a hot-start test for detecting when to start training coreset weights, as well as acceleration techniques. The quality of constructed coresets by Hot DoG and their corresponding posterior approximation is robust to all of its input parameters. Empirically, Hot DoG produces better posterior approximations than other learning-rate-free stochastic gradient methods, and is competitive to those of optimally-tuned ADAM.

Acknowledgements

T.C. was supported by NSERC Discovery Grant RGPIN-2019-03962, and J.H.H. was partially supported by National Science Foundation CAREER award IIS-2340586. We acknowledge the use of the ARC Sockeye computing platform from the University of British Columbia.

References

Robert and Casella (2004) Christian Robert and George Casella. Monte Carlo Statistical Methods. Springer, $2^{\text{nd}}$ edition, 2004.
Robert and Casella (2011) Christian Robert and George Casella. A short history of Markov chain Monte Carlo: subjective recollections from incomplete data. Statistical Science, 26(1):102–115, 2011.
Gelman et al. (2013) Andrew Gelman, John Carlin, Hal Stern, David Dunson, Aki Vehtari, and Donald Rubin. Bayesian Data Analysis. CRC Press, $3^{\text{rd}}$ edition, 2013.
Huggins et al. (2016) Jonathan Huggins, Trevor Campbell, and Tamara Broderick. Coresets for scalable Bayesian logistic regression. In Advances in Neural Information Processing Systems, 2016.
Naik et al. (2022) Cian Naik, Judith Rousseau, and Trevor Campbell. Fast Bayesian coresets via subsampling and quasi-Newton refinement. In Advances in Neural Information Processing Systems, 2022.
Chen et al. (2022) Naitong Chen, Zuheng Xu, and Trevor Campbell. Bayesian inference via sparse Hamiltonian flows. In Advances in Neural Information Processing Systems, 2022.
Campbell (2024) Trevor Campbell. General bounds on the quality of Bayesian coresets. In Advances in Neural Information Processing Systems, 2024.
Chen and Campbell (2024) Naitong Chen and Trevor Campbell. Coreset Markov chain Monte Carlo. In International Conference on Artificial Intelligence and Statistics, 2024.
Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. Adam: a method for stochastic optimization. In International Conference on Learning Representations, 2014.
Ivgi et al. (2023) Maor Ivgi, Oliver Hinder, and Yair Carmon. Dog is SGD’s best friend: a parameter-free dynamic step size schedule. In International Conference on Machine Learning, 2023.
Khaled et al. (2023) Ahmed Khaled, Konstantin Mishchenko, and Chi Jin. DoWG unleashed: an efficient universal parameter-free gradient descent method. In Advances in Neural Information Processing Systems, 2023.
Hinton et al. (2012) Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture 6a: overview of mini-batch gradient descent, 2012.
Campbell and Beronov (2019) Trevor Campbell and Boyan Beronov. Sparse variational inference: Bayesian coresets from scratch. In Advances in Neural Information Processing Systems, 2019.
Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61):2121–2159, 2011.
Carmon and Hinder (2022) Yair Carmon and Oliver Hinder. Making SGD parameter-free. In Conference on Learning Theory, 2022.
Defazio and Mishchenko (2023) Aaron Defazio and Konstantin Mishchenko. Learning-rate-free learning by D-adaptation. In International Conference on Machine Learning, 2023.
Mishchenko and Defazio (2024) Konstantin Mishchenko and Aaron Defazio. Prodigy: an expeditiously adaptive parameter-free learner. In International Conference on Machine Learning, 2024.
Orabona and Cutkosky (2020) Francesco Orabona and Ashok Cutkosky. International Conference on Machine Learning tutorial on parameter-free stochastic optimization, 2020.
Vaswani et al. (2019) Sharan Vaswani, Aaron Mishkin, Issam Laradji, Mark Schmidt, Gauthier Gidel, and Simon Lacoste-Julien. Painless stochastic gradient: interpolation, line-search, and convergence rates. In Advances in Neural Information Processing Systems, 2019.
Paquette and Scheinberg (2020) Courtney Paquette and Katya Scheinberg. A stochastic line search method with expected complexity analysis. Society for Industrial and Applied Mathematics Journal on Optimization, 30(1):349–376, 2020.
Loizou et al. (2021) Nicolas Loizou, Sharan Vaswani, Issam Laradji, and Simon Lacoste-Julien. Stochastic polyak step-size for SGD: an adaptive learning rate for fast convergence. In International Conference on Artificial Intelligence and Statistics, 2021.
Vehtari et al. (2021) Aki Vehtari, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner. Rank-normalization, folding, and localization: an improved $\widehat{R}$ for assessing convergence of MCMC (with discussion). Bayesian Analysis, 16(2):667–718, 2021.
Bélisle et al. (1993) Claude Bélisle, Edwin Romeijn, and Robert Smith. Hit-and-run algorithms for generating multivariate distributions. Mathematics of Operations Research, 18(2):255–266, 1993.
Neal (2003) Radford Neal. Slice sampling. The Annals of Statistics, 31(3):705–767, 2003.
George and McCulloch (1993) Edward George and Robert McCulloch. Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88(423):881–889, 1993.
Carpenter et al. (2017) Bob Carpenter, Andrew Gelman, Matthew Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: a probabilistic programming language. Journal of Statistical Software, 76(1):1––32, 2017.
Elo (1978) Aprad Elo. The Rating of Chessplayers, Past and Present. Arco, $1^{\text{st}}$ edition, 1978.

Appendix A APPENDIX

A.1 Details of Experiments

A.1.1 Model Specification

In this subsection, we describe the six examples (two synthetic and four real data) that we used for our experiments. Processed versions of all datasets used for the experiments are available at https://github.com/NaitongChen/automated-coreset-mcmc-experiments. For each of the regression models, we are given a set of points $(x_{n},y_{n})^{N}_{n=1}$ , each consisting of features $x_{n}\in\mathbb{R}^{p}$ and response $y_{n}$ .

Bayesian sparse linear regression: This is based on Example 4.1 from George and McCulloch (1993). We use the model

$\displaystyle\sigma^{2}$	$\displaystyle\sim{\mathrm{InvGam}}\left(\nu/2,\nu\lambda/2\right),$	(23)
$\displaystyle\forall i\in[p],\quad\gamma_{i}$	$\displaystyle\overset{\text{iid}}{\sim}{\mathrm{Bern}}(q),$	(24)
$\displaystyle\beta_{i}\mid\gamma_{i}$	$\displaystyle\overset{\text{ind}}{\sim}\mathcal{N}\left(0,\left(\mathds{1}(% \gamma_{i}=0)\tau+\mathds{1}(\gamma_{i}=1)c\tau\right)^{2}\right),$	(25)
$\displaystyle\forall n\in[N],\quad y_{n}\mid x_{n},\beta,\sigma^{2}$	$\displaystyle\overset{\text{ind}}{\sim}\mathcal{N}\left(x_{n}^{\top}\beta,% \sigma^{2}\right),$	(26)

where we set $\nu=0.1,\lambda=1,q=0.1,\tau=0.1$ , and $c=10$ . Here we model the variance $\sigma^{2}$ , the vector of regression coefficients $\beta=\begin{bmatrix}\beta_{1}&\dots&\beta_{p}\end{bmatrix}^{\top}\in\mathbb{R% }^{p}$ and the vector of binary variables $\gamma=\begin{bmatrix}\gamma_{1}&\dots&\gamma_{p}\end{bmatrix}^{\top}\in\left% \{0,1\right\}^{p}$ indicating the inclusion of the $p^{\text{th}}$ feature in the model. We set $N=50{,}000$ , $p=10$ , $\beta^{\star}=\begin{bmatrix}0&0&0&0&0&5&5&5&5&5\end{bmatrix}^{\top}$ , and generate a synthetic dataset by

$\displaystyle\forall n\in[N],\quad x_{n}$	$\displaystyle\overset{\text{iid}}{\sim}\mathcal{N}\left(0,I\right),$	(27)
$\displaystyle\epsilon_{n}$	$\displaystyle\overset{\text{iid}}{\sim}\mathcal{N}\left(0,25^{2}\right),$	(28)
$\displaystyle y_{n}$	$\displaystyle=x_{n}^{\top}\beta^{\star}+\epsilon_{n}.$	(29)

Bayesian linear regression: We use the model

	$\displaystyle\begin{bmatrix}\beta&\log\sigma^{2}\end{bmatrix}^{\top}$	$\displaystyle\sim\mathcal{N}(0,I),$		(30)
	$\displaystyle\forall n\in[N],y_{n}\mid x_{n},\beta,\sigma^{2}$	$\displaystyle\overset{\text{ind}}{\sim}\mathcal{N}\left(\begin{bmatrix}1&x_{n}% ^{\top}\end{bmatrix}\beta,\sigma^{2}\right),$		(31)

where $\beta\in\mathbb{R}^{p+1}$ is a vector of regression coefficients and $\sigma^{2}\in\mathbb{R}_{+}$ is the noise variance. Note that the prior here is not conjugate for the likelihood. The dataset consists of flight delay information from $N=98{,}673$ observations and was constructed using flight delay data from https://www.transtats.bts.gov/Homepage.asp and historical weather information from https://www.wunderground.com/. We study the difference, in minutes, between the scheduled and actual departure times against $p=10$ features including flight-specific and meteorological information.

Bayesian logistic regression: We use the model

	$\displaystyle\forall i\in[p+1],\quad\beta_{i}$	$\displaystyle\overset{\text{iid}}{\sim}{\mathrm{Cauchy}}(0,1),$		(32)
	$\displaystyle\forall n\in[N],\quad y_{n}$	$\displaystyle\overset{\text{ind}}{\sim}{\mathrm{Bern}}\left(\left(1+\exp\left(% -\begin{bmatrix}1&x_{n}^{\top}\end{bmatrix}\beta\right)\right)^{-1}\right),$		(33)

where $\beta=\begin{bmatrix}\beta_{1}&\dots&\beta_{p+1}\end{bmatrix}^{\top}\in\mathbb% {R}^{p+1}$ is a vector of regression coefficients. Here we use the same dataset as in linear regression, but instead model the relationship between whether a flight is cancelled using the same set of features. Note that of all flights included, only $0.058\%$ were cancelled.

Bayesian Poisson regression: We use the model

	$\displaystyle\beta$	$\displaystyle\sim\mathcal{N}(0,I),$		(34)
	$\displaystyle\forall n\in[N],y_{n}\mid x_{n},\beta$	$\displaystyle\overset{\text{ind}}{\sim}{\mathrm{Poiss}}\left(\log\left(1+e^{% \begin{bmatrix}1&x_{n}^{\top}\end{bmatrix}\beta}\right)\right),$		(35)

where $\beta\in\mathbb{R}^{p+1}$ is a vector of regression coefficients. The dataset consists of $N=15{,}641$ observations, and we model the hourly count of rental bikes against $p=8$ features (e.g., temperature, humidity at the time, and whether or not the day is a workday). The original bike share dataset is available at https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset.

The remaining two non-regression models are specified as follows.

Gaussian location: We use the model

	$\displaystyle\theta$	$\displaystyle\sim\mathcal{N}(0,I),$		(36)
	$\displaystyle\forall n\in[N],X_{n}$	$\displaystyle\overset{\text{iid}}{\sim}\mathcal{N}(\theta,I),$		(37)

where $\theta,X_{n}\in\mathbb{R}^{d}$ . Here we model the mean $\theta$ . We set $N=10{,}000,d=20$ and generate a synthetic dataset by

\displaystyle\forall n\in[N],x_{n}\overset{\text{iid}}{\sim}\mathcal{N}(0,I).

(38)

Bradley-Terry model: We use the model

	$\displaystyle\theta$	$\displaystyle\overset{\text{iid}}{\sim}\mathcal{N}(0,I),$		(39)
	$\displaystyle\forall n\in[N],y_{n}\mid h_{n},v_{n},\theta$	$\displaystyle\overset{\text{ind}}{\sim}{\mathrm{Bern}}\left(\left(1+\exp\left(% (\theta_{v_{n}}-\theta_{h_{n}})/400\right)\right)^{-1}\right),$		(40)

where $\theta\in\mathbb{R}^{d}$ . The dataset was constructed using games statistics from https://www.nba.com/stats and consists of data of $N=26,651$ NBA games between the 2004 and 2022 seasons. $h_{n}$ and $v_{n}$ are the home team and visitor team IDs for the $n^{\text{th}}$ game in the dataset, and $y_{n}$ denotes the outcome of the game ( $y_{n}=1$ if the home team won and $y_{n}=0$ if the visitor team won). $\theta\in\mathbb{R}^{d}$ represents the Elo ratings or relative skill levels (Elo, 1978, Ch. 1) for each of the $d=30$ teams. We model the Elo ratings using outcomes of pairwise comparisons between teams using game outcomes.

A.1.2 Parameter Settings

For full-data inference results of all examples except for the sparse linear regression model, we ran Stan (Carpenter et al., 2017) with 10 parallel chains, each taking $100{,}000$ steps with the first $50{,}000$ discarded, for a combined $500{,}000$ draws. For full-data inference result of the sparse linear regression example, we use the Gibbs sampler developed by George and McCulloch (1993) to generate $200{,}000$ draws, with the first half discarded as burn-in.

To account for changes in $w$ , for all real data experiments, we use the hit-and-run slice sampler with doubling (Bélisle et al., 1993; Neal, 2003); for the Gaussian location model, we use a kernel that directly samples from $\pi_{w}$ (Chen and Campbell, 2024, Sec. 3.4). for the sparse regression, we use the Gibbs sampler developed by George and McCulloch (1993).

We use Stan (Carpenter et al., 2017) to obtain full data inference results for real data experiments, and Gibbs sampling (George and McCulloch, 1993) for the sparse regression model with discrete variables. The true posterior distribution for the Gaussian location model is available in closed form.

For ADAM, we test multiple learning rates over a log scale grid $\left(10^{k}\right)$ for $k=-3,-2,\dots,1$ . For each experiment under each coreset size, the optimally-tuned ADAM is the one that obtained the lowest average squared z-score after $200,000$ iterations of weight optimization. For all learning-rate-free methods, we test different initial parameters (initial lower bound for prodigy ADAM and $r_{0}$ for Hot DoG, DoG, and DoWG) over a log scaled grid $\left(10^{k}\right)$ for $k=-3,-2,\dots,1$ .

For the logistic regression example, to account for the class imbalance problem, we include all observations from the rare positive class if the coreset size is more than twice as big as the total number of observations with positive labels. Otherwise we sample our coreset to have $50\%$ positive labels and $50\%$ negative labels. Coreset points are uniformly subsampled for all other models.

A.2 Additional Results

Figs. 4 to 6 in the main text show the traces of average squared coordinate-wise z-scores, as well as the gradient estimate norms and hot-start test statistics for Hot DoG when $M=1000$ . In this subsection, we show the same sets of plots for $M=100$ and $M=500$ . Similarly to Fig. 4, Figs. 8 and 9 compare Hot DoG with and without hot-start test. Similarly to Fig. 5, Figs. 10 and 11 show the gradient estimate norms and hot-start test statistics during burn-in. Similarly to Fig. 6, Figs. 12 and 13 compare Hot DoG (with hot-start test) and optimally-tuned ADAM. We see that all plots show the same trends as the ones in Section 5, where $M=1000$ . As a result, we arrive at similar observations as in Section 5. In particular, Hot DoG with burn-in leaves the plateau sooner than without burn-in; the hot-start test passes and thus burn-in terminates shortly after gradient norms are stabilized; Hot DoG is robust to the fixed parameter $r$ .