Efficient Proximal Gradient Algorithms for Joint Graphical Lasso

Chen, Jie; Shimmura, Ryosuke; Suzuki, Joe

doi:10.3390/e23121623

Open AccessArticle

Efficient Proximal Gradient Algorithms for Joint Graphical Lasso

by

Jie Chen

^*

,

Ryosuke Shimmura

and

Joe Suzuki

Graduate School of Engineering Science, Osaka University, Osaka 560-0043, Japan

^*

Author to whom correspondence should be addressed.

Entropy 2021, 23(12), 1623; https://doi.org/10.3390/e23121623

Submission received: 4 November 2021 / Revised: 27 November 2021 / Accepted: 27 November 2021 / Published: 2 December 2021

Download

Browse Figures

Versions Notes

Abstract

:

We consider learning as an undirected graphical model from sparse data. While several efficient algorithms have been proposed for graphical lasso (GL), the alternating direction method of multipliers (ADMM) is the main approach taken concerning joint graphical lasso (JGL). We propose proximal gradient procedures with and without a backtracking option for the JGL. These procedures are first-order methods and relatively simple, and the subproblems are solved efficiently in closed form. We further show the boundedness for the solution of the JGL problem and the iterates in the algorithms. The numerical results indicate that the proposed algorithms can achieve high accuracy and precision, and their efficiency is competitive with state-of-the-art algorithms.

Keywords:

Gaussian graphical model; joint graphical lasso; proximal gradient descent method

1. Introduction

Graphical models are widely used to describe the relationships among interacting objects [1]. Such models have been extensively used in various domains, such as bioinformatics, text mining, and social networks. The graph provides a visual way to understand the joint distribution of an entire set of variables.

In this paper, we consider learning Gaussian graphical models that are expressed by undirected graphs, which represent the relationship among continuous variables that follow a joint Gaussian distribution. In an undirected graph,

G = (V, E)

, and edge set E represents the conditional dependencies among the variables in vertex set V.

Let

X_{1}, \dots, X_{p}

(

p \geq 1

) be Gaussian variables with covariance matrix

Σ \in R^{p \times p}

, and

Θ : = Σ^{- 1}

be the precision matrix, if it exists. We remove the edges so that the variables

X_{i}

,

X_{j}

are conditionally independent given the other variables if and only if the

(i, j)

-th element

θ_{i, j}

in

Θ

is 0:

{i, j} \notin E ⟺ θ_{i, j} = 0 ⟺ X_{i} ⊥ ⊥ X_{j} | X_{V \ {i, j}},

where each edge is expressed as a set of two elements in

{1, \dots, p}

. In this sense, constructing a Gaussian graphical model is equivalent to estimating a precision matrix.

Suppose that we estimate the undirected graph from data consisting of n tuples of p variables and that dimension p is much higher than sample size n. For example, if we have expression data of

p = 20,000

genes for

n = 100

case/control patients, how can we construct a gene regulatory network structure from the data? It is almost impossible to estimate the locations of the nonzero elements in

Θ

by obtaining the inverse of sample covariance matrix

S \in R^{p \times p}

, which is the unbiased estimator of

Σ

. In fact, if

p > n

, then no inverse

S^{- 1}

exists because the rank of

S \in R^{p \times p}

is, at most, n.

In order to address this situation, two directions are suggested:

Sequentially find the variables on which each variable depends via regression so that the quasilikelihood is maximized [2].
Find the locations in $Θ$ , the values of which are zeros, so that the $ℓ_{1}$ regularized log-likelihood is maximized [3,4,5,6].

We follow the second approach because we assume Gaussian variables, also known as graphical lasso (GL). The

ℓ_{1}

regularized log-likelihood is defined by:

\begin{matrix} \underset{Θ}{maximize} \{log det Θ - trace (S Θ) - {λ | | Θ | |}_{1}\}, \end{matrix}

(1)

where tuning parameter

λ

controls the amount of sparsity, and

{| | Θ | |}_{1}

denotes the sum of the absolute value of the off-diagonal elements in

Θ

. Several optimization techniques [4,7,8,9,10,11,12] have been studied for the optimization problem of (1).

In particular, we consider a generalized version of the abovementioned GL. For example, suppose that the gene regulatory networks of thirty case and seventy control patients are different. One might construct a gene regulatory network separately for each of the two categories. However, estimating each on its own does not provide an advantage if a common structure is shared. Instead, we use 100 samples to construct two networks simultaneously. Intuitively speaking, using both types of data improves the reliability of the estimation by increasing the sample size for the genes that show similar values between case and control patients, while using only one type of data leads to a more accurate estimate for genes that show significantly different values. Ref. [13] proposed a joint graphical lasso (JGL) model by including an additional convex penalty (fused or group lasso penalty) to the graphical lasso objective function for K classes. For example, K is equal to two for the case/control patients in the example. JGL includes fused graphical lasso with fused lasso penalty, which encourages sparsity and the similarity of the value of edges across K classes, and group graphical lasso with group lasso penalty, which promotes similar sparsity structure across K graphs. Although there are several approaches to handling the multiple graphical models, such as those of [14,15,16,17], the JGL is considered the most promising.

The main topic of this paper is efficiency improvement in terms of solving the JGL problem. For the GL, relatively efficient solving procedures exist. If we differentiate the

ℓ_{1}

regularized log-likelihood (1) by

Θ

, then we have an equation to solve [4]. Moreover, several improvements have been considered for the GL, such as proximal Newton [12] and proximal gradient [10] procedures. However, for the JGL, even if we derive such an equation, we have no efficient way of handling it.

Instead, the alternating direction method of multipliers (ADMM) [18], which is a procedure for solving convex optimization problems for general purposes, has been the main approach taken [13,19,20,21]. However, ADMM does not scale well concerning the feature dimension p and number of classes K. It usually takes time to converge to a high-accuracy solution [22].

For efficient procedures to solve the JGL problem, ref. [23] proposed a method based on the proximal Newton method only when the penalty term is expressed by fused lasso. The existing method requires expensive computations for the Hessian matrix and Newton directions, which means that it would be expensive to use for high-dimensional problems.

In this paper, we propose efficient proximal-gradient-based algorithms to solve the JGL problem by extending the procedure in [10] and employing the step-size selection strategy proposed in [24]. Moreover, we provide the theoretical analysis of both methods for the JGL problem.

In our proximal gradient methods for the JGL problem, the proximal operator in each iteration is quite simple, which eases the implementation process and requires very little computation and memory at each step. Simulation experiments are used to justify our proposed methods over the existing ones.

Our main contributions are as follows:

We propose efficient algorithms based on the proximal gradient method to solve the JGL problem. The algorithms are first-order methods and quite simple, and the subproblems can be solved efficiently with a closed-form solution. The numerical results indicate that the methods can achieve high accuracy and precision, and the computational time is competitive with state-of-art algorithms.
We provide the boundedness for the solution to the JGL problem and the iterates in algorithms, which is related to the convergence rate of the algorithms. With the boundedness, we can guarantee that our proposed method converges linearly.

Table 1 summarizes the relationship between the proposed and existing methods.

The remaining parts of this paper are as follows. In Section 2, we first provide the background of our proposed methods and introduce the joint graphical lasso problem. In Section 3, we illustrate the detailed content of the proposed algorithms and provide a theoretical analysis. In Section 4, we report some numerical results of the proposed approaches, including comparisons with efficient methods and performance evaluations. Finally, we draw some conclusions in Section 5.

Notation: In this paper,

{| | x | |}_{p}

denotes the

ℓ_{p}

norm of a vector

x \in R^{d}

,

{| | x | |}_{p} : = (\sum_{i = 1}^{d} | x_{i} {|^{p})}^{\frac{1}{p}}

for

p \in [1, \infty)

, and

{| | x | |}_{\infty} : = {max}_{i} | x_{i} |

. For a matrix

X \in R^{p \times q}

,

{| | X | |}_{F}

denotes the Frobenius norm,

{| | X | |}_{2}

denotes the spectral norm,

{| | X | |}_{\infty} : = {max}_{i, j} | x_{i, j} |

, and

{| | X | |}_{1} : = \sum_{i = 1}^{p} \sum_{j = 1}^{q} | x_{i, j} |

if not specified. The inner product is defined by

〈 X, X 〉 : = trace (X^{T} X)

.

2. Preliminaries

This section first reviews the graphical lasso (GL) problem and introduces the graphical iterative shrinkage-thresholding algorithm (G-ISTA) [10] to solve it. Then, we introduce the step-size selection strategy that we apply to the joint graphical lasso (JGL) in Section 3.2.

2.1. Graphical Lasso

Let

x_{1}, \dots, x_{n} \in R^{p}

be

n \geq 1

observations of dimension

p \geq 1

that follow the Gaussian distribution with mean

μ \in R^{p}

and covariance matrix

Σ \in R^{p \times p}

, where without loss of generality, we assume

μ = 0

. Let

Θ : = Σ^{- 1}

, and the empirical covariance matrix

S : = \frac{1}{n} \sum_{i = 1}^{n} x_{i}^{T} x_{i}

. Given penalty parameter

λ > 0

, the graphical lasso (GL) is the procedure to find the positive definite

Θ \in R^{p \times p}

such that:

\underset{Θ}{minimize} \{- log det Θ + trace (S Θ) + {λ ∥ Θ ∥}_{1}\},

(2)

where

{| | Θ | |}_{1} = \sum_{j \neq k} | θ_{j, k} | .

If we regard

V : = {1, \dots, p}

as a vertex set, then we can construct an undirected graph with edge set

{{j, k} | θ_{j, k} \neq 0}

, where set

{j, k}

denotes an undirected edge that connects the nodes

j, k \in V

.

If we take the subgradient of (2), then we find that the optimal solution

Θ_{*}

satisfies the condition:

- Θ_{*}^{- 1} + S + λ Φ ∋ 0,

(3)

where

Φ = (Φ_{j, k})

is

Φ_{j, k} = \{\begin{matrix} 1, & θ_{j, k}^{*} > 0 \\ [- 1, 1], & θ_{j, k}^{*} = 0 \\ - 1, & θ_{j, k}^{*} < 0 \end{matrix} .

2.2. ISTA for Graphical Lasso

In this subsection, we introduce the method for solving the GL problem (2) by the iterative shrinkage-thresholding algorithm (ISTA) proposed by [10], which is a proximal gradient method usually employed in dealing with nondifferentiable composite optimization problems.

Specifically, the general ISTA solves the following composite optimization problem:

\underset{x}{minimize} F (x) : = f (x) + g (x),

(4)

where f and g are convex, with f differentiable and g possibly being nondifferentiable.

For the GL problem (2), we denote

f, g : R^{p \times p} \to R

as

f (Θ) : = - log det Θ + trace (S Θ),

and

g (Θ) : = {λ ∥ Θ ∥}_{1} .

If we define the quadratic approximation

Q_{η} : R^{p \times p} \times R^{p \times p} \to R

w.r.t.

f (Θ)

and

η > 0

by

\begin{matrix} Q_{η} (Θ^{'}, Θ) : = f (Θ) + 〈 Θ^{'} - Θ, \nabla f (Θ) 〉 + \frac{1}{2 η} | | Θ^{'} - Θ {| |}_{F}^{2}, \end{matrix}

(5)

then we can describe the ISTA as a procedure that iterates

\begin{matrix} Θ_{t + 1} & = arg min_{Θ} {Q_{η_{t}} (Θ, Θ_{t}) + g (Θ)} \end{matrix}

(6)

\begin{matrix} = {prox}_{η_{t} g} (Θ_{t} - η_{t} \nabla f (Θ_{t})), \end{matrix}

(7)

given initial value

Θ_{0}

, where the value of step size

η_{t} > 0

may change at each iteration

t = 1, 2, \dots,

for efficient convergence purpose, and we use the proximal operator:

{prox}_{g} (z) : = arg min_{θ} {\frac{1}{2} {∥ z - θ ∥}_{2}^{2} + g (θ)} .

(8)

Note that the proximal operator of function

g = {λ | | Θ | |}_{1}

is the soft-thresholding operator: the absolute value

| θ_{i, j} |

of each off-diagonal element

θ_{i, j}

with

i \neq j

becoming either

θ_{i, j} - sgn (θ_{i, j}) λ

or zero (if

| θ_{i, j} | < λ

). We use the following function for the operator in Section 3:

{[S_{λ} (Θ)]}_{i, j} = sgn (θ_{i, j}) (| θ_{i, j} {| - λ)}_{+}

(9)

where

{(x)}_{+} : = max (x, 0)

.

Definition 1.

A differentiable function

f : R^{n \times p} \to R

is said to have a Lipschitz-continuous gradient if there exists

L > 0

(Lipschitz constant) such that

| | \nabla f (X) - {\nabla f (Y) | |}_{F} \leq L | | X - {Y | |}_{F}, \forall X, Y \in R^{n \times p} .

(10)

It is known that if we choose

η_{t} = \frac{1}{L}

for each step in the ISTA that minimizes

F (\cdot)

, then the convergence rate is, at most, as follows [25]:

F (Θ_{t}) - F (Θ_{*}) = O (\frac{1}{t})

(11)

However, for the GL problem (2), we know neither the exact value of the Lipschitz constant L nor any nontrivial upper bound. [10] implement a backtracking line search option in Step 1 of Algorithm 1 below to handle this issue.

The backtracking line search enables us to compute the

η_{t}

value for each time

t = 1, 2, \dots

by repeatedly multiplying

η_{t}

by a constant

c \in (0, 1)

until

Θ_{t + 1} ≻ 0

(

Θ

is positive definite) and

f (Θ_{t + 1}) \leq Q_{η_{t}} (Θ_{t + 1}, Θ_{t}),

(12)

for the

Θ_{t + 1}

in (7). Additionally, (12) is a sufficient condition for (11), which was derived in [25] (see the relationship between Lemma 2.3 and Theorem 3.1 in [25]).

The whole procedure is given in Algorithm 1.

Algorithm 1 G-ISTA for problem (2).

Input:

S

, tolerance

ϵ > 0

, backtracking constant

0 < c < 1

, initial value

η_{0}

,

Θ_{0}

,

t = 0

.
While

t < t_{max}

(until convergence) do

1:: Backtracking line search: Continue to multiply $η_{t}$ by c until

$Θ_{t + 1} ≻ 0 and f (Θ_{t + 1}) \leq Q_{η_{t}} (Θ_{t + 1}, Θ_{t})$

for $Θ_{t + 1} : = {p r o x}_{η_{t} g} (Θ_{t} - η_{t} \nabla f (Θ_{t}))$ .
2:: Update iterate: $Θ_{t + 1} \leftarrow {p r o x}_{η_{t} g} (Θ_{t} - η_{t} \nabla f (Θ_{t}))$ .
3:: Set next initial step size $η_{t + 1}$ by the Barzilai—Borwein method.
4:: $t \leftarrow t + 1$

end
Output:

ϵ

-optimal solution to problem (2),

Θ_{*} = Θ_{t + 1}

.

2.3. Composite Self-Concordant Minimization

The notion of the self-concordant function was proposed in [26,27,28]. In the following, we say a convex function f is self-concordant with parameter

M \geq 0

if

| f^{‴} (x) | \leq M f^{″} {(x)}^{3 / 2}, for all x \in dom f .

where

dom f

is the domain of f.

Reference [24] considered a composite version of self-concordant function minimization and provided a way to efficiently calculate the step size for the proximal gradient method for the GL problem without relying on the Lipschitz gradient assumption in (10). They proved that

f (Θ) : = - log det Θ + trace (S Θ)

in (2) is self-concordant and considers the following minimization:

F^{*} : = \underset{x}{minimize} {F (x) : = f (x) + g (x)},

where f is convex, differentiable, and self-concordant, and g is convex and possibly nondifferentiable. As for Algorithm 1, without using the backtracking line search, we can compute direction

d_{t}

with initial step size

η_{t}

as follows:

d_{t} : = {p r o x}_{η_{t} g} (Θ_{t} - η_{t} \nabla f (Θ_{t})) - Θ_{t},

(13)

where the operator

p r o x

is defined by (8). Then, we use the modified step size

α_{t}

to update

Θ_{t + 1} : = Θ_{t} + α_{t} d_{t}

, which can be determined by the direction

d_{t}

. After defining two parameters related to the direction:

β_{t} : = η_{t}^{- 1} | | d_{t} {| |}_{F}^{2}

and

λ_{t} : = {(〈 \nabla^{2} f (Θ_{t}) d_{t}, d_{t} 〉)}^{1 / 2}

, the modified step size can be obtained by

α_{t} : = \frac{β_{t}}{λ_{t} (λ_{t} + β_{t})} .

(14)

By Lemma 12 in [24], if the modified step size

α_{t} \in (0, 1]

, then it can ensure a decrease in the objective function and guarantee convergence in the proximal gradient scheme. From (14), if

λ_{t} \geq 1

, then the condition

α_{t} \in (0, 1]

is satisfied. Therefore, we only need to check the case when

λ_{t} < 1

. If the condition

α_{t} \in (0, 1]

does not hold, we can change the value of the initial

η_{t}

(such as the bisection method) to influence the value of

d_{t}

in (13) until the condition is satisfied.

2.4. Joint Graphical Lasso

Let

N \geq 1, p \geq 1, K \geq 2

, and

(x_{1}, y_{1}), \dots, (x_{N}, y_{N}) \in R^{p} \times {1, \dots, K}

, where each

x_{i}

is a row vector. Let

n_{k}

be the number of occurrences in

y_{1}, \dots, y_{N}

such that

y_{i} = k

, so that

\sum_{k = 1}^{K} n_{k} = N

.

For each

k = 1, \dots, K

, we define the empirical covariance matrix

S^{(k)} \in R^{p \times p}

of the data

x_{i}

as follows:

S^{(k)} : = \frac{1}{n_{k}} \sum_{i : y_{i} = k} x_{i}^{T} x_{i} .

Given the penalty parameters

λ_{1} > 0

and

λ_{2} > 0

, the joint graphical lasso (JGL) is the procedure to find the positive definite matrix

Θ^{(k)} \in R^{p \times p}

for

k = 1, \dots, K

, such that:

\underset{Θ}{minimize} \{- \sum_{k = 1}^{K} n_{k} {log det Θ^{(k)} - trace (S^{(k)} Θ^{(k)})} + λ_{1} \sum_{k = 1}^{K} \sum_{i \neq j} | θ_{k, i, j} | + P (Θ)\},

(15)

where

P (Θ)

penalizes

Θ : = {[Θ^{(1)}, \dots, Θ^{(K)}]}^{T}

. For example, ref. [13] suggested the following fused and group lasso penalties:

P_{F} (Θ) : = λ_{2} \sum_{k < l} \sum_{i, j} | θ_{k, i, j} - θ_{l, i, j} |

and

P_{G} (Θ) : = λ_{2} \sum_{i \neq j} {\{\sum_{k = 1}^{K} θ_{k, i, j}^{2}\}}^{1 / 2},

where

θ_{k, i, j}

is the

(i, j)

-th element of

Θ^{(k)} \in R^{p \times p}

for

k = 1, \dots, K

.

Unfortunately, there is no equation like (3) for the JGL to find the optimum

Θ_{*}

. [13] considered the ADMM to solve the JGL problem. However, ADMM is quite time consuming for large-scale problems.

3. The Proposed Methods

In this section, we propose two efficient algorithms for solving the JGL problem. One is an extended ISTA based on the G-ISTA in Section 2.2, and the other is based on the step-size selection strategy introduced in Section 2.3.

3.1. ISTA for the JGL Problem

To describe the JGL problem, we define

f, g : R^{K \times p \times p} \to R

by

\begin{matrix} f (Θ) & : = - \sum_{k = 1}^{K} n_{k} \{log det Θ^{(k)} - trace (S^{(k)} Θ^{(k)})\}, \end{matrix}

(16)

\begin{matrix} g (Θ) & : = λ_{1} \sum_{k = 1}^{K} \sum_{i \neq j} | θ_{k, i, j} | + P (Θ) . \end{matrix}

(17)

Then, the problem (15) reduces to:

\underset{Θ}{minimize} F (Θ) : = f (Θ) + g (Θ),

where the function f is convex and differentiable, and g is convex and nondifferentiable. Therefore, the ISTA is available for solving the JGL problem (15).

The main difference between the G-ISTA and the proposed method is that the latter needs to simultaneously consider K categories of graphical models in the JGL problem (15). What is more, there are two combined penalties in

g (Θ)

, which complicate the proximal operator in the ISTA procedure. Consequently, the operator for the proposed method is not a simple soft thresholding operator, as it is for the G-ISTA method.

If we define the quadratic approximation

Q_{η_{t}} : R^{K \times p \times p} \to R

of

f (Θ_{t})

by:

\begin{matrix} Q_{η_{t}} (Θ, Θ_{t}) : = f (Θ_{t}) + \sum_{k = 1}^{K} 〈Θ^{(k)} - Θ_{t}^{(k)}, \nabla f (Θ_{t}^{(k)})〉 + \frac{1}{2 η_{t}} \sum_{k = 1}^{K} | | Θ^{(k)} - Θ_{t}^{(k)} {| |}_{F}^{2}, \end{matrix}

then the update iteration is simplified as:

\begin{matrix} Θ_{t + 1} & = \underset{Θ}{argmin} \{Q_{η_{t}} (Θ, Θ_{t}) + g (Θ)\} \\ = {prox}_{η_{t} g} (Θ_{t} - η_{t} \nabla f (Θ_{t})) . \end{matrix}

Nevertheless, the Lipschitz gradient constant of

f (Θ)

is unknown over the whole domain in the JGL problem. Therefore, our approach needs a backtracking line search to calculate step size

η_{t}

. We show the details in Algorithm 2.

Algorithm 2 ISTA for problem (15).

Input:

S

, tolerance

ϵ > 0

, backtracking constant

0 < c < 1

, initial step size

η_{0}

, initial iterate

Θ_{0}

.
For

t = 0, 1, \dots,

(until convergence) do

1:: Backtracking line search: Continue to multiply $η_{t}$ by c until

$\begin{matrix} f (Θ_{t + 1}) \leq Q_{η_{t}} (Θ_{t + 1}, Θ_{t}) and Θ_{t + 1}^{(k)} ≻ 0 for k = 1, \dots, K . \end{matrix}$

for $Θ_{t + 1} : = {prox}_{η_{t} g} (Θ_{t} - η_{t} \nabla f (Θ_{t}))$ .
2:: Update iterate: $Θ_{t + 1} \leftarrow {prox}_{η_{t} g} (Θ_{t} - η_{t} \nabla f (Θ_{t}))$ .
3:: Set next initial step size $η_{t + 1}$ . See details in Section 3.3.

end
Output: optimal solution to problem (15),

Θ_{*} = Θ_{t + 1}

.

In the update of

Θ_{t + 1}

, we need to compute the proximal operators for the fused and group lasso penalties. In the following, for each of them, the problem can be divided into the fused lasso problems [29] and group lasso problems [30,31] for

θ_{i, j} \in R^{K}

,

i, j = 1, \dots, p

. We apply the solutions given by (20) and (21) below.

3.1.1. Fused Lasso Penalty $P_{F}$

By the definition of the proximal operator in the update step, we have:

\begin{matrix} Θ_{t + 1} = arg min_{Θ} \{\frac{1}{2} \sum_{k = 1}^{K} | | Θ^{(k)} - Θ_{t}^{(k)} + η_{t} \nabla f (Θ_{t}^{(k)}) {| |}_{F}^{2} & + η_{t} λ_{1} \sum_{k = 1}^{K} \sum_{i \neq j} | θ_{k, i, j} | \\ + η_{t} λ_{2} \sum_{k < l} \sum_{i, j} | θ_{k, i, j} - θ_{l, i, j} |\} . \end{matrix}

(18)

Problem (18) is separable with respect to the elements

θ_{k, i, j}

in

Θ^{(k)} \in R^{p \times p}

; hence, the proximal operator can be computed in a componentwise manner: Let

A = Θ_{t} - η_{t} \nabla f (Θ_{t})

; then, problem (18) reduces to the following for

i = 1, \dots, p

,

j = 1, \dots, p

:

\begin{matrix} \underset{θ_{1, i, j}, \dots, θ_{K, i, j}}{argmin} \{\frac{1}{2} \sum_{k = 1}^{K} {(θ_{k, i, j} - a_{k, i, j})}^{2} + η_{t} λ_{1} 1_{i \neq j} \sum_{k = 1}^{K} | θ_{k, i, j} | + η_{t} λ_{2} \sum_{k < l} | θ_{k, i, j} - θ_{l, i, j} |}\}, \end{matrix}

(19)

where

1_{i \neq j}

is an indicator function, the value of which is 1 only when

i \neq j

.

The problem (19) is known as the fused lasso problem [29,32] given

a_{k, i, j}

for

k = 1, \dots, K

. In particular, let

α : = η_{t} λ_{1} 1_{i \neq j}

and

β : = η_{t} λ_{2}

. When

i \neq j

,

α \neq 0

and

β > 0

, the solution to (19) can be obtained through the soft thresholding operator based on the solution when

α = 0

by the following Lemma.

Lemma 1.

([33]) Denote the solution to parameters α and β as

θ_{i} (α, β)

, and then the solution

θ_{i} (α, β)

of the fused lasso problem:

\frac{1}{2} \sum_{i = 1}^{n} {(y_{i} - θ_{i})}^{2} + α \sum_{i = 1}^{n} | θ_{i} | + β \sum_{i = 1}^{n - 1} | θ_{i} - θ_{i + 1} |

(20)

is given by

{[S_{α} (θ (0, β))]}_{i}

when

y_{1}, \dots, y_{n} \in R

are given for

n \geq 1

.

Additionally, rather efficient algorithms are available for solving the fused lasso problem (20) when

α = 0

(i.e.,

θ (0, β)

) such as [32,34,35].

3.1.2. Group Lasso Penalty $P_{G}$

By definition, the update of

Θ_{t + 1}

for the group lasso penalty

P_{G} (Θ)

is as follows:

\begin{matrix} Θ_{t + 1} = arg min_{Θ} \{\frac{1}{2} \sum_{k = 1}^{K} | | Θ^{(k)} - Θ_{t}^{(k)} + η_{t} \nabla f (Θ_{t}^{(k)}) {| |}_{F}^{2} & + η_{t} λ_{1} \sum_{k = 1}^{K} \sum_{i \neq j} | θ_{k, i, j} | \\ + η_{t} λ_{2} \sum_{i \neq j} {(\sum_{k = 1}^{K} θ_{k, i, j}^{2})}^{1 / 2}\} . \end{matrix}

Similarly, let

A = Θ_{t} - η_{t} \nabla f (Θ_{t})

; then, the problem becomes the following for

i = 1, \dots, p

,

j = 1, \dots, p

:

\begin{matrix} \underset{θ_{1, i, j}, \dots, θ_{K, i, j}}{argmin} \{\frac{1}{2} \sum_{k = 1}^{K} {(θ_{k, i, j} - a_{k, i, j})}^{2} + η_{t} λ_{1} 1_{i \neq j} \sum_{k = 1}^{K} | θ_{k, i, j} | + η_{t} λ_{2} 1_{i \neq j} {(\sum_{k = 1}^{K} θ_{k, i, j}^{2})}^{1 / 2}\} . \end{matrix}

We have

θ_{k, i, j} = a_{k, i, j}

for

i = j

. In addition, for

i \neq j

, the solution [31,36,37] is given by

θ_{k, i, j} = S_{η_{t} λ_{1}} (a_{k, i, j}) {(1 - \frac{η_{t} λ_{2}}{\sqrt{\sum_{k = 1}^{K} S_{η_{t} λ_{1}} {(a_{k, i, j})}^{2}}})}_{+} .

(21)

3.2. Modified ISTA for JGL

Thus far, we have seen that

f (Θ)

in the JGL problem (15) is not globally Lipschitz gradient continuous. The ISTA may not be efficient enough for the JGL case because it includes the backtracking line search procedure for this case, which needs to evaluate the objective function and the positive definiteness of

Θ_{t + 1}

in Step 1 of Algorithm 2 and is inefficient when the evaluation is expensive.

In this section, we modify Algorithm 2 to Algorithm 3 based on the step-size selection strategy in Section 2.3, which takes advantage of the properties of the self-concordant function. The self-concordant function does not rely on the Lipschitz gradient assumption on the function

f (Θ)

[24], and we can eliminate the need for the backtracking line search.

Lemma 2.

([38]) Self-concordance is preserved by scaling and addition: if f is a self-concordant function and a constant

a \leq 1

, then

a f

is self-concordant. If

f_{1}, f_{2}

are self-concordant, then

f_{1} + f_{2}

is self-concordant.

By Lemma 2, the function

f (Θ)

in (16) is self-concordant. In Algorithm 3, for the initial step size of

η_{t}

in each iteration, we use the Barzilai–Borwein method [39]. We apply the step-size mechanism in Section 2.3, which is employed in Steps 3–5 of Algorithm 3.

Algorithm 3 Modified ISTA (M-ISTA).

Input:

S

, tolerance

ϵ > 0

, initial step size

η_{0}

, initial iterate

Θ_{0}

.
For

t = 0, 1, \dots,

(until convergence) do

1:: Initialize $η_{t}$ .
2:: Compute

$\begin{matrix} d_{t} : = {prox}_{η_{t} g} (Θ_{t} - η_{t} \nabla f (Θ_{t})) - Θ_{t} . \end{matrix}$
3:: Compute

$\begin{matrix} β_{t} : = η_{t}^{- 1} | | d_{t} {| |}_{F}^{2} \end{matrix}$

and

$\begin{matrix} λ_{t} : = \sum_{k = 1}^{K} \sqrt{n_{k}} | | {(Θ_{t}^{(k)})}^{- 1} d_{t}^{(k)} {| |}_{F} . \end{matrix}$
4:: Determine the step size $α_{t} : = \frac{β_{t}}{λ_{t} (λ_{t} + β_{t})}$ .
5:: If $α_{t} > 1$ , then set $η_{t} : = η_{t} / 2$ and go back to Step 2.
6:: Update $Θ_{t + 1} : = Θ_{t} + α_{t} d_{t}$ .

end
Output: optimal solution to problem (15),

Θ_{*} = Θ_{t + 1}

.

There is no backtracking procedure in this algorithm that guarantees the positive definiteness of

Θ_{t + 1}

, as in Step 1 of Algorithm 2. We next show how to ensure the positive definiteness of

Θ_{t + 1}

in the iterations of Algorithm 3.

Lemma 3.

([40], Theorem 2.1.1) Let f be a self-concordant function, and let

x \in dom f

. Additionally, if

W (x) = {y | {(〈 \nabla^{2} f (x) (y - x), y - x 〉)}^{1 / 2} \leq 1},

then

W (x) \subset d o m f

.

In Algorithm 3, because we know

α_{t} : = \frac{β_{t}}{λ_{t} (λ_{t} + β_{t})} < 1

with

β_{t} > 0

and

λ_{t} > 0

by Steps 3–5. Thus, we have

α_{t} λ_{t} < 1

:

α_{t} λ_{t} : = α_{t} {(〈 \nabla^{2} f (Θ_{t}) d_{t}, d_{t} 〉)}^{1 / 2} < 1,

which implies,

{(〈 \nabla^{2} f (Θ_{t}) (Θ_{t + 1} - Θ_{t}), Θ_{t + 1} - Θ_{t} 〉)}^{1 / 2} < 1 .

Hence, from Lemma 3, we see that

Θ_{t + 1}

stays in the domain and maintains positive definiteness.

3.3. Theoretical Analysis

For multiple Gaussian graphical models, Honorio and Samaras [14] and Hara and Washio [17] provided lower and upper bounds for the optimal solution

Θ_{*}

. However, the models they considered are different than the JGL. To the best of our knowledge, no related research has provided the bounds of the optimal solution

Θ_{*}

for the JGL problem (15).

In the following section, we show the bounds of the optimal solution

Θ_{*}

for the JGL and the iterates

Θ_{t}

generated by Algorithms 2 and 3, which are applied to both fused and group lasso-type penalties.

Proposition 1.

The optimal solution

Θ_{*}

of the problem (15) satisfies

max_{1 \leq k \leq K} \frac{n_{k}}{p λ_{c} + n_{k} | | S^{(k)} {| |}_{2}} \leq | | Θ_{*}^{(k)} {| |}_{2} \leq \frac{N p}{λ_{1}} + \sum_{k = 1}^{K} \sum_{i = 1}^{p} {(s_{k, i, i})}^{- 1},

where

λ_{c} : = \sqrt{K λ_{1}^{2} + 2 K λ_{1} λ_{2} + λ_{2}^{2}}

, and

s_{k, i, i}

is the i-th diagonal element of

S^{(k)}

.

For the proof, see Appendix A.1.

Note that the objective function value

F (Θ)

always decreases with the increase in iteration in both algorithms due to [25] (Remark 3.1) and Lemma 12 in [24]. Therefore, the following inequality holds for Algorithms 2 and 3:

F (Θ_{t + 1}) \leq F (Θ_{t}) for t = 0, 1, \dots .

(22)

Then, based on the condition (22), we provide the explicit bounds of iterates

{Θ_{t}}_{t = 0, 1 \dots}

in Algorithms 2 and 3 for the JGL problem (15).

Proposition 2.

Sequence

{Θ_{t}}_{t = 0, 1, \dots,}

generated by Algorithms 2 and 3 can be bounded:

m \leq | | Θ_{t} {| |}_{2} \leq M,

where

M : = | | Θ_{0} {| |}_{F} + \frac{2 N p}{λ_{1}} + 2 \sum_{k = 1}^{K} \sum_{i = 1}^{p} {s_{k, i, i}}^{- 1}

,

m : = e^{- \frac{C_{1}}{n_{m}}} M^{(1 - K p)}

,

n_{m} = max_{k} n_{k}

, and constant

C_{1} : = F (Θ_{0})

.

For the proof, see Appendix A.2.

With the help of Propositions 1 and 2, and the following Lemma, we can obtain the range of the step size that ensures the linear convergence rate of Algorithm 2.

Lemma 4.

Let

Θ_{t}

be t-th iterate in Algorithm 2. Denote

λ_{m i n}

and

λ_{m a x}

as the minimum and maximum eigenvalues of the corresponding matrix, respectively. Define

a_{k} : = min {λ_{min} (Θ_{t}^{(k)}), λ_{min} (Θ_{*}^{(k)})}, b_{k} : = max {λ_{max} (Θ_{t}^{(k)}), λ_{max} (Θ_{*}^{(k)})}

and

n_{l} = {min}_{k = 1 \dots, K} n_{k}

,

n_{m} = {max}_{k = 1 \dots, K} n_{k}

,

a_{l} = {min}_{k = 1 \dots, K} a^{(k)}

, and

b_{m} = {max}_{k = 1 \dots, K} b^{(k)}

. The sequence

{Θ_{t}}_{t = 0, 1, \dots}

generated by Algorithm 2 satisfies

| | Θ_{t + 1} - Θ_{*} {| |}_{F} \leq γ_{t} | | Θ_{t} - Θ_{*} {| |}_{F}

with the convergence rate

γ_{t} : = max {\frac{η_{t} n_{m}}{a_{l}^{2}} - 1, 1 - \frac{η_{t} n_{l}}{b_{m}^{2}}}

.

Proof.

It can be easily extended by Lemma 3 in [10]. □

Lemma 4 implies that to obtain the convergence rate

γ_{t} < 1

, we require:

0 < η_{t} < \frac{2 a_{l}^{2}}{n_{m}} .

(23)

After using Propositions 1 and 2, we can obtain the bounds of

a_{l}

. Further, we can obtain the step size

η_{t}

that satisfies (23) and guarantee s the linear convergence rate

(γ_{t} < 1)

. However, the step size is quite conservative in practice. Hence, we consider the Barzilai–Borwein method for implementation and regard the step size

η_{t}

that satisfies (23) as a safe choice. When the number of backtracking iterations in Step 1 of Algorithm 2 exceeds the given maximum number to fulfill the backtracking line search condition, we can use the safe step size

η_{t}

for the subsequent calculations. In Section 4.2.3, we confirm the linear convergence rate of the proposed ISTA by experiment.

4. Experiments

In this section, we evaluate the performance of the proposed methods on both synthetic and real datasets, and we compare the following algorithms:

ADMM: the general ADMM method proposed by [13].
FMGL: the proximal Newton-type method proposed by [23].
ISTA: the proposed method in Algorithm 2.
M-ISTA: the proposed method in Algorithm 3.

We perform all the tests in R Studio on a Macbook Air with 1.6 GHz Intel Core i5 and 8 GB memory. The wall times are recorded as the run times for the four algorithms.

4.1. Stopping Criteria and Model Selection

In the experiments, we consider two stopping criteria for the algorithms.

1. Relative error stopping criterion:

\frac{\sum_{k = 1}^{K} | | Θ_{t + 1}^{(k)} - Θ_{t}^{(k)} {| |}_{F}}{max {\sum_{k = 1}^{K} | | Θ_{t}^{(k)} | |_{F}, 1}} \leq ϵ .

2. Objective error stopping criterion:

F (Θ_{t}) - F (Θ_{*}) \leq ϵ .

ϵ

is a given accuracy tolerance; we terminate the algorithm if the above error is smaller than

ϵ

or the maximum number of iterations exceeds 1000. We use the objective error for convergence rate analysis and the relative error for the time comparison.

The JGL model is affected by regularized parameters

λ_{1}

and

λ_{2}

. For selecting the parameters, we use the V-fold crossvalidation method. First, the dataset is randomly split into V segments of equal size, a single subset (test data), estimated by the other

V - 1

subsets (training data), is evaluated, and the subset is changed for the test to repeat V times so that each subset is used.

Let

S_{v}^{(k)}

be the sample covariance matrix of the v-th (

v = 1, \dots, V

) segment for class

k = 1, \dots, K

. We estimate the inverse covariance matrix by the remaining

V - 1

subsets

{\hat{Θ}}_{λ, - v}^{(k)}

and choose

λ_{1}

and

λ_{2}

, which minimize the average predictive negative log-likelihood as follows:

C V (λ_{1}, λ_{2}) = \sum_{v = 1}^{V} \sum_{k = 1}^{K} \{n_{k} trace (S_{v}^{(k)} {\hat{Θ}}_{λ, - v}^{(k)}) - logdet {\hat{Θ}}_{λ, - v}^{(k)}\}

4.2. Synthetic Data

The performance of the proposed methods was assessed on synthetic data in terms of the number of iterations, the execution time, the squared error, and the receiver operating characteristic (ROC) curve. We follow the data generation mechanism described in [41] with some modifications for the JGL model. We put the details in Appendix B.

4.2.1. Time Comparison Experiments

We vary

p, N, K and λ_{1}

to compare the execution time of our proposed methods with that of the existing methods. We consider only the fused penalty in our proposed method for a fair comparison in the experiments because the FMGL algorithm applies only to the fused penalty. First, we compare the performance among different algorithms under various dimensions p, which are shown in Figure 1.

Figure 1 shows that the execution time of the FMGL and ADMM increases rapidly as p increases. In particular, we observe that the M-ISTA significantly outperforms when p exceeds 200. The ISTA shows better performance than the three methods when p is less than 200, but it requires more time as p grows, compared to the M-ISTA. It is reasonable to consider that evaluating the objective function in the backtracking line search at every iteration increases the computational burden, especially when p increases, which means that the M-ISTA is a good choice for these cases. Furthermore, the ISTA can be a good candidate when the evaluation is inexpensive.

Table 2 summarizes the performance of the four algorithms under different parameter settings to achieve a given precision,

ϵ

, of the relative error. The results presented in Table 2 reveal that when we increase the number of classes K, all the algorithms require more time than usual. Moreover, the execution time of ADMM becomes huge among them. When we vary

λ_{1}

, the algorithms become more efficient as the value increases. For most instances, the M-ISTA and ISTA outperform the existing methods, such as ADMM and FMGL. For the exceptional cases (

p = 20, k = 2, N = 60, λ_{1} = 0.1

and

λ_{2} = 0.05

), the M-ISTA and ISTA are still comparable with the FMGL and faster than ADMM.

4.2.2. Algorithm Assessment

We generate the simulation data as described in Appendix B and regard the synthetic inverse covariance matrices

Θ^{(k)}

as the true values for our assessment experiments.

First, we assessed our proposed method by drawing an ROC curve, which displays the number of true positive edges (i.e., TP edges) selected compared to the number of false positive edges (i.e., FP edges) selected. We say that an edge

(i, j)

in the k-th class is selected in estimate

{\hat{Θ}}^{(k)}

if element

{\hat{θ}}_{k, i, j} \neq 0

, and the edges are true positive edges selected if the precision matrix element

θ_{k, i, j} \neq 0

and false positive edges selected if the precision matrix element

θ_{k, i, j} = 0

, where the two quantities are defined by

T P = \sum_{k = 1}^{K} \sum_{i, j} 1 (θ_{k, i, j} \neq 0) \cdot 1 ({\hat{θ}}_{k, i, j} \neq 0)

and

F P = \sum_{k = 1}^{K} \sum_{i, j} 1 (θ_{k, i, j} = 0) \cdot 1 ({\hat{θ}}_{k, i, j} \neq 0),

where

1 (\cdot)

is the indicator function.

To confirm the validity of the proposed methods, we compare the ROC figures of the fused penalty and group penalty. We fix the parameters

λ_{2}

for each curve and change the

λ_{1}

value to obtain various numbers of selected edges because the sparsity penalty parameter

λ_{1}

can control the number of selected total edges.

We show the ROC curves for fused and group lasso penalties in Figure 2a,b respectively. From the figures, we observe that both penalties show highly accurate predictions for the edge selections. The result of

λ_{2} = 0.0166

in the fused penalty case is better than that in

λ_{2} = 0.05

. Additionally, the result of

λ_{2} = 0.0966

in the group penalty case is better than that in

λ_{2} = 0.09

, which means that if we select the tuning parameters properly, then we can obtain precise results while simultaneously meeting our different model demands.

Then, Figure 3a,b display the mean squared error (MSE) between the estimated values and true values.

M S E = \frac{2}{K p (p - 1)} \sum_{k = 1}^{K} \sum_{i < j} {({\hat{θ}}_{k, i, j} - θ_{k, i, j})}^{2},

where

{\hat{θ}}_{k, i, j}

is the value estimated by the proposed method, and

θ_{k, i, j}

is the true precision matrix value we used in the data generation.

The figures illustrate that when the total number of edges selected increases, the errors decrease and finally achieve relatively low values.

Overall, the proposed method shows competitive efficiency not only in computational time but also in accuracy.

4.2.3. Convergence Rate

This section shows the convergence rate of the ISTA for solving the JGL problem (15) in practice, with

λ_{1} = 0.1

,

0.09

and

0.08

. We recorded the number of iterations to achieve the different tolerance of

F (Θ_{t}) - F (Θ_{*})

in Figure 4 and ran it on a synthetic dataset, with

p = 200

,

K = 2

,

λ_{2} = 0.05

, and

N = 400

. The figure reveals that as

λ_{1}

decreases, more iterations are needed to converge to the specified tolerance. Moreover, the figure shows the linear convergence rate of the proposed ISTA method, which corroborate the theoretical analysis in Section 3.3.

4.3. Real Data

In this section, we use two different real datasets to demonstrate the performance of our proposed method and visualize the result.

Firstly, we used the presidential speeches dataset in [42] for the experiment to jointly estimate common links across graphs and show the common structure. The dataset contains 75 most-used words (features) from several big speeches of the 44 US presidents (samples). In addition, we used the clustering result in [42], where the authors split the 44 samples into two groups with similar features, and then we obtained two classes of samples

(K = 2)

.

We used Cytoscape [43] to visualize the results when

λ_{1} = 1.9

and

λ_{2} = 0.16

. We chose these relatively large tuning parameters for better interpretation of the network figure. Figure 5 shows the relationship network graph of the high-frequency words identified by the JGL model with the proposed method. As shown in the figure, each node represents a word, and the edges demonstrate the relationships between words.

We use different colors to show various structures. The black edges are a common structure between the two classes, the red edges are the specific structures for the first class

(k = 1)

, and the green edges are for the second class

(k = 2)

. Figure 5 shows a subnetwork on the top with red edges, meaning there are relationships among those words, and the connections only exist in the first group.

We compared the time cost among four algorithms and show the results in Table 3. We used the crossvalidation method (

V = 6

) described in Section 4.1 to select the optimal tuning parameters (

λ_{1} = 0.1

,

λ_{2} = 0.05

). In addition, we manually chose the other two pairs of parameters for more comparisons.

Table 3 shows that ISTA outperforms the other three algorithms, and our proposed methods offer stable performance when varying the parameters, while ADMM is the slowest in most cases.

Secondly, we use a breast cancer dataset [44] for time comparison. There are 250 samples and 1000 genes in the dataset, with 192 control samples and 58 case samples

(K = 2)

. Furthermore, we extract 200 genes with the highest variances among the original genes. The tuning parameter pair

(λ_{1} = 0.01

,

λ_{2} = 0.0166)

was chosen by the crossvalidation method. Table 3 exhibits that our proposed methods (ISTA and M-ISTA) outperform ADMM and FMGL, and M-ISTA shows the best performance in the breast cancer dataset.

5. Discussion

We propose two efficient proximal gradient descent procedures with and without the backtracking line search option for the joint graphical lasso. The first (Algorithm 2) does not require extra variables, unlike ADMM, which needs manual tuning the Lagrangian penalty parameters

ρ

in [13] and storing and calculating dual variables. Moreover, we reduce the update iterate step to subproblems that can be solved efficiently and precisely by lasso-type problems. Based on Algorithm 2, we modified the step-size selection by extending the strategy in [24] to the second one (Algorithm 3), which does not rely on the Lipschitz assumption. Additionally, the second does not require a backtracking line search, significantly reducing the computation time needed to evaluate objective functions.

From the theoretical perspective, we reach the linear convergence rate for the ISTA. Furthermore, we derive the lower and upper bounds of the solution to the JGL problem and the iterates in the algorithms, guaranteeing that the ISTA converges linearly. Numerically, the methods are demonstrated on simulated and real datasets to illustrate their robust and efficient performance over state-of-the-art algorithms.

For further computational improvement, the most expensive step in the algorithms is to calculate the inversion of matrices required by the gradient of

f (Θ)

. Both algorithms have a complexity of

O (K p^{3})

per iteration. Moreover, we can solve the matrix inversion problem with more efficient algorithms with lower complexity. In addition, we can also use the faster computation procedure in [13] to decompose the optimization problem for the proposed methods and regard it as preprocessing. Overall, the proposed methods are highly efficient for the joint graphical lasso problem.

Author Contributions

Conceptualization, J.C., R.S. and J.S.; methodology, J.C., R.S. and J.S.; software, J.C. and R.S.; validation, J.C., R.S. and J.S.; formal analysis, J.C., R.S. and J.S.; writing—original draft preparation, J.C. and J.S.; writing—review and editing, J.C., R.S. and J.S.; visualization, J.C.; supervision, J.S.; project administration, J.S.; funding acquisition, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Grant-in-Aid for Scientific Research (KAKENHI) C, Grant number: 18K11192.

Data Availability Statement

Publicly available datasets were analyzed in this paper. Presidential speeches dataset: https://www.presidency.ucsb.edu, accessed on 5 November 2021; Breast cancer dataset: https://www.rdocumentation.org/packages/doBy/versions/4.5-15/topics/breastcancer, accessed on 5 November 2021.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ADMM	alternating direction method of multipliers
FMGL	fused multiple graphical lasso algorithm
FP	false positive
G-ISTA	graphical iterative shrinkage-thresholding algorithm
GL	graphical lasso
ISTA	iterative shrinkage-thresholding algorithm
JGL	joint graphical lasso
M-ISTA	modified iterative shrinkage-thresholding algorithm
MSE	mean squared error
ROC	receiver operating characteristic
TP	true positive

Appendix A. Proofs of Propositions

Appendix A.1. Proof of Proposition 1

We first introduce the Lagrange dual problem of (15). By introducing the auxiliary variables

Z = {Z^{(1)}, \dots, Z^{(K)}}

, we can rewrite the problem as follows:

\begin{matrix} min_{Θ} f (Θ) + g (Θ) \\ subject to Z = Θ \end{matrix}

Then, the Lagrange function of the above is given by:

L (Θ, Z, Λ) = f (Θ) + g (Z) + \sum_{k = 1}^{K} 〈 Λ^{(k)}, Θ^{(k)} - Z^{(k)} 〉,

where

Λ = {Λ^{(1)}, \dots, Λ^{(K)}}, Λ^{(k)} \in R^{p \times p}

are dual variables. To obtain the dual problem, we minimize the primal variables as follows:

\begin{matrix} min_{Λ, Z} L (Θ, Z, Λ) & = min_{Θ} {f (Θ) + \sum_{k = 1}^{K} 〈 Λ^{(k)}, Θ^{(k)} 〉} - max_{Z} {- g (Z) - \sum_{k = 1}^{K} 〈 Λ^{(k)}, - Z^{(k)} 〉} \\ = min_{Θ} {f (Θ) + \sum_{k = 1}^{K} 〈 Λ^{(k)}, Θ^{(k)} 〉} - g^{*} (Λ) \\ = min_{Θ} \{\sum_{k = 1}^{K} 〈 Λ^{(k)} + n_{k} S^{(k)}, Θ^{(k)} 〉 - \sum_{k = 1}^{K} n_{k} logdet Θ^{(k)}\} - g^{*} (Λ) . \end{matrix}

Taking derivative of the function:

L_{1} : = \sum_{k = 1}^{K} 〈 Λ^{(k)} + n_{k} S^{(k)}, Θ^{(k)} 〉 - \sum_{k = 1}^{K} n_{k} logdet Θ^{(k)},

\nabla_{Θ^{(k)}} L_{1} = 0,

We obtain

n_{k} S^{(k)} + Λ^{(k)} = n_{k} {(Θ^{(k)})}^{- 1}

(A1)

for

k = 1, \dots, K

. Substitute the Equation (A1) into the dual problem

{min}_{Θ, Z} L (Θ, Z, Λ)

, then it becomes:

\begin{matrix} min_{Θ, Z} & L (Θ, Z, Λ) = \sum_{k = 1}^{K} n_{k} p + \sum_{k = 1}^{K} n_{k} logdet (S^{(k)} + \frac{1}{n_{k}} Λ^{(k)}) - g^{*} (Λ) . \end{matrix}

Hence, we can obtain the duality gap [38] (the primal problem minus the dual problem) as follows:

\begin{matrix} f (Θ) + g (Z) & - \sum_{k = 1}^{K} n_{k} p - \sum_{k = 1}^{K} n_{k} {logdet Θ^{(k)}} + g^{*} (Λ) \\ = \sum_{k = 1}^{K} n_{k} trace (S^{(k)} Θ^{(k)}) + g (Z) - \sum_{k = 1}^{K} n_{k} p + g^{*} (Λ), \end{matrix}

when the gap value is 0, the optimal solution is found. Because the conjugate function

g^{*} (Λ)

is the indicator function, the value is hence 0 for the optimal solution.

Firstly, for the group penalty

P_{G} (Θ)

, the duality gap is

\begin{matrix} \sum_{k = 1}^{K} [n_{k} trace (S^{(k)} Θ_{*}^{(k)}) - n_{k} p] & + λ_{1} \sum_{k = 1}^{K} \sum_{i \neq j} | {θ_{*}}_{k, i, j} | + λ_{2} \sum_{i \neq j} \sqrt{\sum_{k = 1}^{K} {θ_{*}}_{k, i, j}^{2}} = 0 . \end{matrix}

(A2)

From Equation (A2), we obtain

\begin{matrix} λ_{1} | | Θ_{*} {| |}_{1} & = - \sum_{k = 1}^{K} n_{k} trace (S^{(k)} Θ_{*}^{(k)}) - λ_{2} \sum_{i \neq j} \sqrt{\sum_{k = 1}^{K} {θ_{*}}_{k, i, j}^{2}} + \sum_{k = 1}^{K} n_{k} p + \sum_{k = 1}^{K} \sum_{i = 1}^{p} λ_{1} | {θ_{*}}_{k, i, i} | \\ \leq \sum_{k = 1}^{K} n_{k} p + \sum_{k = 1}^{K} \sum_{i = 1}^{p} λ_{1} | {θ_{*}}_{k, i, i} | . \end{matrix}

From Equation (A1), we have the following relationship of diagonal elements,

\begin{matrix} {θ_{k, i, i}}^{*} = d i a g {(S^{(k)} + \frac{1}{n_{k}} Λ_{*}^{(k)})}^{- 1}, \end{matrix}

and due to dual variable

Λ_{k, i, i}^{*} > 0

, for

k = 1, \dots, K

. Hence,

\begin{matrix} | | Θ_{*} {| |}_{1} & \leq \frac{1}{λ_{1}} \sum_{k = 1}^{K} n_{k} p + \sum_{k = 1}^{K} \sum_{i = 1}^{p} d i a g {(S^{(k)} + \frac{1}{n_{k}} Λ_{*}^{(k)})}^{- 1} \\ \leq \frac{1}{λ_{1}} \sum_{k = 1}^{K} n_{k} p + \sum_{k = 1}^{K} \sum_{i = 1}^{p} d i a g {(S^{(k)})}^{- 1} . \end{matrix}

By

| | Θ_{*} {| |}_{2} \leq | | Θ_{*} {| |}_{F} \leq | | Θ_{*} {| |}_{1}

, we obtain the upper bound:

\begin{matrix} | | Θ_{*} {| |}_{2} \leq | | Θ_{*} {| |}_{F} \leq \frac{1}{λ_{1}} \sum_{k = 1}^{K} n_{k} p + \sum_{k = 1}^{K} \sum_{i = 1}^{p} {s_{k, i, i}}^{- 1} . \end{matrix}

(A3)

The proof is similar for the fused penalty

P_{F} (Θ)

; therefore, we omit it here. Next, we continue to prove the lower bound of

Θ_{*}

.

Firstly, for the group penalty

P_{G} (Θ)

. Let

E^{(k)}

be non-negative

p \times p

matrix satisfying

- E_{k, i, j} \leq θ_{k, i, j} \leq E_{k, i, j}

. Introducing the Lagrange multipliers

Γ^{(k)}

and

Γ_{0}^{(k)}

for

k = 1, \dots, K

. This procedure is similar to the way in [17].

Then, the new Lagrange problem becomes,

\begin{matrix} max_{Θ, E} min_{Γ, Γ_{0}} & \{f (Θ) - \sum_{k = 1}^{K} \sum_{i \neq j} λ_{1} E_{k, i, j} - λ_{2} \sum_{i \neq j} \sqrt{\sum_{k = 1}^{K} E_{k, i, j}^{2}} \\ - \sum_{k = 1}^{K} t r (Γ^{(k)} Θ^{(k)}) - t r (a b s (Γ^{(k)}) E^{(k)}) - t r (Γ_{0}^{(k)} E^{(k)})\}, \end{matrix}

Taking derivative w.r.t

Θ^{(k)}

and

E_{k, i, j}

, we obtain the following equations:

n_{k} {Θ^{(k)}}^{- 1} - n_{k} S^{(k)} - Γ^{(k)} = 0,

(A4)

\begin{matrix} - λ_{1} - λ_{2} \frac{E_{k, i, j}}{\sqrt{\sum_{k = 1}^{K} E_{k, i, j}^{2}}} + | Γ_{k, i, j} | + Γ_{0}^{(k)} & = 0, for i \neq j, \end{matrix}

(A5)

\begin{matrix} | Γ_{k, i, j} | + Γ_{0}^{(k)} & = 0, for i = j . \end{matrix}

(A6)

When

i \neq j

, from Equation (A5),

\begin{matrix} | Γ_{k, i, j} | & \leq λ_{1} + λ_{2} \frac{E_{k, i, j}}{\sqrt{\sum_{k = 1}^{K} E_{k, i, j}^{2}}} \\ | Γ_{k, i, j} |^{2} & \leq {(λ_{1} + λ_{2} \frac{E_{k, i, j}}{\sqrt{\sum_{k = 1}^{K} E_{k, i, j}^{2}}})}^{2} \\ = λ_{1}^{2} + 2 λ_{1} λ_{2} \frac{E_{k, i, j}}{\sqrt{\sum_{k = 1}^{K} E_{k, i, j}^{2}}} + λ_{2}^{2} \frac{E_{k, i, j}^{2}}{\sum_{k = 1}^{K} E_{k, i, j}^{2}} \\ \leq λ_{1}^{2} + 2 λ_{1} λ_{2} + λ_{2}^{2} \frac{E_{k, i, j}^{2}}{\sum_{k = 1}^{K} E_{k, i, j}^{2}} . \end{matrix}

Taking summation of each k,

\sum_{k = 1}^{K} {| Γ_{k, i, j} |}^{2} \leq K λ_{1}^{2} + 2 K λ_{1} λ_{2} + λ_{2}^{2} .

Then,

\sqrt{\sum_{k = 1}^{K} {| Γ_{k, i, j} |}^{2}} \leq \sqrt{K λ_{1}^{2} + 2 K λ_{1} λ_{2} + λ_{2}^{2}} .

(A7)

From (A4) and (A7), we have the following relationship

\begin{matrix} | | {Θ^{(k)}}^{- 1} | | \leq | | \frac{1}{n_{k}} Γ^{(k)} + S^{(k)} {| |}_{2} & \leq \frac{1}{n_{k}} | | Γ^{(k)} {| |}_{2} + | | S^{(k)} {| |}_{2} \\ \leq \frac{p}{n_{k}} max_{i, j} | Γ_{k, i, j} | + | | S^{(k)} {| |}_{2} \\ \leq \frac{p}{n_{k}} max_{k} max_{i, j} | Γ_{k, i, j} | + | | S^{(k)} {| |}_{2} \\ \leq \frac{p (\sqrt{K λ_{1}^{2} + 2 K λ_{1} λ_{2} + λ_{2}^{2}})}{n_{k}} + | | S^{(k)} {| |}_{2} . \end{matrix}

The last equation holds because

max_{k} max_{i, j} | Γ_{k, i, j} | \leq \sqrt{\sum_{k = 1}^{K} {| Γ_{k, i, j} |}^{2}} .

We only consider the case when

i \neq j

for

m a x_{i, j} | Γ_{i j}^{(k)} |

, because from Equations (A5) and (A6), we know

| Γ_{i j}^{(k)} | > | Γ_{i i}^{(k)} |

. Overall, the lower bound is

\frac{n_{k}}{p \sqrt{K λ_{1}^{2} + 2 K λ_{1} λ_{2} + λ_{2}^{2}} + n_{k} | | S^{(k)} {| |}_{2}} .

The lower bound of fused penalty can be derived in similar way.

Appendix A.2. Proof of Proposition 2

By Equation (22) and convexity of

F (Θ)

, it is easy to obtain

| | Θ_{t} - Θ_{*} {| |}_{F} \leq | | Θ_{0} - Θ_{*} {| |}_{F} .

Since

| | \cdot {| |}_{2} \leq | | \cdot {| |}_{F}

, then

\begin{matrix} | | Θ_{t} {| |}_{2} - | | Θ_{*} {| |}_{2} & \leq | | Θ_{t} - Θ_{*} {| |}_{2} \\ \leq | | Θ_{t} - Θ_{*} {| |}_{F} \\ \leq | | Θ_{0} - Θ_{*} {| |}_{F} . \end{matrix}

Hence,

\begin{matrix} | | Θ_{t} {| |}_{2} & \leq | | Θ_{0} - Θ_{*} {| |}_{F} + | | Θ_{*} {| |}_{2} \\ \leq | | Θ_{0} {| |}_{F} + 2 | | Θ_{*} {| |}_{F} . \end{matrix}

Then, by Equation (A3), we can complete the proof of the upper bound.

To prove the lower bound, denote

\begin{matrix} a_{t}^{(k)} & = λ_{min} (Θ_{t}^{(k)}) \\ {(a_{t})}_{l} & = min_{k = 1, \dots, K} a_{t}^{(k)} . \end{matrix}

By the definition of the matrix norm, we have

| | Θ_{t}^{(k)} {| |}_{2} \geq a_{t}^{(k)} \geq {(a_{t})}_{l} .

Denote the upper bound of

| | Θ_{t} {| |}_{2}

as M, and that of

| | Θ_{t} {| |}_{2}

as

M^{(k)}

, for

k = 1, \dots, K

. By definition of tensor norm, we have

M \geq | | Θ_{t} {| |}_{2} \geq | | Θ_{t}^{(k)} {| |}_{2} \geq {(a_{t})}_{l}

.

Let constant

C_{1} : = f (Θ_{0}) + g (Θ_{0})

. By the Equation (22), we have

C_{1} \geq f (Θ_{t}) + g (Θ_{t}) .

Note that

S ⪰ 0, Θ_{t} ≻ 0

implies

t r (S Θ_{t}) \geq 0

and because

g (Θ_{t}) \geq 0

\begin{matrix} C_{1} & \geq - \sum_{k = 1}^{K} n_{k} logdet Θ_{t}^{(k)} \\ = - \sum_{k = 1}^{K} n_{k} \log (Π_{i = 1}^{p} λ_{i}) . \end{matrix}

Let the eigenvalues of

Θ_{t}^{(k)}

as

λ_{1} \leq λ_{2} \leq \dots \leq λ_{p}

. Then

a_{t}^{(k)} = λ_{1} \leq λ_{p} \leq M^{(k)}

, hence,

Π_{i = 1}^{p} (λ_{i}) = a_{t}^{(k)} \cdot λ_{2} \cdot \dots λ_{p} \leq a_{t}^{(k)} \cdot {M^{(k)}}^{(p - 1)} .

Then,

\sum_{k = 1}^{K} n_{k} \log (Π_{i = 1}^{p} λ_{i}) \leq \sum_{k = 1}^{K} n_{k} [l o g a_{t}^{(k)} + (p - 1) l o g M^{(k)}] .

Let the coefficient

n_{k}

of the term which contains

{(a_{t})}_{l}

in

- \sum_{k = 1}^{K} n_{k} l o g a_{t}^{(k)}

as

n_{x}

, then

\sum_{k = 1}^{K} n_{k} l o g a_{t}^{(k)} = n_{x} l o g {(a_{t})}_{l} + \sum_{k \neq x} n_{k} l o g a_{t}^{(k)} .

Because

M^{(k)} \leq M,

denote

n_{m} = max_{i = 1 \dots, K} n_{k}

, then,

\begin{matrix} \sum_{k = 1}^{K} n_{k} l o g a_{t}^{(k)} \leq n_{m} l o g {(a_{t})}_{l} + n_{m} (K - 1) l o g M . \end{matrix}

Hence,

\begin{matrix} C_{1} & \geq - \sum_{k = 1}^{K} n_{k} \log (Π_{i = 1}^{p} (λ_{i})) \\ \geq - \sum_{k = 1}^{K} n_{k} [l o g a_{t}^{(k)} + (p - 1) l o g M^{(k)}] \\ \geq - n_{m} l o g {(a_{t})}_{l} - n_{m} (K - 1) l o g M \\ - K n_{m} (p - 1) l o g M . \end{matrix}

Then, we can obtain

\begin{matrix} l o g {(a_{t})}_{l} & \geq - K (p - 1) l o g M - (K - 1) l o g M - \frac{C_{1}}{n_{m}} \\ {(a_{t})}_{l} & \geq e^{(1 - K p) l o g M - \frac{C_{1}}{n_{m}}} . \end{matrix}

Hence, the lower bound is proved:

| | Θ_{t} {| |}_{2} \geq | | Θ_{t}^{(k)} {| |}_{2} \geq {(a_{t})}_{l} \geq e^{- \frac{C_{1}}{n_{m}}} M^{(1 - K p)} .

Appendix B. Data Generation

We generate

n_{k}

samples independently and identically distributed observations from a multivariate normal distribution

N {0, {({\hat{Θ}}^{(k)})}^{- 1}}

, where

Θ^{(k)}

is the inverse covariance matrix of the k-th category. Specifically, we generate p points randomly on a unit space and calculate their pairwise distances. Then, we find the m-nearest neighbors point by this distance. We connect any two points that are m-nearest neighbors of each other. The integer m determines for the degree of sparsity of the data, and m values range from 4 to 9 in our experiments.

Additionally, we add heterogeneity to the common structure by building extra individual connections in the following way: we randomly choose a pair of symmetric zero elements,

θ_{k, i, j} = θ_{k, j, i} = 0

, and replace them with a value uniformly generated from the

[- 1, - 0, 5] \cup [0.5, 1]

interval. This operation is repeated

M / 2

times, where M is the number of off-diagonal nonzero elements in

Θ^{(k)}

.

References

Lauritzen, S.L. Graphical Models; Clarendon Press: Oxford, UK, 1996; Volume 17. [Google Scholar]
Meinshausen, N.; Bühlmann, P. High-dimensional graphs and variable selection with the lasso. Ann. Stat. 2006, 34, 1436–1462. [Google Scholar] [CrossRef] [Green Version]
Yuan, M.; Lin, Y. Model selection and estimation in the Gaussian graphical model. Biometrika 2007, 94, 19–35. [Google Scholar] [CrossRef] [Green Version]
Friedman, J.; Hastie, T.; Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 2008, 9, 432–441. [Google Scholar] [CrossRef] [Green Version]
Banerjee, O.; El Ghaoui, L.; d’Aspremont, A. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res. 2008, 9, 485–516. [Google Scholar]
Rothman, A.J.; Bickel, P.J.; Levina, E.; Zhu, J. Sparse permutation invariant covariance estimation. Electron. J. Stat. 2008, 2, 494–515. [Google Scholar] [CrossRef]
Banerjee, O.; Ghaoui, L.E.; d’Aspremont, A.; Natsoulis, G. Convex optimization techniques for fitting sparse Gaussian graphical models. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 89–96. [Google Scholar]
Xue, L.; Ma, S.; Zou, H. Positive-definite l1-penalized estimation of large covariance matrices. J. Am. Stat. Assoc. 2012, 107, 1480–1491. [Google Scholar] [CrossRef] [Green Version]
Mazumder, R.; Hastie, T. The graphical lasso: New insights and alternatives. Electron. J. Stat. 2012, 6, 2125. [Google Scholar] [CrossRef]
Guillot, D.; Rajaratnam, B.; Rolfs, B.T.; Maleki, A.; Wong, I. Iterative thresholding algorithm for sparse inverse covariance estimation. arXiv 2012, arXiv:1211.2532. [Google Scholar]
d’Aspremont, A.; Banerjee, O.; El Ghaoui, L. First-order methods for sparse covariance selection. SIAM J. Matrix Anal. Appl. 2008, 30, 56–66. [Google Scholar] [CrossRef] [Green Version]
Hsieh, C.J.; Sustik, M.A.; Dhillon, I.S.; Ravikumar, P. QUIC: Quadratic approximation for sparse inverse covariance estimation. J. Mach. Learn. Res. 2014, 15, 2911–2947. [Google Scholar]
Danaher, P.; Wang, P.; Witten, D.M. The joint graphical lasso for inverse covariance estimation across multiple classes. J. R. Stat. Soc. Ser. B Stat. Methodol. 2014, 76, 373. [Google Scholar] [CrossRef]
Honorio, J.; Samaras, D. Multi-Task Learning of Gaussian Graphical Models; ICML: Baltimore, MA, USA, 2010. [Google Scholar]
Guo, J.; Levina, E.; Michailidis, G.; Zhu, J. Joint estimation of multiple graphical models. Biometrika 2011, 98, 1–15. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhang, B.; Wang, Y. Learning structural changes of Gaussian graphical models in controlled experiments. arXiv 2012, arXiv:1203.3532. [Google Scholar]
Hara, S.; Washio, T. Learning a common substructure of multiple graphical Gaussian models. Neural Netw. 2013, 38, 23–38. [Google Scholar] [CrossRef] [Green Version]
Glowinski, R.; Marroco, A. Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problèmes de Dirichlet non linéaires. ESAIM Math. Model. Numer. Anal.-Modél. Math. Et Anal. Numér. 1975, 9, 41–76. [Google Scholar] [CrossRef]
Tang, Q.; Yang, C.; Peng, J.; Xu, J. Exact hybrid covariance thresholding for joint graphical lasso. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Berlin/Heidelberg, Germany, 2015; pp. 593–607. [Google Scholar]
Hallac, D.; Park, Y.; Boyd, S.; Leskovec, J. Network inference via the time-varying graphical lasso. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 205–213. [Google Scholar]
Gibberd, A.J.; Nelson, J.D. Regularized estimation of piecewise constant gaussian graphical models: The group-fused graphical lasso. J. Comput. Graph. Stat. 2017, 26, 623–634. [Google Scholar] [CrossRef] [Green Version]
Boyd, S.; Parikh, N.; Chu, E. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers; Now Publishers Inc.: Norwell, MA, USA, 2011. [Google Scholar]
Yang, S.; Lu, Z.; Shen, X.; Wonka, P.; Ye, J. Fused multiple graphical lasso. SIAM J. Optim. 2015, 25, 916–943. [Google Scholar] [CrossRef] [Green Version]
Tran-Dinh, Q.; Kyrillidis, A.; Cevher, V. Composite self-concordant minimization. J. Mach. Learn. Res. 2015, 16, 371–416. [Google Scholar]
Beck, A.; Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2009, 2, 183–202. [Google Scholar] [CrossRef] [Green Version]
Nesterov, Y.; Nemirovskii, A. Interior-Point Polynomial Algorithms in Convex Programming; SIAM: Philadelphia, PA, USA, 1994. [Google Scholar]
Renegar, J. A Mathematical View of Interior-Point Methods in Convex Optimization; SIAM: Philadelphia, PA, USA, 2001. [Google Scholar]
Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course; Springer Science & Business Media: New York, NY, USA, 2003; Volume 87. [Google Scholar]
Tibshirani, R.; Saunders, M.; Rosset, S.; Zhu, J.; Knight, K. Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. B 2005, 67, 91–108. [Google Scholar] [CrossRef] [Green Version]
Simon, N.; Friedman, J.; Hastie, T.; Tibshirani, R. A sparse-group lasso. J. Comput. Graph. Stat. 2013, 22, 231–245. [Google Scholar] [CrossRef]
Friedman, J.; Hastie, T.; Tibshirani, R. A note on the group lasso and a sparse group lasso. arXiv 2010, arXiv:1001.0736. [Google Scholar]
Hoefling, H. A path algorithm for the fused lasso signal approximator. J. Comput. Graph. Stat. 2010, 19, 984–1006. [Google Scholar] [CrossRef] [Green Version]
Friedman, J.; Hastie, T.; Höfling, H.; Tibshirani, R. Pathwise coordinate optimization. Ann. Appl. Stat. 2007, 1, 302–332. [Google Scholar] [CrossRef] [Green Version]
Tibshirani, R.J.; Taylor, J. The solution path of the generalized lasso. Ann. Stat. 2011, 39, 1335–1371. [Google Scholar] [CrossRef] [Green Version]
Johnson, N.A. A dynamic programming algorithm for the fused lasso and l 0-segmentation. J. Comput. Graph. Stat. 2013, 22, 246–260. [Google Scholar] [CrossRef]
Yuan, M.; Lin, Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B 2006, 68, 49–67. [Google Scholar] [CrossRef]
Suzuki, J. Sparse Estimation with Math and R: 100 Exercises for Building Logic; Springer Nature: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
Boyd, S.; Boyd, S.P.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Barzilai, J.; Borwein, J.M. Two-point step size gradient methods. IMA J. Numer. Anal. 1988, 8, 141–148. [Google Scholar] [CrossRef]
Nemirovski, A. Interior point polynomial time methods in convex programming. Lect. Notes 2004, 42, 3215–3224. [Google Scholar]
Li, H.; Gui, J. Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics 2006, 7, 302–317. [Google Scholar] [CrossRef]
Weylandt, M.; Nagorski, J.; Allen, G.I. Dynamic visualization and fast computation for convex clustering via algorithmic regularization. J. Comput. Graph. Stat. 2020, 29, 87–96. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Shannon, P.; Markiel, A.; Ozier, O.; Baliga, N.S.; Wang, J.T.; Ramage, D.; Amin, N.; Schwikowski, B.; Ideker, T. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 2003, 13, 2498–2504. [Google Scholar] [CrossRef] [PubMed]
Miller, L.D.; Smeds, J.; George, J.; Vega, V.B.; Vergara, L.; Ploner, A.; Pawitan, Y.; Hall, P.; Klaar, S.; Liu, E.T.; et al. An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc. Natl. Acad. Sci. USA 2005, 102, 13550–13555. [Google Scholar] [CrossRef] [PubMed] [Green Version]

Figure 1. Plot of time comparison under different p. Setting

λ_{1} = 0.1

,

λ_{2} = 0.05

,

K = 2

, and

N = 200

.

Figure 1. Plot of time comparison under different p. Setting

λ_{1} = 0.1

,

λ_{2} = 0.05

,

K = 2

, and

N = 200

.

Figure 2. Plot of true positive edges vs. false positive edges selected. Setting

p = 50

,

K = 2

. (a) The fused penalty; (b) The group penalty.

Figure 2. Plot of true positive edges vs. false positive edges selected. Setting

p = 50

,

K = 2

. (a) The fused penalty; (b) The group penalty.

Figure 3. Plot of the mean squared errors vs. total edges selected. Setting

p = 50

,

K = 2

. (a) The fused penalty; (b) The group penalty.

Figure 3. Plot of the mean squared errors vs. total edges selected. Setting

p = 50

,

K = 2

. (a) The fused penalty; (b) The group penalty.

Figure 4. Plot of

log (F (Θ_{t}) - F (Θ_{*}))

vs. the number of iterations with different

λ_{1}

values. Setting

p = 200, N = 400, K = 2

and

λ_{2} = 0.05

.

Figure 4. Plot of

log (F (Θ_{t}) - F (Θ_{*}))

vs. the number of iterations with different

λ_{1}

values. Setting

p = 200, N = 400, K = 2

and

λ_{2} = 0.05

.

Figure 5. Network figure of the words in president speeches dataset.

Table 1. Efficient JGL procedures.

Model	ADMM	Proximal Newton	Proximal Gradient
GL [4]	[8]	[12]	[10]
JGL [13]	[13]	[23]	Current Paper
JGL [13]	[13]	(for fused penalty)	(for fused and group penalties)

Table 2. Computational time under different settings.

Parameters Setting						Computational Time
p	K	N	$λ_{1}$	$λ_{2}$	precision $ϵ$	ADMM	FMGL	ISTA	M-ISTA
20	2	60	0.1	0.05	0.00001	10.506 s	1.158 s	2.174 s	1.742 s
	3		0.1	0.05		1.879 min	4.267 s	3.357 s	3.668 s
	5		1	0.5		1.123 min	10.556 s	4.216 s	2.874 s
30	2	120	0.1	0.05	0.0001	10.095 s	5.259 s	2.690 s	4.857 s
	3		0.1	0.05		2.014 min	38.562 s	14.722 s	31.870 s
	5		1	0.5		2.447 min	15.819 s	22.431 s	12.113 s
50	2	600	0.02	0.005	0.0001	6.427 s	10.228 s	7.213 s	4.625 s
			0.03			6.240 s	8.925 s	6.645 s	4.023 s
			0.04			7.025 s	9.381 s	6.144 s	3.993 s
200	2	400	0.09	0.05	0.0001	4.050 min	1.874 min	2.289 min	35.038 s
			0.1			4.569 min	1.137 min	1.340 min	24.852 s
			0.12			3.848 min	1.881 min	1.443 min	18.367 s

Table 3. Time comparison result of two real datasets.

Dataset	Parameters Setting			Computational Time
Dataset	$λ_{1}$	$λ_{2}$	Precision $ϵ$	ADMM	FMGL	ISTA	M-ISTA
Speeches	0.1	0.05	0.0001	19.969 s	4.977 min	11.829 s	12.867 s
	0.2	0.1		4.661 min	3.209 min	11.560 s	12.682 s
	0.5	0.25		5.669 min	1.490 min	11.043 s	12.788 s
Breast cancer	0.1	0.0166	0.0001	3.809 min	7.937 min	1.305 min	1.158 min
	0.2	0.02		6.031 min	5.198 min	1.503 min	1.230 min
	0.3	0.03		5.499 min	2.265 min	1.188 min	1.061 min

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, J.; Shimmura, R.; Suzuki, J. Efficient Proximal Gradient Algorithms for Joint Graphical Lasso. Entropy 2021, 23, 1623. https://doi.org/10.3390/e23121623

AMA Style

Chen J, Shimmura R, Suzuki J. Efficient Proximal Gradient Algorithms for Joint Graphical Lasso. Entropy. 2021; 23(12):1623. https://doi.org/10.3390/e23121623

Chicago/Turabian Style

Chen, Jie, Ryosuke Shimmura, and Joe Suzuki. 2021. "Efficient Proximal Gradient Algorithms for Joint Graphical Lasso" Entropy 23, no. 12: 1623. https://doi.org/10.3390/e23121623

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Proximal Gradient Algorithms for Joint Graphical Lasso

Abstract

1. Introduction

2. Preliminaries

2.1. Graphical Lasso

2.2. ISTA for Graphical Lasso

2.3. Composite Self-Concordant Minimization

2.4. Joint Graphical Lasso

3. The Proposed Methods

3.1. ISTA for the JGL Problem

3.1.1. Fused Lasso Penalty $P_{F}$

3.1.2. Group Lasso Penalty $P_{G}$

3.2. Modified ISTA for JGL

3.3. Theoretical Analysis

4. Experiments

4.1. Stopping Criteria and Model Selection

4.2. Synthetic Data

4.2.1. Time Comparison Experiments

4.2.2. Algorithm Assessment

4.2.3. Convergence Rate

4.3. Real Data

5. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Proofs of Propositions

Appendix A.1. Proof of Proposition 1

Appendix A.2. Proof of Proposition 2

Appendix B. Data Generation

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Efficient Proximal Gradient Algorithms for Joint Graphical Lasso

Abstract

1. Introduction

2. Preliminaries

2.1. Graphical Lasso

2.2. ISTA for Graphical Lasso

2.3. Composite Self-Concordant Minimization

2.4. Joint Graphical Lasso

3. The Proposed Methods

3.1. ISTA for the JGL Problem

3.1.1. Fused Lasso Penalty P F

3.1.2. Group Lasso Penalty P G

3.2. Modified ISTA for JGL

3.3. Theoretical Analysis

4. Experiments

4.1. Stopping Criteria and Model Selection

4.2. Synthetic Data

4.2.1. Time Comparison Experiments

4.2.2. Algorithm Assessment

4.2.3. Convergence Rate

4.3. Real Data

5. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Proofs of Propositions

Appendix A.1. Proof of Proposition 1

Appendix A.2. Proof of Proposition 2

Appendix B. Data Generation

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.1.1. Fused Lasso Penalty $P_{F}$

3.1.2. Group Lasso Penalty $P_{G}$