1 Introduction

Spatio-temporal count data relating to a set of n non-overlapping areal units for T consecutive time periods are prevalent in many fields, including epidemiology [1] and social science [2]. As geographical proximity can often indicate correlation, such data can be modelled as a graph, with vertices representing areas and edges between areas that share a geographic boundary and so their data values are assumed to be spatially correlated. This spatial correlation is then represented as a weight matrix arising from these adjacency relationships. However, such models are often not ideal representations as geographical proximity does not always imply correlation [3]. Instead, Lee, Meeks and Pettersson [4] recently proposed a new method for addressing this issue by deriving a specific objective function (given in full in Sect. 2.2), and then searching for a spanning subgraph with no isolated vertices which maximises this function. Maximising this objective function corresponds to maximising the natural logarithm of the product of full conditional distributions over all vertices (corresponding to spatial units) in a conditional autoregressive model; further details are available in [4]. This objective function is highly non-linear, and rewards removing as few edges as possible, while applying a penalty that (non-linearly) increases as the difference between the weight of each vertex and the average weight over its neighbours increases. Lee, Meeks and Pettersson [4] studied an application of this problem where the geographical proximity data corresponds to a graph with 257 vertices; we note that an exhaustive search for an optimal subgraph would be intractable even with significantly fewer vertices. Instead, Lee, Meeks and Pettersson gave a heuristic for solving this problem, but point out that many standard techniques are not applicable to this problem, suggesting that it is challenging to efficiently find even one optimal subgraph.

In addition, the overarching approach when modelling spatial correlation in areal unit data is to fix the graph that corresponds to a spatial neighbourhood matrix in advance of undertaking the modelling. However, the neighbourhood matrix (based on the graph) determines the spatial correlation structure in the data, and hence should be estimated as a parameter in the model. Many approaches ignore this estimation problem by specifying a single value for it, with one common specification being based on geographical contiguity (see e.g. [5,6,7]). This approach naively assumes that: (i) geographical contiguity is appropriate for the data without any checking of this assumption; and (ii) no account is taken of the uncertainty in the estimation of its value. This can be improved by using enumeration algorithms (i.e., algorithms that find all optimal subgraphs rather than just one) to model this uncertainty, which can be incorporated via a Bayesian modelling framework [8].

1.1 Our Contribution

We show that the problem of identifying a subgraph that maximises spatial correlation (formally defined in Sect. 2.2) is indeed NP-hard, even on planar graphs, and provide examples that illustrate two of the major challenges inherent in the problem: we cannot optimise independently on disjoint connected components, and we cannot iterate towards a solution. We also show that the decision variant of minimising the penalty portion of the objective function is NP-complete even when restricted to planar graphs with maximum degree at most five. We then investigate a simplification in which the goal is to find a subgraph with a penalty term of zero. We completely characterise all such subgraphs, and then show that the problem is solvable in linear time in the number of edges of the graph. However, we also show that finding a subgraph with a penalty term of zero on all vertices of degree two or more is NP-complete.

In the positive direction, we give two exact algorithms that are tractable in respective restricted settings, and enumerate all optimal solutions. Both algorithms have connections to the maximum degree of the input graph: we note that as a set of areal units are planar, then most graphs that correspond to a given set of units will have a low average number of edges and a relatively low maximum degree. Clearly, the degree is dependent on the set of areal units, but as an example the set of 4835 lower super output areas in Greater London, England has a maximum degree of 15 and a mean degree of 6 [9].

Our first algorithm has a precalculation time of \(f(tw(G),\Delta (G)) \cdot n^{O(\Delta (G))}\), \(O(n^2)\) delay, and \(O(n^2)\) postcalculation time, where tw(G) and \(\Delta (G)\) are the treewidth and maximum degree of G respectively. Our second algorithm is only guaranteed to be correct if the underlying graph has maximum degree three: the consideration of such restrictions has been increasingly common since Lindgren et al. [10] developed their stochastic partial differential equations approach based on a triangulation of the spatial study region. Under these conditions, this second algorithm enumerates all optimal subgraphs with a precalculation time of \(f(k) \cdot n \log n\) and a linear delay between outputs, where k is the maximum number of edges that can be removed.

1.2 Paper Outline

Section 2 gives notation and definitions, the formal problem definition, and examples that illustrate two of the major challenges inherent in the problem. We then prove in Sect. 3 that, unless P = NP, there is no polynomial-time algorithm to solve the main optimisation problem, even when restricted to planar graphs. Section 4 then examines three simplifications of the problem. In Sect. 5 we introduce two algorithms to exactly solve the problem, and we finish with concluding thoughts and open problems in Sect. 6.

For readers who are familiar with the conference version of this paper [11], this paper extends said conference version by including complete proofs for all lemmas in Sect. 3, the complete reduction and all required prerequisite results for Theorem 4.6, the complete proof for Theorem 4.8, and the complete details for the two enumeration algorithms in Sect. 5, as well as their associated proofs and complexity calculations.

2 Background

In this section, we give the notation we need for this paper, define the problem, and then demonstrate why some common techniques from graph theory are not applicable to this problem.

2.1 Notation and Definitions

A graph is a pair \(G=(V,E)\), where the vertex set V is a finite set, and the edge set E is a set of unordered pairs of elements of V. Two vertices u and v are said to be adjacent if \(e = uv \in E\); u and v are said to be the endpoints of e. The neighbourhood of v in G is the set \(N_G(v) {:=}\{\,\,u \in V\mid uv \in E\,\}\), and the degree of v in G is \(d_G(v) {:=}|N_G(v)|\). An isolated vertex is a vertex of degree zero, and a leaf is a vertex of degree one. The maximum degree of a graph G is \(\Delta (G) {:=}\max _{v\in V} d_G(v)\). A graph \(H = ( V_H,E_H)\) is a subgraph of G if \(V_H \subseteq V\) and \(E_H \subseteq E\); H is a spanning subgraph of G if \(V_H = V\) so that H is obtained from G by deleting a (possibly empty) subset of edges. Given an edge e in E(G) (respectively a set \(E' \subseteq E(G)\)) we write \(G {\setminus } e\) (respectively \(G {\setminus } E'\)) for the subgraph of G obtained by deleting e (respectively deleting every element of \(E'\)).

A graph G is planar if it can be drawn in the plane (i.e., vertices can be mapped to points in the plane, and edges to curves in the plane whose extreme points are the images of its endpoints) in such a way that no two edges cross. Given any partition of a subset of the plane into regions, we can define a planar graph, commonly referred to as a dual graph, whose vertices are in bijection with the set of regions, in which two vertices corresponding to regions are adjacent if and only if the corresponding regions share a border of positive length. In particular, if each region (including the “outer” region) has three sides (i.e., the partition is a triangulation of the plane) then the resulting graph will have maximum degree three.

2.2 The Optimisation Problem

Following Lee, Meeks and Pettersson [4], we define the neighbourhood discrepancy of a vertex f in a graph H with weight function \(f:V(G) \rightarrow {\mathbb {Q}}\) (written \({{\,\textrm{ND}\,}}_H(v,f)\)) as

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(v,f) {:=}\left( f(v) - \frac{\sum _{u \in N_H(v)} f(u)}{d_H(v)}\right) ^2. \end{aligned}$$

Intuitively, \({{\,\textrm{ND}\,}}_H(v,f)\) asks “how different is f(v) from the average of f(u) averaged over all neighbours of v; for a more complete background see [4]. We can now define the following optimisation problem, introduced by Lee et al. [4].

figure a

We will say that a subgraph H of G is valid if H is a spanning subgraph of G and \(\Delta (H) \ge 1\). Note that we do not require that H be a connected graph. Given a vertex v in the input graph G, we will sometimes refer to f(v) as the weight of v.

2.3 Why Common Graph Algorithm Techniques Fail

This problem is particularly resistant to many approaches common in algorithmic graph theory. We will describe two of these now. Firstly, on a disconnected graph G, combining optimal solutions on each connected component is not guaranteed to find an optimal solution on G. This is true even if there are only two disconnected components, one of which is an isolated edge and the other being a path, as illustrated in the following example.

Example 2.1

Consider the graph G consisting of a path on four vertices \((v_1,v_2,v_3,v_4)\) along with an isolated edge between vertices \(v_a\) and \(v_b\), as shown in Fig. 1, and let \(H = G{\setminus } \{\, v_2v_3 \, \}\). Note that H is the only proper subgraph of G which has no isolated vertices. Let f be defined as follows: \(f(v_1) = 0\), \(f(v_2) = 1\), \(f(v_3) = 10\), \(f(v_4) = 11\), \(f(v_a) = 0\), and \(f(v_b) = x\) for some real x. If \(x=1\) then \({{\,\textrm{score}\,}}(G,f) \simeq -23.93 < {{\,\textrm{score}\,}}(H,f) \simeq -10.75\) but if \(x=1000\) then \({{\,\textrm{score}\,}}(G,f) \simeq -85.67 > {{\,\textrm{score}\,}}(H,f) \simeq 87.05\).

Fig. 1
figure 1

Graph for Example 2.1. The value of the function at each vertex is shown inside the respective vertex, where x can be either 1 (in which case \({{\,\textrm{score}\,}}(G,f) \simeq -23.93 < {{\,\textrm{score}\,}}(H,f) \simeq -10.75\)) or 1000 (in which case \({{\,\textrm{score}\,}}(G,f) \simeq -85.67 > {{\,\textrm{score}\,}}(H,f) \simeq -87.05\))

To understand why disconnected components can affect each other in such a manner, note that the negative term in the score function contains a logarithm of a sum of neighbourhood discrepancies. This means that the contribution of the neighbourhood discrepancies of any set of vertices to the total score depends on the total sum of the neighbourhood discrepancies across the whole graph. In other words, the presence of a large neighbourhood discrepancy elsewhere (even in a separate component) in the graph can reduce the impact of the neighbourhood discrepancy at a given vertex or set of vertices. However, the positive term in the score function is a sum of logarithms, so the contribution to the positive term from the degree of one vertex does not depend on any other part of the graph.

A reader might also be tempted to tackle this problem by identifying a “best” edge to remove and proceeding iteratively. The following example highlights that any algorithm using such a greedy approach may, in some cases, not find an optimal solution.

Example 2.2

Consider the graph G being a path on six vertices labelled \(v_1, v_2, v_3,v_4, v_5\), and \(v_6\) with \(f(v_1) = 2019\), \(f(v_2) = 1999\), \(f(v_3) = 1000\), \(f(v_4) = 2000\), \(f(v_5) = 1001\), and \(f(v_6) = 981\) as shown in Fig. 2. Let \(H = G{\setminus } \{\, v_2v_3, v_4v_5\, \}\), and let \(H' = G{\setminus } \{\, v_3v_4\, \}\). The maximum score that can be achieved with the removal of only one edge is achieved by removing edge \(v_3v_4\) and creating \(H'\) where \({{\,\textrm{score}\,}}(H',f) \simeq -88.01\). However, the optimal solution to Correlation Subgraph Optimisation on G is H, and involves removing edges \(v_2v_3\) and \(v_4v_5\), achieving \({{\,\textrm{score}\,}}(H,f) \simeq -87.06\).

Fig. 2
figure 2

Graph for Example 2.2. The value of the function at each vertex is shown inside the respective vertex

3 Hardness on Planar Graphs

In this section, we prove NP-hardness of Correlation Subgraph Optimisation on planar graphs.

Theorem 3.1

Correlation Subgraph Optimisation on planar graphs is NP-hard.

We prove this result by means of a reduction from the following problem, shown to be NP-complete in [12]; the incidence graph \(G_{\Phi }\) of a CNF formula \(\Phi \) is a bipartite graph whose vertex sets correspond to the variables and clauses of \(\Phi \) respectively, and in which a variable x and clause C are connected by an edge if and only if x appears in C.

figure b

We will prove NP-hardness by showing how, using a computable function, one can take an arbitrary instance I of Cubic Planar Monotone 1-in-3 SAT  and create in polynomial time an instance \(I'\) of Correlation Subgraph Optimisation such that I is a yes-instance of Cubic Planar Monotone 1-in-3 SAT if and only if \(I'\) is a yes-instance of Correlation Subgraph Optimisation.

We begin by describing the construction of a graph G and function \(f:V(G) \rightarrow {\mathbb {N}}\) corresponding to the formula \(\Phi \) in an instance of Cubic Planar Monotone 1-in-3 SAT; the construction will be defined in terms of an integer parameter \(t \ge 1\) whose value we will determine later. Note that G is not the incidence graph \(G_\Phi \) of \(\Phi \), but instead \(G_\Phi \) is a graph minor of G.

Suppose that \(\Phi \) has variables \(x_1,\ldots ,x_n\) and clauses \(C_1,\ldots ,C_m\). Since every variable appears in exactly three clauses and each clause contains exactly three variables, we must have \(m=n\).

For each variable \(x_i\), G contains a variable gadget on \(3t^2 + 6t + 8\) vertices. The non-leaf vertices of the gadget are:

  • \(u_i\), with \(f(u_i) = 7t\),

  • \(v_i\), with \(f(v_i) = 4t\),

  • \(z_i\) with \(f(z_i) = t\),

  • \(z_i'\) with \(f(z_i') = 4t\), and

  • \(w_{i,j}\) for each \(j \in \{\, 1,2,3\, \}\), with \(f(w_{i,j}) = 3t\).

The vertex \(v_i\) is adjacent to \(u_i\), \(z_i\) and each \(w_{i,j}\) with \(i \in \{\, 1,2,3\, \}\); \(z_i\) is adjacent to \(z_i'\). We add leaves to this gadget as follows:

  • \(u_i\) has 3t pendant leaves, each assigned value \(7t+1\) by f;

  • \(z_i\) has 3t pendant leaves, each assigned value \(t-1\) by f;

  • \(z_i'\) has \(3t^2\) pendant leaves, each assigned value 4t by f;

  • each vertex \(w_{i,j}\) has exactly one pendant leaf, assigned value 3t by f.

For each clause \(C_j\), G contains a clause gadget on \(t^2+2\) vertices: \(a_j\) and \(a_j'\), which are adjacent, and \(t^2\) pendant leaves adjacent to \(a_j'\). We set \(f(a_j)=2t\), and f takes value t on \(a_j'\) and all of its leaf neighbours.

We complete the definition of G by specifying the edges with one endpoint in a variable gadget and the other in a clause gadget: if the variable \(x_i\) appears in clauses \(C_{r_1}\), \(C_{r_2}\) and \(C_{r_3}\), with \(r_1< r_2 < r_3\), then we have edges \(w_{i,1}a_{r_1}\), \(w_{i,2}a_{r_2}\) and \(w_{i,3}a_{r,3}\). The construction of the variable and clause gadgets is illustrated in Fig. 3.

Fig. 3
figure 3

Construction of the variable and clause gadgets. We later show that \(x_i\) being true is equivalent to \(N_H(v_i) = \{\, u_i, w_{i,1}, w_{i,2}, w_{i,3}\, \}\) (i.e., H contains the red edges but not the blue edges), and \(x_i\) being false is equivalent to \(N_H(v_i) = \{\, u_i, z_i\, \}\) (i.e., H contains the blue edges but not the red edges)

Recall that a subgraph H of G is valid if H is a spanning subgraph of G and \(d_H(v) \ge 1\) for all \(v \in V\). Recall that the neighbourhood discrepancy of a vertex v with respect to f in a valid subgraph H, written \({{\,\textrm{ND}\,}}_H(v,f)\), is

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(v,f) {:=}\left( f(v) - \frac{\sum _{u \in N_H(v)} f(u)}{d_H(v)}\right) ^2. \end{aligned}$$

The goal of Correlation Subgraph Optimisation is therefore to maximise

$$\begin{aligned} {{\,\textrm{score}\,}}(H,f) {:=}\sum _{v \in V} \ln d_H(v) - n \ln \left[ \sum _{v \in V}d_H(v) {{\,\textrm{ND}\,}}_H(f,v)\right] , \end{aligned}$$

over all valid subgraphs H of G.

We now prove several properties of valid subgraphs of G.

Lemma 3.2

For any valid subgraph H,

$$\begin{aligned} \sum _{\text { a leaf } u \text { in } G} {{\,\textrm{ND}\,}}_H(u,f) = 6nt. \end{aligned}$$

Proof

Note that the neighbourhood of any leaf vertex must be the same in G and in any valid subgraph H; it therefore suffices to determine \(\sum _{\text { a leaf } u \text { in } G} {{\,\textrm{ND}\,}}_G(u,f)\). In each variable gadget, there are \(3t^2 + 3\) leaves whose neighbourhood discrepancy in G is zero, and 6t leaves for which the neighbourhood discrepancy is one. This gives a contribution to the sum of 6t for every variable gadget, a total of 6nt over all such gadgets. In each clause gadget there are \(t^2\) leaves, each with neighbourhood discrepancy zero. \(\square \)

Lemma 3.3

For any valid subgraph H, and every \(1 \le i \le n\),

$$\begin{aligned} 0 \le {{\,\textrm{ND}\,}}_H(z_i',f), {{\,\textrm{ND}\,}}_H(a_i',f) < 1/t^2. \end{aligned}$$

Proof

The first inequality is immediate from the definition of neighbourhood discrepancy. For the second inequality, observe that we only have \({{\,\textrm{ND}\,}}_H(z_i',f) > 0\) if the edge \(z_iz_i'\) belongs to H. By validity of H, note that \(z_i'\) must in this case have the same neighbourhood in G and in H. Therefore we have

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(z_i',f) = \left( 4t - \frac{3t^2 \cdot 4t + t}{3t^2 + 1}\right) ^2 =&\left( \frac{12t^3 + 4t - 12t^3 - t}{3t^2 + 1}\right) ^2 \\ =&\left( \frac{3t}{3t^2 + 1}\right) ^2 < 1/t^2. \end{aligned}$$

Similarly, we only have \({{\,\textrm{ND}\,}}_H(a_i',f) > 0\) if the edge \(a_ia_i'\) belongs to H. In this case,

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(a_i',f) = \left( t - \frac{2t + t^3}{t^2+1}\right) ^2 = \left( \frac{t^3 + t - 2t - t^3}{t^2+1}\right) ^2 < 1/t^2. \end{aligned}$$

\(\square \)

Lemma 3.4

For any valid subgraph H,

$$\begin{aligned} \sum _{v \in V} \ln d_H(v) \ge 6n \ln t + 2n. \end{aligned}$$

Proof

Consider first a single variable gadget, corresponding to the variable \(x_i\), and note that every edge incident with a leaf must be present in H. We therefore conclude that

  • \(d_H(z_i) \ge 3t\),

  • \(d_H(z_i') \ge 3t^2\), and

  • \(d_H(u_i) \ge 3t\).

It follows that the contribution to the sum from this gadget is at least

$$\begin{aligned} \ln 3t + \ln 3t^2 + \ln 3t&\ge \ln 3 + \ln t + 2 \ln t + \ln 3 + \ln t + \ln 3 \\&\ge 4 \ln t + 2. \end{aligned}$$

Now consider a single clause gadget, corresponding to the clause \(C_j\). Invoking the validity of H again, we observe that \(d_H(a_j') \ge t^2\), so the contribution from this gadget is at least \(2\ln t\).

Summing over all variables and clauses gives the result. \(\square \)

Lemma 3.5

Let H be any subgraph of G (not necessarily valid). Then

$$\begin{aligned} \sum _{v \in V} \ln d_H(v) \le 6n \ln t + 20n. \end{aligned}$$

Proof

Note that the sum is maximised when H is equal to G, so it suffices to show that the right-hand side gives an upper bound on \(\sum _{v \in V} \ln d_G(v)\).

Note that, for \(x \ge 3\), \(\ln (x+1)< \ln (x+2) < \ln (x) + 1\). For a single variable gadget the contribution to this sum is then:

$$\begin{aligned} \ln (3t^2 + 1) +&\ln (3t + 2) + \ln 5 + \ln (3t + 1) + 3 \ln 3 \\&\le 2 \ln t + \ln 3 + 1 + \ln t + \ln 3 + 1 + \ln 5 + \ln t + \ln 3 + 1 + 6 \\&\le 4 \ln t + 17. \end{aligned}$$

For each clause gadget, the contribution to this sum is

$$\begin{aligned} \ln (t^2 + 1) + \ln 4 \le 2 \ln t + 3. \end{aligned}$$

Summing over all variables and clauses gives

$$\begin{aligned} \sum _{v \in V} \ln d_G(v) \le 6n \ln t + 20n, \end{aligned}$$

as required. \(\square \)

Lemma 3.6

If \(\Phi \) is satisfiable, there is a valid subgraph H such that for all \(v \in V \setminus \{\, z_i', a_i'\mid 1 \le i \le n\, \}\) with \(d_G(v) > 1\) we have \({{\,\textrm{ND}\,}}(v,H) = 0\).

Proof

Let \(b:\{\, x_1,\ldots ,x_n\, \} \rightarrow \{\, \texttt {true},\texttt {false}\, \}\) be a truth assignment such that, for each clause \(C_j\) in \(\Phi \), exactly one variable in \(C_j\) evaluates to \(\texttt {true}\) under b (as a reminder, we are reducing from Cubic Planar Monotone 1-in-3 SAT so no variable appears negated in any clause). We define a valid subgraph H of G with reference to b. In the following, unless otherwise noted, i ranges between 1 and n inclusive.

The subgraph H contains all edges within each clause gadget (i.e., all edges incident with \(a_i'\) for each i). Let \(G_i\) be the variable gadget corresponding to \(x_i\), and \(H_i\) the subgraph of H induced by the same set of vertices. If \(b(x_i) = \texttt {true}\), we set \(H_i\) to be \(G_i \setminus z_iv_i\). If \(b(x_i) = \texttt {false}\), we set \(H_i\) to be \(G_i {\setminus } \{\, v_iw_{i,1},v_iw_{i,2},v_iw_{i,3},z_iz_i'\, \}\).

To complete the definition of H, we define the set of edges in H that have one endpoint in a variable gadget and one in a clause gadget: for each i, j, and \(\ell \), H contains the edge \(w_{i,\ell }a_j\) if and only if \(b(x_i) = \texttt {true}\). It is easy to verify that H is a valid subgraph; it remains to demonstrate that \({{\,\textrm{ND}\,}}_H(v,f) = 0\) for the required vertices.

First consider \(u_i\), for any i. The neighbourhood of \(u_i\) does not depend on the value of \(b(x_i)\), so in all cases we have

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(u_i,f) = \left( 7t - \frac{3t(7t+1) + 4t}{3t+1} \right) ^2 = \left( \frac{21t^2 + 7t - 21t^2 - 3t -4t}{3t + 1} \right) ^2 = 0. \end{aligned}$$

Now consider \(v_i\), for any i. If \(b(x_i) = \texttt {true}\), then \(N_H(v_i) = \{\, u_i,w_{i,1},w_{i,2},w_{i,3}\, \}\) and

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(v_i,f) = \left( 4t - \frac{7t + 3t + 3t + 3t}{4} \right) ^2 = 0. \end{aligned}$$

On the other hand, if \(b(x_i) = \texttt {false}\), then \(N_H(v_i) = \{\, u_i,z_i\, \}\) and

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(v_i,f) = \left( 4t - \frac{7t + t}{2} \right) ^2 = 0. \end{aligned}$$

Next consider \(z_i\), for any i. Note that \(|N_H(z_i) \cap \{\, z_i',v_i\, \} |= 1\) for either value of \(b(x_i)\), and moreover that \(z_i\) is always adjacent to its leaf neighbours, so in both cases we have

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(z_i,f) = \left( t - \frac{3t(t-1) + 4t}{3t + 1} \right) ^2 = \left( \frac{3t^2 + t - 3t^2 + 3t - 4t}{3t + 1} \right) ^2 = 0. \end{aligned}$$

Next consider \(w_{i,j}\), for any i and j. In all cases, \(w_{i,j}\) is adjacent to its leaf neighbour. If \(b(x_i) = \texttt {false}\), this is the only neighbour of \(w_{i,j}\), and it is clear that \({{\,\textrm{ND}\,}}_H(w_{i,j},f) = 0\). If, on the other hand, \(b(x_i) = \texttt {true}\) then \(w_{i,j}\) is additionally adjacent to both \(v_i\) and \(a_j\) for some clause \(C_j\). In this case we have

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(w_{i,j},f) = \left( 3t - \frac{3t + 4t + 2t}{3} \right) ^2 = 0. \end{aligned}$$

Finally, consider \(a_j\), for any j. By the definition of H, \(a_j\) is adjacent to its leaf neighbours in G and, since \(C_j\) contains exactly one variable that evaluates to true under b, exactly one vertex \(w_{i,\ell }\) for some values of i and \(\ell \); we therefore have

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(a_j,f) = \left( 2t - \frac{t + 3t}{2}\right) ^2 = 0. \end{aligned}$$

\(\square \)

Lemma 3.7

If \(\Phi \) is not satisfiable, then for any valid subgraph H, there exists a vertex \(v \in V {\setminus } \{\, z_i',a_i'\mid 1 \le i \le n \, \}\) with \(d_G(v) > 1\) such that

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(v,f) \ge t^2/9. \end{aligned}$$

Proof

Suppose for a contradiction that there is a valid subgraph H so that for every such vertex v we have \({{\,\textrm{ND}\,}}_H(v,f) < t^2/9\).

We begin by arguing that \(N_H(v_i) \in \{\,\{u_i,z_i\}, \{u_i,w_{i,1},w_{i,2},w_{i,3}\}\, \}\); this will allow us to construct a truth assignment based on H. We first argue that we must have \(u_i \in N_H(v_i)\): to see this, observe that if this is not the case then \(\sum _{u \in N_H(v_i)} f(u)/d_H(v_i) \le 3t\) so we have

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(v_i,f) = \left( f(v_i) - \frac{\sum _{u\in N_H(v_i)} f(u)}{d_H(v_i)} \right) ^2 \ge \left( 4t - 3t \right) ^2 = t^2 > t^2/9. \end{aligned}$$

Now suppose \(z_i \in N_H(v_i)\). If we also have \(w_{i,j} \in N_H(v_i)\) for some j, then

$$\begin{aligned} \frac{\sum _{u \in N_H(v_i)}f(u)}{d_H(v_i)} \le \frac{7t + t + 3t}{3} = \frac{11t}{3}, \end{aligned}$$

so

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(v_i,f) = \left( f(v_i) -\frac{\sum _{u\in N_H(v_i)} f(u)}{d_H(v_i)} \right) ^2 \ge \left( 4t - 11t/3\right) ^2 = (t/3)^2 = t^2/9, \end{aligned}$$

a contradiction; it follows that if \(z_i \in N_H(v_i)\) then \(N_H(v_i) = \{\, u_i,z_i\, \}\). Now suppose that \(z_i \notin N_H(v_i)\). In this case, if \(N_H(v_i) \ne \{\, u_i,w_{i,1},w_{i,2},w_{i,3} \,\}\), we have \(N_H(v_i) = \{\, u_i\, \} \cup W\) where \(W \subsetneq \{\, w_{i,1}, w_{i,2},w_{i,3}\, \}\). Then as \(f(w_{i,1}) = f(w_{i,2}) = f(w_{i,3}) = 3t\) we get

$$\begin{aligned} \frac{\sum _{u \in N_H(v_i)}f(u)}{d_H(v_i)} \in \left\{ \,\frac{7t}{1}, \frac{7t+3t}{2}, \frac{7t+3t+3t}{3} \,\right\} , \end{aligned}$$

leading to

$$\begin{aligned} f(v_i) - \frac{\sum _{u \in N_H(v_i)}f(u)}{d_H(v_i)} \in \left\{ \,-3t, -t, \frac{-t}{3}\, \right\} , \end{aligned}$$

so

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(v_i) = \left( f(v_i) - \frac{\sum _{u \in N_H(v_i)}f(u)}{d_H(v_i)}\right) ^2 \ge \left( \frac{-t}{3}\right) ^2 = t^2/9, \end{aligned}$$

again giving a contradiction.

We therefore conclude that, for each i, \(N_H(v_i) \in \{\, \{u_i,z_i\}, \{u_i,w_{i,1},w_{i,2},w_{i,3}\}\, \}\). We now define a truth assignment \(b:\{\, x_1,\ldots ,x_n\, \} \rightarrow \{\, \texttt {true},\texttt {false}\, \}\) by setting

$$\begin{aligned} b(x_i) = {\left\{ \begin{array}{ll} \texttt {true}&{}{} \text{ if }\ N_H(v_i) = \{\, u_i,w_{i,1},w_{i,2},w_{i,3} \,\},\\ \texttt {false}&{}{} \text{ if }\ N_H(v_i) = \{\, u_i,z_i \,\}. \end{array}\right. } \end{aligned}$$

Since \(\Phi \) is not satisfiable, there is at least one clause \(C_j\) in \(\Phi \) such that, under b, the number of variables in \(C_j\) evaluating to \(\texttt {true}\) is either zero, two or three.

Suppose first that no variable in \(C_j\) evaluates to \(\texttt {true}\) under b. If \(a_j\) is not adjacent to any vertex \(w_{i,\ell }\), then \({{\,\textrm{ND}\,}}_H(a_i,f) = t^2\), a contradiction. Therefore \(a_j\) is adjacent to at least one vertex \(w_{i,\ell }\) where \(b(x_i) = \texttt {false}\). By definition of b, \(w_{i,\ell }\) is not adjacent to \(v_i\), so we have

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(w_{i,\ell },f) \ge \left( 3t - \frac{3t + 2t}{2}\right) ^2 = t^2/4, \end{aligned}$$

a contradiction.

We can therefore conclude that at least two variables in \(C_j\) evaluate to true under b. Suppose two true variables appearing in \(C_j\) are \(x_i\) and \(x_{\ell }\), and that \(w_{i,r}\) and \(w_{\ell ,s}\) are adjacent to \(a_j\) in G. We begin by arguing that both \(w_{i,r}\) and \(w_{\ell ,s}\) must be adjacent to \(a_j\) in H. Suppose for a contradiction (without loss of generality) that \(w_{i,r}a_j \notin E(H)\). In this case, since \(b(x_i) = \texttt {true}\), we know that \(v_iw_{i,r} \in E(H)\), so we have

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(w_{i,r},f) = \left( 3t - \frac{3t + 4t}{2}\right) ^2 = t^2/4, \end{aligned}$$

giving the required contradiction. We therefore conclude that \(w_{i,r}\) and \(w_{\ell ,s}\) are adjacent to \(a_j\) in H, but in this case we have

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(a_j,f) \ge \left( 2t - \frac{t + 3t + 3t}{3}\right) ^2 = t^2/9, \end{aligned}$$

again giving a contradiction and completing the proof. \(\square \)

We now give bounds on the possible values for \({{\,\textrm{score}\,}}(H,f)\) depending on whether or not \(\Phi \) is satisfiable.

Lemma 3.8

If \(\Phi \) is satisfiable, there is a valid subgraph H with

$$\begin{aligned} {{\,\textrm{score}\,}}(H,f) \ge 6 n \ln t - n \ln (12nt). \end{aligned}$$

Proof

By Lemma 3.6, in this case there is a valid subgraph H so that \({{\,\textrm{ND}\,}}_H(v,f) = 0\) for all vertices v with degree greater than one in G, other than vertices \(z_i'\) for some i and \(a_j'\) for some j. It follows that for this choice of H

$$\begin{aligned} \sum _{v \in V}d_H(v) {{\,\textrm{ND}\,}}_H(v,f) =&\sum _{\text { a leaf }v\text { in }G} {{\,\textrm{ND}\,}}(v,f) + \sum _i d_H(z_i') {{\,\textrm{ND}\,}}_H(z_i',f) \\ {}&+ \sum _j d_H(a_j') {{\,\textrm{ND}\,}}_H(a_j',f). \end{aligned}$$

Applying Lemmas 3.2 and 3.3, we see that

$$\begin{aligned} \sum _{v \in V}d_H(v) {{\,\textrm{ND}\,}}_H(v,f)&\le 6nt + n \cdot (3t^2 + 1) \cdot 1/t^2 + n \cdot (t^2 + 1) \cdot 1/t^2 \\ {}&\le 6nt + 4n + 2n/t^2. \end{aligned}$$

By Lemma 3.4 and since \(t\ge 1\), we have

$$\begin{aligned} {{\,\textrm{score}\,}}(H,f)&\ge 6 n \ln t + 2n - n \ln (6nt + 4n + 2n/t^2) \\&> 6 n \ln t - n \ln (12nt) \end{aligned}$$

as required.\(\square \)

Lemma 3.9

If \(\Phi \) is not satisfiable, then for every valid subgraph H we have

$$\begin{aligned} {{\,\textrm{score}\,}}(H,f) \le 6n \ln t + 20n - n \ln (t^2/9). \end{aligned}$$

Proof

This follows immediately from Lemmas 3.5 and 3.7. \(\square \)

We are now ready to prove Theorem 3.1, which we restate here for convenience.

Theorem 3.1

Correlation Subgraph Optimisation on planar graphs is NP-hard.

Proof

We suppose for a contradiction that there is a polynomial-time algorithm \({\mathcal {A}}\) to solve Correlation Subgraph Optimisation on planar graphs, and show that this would allow us to solve Cubic Planar Monotone 1-in-3 SAT  in polynomial time.

Given an instance \(\Phi \) of Cubic Planar Monotone 1-in-3 SAT, where we will assume without loss of generality that \(\Phi \) has \(n > e^{47}\) variables, we proceed as follows. First construct \((G_{\Phi },f)\) as defined above, taking \(t = n^2\); it is clear that this can be done in polynomial time in \(|\Phi |\). Note that \(G_{\Phi }\) is planar: to see this, observe that repeatedly deleting vertices of degree one gives a subdivision of the incidence graph which is planar by assumption. We then run \({\mathcal {A}}\) on \((G_{\Phi },f)\) and return YES if the output is at least \(\frac{17}{2} n \ln n\), and NO otherwise.

It remains to demonstrate that this procedure gives the correct answer. Suppose first that \(\Phi \) is satisfiable. In this case, by Lemma 3.8, we know that there exists a subgraph H of G with

$$\begin{aligned} {{\,\textrm{score}\,}}(H,f)&\ge 6n\ln t - n \ln (12nt)\\&= 6n \ln n^2 - n \ln (12n^3) \\&= 12 n \ln n - 3 n \ln n - n \ln 12 \\&\ge 9 n\ln n - 3n\\&> \frac{17}{2} n \ln n, \end{aligned}$$

since \(3 < \ln n / 2\), so our procedure returns YES.

Conversely, suppose that \(\Phi \) is not satisfiable. In this case, by Lemma 3.9 we know that, for every valid subgraph H we have

$$\begin{aligned} {{\,\textrm{score}\,}}(H,f)&\le 6n \ln t + 20n - n \ln (t^2/9)\\&= 6n \ln n^2 + 20n - n \ln (n^4/9)\\&= 12n \ln n + 20n - 4n \ln n + n \ln 9\\&\le 8 n \ln n + 23n\\&< \frac{17}{2}n \ln n, \end{aligned}$$

since \(23 < \ln n / 2\) as \(n > e^{47}\), so our procedure returns NO. \(\square \)

4 Simplifications of the Problem

One may wonder if the hardness of Correlation Subgraph Optimisation is due to the interplay between the two parts of the objective function. We show in Sect. 4.1 that just determining if there is a valid subgraph with total neighbourhood discrepancy below some given constant is NP-complete, even if the input graph is planar and has maximum degree at most five. In Sect. 4.2, we show that subgraphs that have zero neighbourhood discrepancy everywhere (if they exist) can be found in time linear in the number of edges, however finding subgraphs that have zero neighbourhood discrepancy everywhere excluding leaves is NP-complete.

4.1 Minimising Neighbourhood Discrepancy

Consider the following problem, which requires us to minimise only the neighbourhood discrepancy.

figure c

First observe that the Average Value Neighbourhood Optimisation is clearly in NP. We will show the NP-hardness of Average Value Neighbourhood Optimisation by giving a reduction from Cubic Planar Monotone 1-in-3 SAT, which we used earlier in Sect. 3. As a reminder, the incidence graph \(G_{\Phi }\) of a CNF formula \(\Phi \) is a bipartite graph whose vertex sets correspond to the variables and clauses of \(\Phi \) respectively, and in which a variable x and clause C are connected by an edge if and only if x appears in C. This was shown to be NP-complete in [12].

Let \(\Phi = C_1 \wedge \cdots \wedge C_m\) be the input to an instance of Cubic Planar Monotone 1-in-3 SAT, where each clause \(C_j\) is of the form \((x_{j_1},x_{j_2},x_{j_3})\), and suppose that the variables appearing in \(\Phi \) are \(x_1,\ldots ,x_n\). We will construct an instance (Gfk) of Average Value Neighbourhood Optimisation  which is a yes-instance if and only if \(\Phi \) is a yes-instance for Cubic Planar Monotone 1-in-3 SAT.

The graph G consists of variable gadgets and clause gadgets, with some edges between variable and clause gadgets. For each variable \(x_i\), we have a variable gadget \(G_i\), as illustrated in Fig. 4. \(G_i\) consists of 13 vertices:

  • \(v_i\), with \(f(v_i) = 4\);

  • \(u_i\), with \(f(u_i) = 7\);

  • \(z_i\), with \(f(z_i) = 1\), and \(z_i'\), with \(f(z_i')=4\);

  • \(w_{i,j}\) for \(j \in \{\, 1,2,3\, \}\), with \(f(w_{i,j}) = 3\);

  • the triangle vertices \(w_{i,j}[\ell ]\) for \(j \in \{\,1,2,3\,\}\) and \(\ell \in \{\,1,2\,\}\), with \(f(w_{i,j}[\ell ])=3\).

The gadget \(G_i\) also contains the following edges:

  • \(v_iu_i\), \(v_iz_i\) and \(z_iz_i'\);

  • \(v_iw_{i,j}\) for \(j \in \{\,1,2,3\,\}\);

  • \(w_{i,j}w_{i,j}[1]\), \(w_{i,j}w_{i,j}[2]\) and \(w_{i,j}[1]w_{i,j}[2]\) for \(j \in \{\,1,2,3\,\}\).

For each clause \(C_j\), we have a clause gadget \(H_j\), which consists of two vertices \(a_j\) and \(b_j\) joined by an edge; we set \(f(a_j) = 2\) and \(f(b_j)=1\).

Fig. 4
figure 4

The construction of the variable gadget \(G_i\); the number in each vertex indicates the corresponding value of the function f. We will later show that the variable \(x_i\) being set true corresponds to the removal of the edge \(z_iv_i\), and the variable \(x_i\) being set false corresponds to the removal of the three edges of the form \(v_iw_{i,j}\) for \(j\in \{\,1,2,3\,\}\)

We complete the construction of G by describing the edges between clause and variable gadgets. For each variable \(v_i\), let the three distinct clauses that \(v_i\) appears in be \(C_{h_1}\), \(C_{h_2}\), and \(C_{h_3}\), with \(h_1< h_2 < h_3\). Then, for each \(j\in \{\,1,2,3\,\}\), we insert the edge \(w_{i,j}a_{h_j}\) into G. Finally, we set \(k = 27n + m\). It is clear that we can construct (Gfk) from \(\Phi \) in polynomial time.

It is straightforward to verify that the maximum degree of G is 5. To see that G is planar, first note that (repeatedly) adding or removing vertices of degree one does not change the planarity of a graph. Thus G is planar if and only if the graph \(G'\), obtained by deleting the vertices \(u_i\), \(z_i'\) and \(z_i\) from each variable gadget \(G_i\) and \(b_j\) from each clause gadget, is planar. Moreover, it is clear that \(G'\) is planar if and only if the graph \(G''\), obtained from \(G'\) by deleting all triangle vertices (i.e., vertices of the form \(w_{i,j}[\ell ]\)), is planar. But \(G''\) is a subdivision of the incidence graph \(G_{\Phi }\) which, as we are reducing from Cubic Planar Monotone 1-in-3 SAT, is planar. We can therefore conclude that G is indeed planar.

We now argue that (Gfk) is a yes-instance if and only if \(\Phi \) is a yes-instance. Recall that H is a valid subgraph of G if \(V(H) = V(G)\) and \(d_H(v) \ge 1\) for all \(v \in V(G)\). We now argue that the neighbourhood discrepancies of certain vertices in V is independent of our choice of valid subgraphs H.

Lemma 4.1

Let H be any valid subgraph of G. Then, for all \(i \in \{\,1,\ldots ,n\,\}\),

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(u_i,f) = {{\,\textrm{ND}\,}}_H(z_i,f) = {{\,\textrm{ND}\,}}_H(z_i',f) = 9. \end{aligned}$$

Proof

First consider \(u_i\). Since \(d_G(u_i) = 1\), we know that \(N_H(u_i) = \{\,v_i\,\}\), so \({{\,\textrm{ND}\,}}_H(u_i,f) = (f(u_i) - f(v_i))^2 = (7-4)^2 = 9\). Similarly, we know that \(N_H(z_i') = \{\,z_i\,\}\), so \({{\,\textrm{ND}\,}}_H(z_i',f) = (f(z_i') - f(z_i))^2 = (4-1)^2 = 9\). Finally, we know that \(\emptyset \ne N_H(z_i) \subseteq \{\,z_i',v_i\,\}\), where \(f(z_i') = f(v_i) = 4\), so \(\sum _{y \in N_H(z_i)} f(y) / d_H(z_i) = 4\). Thus we have that \({{\,\textrm{ND}\,}}_H(z_i,f) = (1 - 4)^2 = 9\). \(\square \)

Lemma 4.2

Let H be any valid subgraph of G. Then, for all \(j \in \{\,1,\ldots ,m\,\}\), \({{\,\textrm{ND}\,}}_H(b_j,f) = 1\).

Proof

Since \(N_G(b_j) = \{\,a_j\,\}\), we must also have \(N_H(b_j) = \{\,a_j\,\}\). It follows that \({{\,\textrm{ND}\,}}_H(b_j,f) = (f(b_j) - f(a_j))^2 = (1 - 2)^2 = 1\). \(\square \)

Lemma 4.3

Let H be any valid subgraph of G. Then, for all \(i \in \{\,1,\ldots ,m\,\}\), \(j \in \{\,1,2,3\,\}\) and \(\ell \in \{\,1,2\,\}\), \({{\,\textrm{ND}\,}}_H(w_{i,j}[\ell ],f) = 0\).

Proof

The claim follows immediately from the observation that, for all \(y \in N_G(w_{i,j}[\ell ])\) (and hence in \(N_H(w_{i,j}[\ell ])\)), we have \(f(y) = f(w_{i,j}[\ell ])\). \(\square \)

By Lemmas 4.1, 4.2 and 4.3, we see that \(\sum _{v\in V(H)} {{\,\textrm{ND}\,}}_H(v,f) \ge 27n + m\) for any valid subgraph H, so (Gfk) is a yes-instance if and only if there is some valid subgraph H such that

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(v,f) = 0 \text { for all } v \in \left\{ \,v_i, w_{i,1}, w_{i,2}, w_{i,3}, a_j\,\right\} , \end{aligned}$$

for all \(i \in \{\,1,\ldots ,n\,\}\) and \(j \in \{\,1,\ldots ,m\,\}\). We will say that such a subgraph H is good.

We first argue that, if \(\Phi \) is a yes-instance, there is a good subgraph H.

Lemma 4.4

Suppose that there is an assignment \(g:\{\,x_1,\ldots ,x_n\,\} \rightarrow \{\,\texttt {true},\texttt {false}\,\}\) such that every clause \(C_j\) contains precisely one true variable. Then there is a good subgraph H.

Proof

We construct H by deleting the following edges from G. In each variable gadget \(G_i\) such that \(g(x_i) = \texttt {true}\), we delete the edges \(v_iz_i\) and \(w_{i,j}w_{i,j}[\ell ]\) for each \(j \in \{\,1,2,3\,\}\) and \(\ell \in \{\,1,2\,\}\). In each variable gadget \(G_i\) such that \(g(x_i) = \texttt {false}\), we delete the edges \(v_iw_{i,j}\) for \(j \in \{\, 1,2,3\, \}\) and all edges from \(G_i\) to vertices in clause gadgets. It suffices to demonstrate that \({{\,\textrm{ND}\,}}_H(v,f) = 0\) for \(v \in \{\, v_1,\ldots ,v_n\,\} \cup \{\, w_{i,j}\mid 1 \le i \le n, 1 \le j \le 3\, \} \cup \{\,a_j\mid 1 \le j \le m\,\}\).

Suppose first that \(v = v_i\) for some i. If \(g(x_i) = \texttt {true}\) then

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(v_i,f)&= \left( f(v_i) - \frac{f(u_i) + f(w_{i,1}) + f(w_{i,2}) + f(w_{i,3})}{4}\right) ^2 \\&= \left( 4 - \frac{7 + 3 + 3 + 3}{4}\right) ^2 = 0, \end{aligned}$$

and if \(g(x_i) = \texttt {false}\) then

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(v_i,f) = \left( f(v_i) - \frac{f(u_i) + f(z_i)}{2}\right) ^2 = \left( 4 - \frac{7 + 1}{2}\right) ^2 = 0, \end{aligned}$$

as required.

Now suppose that \(v = w_{i,\ell }\) for some \(1 \le i \le n\) and \(1 \le \ell \le 3\). If \(g(x_i) = \texttt {true}\) then \(w_{i,j}\) has precisely two neighbours in H, \(v_i\) and \(a_j\) for some \(j \in \{\, 1,\ldots ,m\, \}\). Thus we have

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(w_{i,\ell },f) = \left( f(w_{i,\ell },H) - \frac{f(v_i) + f(a_j)}{2}\right) ^2 = \left( 3 - \frac{4 + 2}{2}\right) ^2 = 0. \end{aligned}$$

If, on the other hand, \(g(x_i) = \texttt {false}\), we have

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(w_{i,\ell },f) = \left( f(w_{i,\ell }) - \frac{f(w_{i,\ell }[1]) + f(w_{i,\ell }[2]}{2}\right) ^2 = \left( 3 - \frac{3+3}{2}\right) ^2 = 0. \end{aligned}$$

Finally, suppose that \(v = a_j\) for some \(j \in \{\, 1,\ldots ,m\, \}\). Since exactly one variable appearing in \(C_j\) evaluates to true under g, there is exactly one edge from \(a_j\) to a vertex belonging to a clause gadget; the unique neighbour of \(a_j\) in a clause gadget will be \(w_{i,\ell }\) for some \(i \in \{\,1,\ldots ,n\,\}\) and \(\ell \in \{\,1,2,3\,\}\). Thus

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(a_j,f) = \left( f(a_j) - \frac{f(b_j) + f(w_{i,\ell })}{2}\right) ^2 = \left( 2 - \frac{1 + 3}{2}\right) ^2 = 0, \end{aligned}$$

completing the proof that H is good. \(\square \)

Conversely, we now argue that the existence of a good subgraph H implies that \(\Phi \) is a yes-instance.

Lemma 4.5

Suppose that there is a good subgraph H. Then there is an assignment \(g:\{\,x_1,\ldots ,x_n\,\} \rightarrow \{\,\texttt {true},\texttt {false}\,\}\) such that every clause \(C_j\) contains precisely one true variable.

Proof

We begin by observing that, for any \(i \in \{\,1,\ldots ,n\,\}\), we have \({{\,\textrm{ND}\,}}_H(v_i,f) = 0\) if and only if either \(N_H(v_i) = \{\,u_i,z_i\,\}\) or \(N_H(v_i) = \{\,u_1,w_{i,1},w_{i,2},w_{i,3}\,\}\). We can therefore define an assignment \(g:\{\,x_1,\ldots ,x_n\,\}\rightarrow \{\,\texttt {true}, \texttt {false}\,\}\) by setting

$$\begin{aligned} g(x_i) = {\left\{ \begin{array}{ll} \texttt {true}&{} \text {if } N_H(v_i) = \{\,u_i,w_{i,1},w_{i,2},w_{i,3}\,\}\\ \texttt {false}&{} \text {if } N_H(v_i) = \{\, u_i,z_i\, \}. \end{array}\right. } \end{aligned}$$

We claim that, with this assignment, every clause \(C_j\) must contain exactly one true variable. To show that this is true, we suppose, for a contradiction, that the clause \(C_j = (x_{j_1},x_{j_2},x_{j_3})\) does not contain exactly one true literal. Note that, in order to have \({{\,\textrm{ND}\,}}_H(a_j,f) = 0\), as \(b_j \in N_H(a_j)\) for every good subgraph H, we must have \(N_H(a_j) = \{\,b_j, w_{i,\ell }\,\}\) for some \(i \in \{\,j_1,j_2,j_3\, \}\) and \(\ell \in \{\,1,2,3\, \}\).

Suppose first that \(g(x_{j_1}) = g(x_{j_2}) = g(x_{j_3}) = \texttt {false}\). Then, by definition of g, we see that \(w_{i,\ell }\) is not adjacent to \(v_i\) in H, so \(f(y) \le f(w_{i,\ell })\) for all \(y \in N_H(w_{i,\ell })\); it follows that, to achieve \({{\,\textrm{ND}\,}}_H(w_{i,\ell },f) = 0\), we must have \(f(y) = f(w_{i,\ell })\) for all \(y \in N_H(w_{i,\ell })\). Since \(f(a_j) = 2 < f(w_{i,\ell }) = 3\), it follows that \(a_j\) and \(w_{i,\ell }\) are not adjacent in H, giving the required contradiction.

Now suppose that at least two variables in \(C_j\) evaluate to true; it follows that there is some variable \(x_{j_{r}}\) in \(C_j\) such that \(g(x_{j_{r}}) = \texttt {true}\) but \(a_jw_{j_{r},\ell } \in E(G) {\setminus } E(H)\) for some \(\ell \in \{\, 1,2,3\, \}\). Thus \(N_H(w_{j_r,\ell }) \subseteq \{\,v_{j_r},w_{j_r,\ell }[1],w_{j_r,\ell }[2]\,\}\) and so \(f(y) \ge f(w_{j_r,\ell })\) for all \(y \in N_H(w_{j_r,\ell })\). Moreover, by definition of g, we know that \(v_i \in N_H(w_{j_r,\ell })\), where \(f(v_i) > f(w_{j_r,\ell })\); this gives \({{\,\textrm{ND}\,}}_H(w_{j_r,\ell },f) > 0\), a contradiction.

Thus, we can conclude that every clause \(C_j\) contains precisely one true variable. \(\square \)

Theorem 4.6

Average Value Neighbourhood Optimisation is NP-complete, even when restricted to input graphs G that are planar and have a maximum degree of five.

Proof

The result follows immediately from Lemmas 4.4 and 4.5. \(\square \)

4.2 Ideal and Near-Ideal Subgraphs

An obvious upper-bound to \({{\,\textrm{score}\,}}(H, f)\) is given by \(\sum _{v\in V(H)} \ln d_H(v)\) (i.e., assume every vertex has zero neighbourhood discrepancy), so a natural question to ask is whether, for a given graph G and function f, a valid subgraph H of G can be found that achieves this bound. In such a graph, it must hold that \({{\,\textrm{ND}\,}}_H(v,f) = 0\) for every \(v\in V(H)\). We say such a graph H is f-ideal (or simply ideal, if f is clear from the context). We now show that this definition is equivalent to saying that a graph H is f-ideal if and only if the restriction of f to any connected component of H is a constant-valued function.

Observation 4.7

A graph H is f-ideal for a function \(f:V(H) \rightarrow {\mathbb {Q}}\) if and only if for each connected component \(C_i\) in H there exists some constant \(c_i\) such that \(f(v) = c_i\) for all \(v\in V(C_i)\).

Proof

Let P denote a path of maximal length in an f-ideal graph such that the weights of the vertices of P strictly increase as one follows the path. In an ideal graph, any edge between vertices of different weights means that P must contain at least two distinct vertices, however neither the first and nor the last vertex in such a path cannot have zero neighbourhood discrepancy. Thus, no such path on one or more edges can exist in an ideal graph, so a graph G is ideal if and only if for each connected component \(C_i\) in G there exists some constant \(c_i\) such that \(f(v) = c_i\) for all \(v\in V(C_i)\). \(\square \)

Thus, ideal subgraphs can be found by removing any edge uv if \(f(u) \ne f(v)\) (in \(O(|E|) = O(n^2)\) time), and if necessary we can test if such a graph has no isolated vertices (and thus is valid) quickly.

The proof of Theorem 4.7 relies on maximal paths with increasing weights, so one might be tempted to relax the ideal definition to only apply on vertices that are not leaves. We therefore say a graph H is f-near-ideal if \({{\,\textrm{ND}\,}}_H(v,f) = 0\) for every \(v\in V(H)\) with \(d_H(v) \ge 2\). In other words, we now allow non-zero neighbourhood discrepancy, but only at leaves. As a reminder, a subgraph H of G is valid if H is a spanning subgraph of G with \(d_H(v) \ge 1\) for all \(v\in V(H)\).

figure d

While an ideal subgraph (if one exists) can be found quickly, it turns out that solving Near Ideal Subgraph is NP-complete, even on trees. We reduce from Subset Sum, which is NP-complete [13],Footnote 1 and which we define as follows.

figure e
Fig. 5
figure 5

Diagram of gadget for reduction from subset sum. The values inside the vertices are their associated weights

Given an instance (Sk) of Subset Sum, we will construct a graph G and weight function \(f:E(G) \rightarrow {\mathbb {Q}}\) such that (Gf) has a near-ideal subgraph if and only if there is a solution to our instance of subset sum.

The graph G contains \(3n+3\) vertices labelled as follows:

  • \(v_t\) for the target value, \(v_s\) for a partial sum, and \(v_z\) for a pendant,

  • \(v_p^j\) for \(p\in \{\,1,\ldots ,n\,\}\) and \(j\in \{\,1,2,3\,\}\).

Vertex \(v_s\) is adjacent to vertices \(v_t\), \(v_z\), and \(v_p^1\) for \(p\in \{\, 1,\ldots ,n\, \}\). For each \(p\in \{\,1,\ldots ,n\,\}\), \(v_p^1\) is adjacent to \(v_p^2\), and \(v_p^2\) is adjacent to \(v_p^3\). This graph can be seen in Fig. 5. We define f as follows:

  • \(f(v_t) = -k\),

  • \(f(v_s) = f(v_z) = 0\), and

  • \(f(v^j_p) = S(p)\) for \(p\in \{\,1,\ldots ,n\,\}\), and for \(j\in \{\,1,2,3\,\}\).

Note that for the condition \(d_H(v) \ge 1\) to hold for our subgraph H, the only edges in G that might not be in H are of the form \(v_sv_p^1\) or \(v_p^1v_p^2\) for some \(p\in \{\,1,\ldots ,n\,\}\). Additionally, for any \(p\in \{\,1,\ldots ,n\,\}\), at most one of \(v_sv_p^1\) or \(v_p^1v_p^2\) can be removed.

We are now ready to show that G has a near-ideal subgraph if and only if (Sk) is a yes-instance for Subset Sum.

Theorem 4.8

Near Ideal Subgraph is NP-complete, even if the input graph G is a tree.

Proof

First, assume that G has an f-near-ideal subgraph H. Consider the vertex \(v_s\) and its neighbours in H. Since \(v_s\) must have degree at least 2 in H, it must hold that \({{\,\textrm{ND}\,}}_H(v_s,f) = 0\), giving us

$$\begin{aligned} 0 =&\left( f(v_s) - \frac{\sum _{w\in N_H(v_s)}f(w)}{d_H(v_s)}\right) ^2. \end{aligned}$$

As \(f(v_s)=0\) and \(d_H(v_s) \ge 2\), this implies

$$\begin{aligned} 0 =&\sum _{w\in N_H(v_s)}f(w). \end{aligned}$$

The only neighbours of \(v_s\) are \(v_t\), \(v_z\), and for some \(U\subseteq S\), vertices of the form \(v_u^1\) for every \(u\in U\). This gives us

$$\begin{aligned} 0 = f(v_t) + 0 + \sum _{u\in U}f(v^1_u), \end{aligned}$$

and after rearranging and substituting in values of f we get

$$\begin{aligned} k = \sum _{u\in U} S(u), \end{aligned}$$

and thus U is a solution to our instance of subset sum.

Now, assume that U is a solution to Subset Sum (i.e., there exists a \(U\subseteq S\) such that \(\sum _{u\in U}S(u) = k\)). We will show that G contains an f-near-ideal subgraph. Let H be the subgraph containing edges \(v_tv_s\), \(v_sv_t\), \(v_p^2v_p^3\) for \(p\in \{\,1,\ldots ,n\,\}\), \(v_p^1v_p^2\) for \(p\in \{\,1,\ldots ,n\,\}{\setminus } U\), and \(v_sv_u^1\) for \(u\in U\). Note that H has no isolated vertices.

The only vertices with degree at least two in H are \(v_s\), and vertices of the form \(v_p^2\) for \(p\in \{\,1,\ldots ,n\,\}{\setminus } U\). For any vertex of the form \(v_p^2\) for \(p\in \{\,1,\ldots ,n\,\}{\setminus } U\), \({{\,\textrm{ND}\,}}_H(v_p^2,f) = 0\), so the last vertex to examine is \(v_s\). As \(\sum _{u\in U} S(u) = k\), this implies

$$\begin{aligned} {{\,\textrm{ND}\,}}_H(v_s,f) =&\left( f(v_s) - \frac{\sum _{w\in N_H(v_s)}f(w)}{d_H(v_s)}\right) ^2 \\ =&\left( 0 - \frac{-k + \sum _{u\in U}f(v_z)}{d_H(v_s)}\right) ^2 \\ =&\left( 0 - \frac{-k + \sum _{u\in U}S(u)}{d_H(v_s)}\right) ^2 \\ =&\left( - \frac{0}{d_H(v_s)}\right) ^2 \\ =&\; 0, \end{aligned}$$

and so H is f-near-ideal.

NP-hardness of Near Ideal Subgraph then follows from the NP-hardness of Subset Sum, and we obtain NP-completeness by using a near-ideal subgraph H as a certificate. \(\square \)

We note that while Subset Sum can be solved in pseudo-polynomial time with dynamic programming [14], this does not immediately give a pseudo-polynomial algorithm for Near Ideal Subgraph as there is no obvious bijection between an arbitrary instance of Near Ideal Subgraph and a graph of the form shown in Fig. 5.

5 Parameterised Results

In this section, we describe two parameterised algorithms for Correlation Subgraph Optimisation. We make use of the following definitions from parameterised complexity to describe these. A problem is in the fixed parameter tractable (or FPT) class with respect to some parameter k if the problem can be solved on inputs of size n in time \(f(k) \cdot n^{O(1)}\) for some computable function f. Note in particular that the exponent of n is constant and independent of k. Another class of parameterised problems is XP: a problem is in XP with respect to some parameter k if the problem can be solved on inputs of size n in time \(O(n^{f(k)})\). For problems in XP, the exponent of n may change for different values of k, but if an upper bound on k is given then there is also an upper bound for the exponent of n.

Both of the algorithms in this section solve the enumeration problem, rather than just the decision problem. This means that the algorithms output a complete list of all optimal solutions to the problem. For our problem, this will be a list of subgraphs that achieve the optimal score. As the number of solutions may be exponential in n, it is not possible to bound the running time of the whole algorithm by a polynomial in n. Instead, we follow Creignou et al. [15] and use precalculation, delay, and postcalculation times. The precalculation time is the time before the first result is output, the delay time is the time between any two successive outputs, and the postcalculation time is the time between the final output, and the termination of the algorithm. For further background on parameterised complexity, see [16], and for further background on parameterised enumeration, see [15, 17, 18].

Enumerating all subgraphs that optimally solve \(k\)-Correlation Subgraph Optimisation allows one to investigate the uncertainty of the spatial correlation, which can then be incorporated into modelling with a Bayesian modelling framework [8]. We define this enumeration problem as Enum-\(k\)-Correlation Subgraph Optimisation .

figure f

In Sect. 5.1, we show that all solutions to Enum-\(k\)-Correlation Subgraph Optimisation can be enumerated with XP precalculation time (where the parameter is the maximum degree of the graph plus the treewidth of the graph), linear delay, and linear postcalculation time. Then, in Sect. 5.2, we consider the more restricted case where G has maximum degree three, and show that with this restriction, all solutions to Enum-\(k\)-Correlation Subgraph Optimisation can be enumerated with FPT precalculation time, linear delay and linear postcalculation time, where the parameter is the maximum number of edges that are removed. We highlight that this restriction on the maximum degree occurs naturally in triangulations of surfaces, such as can occur when discretising spatial data [10].

5.1 An Exact XP Enumeration Algorithm Parameterised by Treewidth and Maximum Degree

In this section, we give an exact XP algorithm for solving Correlation Subgraph Optimisation on arbitrary graphs. To aid readability, we first show that the algorithm solves the decision problem, and then at the end of this section we explain how to extend it to solve the enumeration problem Enum-\(k\)-Correlation Subgraph Optimisation.

We use treewidth, first introduced in [19], as a parameter for this algorithm, which we will define here for readability. To define treewidth, we first define a tree decomposition of a graph G. A tree decomposition of a graph G is a tuple \((T, {\mathcal {B}})\) where T is a treeFootnote 2 and \({\mathcal {B}}\) is a function mapping vertices of T to subsets of the vertices of G such that

  1. 1.

    for any \(v \in V(G)\), there is at least one vertex \(i\in V(T)\) such that \(v\in {\mathcal {B}}(i)\),

  2. 2.

    for any edge \(uv\in E(G)\), there is at least one vertex \(i\in V(T)\) such that \(\{\,u,v\,\}\subseteq {\mathcal {B}}(i)\), and

  3. 3.

    for any vertex \(v\in E(G)\), the subgraph of T induced on the vertex set \(\{\,i \mid i\in V(T) \wedge v \in {\mathcal {B}}(i)\,\}\) is connected.

For a given vertex \(i\in V(T)\), the set \({\mathcal {B}}(i)\) is often called the bag at i, and the set \(\{ \,{\mathcal {B}}(i) \mid i\in V(T)\,\}\) is collectively referred to as the bags of the tree decomposition. As is common when working with treewidth, we will slightly abuse notation and use the term bag to refer to both a vertex i of T, and also the set \({\mathcal {B}}(i)\)—the distinction will always be clear from context. The width of a tree decomposition \((T, {\mathcal {B}})\) is given by \(\max \{\, |{\mathcal {B}}(i)| \mid i \in V(T)\,\}\), and the treewidth of a graph G, written tw(G), is then the minimum width over all tree decompositions of G.

Our result uses the concept of a nice tree decomposition. A nice tree decomposition is a tree decomposition with one leaf bag selected as a root bag so that the children of a bag \(\nu \) are the bags adjacent to \(\nu \) that are further from the root, and the additional property that each leaf bag is empty, and each non-leaf bag is either an introduce bag, forget bag, or join bag, which are defined as follows. An introduce bag \(\nu \) has exactly one child below it, say \(\mu \), such that \(\nu \) contains every element in \(\mu \) as well as precisely one more element. A forget bag \(\nu \) has exactly one child below it, say \(\mu \), such that \(\nu \) contains every element in \(\mu \) except one. A join bag \(\lambda \) has exactly two children below it, say \(\mu \) and \(\nu \), such that \(\lambda \), \(\mu \), and \(\nu \), all have precisely the same elements. In particular, we also draw attention to the fact that if G has a treewidth of t, then G also has a nice tree decomposition \((T,{\mathcal {B}})\) with width t such that T has O(k|V(G)|) vertices [16, Theorem 7.4]. See [16], in particular Chapter 7, for a more in-depth introduction to tree decompositions and tree width, a formal definition of nice tree decompositions, and methods for constructing them efficiently.

Theorem 5.1

Correlation Subgraph Optimisation can be solved in time

$$\begin{aligned} O(2^{2\Delta (G)(tw(G)+1)}\cdot n^{2\Delta (G) + 1}). \end{aligned}$$

The algorithm follows fairly standard dynamic programming techniques on tree decompositions, and our enumeration result follows from tracking which states lead to which.

The algorithm starts by finding a nice tree decomposition T of G with treewidth tw(G) that is rooted at some arbitrary leaf bag. In this section, we will assume that both \(n \ge 2\) and \(\Delta (G) \ge 2\). If either of these conditions does not hold, then there is at most one valid subgraph to consider (G itself), so one can simply calculate and return the value of \({{\,\textrm{score}\,}}(G, f)\).

We first define some specific terminology that will be useful when describing the algorithm. Let \((T,{\mathcal {B}})\) be a tree decomposition (not necessarily nice) with an arbitrary bag labelled as the root. For each bag \(\nu \), denote by \(G_\nu \) the induced subgraph of G consisting precisely of vertices that appear in bags below \(\nu \) but do not appear in \(\nu \). The set of edges between a vertex in \(\nu \) and a vertex in \(G_\nu \) will be important to our algorithm, so we will write \(E_\nu = \{ \,uv \in E(G) \mid u\in \nu \wedge v \in G_\nu \,\}\) to be the set of edges with one endpoint in \(G_\nu \) and the other in \(\nu \). An example of a graph, a tree decomposition, \(G_\nu \), and \(E_\nu \) are shown in Fig. 6.

Our algorithm will process each bag, from the leaves towards the root, determining a set of states for each bag such that we can guarantee that the optimal solution will correspond to a state in the root bag.

Fig. 6
figure 6

From left to right, we have a graph G with a tree decomposition displayed by circling vertices, the tree indexing a tree decomposition of G drawn as a graph with the root and the bag \(\nu \) labelled, the graph \(G_\nu \) consisting of the induced subgraph on vertices D and E, and the set of edges \(E_\nu = \{\,CD, EF\,\}\) (i.e., the edges of G that are between a vertex in \(G_\nu \) and a vertex in \(\nu \))

Definition 5.2

Given a bag \(\nu \) of a tree decomposition of a graph G and a set of edges \(I \subseteq E_\nu \), define \({\mathcal {G}}'_{\nu ,I}\) to be the set of graphs \(G'\) with:

  • \(V(G') = V(G_\nu ) \cup \nu \),

  • \(E(G') \subseteq E(G)\), and

  • for any edge \(uv \in E_\nu \), \(uv \in E(G')\) if and only if \(uv \in I\).

Definition 5.3

Given a bag \(\nu \) of a nice tree decomposition of a graph G, the set of all valid states at \(\nu \) is

$$\begin{aligned} S_\nu = \{ \,(I, D) \mid&\, I \subseteq E_\nu \text {, } {\mathcal {G}}'_{\nu , I} \ne \emptyset \text {, and there exists a graph } H \in {\mathcal {G}}'_{\nu ,I} \\&\text { with } D = \sum _{v\in G_\nu } \ln d_H(v), \text {and } d_H (v) \ge 1 \;\forall v \in V(G_\nu )\,\}.\end{aligned}$$

We often discuss the graph H that corresponds to a state, which we now define.

Definition 5.4

Given a bag \(\nu \) of a nice tree decomposition of a graph G, a graph H corresponds to a valid state (ID) at \(\nu \) if \(H\in {\mathcal {G}}'_{\nu ,I}\), \(D = \sum _{v\in G_\nu } \ln d_H (v)\), and \(d_H(v) \ge 1 \;\forall v\in V(G_\nu )\).

Note that a graph may correspond to at most one state, while there may be multiple graphs which all correspond to the same state.

In this section, we use G (respectively \(G_\nu \)) to refer to the original graph in the problem (respectively an induced subgraph thereof, as defined earlier), and we will use H, or \(H_\nu \) for a given bag \(\nu \) of a nice tree decomposition of G, to refer to a graph that is a subgraph of \(G_\nu \) corresponding to a state.

Given a graph G with a nice tree decomposition containing a bag \(\nu \), a function \(f:E(G) \rightarrow {\mathbb {Q}}\) on G, and a subgraph H of G, let \(H_\nu \) be the subgraph of H induced by \(V(G_\nu )\). Then we can write \({{\,\textrm{score}\,}}(H, f)\) as

$$\begin{aligned}&{{\,\textrm{score}\,}}(H, f) =\sum _{v\in V(G) \setminus V(G_\nu )} \ln d_H(v) + \sum _{v\in V(G_\nu )} \ln d_H(v) \\ {}&\quad - n \ln \left[ \sum _{v\in V(G) \setminus V(G_\nu )} d_H(v) {{\,\textrm{ND}\,}}_H(v,f)+ \sum _{v\in V(G_\nu )} d_H(v) {{\,\textrm{ND}\,}}_H(v,f) \right] . \end{aligned}$$

Using the above, for a given state (ID) of the bag \(\nu \), if we know the minimum value of \(\sum _{v\in V(G_\nu )} d_{H'_\nu }(v) {{\,\textrm{ND}\,}}_H(v,f)\) over all subgraphs \(H'_\nu \) that correspond to (ID), we can calculate the maximum value that can be achieved by \({{\,\textrm{score}\,}}(H', f)\) over all graphs \(H'\) that correspond to (ID) by iterating over all induced graphs \(H'\) on the vertex set \(V(G) {\setminus } V(G_\nu )\) that agree with I, for each calculate the value of \({{\,\textrm{score}\,}}(H,f)\), and only keep the maximum.

Towards this end our algorithm will store, for a given state (ID), the minimum value of \(\sum _{v\in V(G_\nu )} d_{H'_\nu }(v) {{\,\textrm{ND}\,}}_H(v,f)\) over all subgraphs \(H'_\nu \) of \(G_\nu \) that correspond to (ID). We will use N(ID) to refer to this minimum value.

Definition 5.5

Given a graph G, a nice tree decomposition T of G, and a bag \(\nu \) of T that contains a state (ID), \(N_\nu (I,D)\) is the minimum value of \(\sum _{v\in V(G_\nu )} d_{H'_\nu }(v) {{\,\textrm{ND}\,}}_H(v,f)\) taken over all subgraphs \(H'_\nu \) of \(G_\nu \) where \(H'_\nu \) corresponds to the state (ID).

We now prove an upper bound on the number of valid states in any one bag.

Lemma 5.6

Given a graph G with n vertices and a nice tree decomposition T of G, every bag of T will have at most \(2^{\Delta (G)(tw(G)+1)} \cdot n^{\Delta (G)}\) states.

Proof

Let \(\nu \) be an arbitrary bag of T. Given an arbitrary state (ID) of \(\nu \), we see that I may contain up to \(\Delta (G)\) edges per vertex \(v \in \nu \), giving at most \(2^{\Delta (G)(tw(G)+1)}\) possibilities for I. Additionally, each vertex of any subgraph \(H_\nu \) corresponding to a given state (there may be at most n such vertices) has a degree of at most \(\Delta (G)\), giving \(n^{\Delta (G)}\) possibilities for D and thus the result follows. \(\square \)

We now discuss the process of determining states at each type of bag in a nice tree decomposition (leaf, introduce, forget, and join). We begin by observing that there is only one state \((\emptyset , 0)\) at each leaf bag (apart from the root), and that the states at an introduce bag \(\nu \) are exactly the states of the child of \(\nu \). This leaves forget and join bags, which are slightly more involved.

We now give the algorithm for determining states at a forget bag. For a bag that forgets vertex v, this involves taking each state at the child of v and each subset of edges incident to v. If the state (which contains information about edges between vertices already forgotten, and vertices not yet forgotten) is consistent with a particular subset of edges (i.e., there is no edge incident to v such that the state assumes the edge will be removed, but the edge is present in the subset) then a new state is created at the forget bag.

Algorithm 1
figure g

Process forget bags. The arguments for this procedure are a forget bag \(\nu \), the states \(S_\mu \) of its child \(\mu \), and a function \(f:E(G) \rightarrow \mathbb {Q}\)

Lemma 5.7

Given a graph G, a nice tree decomposition \((T,{\mathcal {B}})\) of G, a forget bag \(\nu \) of \((T,{\mathcal {B}})\) with child \(\mu \) with states \(S_\mu \), and a function \(f:E(G) \rightarrow {\mathbb {Q}}\), running Algorithm 1 with parameters \(\nu \), \(S_\mu \), and f correctly determines the possible states \(S_\nu \) as per Definition 5.3.

Proof

We will first show that any state in \(S_\nu \) (as per Definition 5.3) will be found by Algorithm 1, and then show that any state found by the algorithm is valid per Definition 5.3. Following the variable labelling of Algorithm 1, let v be the vertex forgotten and let \(\mu \) be the child of \(\nu \).

Let \(s_\nu =(I_\nu ,D_\nu ) \in S_\nu \) be a state at \(\nu \) as per Definition 5.3. We will show that Algorithm 1 will find \(s_\nu \). Let H be a subgraph corresponding to \(s_\nu \), and let \(H_\mu \) be the induced subgraph of H with \(V(H_\mu ) = V(H) \setminus \{\,v\,\}\). Letting \(I_\mu = E(H_\mu ) \cap E_\mu \), we see that \(H_\mu \in {\mathcal {G}}'_{\mu ,I_\mu }\), so there must be some state \(s_\mu = (I_\mu , D_\mu )\) in \(S_\mu \) corresponding to \(H_\mu \), and as Line 5 iterates over all states in \(S_\mu \), the state \(s_\mu \) must be considered by Algorithm 1. Let E be the set of edges in H incident with v. By the definition of H corresponding to \(s_\nu \), it must be that \(|E| \ge 1\). It follows that Line 6 must consider the subset E as a set of edges, and as both E and \(I_\mu \) were defined from E(H), this set must satisfy the condition on Line 7. Therefore it must be that Algorithm 1 calculates and adds \((I_\nu , D_\nu )\) to \(S_\nu \), and calculates and updates \(N_\nu (I_\nu , D_\nu )\).

To see that no extraneous states are found, we consider an arbitrary state \((I_\nu , D_\nu )\) generated by Algorithm 1. It must be added at Line 14, so there is some \((I_\mu , D_\mu )\in S_\mu \) and some \(E\subseteq E_\nu \) such that \(uv \in E\) if and only if \(uv\in I_\mu \) for each \(u\in \nu \). Then there must be some graph \(H_\mu \in {\mathcal {G}}'_{\mu , I_\mu }\) that corresponds to \((I_\mu , D_\mu )\). Now consider the graph \(H_\nu \) defined by \(V(H_\nu ) = V(G_\nu )\) and \(E(H_\nu ) = E(H_\mu ) \cup E\). Any edge \(uw\in E(H_\nu )\) with \(u\in G_\nu \), \(w\not \in G_\nu \) and \(w\ne v\) must be in \(I_\mu \) as \((I_\mu , D_\mu ) \in S_\mu \), and such edges are then in \(I_\nu \). Any edge \(uv\in E(H_\nu )\) with \(u\in G_\nu \) is clearly not in \(I_\nu \), and any edge \(vw\in E(H_\nu )\) with \(w\not \in G_\nu \) is in \(I_\nu \). As no other edges are added to \(I_\nu \), we see that \(I_\nu \subseteq E_\nu \), so \(H_\nu \in {\mathcal {G}}'_{\nu , I_\nu }\). We now only have to check that \(\deg u \ge 1\) for all \(u\in V(H_\nu )\). For any \(u\in V(H_\mu )\), this holds as \((I_\mu , D_\mu )\in S_\mu \), and as \(V(H_\nu ) {\setminus } V(H_\mu ) = \{\,v\,\}\), and \(|E| \ge 1\) by Line 6, we see that \((I_\nu , D_\nu )\) is a state in \(S_\nu \) by Definition 5.3. \(\square \)

Lemma 5.8

Given a graph G on n vertices, a nice tree decomposition \((T,{\mathcal {B}})\) of G, a forget bag \(\nu \) of \((T,{\mathcal {B}})\) with child \(\mu \) with states \(S_\mu \), and a function \(f:E(G) \rightarrow {\mathbb {Q}}\), running Algorithm 1 with parameters \(\nu \), \(S_\mu \), and f takes \(O(2^{\Delta (G)(tw(G)+2)}\cdot n^{\Delta (G)})\) time.

Proof

The loop beginning on line 5 will run \(O(2^{\Delta (G)(tw(G)+1)}\cdot n^{\Delta (G)})\) times as by Lemma 5.6 there \(O(2^{\Delta (G)(tw(G)+1)}\cdot n^{\Delta (G)})\) states in any bag, and the loop beginning on line 6 will run \(O(2^{\Delta (G)})\) times as B is the set of edges in G incident with v. Together, these give the result. \(\square \)

Algorithm 2
figure h

Process join bags. The parameters for this procedure are a join bag \(\lambda \) and the sets of states \(S_\nu , S_\mu \) of its two children \(\nu \) and \(\mu \)

Lemma 5.9

Given a graph G, a nice tree decomposition \((T,{\mathcal {B}})\) of G, a join bag \(\lambda \) of \((T,{\mathcal {B}})\) with children \(\nu \) and \(\mu \) with corresponding states \(S_\nu \) and \(S_\mu \), and a function \(f:E(G) \rightarrow {\mathbb {Q}}\), running Algorithm 2 with parameters \(\lambda \), \(S_\nu \), and \(S_\mu \) correctly determines the possible states of \(\lambda \) as per Definition 5.3.

Proof

We again proceed by first showing that Algorithm 2 does correctly find every valid state, and then show that any state found by Algorithm 2 is valid. Let \(\nu \) and \(\mu \) be the two children of \(\lambda \).

Consider a state \(s_\lambda =(I_\lambda ,D_\lambda )\in S_\lambda \). We want to show that Algorithm 2 does indeed find \(s_\lambda \). Let \(H_\lambda \) be a subgraph of \(G_\lambda \) corresponding to \(s_\lambda \), let \(H_\nu \) be the induced subgraph of H with \(V(H_\nu ) = V(G_\nu )\), and let \(H_\mu \) be the induced subgraph of H with \(V(H_\mu ) = V(G_\mu )\).

For any edge \(uv \in I_\lambda \) (where \(v\in G_\lambda \) and \(u\in \lambda \)), it must be that \(v\not \in \lambda \). Then v must be forgotten by some bag below \(\lambda \), and as each vertex is forgotten exactly once in a nice tree decomposition, without loss of generality, we can say that v must have been forgotten in the subtree rooted at \(\nu \). From the definition of a nice tree decomposition, this means that \(v\not \in \nu \) as well while \(u\in \nu \), so \(uv \in I_\nu \). By considering all edges in \(I_\lambda \) in turn, we see that \(I_\lambda \) can be partitioned into \(I_\nu \) and \(I_\mu \), where \(I_\nu \) only contains edges of the form uv where v was forgotten in the subtree rooted at \(\nu \) and \(I_\mu \) only contains edges of the form uw where w was forgotten in the subtree rooted at \(\mu \). As Algorithm 2 considers each possible state in \(S_\nu \) (Line 3) and each possible state in \(S_\mu \) (Line 4), Algorithm 2 will find \(I_\lambda \). Let \(D_\nu = \sum _{v\in V(H_\nu )} \ln \deg _{H_\nu } v\), and let \(D_\mu = \sum _{v\in V(H_\mu )} \ln \deg _{H_\mu } v\). Then \((I_\nu , D_\nu )\) is a state of \(\nu \), and \((I_\mu , D_\mu )\) is a state of \(\mu \). As \(V(G_\lambda ) = V(G_\nu ) \cup V(G_\mu )\) and \(V(G_\nu ) \cap V(G_\mu ) = \emptyset \), we have \(D_\lambda = D_\nu + D_\mu \) and so Algorithm 2 does create and store \((I_\lambda , D_\lambda )\). Additionally, as each vertex forgotten below \(\lambda \) is either forgotten at a bag below \(\mu \) or forgotten at a bag below \(\nu \) (but not both), we see that the value of \(N_\lambda (I_\lambda , D_\lambda )\) is correctly calculated and stored.

We now show that any state found by Algorithm 2 is valid according to Definition 5.3. Let \(s_\lambda = (I_\lambda , D_\lambda )\) be an arbitrary state created by Algorithm 2. It must be added at Line 11, so the algorithm will consider two valid states \((I_\nu , D_\nu )\) and \((I_\mu , D_\mu )\) of \(S_\nu \) and \(S_\mu \) respectively, where \(I_\lambda = I_\nu \cup I_\mu \) and \(D_\lambda = D_\nu + D_\mu \). Consider the graphs \(H_\nu \) and \(H_\mu \) corresponding to \((I_\nu , D_\nu )\) and \((I_\mu , D_\mu )\), respectively. As we have a tree decomposition, we know that these two graphs are vertex disjoint. Any edge \(uv \in I_\lambda \) must therefore, without loss of generality, satisfy \(u\in G_\nu \) and \(v\not \in G_\nu \). However, as \(\lambda \) is a join bag, then \(v\not \in G_\lambda \), and so we have \(I_\lambda \subseteq E_\lambda \). Let \(H_\lambda \) be the union of \(H_\nu \) and \(H_\mu \), then as \(H_\nu \) and \(H_\mu \) are vertex-disjoint and \(V(H_\lambda ) = V(H_\nu ) \cup V(H_\mu )\), we must have \(\deg _{H_\lambda } v \ge 1\) for any \(v\in V(H_\lambda )\). We also see that \(H_\lambda \in {\mathcal {G}}'_{\lambda , I_\lambda }\), and as \(H_\nu \) and \(H_\mu \) are vertex-disjoint we also know that \(D_\lambda = D_\nu + D_\mu = \sum _{v\in V(H_\lambda )} \ln \deg v\), so \(s_\lambda \) corresponds to \(H_\lambda \) and \(s_\lambda \) is a valid state as per Definition 5.3. \(\square \)

Lemma 5.10

Given a graph G on n vertices, a nice tree decomposition \((T,{\mathcal {B}})\) of G, a join bag \(\lambda \) of \((T,{\mathcal {B}})\) with children \(\nu \) and \(\mu \) with corresponding states \(S_\nu \) and \(S_\mu \), and a function \(f:E(G) \rightarrow {\mathbb {Q}}\), running Algorithm 2 with parameters \(\lambda \), \(S_\nu \), and \(S_\mu \) takes \(O(2^{2\Delta (G)(tw(G)+1)}\cdot n^{2\Delta (G)+1})\) time.

Proof

Algorithm 2 considers all possible pairs of states from each child, giving a runtime of \(O(2^{2\Delta (G)(tw(G)+1))}\cdot n^{2\Delta (G)})\). As a nice tree decomposition on a graph with n vertices has O(n) bags [16], Algorithm 4 therefore runs in \(O(2^{2\Delta (G)(tw(G)+1)} \cdot n^{2\Delta (G)+1})\) time. \(\square \)

Algorithm 3
figure i

Process a bag \(\lambda \).

Algorithm 4
figure j

Find the score of an optimal subgraph. The parameters to this procedure are a graph G with maximum degree \(\Delta \), a nice tree decomposition T of G with root bag r where the width of T is tw, and a function \(f:E(G) \rightarrow \mathbb {Q}\)

We are now ready to prove Theorem 5.1, which we restate here for convenience.

Theorem 5.1

Correlation Subgraph Optimisation can be solved in time

$$\begin{aligned} O(2^{2\Delta (G)(tw(G)+1)}\cdot n^{2\Delta (G) + 1}). \end{aligned}$$

Proof

We see that Algorithm 4 will calculate an optimal solution to Correlation Subgraph Optimisation . The correctness of this algorithm follows from Definition 5.3, Lemmas 5.7, and 5.9, and the runtime complexity follows from Lemmas 5.8 and 5.10. \(\square \)

To be able to enumerate all solutions, we need to track which states lead to which other states, as well as all sets of edges E (as defined by the loop starting on line 6) which obtain the optimal value of \(N(I_\nu , D_\nu )\) when creating each new state, while processing bags. This adds only a constant to the running time of our algorithms as all edge sets are already considered. This can be used to create, for any state (ID) corresponding to an optimal solution, a tree rooted at (ID) such that the children of a state s correspond precisely to those states that can lead to s. The height of this tree is at most tw(G), but the reconstruction of each optimal subgraph takes \(O(n^2)\) as G (and indeed H) may have \(O(n^2)\) edges.

As we store each valid state exactly once, and each optimal solution corresponds to exactly one state in the root node, we will not generate duplicates in this process.

Theorem 5.11

For an integer \(k\ge 1\), all solutions to Enum-\(k\)-Correlation Subgraph Optimisation can be output on graphs with a precalculation time of \(O(2^{2\Delta (G)(tw(G)+1)} \cdot n^{2\Delta (G)+1})\), with \(O(n^2)\) delay, and with \(O(n^2)\) postcalculation time.

5.2 Parameterisation by Maximum Number of Edges to Remove in Graphs with Maximum Degree Three

We now study the problem when G has maximum degree three, where we want to bound the maximum number of edges that can be removed. Restricting G to graphs of maximum degree three is of interest as the dual graph of any triangulation has maximum degree three and triangulations can be used to represent discretised surfaces [10, 20, 21]. For this setting, we make use of the degree sequence of a graph. Given a graph G, the degree sequence of G is the ordered list of the degree of each vertex of G in non-increasing order.

That is, given a graph G with vertices \(\{\,v_1,\ldots , v_n\,\}\) where \(d(v_i) \ge d(v_{i+1})\) for \(i\in \{\,1,\ldots , n-1\,\}\), the degree sequence of G is

$$\begin{aligned} (d(v_1), d(v_2), \ldots , d(v_n)). \end{aligned}$$

Given a graph G, and a valid subgraph H of G, we will use R to denote the graph \(G\setminus E(H)\).

Theorem 5.12

For an integer \(k\ge 1\), all solutions to Enum-\(k\)-Correlation Subgraph Optimisation can be output on graphs with maximum degree three with a precalculation time of \(2^{k(2 \log k + O(1))} n \log n\), with O(n) delay, and with O(n) postcalculation time.

We achieve this result by considering in turn each possibility for R, and for each such graph R we consider in turn the possibilities for the degree sequence of \(G\setminus E(R)\). The number of distinct graphs R that must be considered, and the number of degree sequences of \(G\setminus R\), both depend only on k and are independent of n. In this setting G has maximum degree three, and \(G\setminus R\) must have minimum degree at least one to be a valid subgraph, so R therefore has maximum degree two and consists only of paths and cycles. As a result, R has treewidth at most two. We can therefore adapt well-known colour-coding methods (see [22, Section 13.3] for more details) for finding subgraphs with bounded treewidth in FPT time so that we can identify a subgraph R in G whose removal gives the biggest improvement to the neighbourhood discrepancy term while still maintaining the correct degree sequence of \(G{\setminus } R\). By storing the appropriate data when finding these optimal subgraphs with bounded treewidth, and also tracking which subgraphs R and which degree sequences of R lead to optimal values of the score function, we are able to rebuild all embeddings that achieve this optimum.

We begin with some notation. Let t be the number of connected components of R. As R has maximum degree 2, we will write \(R = \bigcup _{a=1}^t R^a\) where each \(R^a\) is connected (i.e., each \(R^a\) is either a path or a cycle). For each a, let the vertices of \(R^a\) be \(\{\,r^a_1,\ldots ,r^a_\ell \,\}\) such that for \(j\in \{\,1,\ldots ,\ell -1\,\}\) there is an edge from \(r^a_j\) to \(r^a_{j+1}\) (if \(R^a\) is a cycle, there is also an edge from \(r^a_1\) to \(r^a_\ell \)). For an integer j, let \(R^a_j\) be the induced subgraph of \(R^a\) on vertices \(\{\,r^a_1,\ldots ,r^a_j\,\}\). We will use \(\psi \) to denote proper embeddings of \(R^a\) (or subgraphs thereof) into G (i.e., it must hold that for any \(u,v \in V(R)\), \(uv\in E(R) \implies \psi (u)\psi (v) \in E(G)\)). We will abuse notation slightly to say, for an arbitrary subgraph \(R'\) of R, that \(E(\psi (R')) = \{\, \psi (u)\psi (v) \mid uv \in E(R')\,\} \) is the set of edges of G onto which the edges of \(R'\) are mapped under \(\psi \).

Recalling that the neighbourhood discrepancy portion of the objective function has a negative coefficient (i.e., is a penalty), for a set of edges \(F \subset E(G)\) such that \(G{\setminus } F\) has minimum degree 1, let

$$\begin{aligned} I(v, F) {:=}D_G(v) {{\,\textrm{ND}\,}}_G (v,f) - d_{G\setminus F}(v) {{\,\textrm{ND}\,}}_{G\setminus F} (v,f). \end{aligned}$$

In other words, I(vF) is the improvement to the neighbourhood discrepancy portion of our objective function from the vertex v that is obtained by removing from G the edges that are in the set F. Note that if F contains edges not incident to v, then those additional edges do not alter the value of I(vF).

For this next result we need to use integer partitions. A set \(\{\, p_1,\ldots ,p_q\,\}\) is an integer partition of k if, for each \(i\in \{\,1,\ldots ,q\, \}\), \(p_i\) is a positive integer and \(\sum _{i=1}^{q} p_i = k\).

Our first result shows that the number of subgraphs R that must be considered is at worst exponential in k and independent of n. This bound is not tight, but sufficient for our purpose.

Lemma 5.13

Given an integer k, there are \(O(2^k e^{\sqrt{k}})\) non-isomorphic graphs R such that \(|E(R)| \le k\) and \(\Delta (R) \le 2\).

Proof

The number of integer partitions of k, and therefore an upper bound on the number of ways to group k edges into components, is \(O(e^{\sqrt{k}})\) by [23, Equation 5.1.2]. Any such grouping has at most k components, and each component must be either a cycle or a path, giving \(O(2^k e^{\sqrt{k}})\) as required. \(\square \)

We note that we can take an integer partition of k into parts \(\{\, p_1, \ldots , p_q\, \}\) for some integer q, and any non-negative integer \(c \le 2^q\), and construct a corresponding graph R as follows. First, add vertices \(\{\, v_1,\ldots , v_k\, \}\) to R. For each \(i\in [q]\), let \(V_i = \{\,v_{g(i-1)+1},\ldots , v_{g(i)}\, \}\) where \(g(0) = 0\) and \(g(i) = \sum _{j=1}^{i} p_j\). Then, letting \(c_1c_2\ldots c_{q}\) be the binary representation of c, for each \(i\in [q]\), if \(c_i = 1\) then add to R a path using \(p_i\) edges on vertices \(V_i\), and if \(c_i = 0\) then add to R a cycle on \(p_i\) edges on vertices \(V_i\). This lets us generate non-isomorphic graphs R in time dependent on k but independent of n, which is sufficient for our purposes.

We now bound the number of degree sequences that H can have, given a graph R on at most k vertices, if \(H = G \setminus E(\psi (R))\) for some embedding \(\psi \) of R, and such that H has minimum degree at least one. Once we have such a degree sequence, we know exactly what the first term of the objective function will be, and can focus on minimising the penalty.

Lemma 5.14

Let G be a graph with \(\Delta (G) = 3\) and no isolated vertices, and let k be a positive integer. Let R be a graph with no isolated vertices that satisfies \(|E(R)| \le k\) and \(\Delta (R) \le 2\). Then, there are O(k) possible degree sequences for \(G{\setminus } E(\psi (R))\) such that \(G{\setminus } E(\psi (R))\) has no isolated vertices, where \(\psi \) is an embedding of R into G.

Proof

Clearly, \(G\setminus E(\psi (R))\) can only have vertices of degree one, two, or three, so for our given graphs G and R, we must determine how many vertices of degree one, two, or three, can be in \(G\setminus E(\psi (R))\). To avoid isolated vertices, each degree two vertex of R must be mapped to a degree three vertex of G. What remains in R is O(k) vertices of degree one, and these may be mapped either to a vertex of degree two in G, or a vertex of degree three in G. Thus the only choice is the number of vertices of degree one in R that are mapped to vertices of degree three in G, and there are at most O(k) such choices, giving our result. \(\square \)

We note that all degree sequences of length k can be enumerated by enumerating all sequences of k integers where each integer \(d_i\) in the sequence satisfies \(1 \le d_i \le k-1\), and then for each sequence test if it is a valid degree sequence using either the Havel-Hakimi method [24, 25] or the Erdős-Gallai method [26]. This can be done in time dependent on k but not on n.

We now introduce colour-coding as a technique for finding substructures within graphs; further background on colour-coding can be found in [22, Chapter 13]. This involves assigning colours to both R and G, and then searching for embeddings of R into G such that the embedding preserves colours. Our result, as is common in colour-coding, relies on the existence of a suitable family of colourings; we obtain the required results from [27]. In particular, we colour each vertex in R with a unique colour, and we use the same set of |V(R)| colours to colour the vertices of G.Footnote 3 We will say an embedding \(\psi \) of R into G is colour-preserving if, for each \(v\in V(R)\), the colours of v and \(\psi (v)\) are the same.

Recall that \(R^a\) denotes a component of R, and that \(R^a\) is on vertices \(\{\, r^a_1, \ldots , r^a_\ell \, \}\) such that there is an edge from \(r^a_j\) to \(r^a_{j+1}\) for \(j\in \{\, 1,\ldots ,\ell -1\, \}\), as well as the edge \(r^a_1r^a_\ell \) if \(R^a\) is a cycle. We search for each path and cycle independently as a set of colour-preserving embeddings of individual paths and cycles can be combined in the following way to give an injective (colour-preserving) mapping from R into G. Let \(R^1,\ldots , R^t\) be the connected components of R for some postive integer t, and for \(i\in \{\, 1,\ldots , t\, \}\) let \(\psi _i\) be the colour-preserving embedding of \(R^i\) into G. Additionally, let g(v) be the unique integer in \(\{\, 1,\ldots ,t\, \}\) such that v is a vertex in \(R^t\). Define \(\psi (v) {:=}\psi _{g(v)}(v)\) for all \(v\in V(G)\). As each vertex of R is assigned a unique colour, and \(\psi _i\) is colour-preserving, if there are two vertices in R, say \(v_1\) and \(v_2\), such that \(\psi (v_1) = \psi (v_2)\), then it follows that \(v_1 = v_2\), and so \(\psi \) is injective (colour-preserving).

To find all such colour-preserving embeddings, we now define the problem of finding all embeddings of a vertex-coloured cycle into a vertex-coloured graph such that the improvement to the neighbourhood discrepancy is maximised. We later explain how to adjust it to find paths as well. Note that this process cannot necessarily determine how all vertices of degree one in R are mapped onto vertices in G to obtain our desired degree sequence of \(G {\setminus } E(\psi (R))\); this detail is taken care of in our final result.

figure k

We solve Enumerate Embed Coloured Cycle in a 2-stage process. First, Algorithm 5 finds the optimal value of the improvement to the score using dynamic programming, and while doing so stores the states (defined below) which lead to these optimal embeddings. Then a simple backtracking algorithm can process these states and output the embeddings that achieve this optimal score. We keep the two algorithms separate, as when solving Enum-\(k\)-Correlation Subgraph Optimisation we check numerous subgraphs R and for each R numerous colourings of G. However, not all of these subgraphs R and colourings of G may lead to optimal score values, so in the cases where these choices do not lead to optimal solutions, enumerating embeddings is not necessary.

For \(i\ge 2\), at each vertex v of G with colour \(c_i\), Algorithm 5 stores states (uvxy) such that there is an embedding \(\psi \) of \(R^a_j\) with \(\psi (r^a_1) = x\), \(\psi (r^a_{i-1}) = u\), \(\psi (r^a_i) = v\), and \(\psi (r^a_\ell ) = y\). We say that such an embedding \(\psi \) corresponds to the given state. In a state, the edge uv represents the “previous edge” used to reach the vertex v, as this is necessary (but not sufficient) to calculate any change to the neighbourhood discrepancy at v. Similarly, to calculate the improvement at x, to which \(r^a_1\) is mapped, we need to know how the edge \(r^a_1r^a_\ell \) is mapped, and this depends on the choice of \(\psi (r^a_\ell )=y\). This will then affect the improvement at vertex y, so it must be tracked.

Additionally, for each vertex v and each state (uvxy) we also store \(P_v(u,v,x,y)\), the set of preceding states that lead to this state, as well as \(I_v'(u,v,x,y)\), the best value of the sum of the improvement to the neighbourhood discrepancies over the first \(i-1\) vertices of the embedding of \(R^a\). Letting \(\Psi \) be the set of all embeddings corresponding to (uvxy), the best improvement can be expressed as

$$\begin{aligned} I_v'(u,v,x,y) = \max _{\psi \in \Psi } \sum _{j=1}^{i-1} I(\psi (r^a_j), E(\psi (R^a))). \end{aligned}$$

Algorithm 5 calculates \(I_v'(s)\) and \(P_v(s)\) for each vertex v and each relevant state s at v using common dynamic programming approaches. Then the final step looks at all vertices v of colour \(c_\ell \) (i.e., those vertices that are picked last by Algorithm 5) and calculates as \(I_b\) the best improvement over all states \(s\in S(v)\) over all such vertices v, after adding the improvement at vertex v, and also stores as \(P_b\) all states that attain \(I_b\).

Note that no states are stored at any vertex coloured \(c_1\), and the improvement stored at a state of v does not include the actual improvement of the vertex v, as this depends on how further edges of \(R^a\) are embedded.

Keeping track of the previous states that led to the optimal value at each state allows us to backtrack to enumerate all embeddings that achieve the optimal overall value.

We will see later that we need to try many colourings of G to find optimal colour-matching embeddings of R into G. Thus, while it is possible to enumerate all embeddings of each cycle and path when (or just after) running Algorithm 5, doing so may be unnecessary and will actually instead introduce extra delay between the output of solutions to Enum-\(k\)-Correlation Subgraph Optimisation. Instead, Algorithm 5 will return a representation of the optimal embeddings. This representation contains two parts: firstly, the set \(P_b\) contains the states that correspond to colour-preserving embeddings of a cycle \(R_a\) that maximise the value of \(\sum _{j=1}^a I(\psi (r^a_j), E(\psi (R^a))\). The second part is all sets \(P_v\) for \(v\in V(G)\); we explain later how these can be utilised in a backtracking algorithm to reconstruct all embeddings \(\psi \) that maximise \(\sum _{j=1}^a I(\psi (r^a_j), E(\psi (R^a))\).

Algorithm 5
figure l

Embed cycle. This procedure takes as parameters a vertex-coloured cycle \(R^a\) on \(|V(R^a)|\) distinct colours such that \(r^a_i\) has colour \(c_i\), and a (not necessarily proper) vertex-coloured graph G with \(\Delta (G) = 3\) on the same set of \(|V(R^a)|\) colours.

figure m

We now bound the number of states, and show the correctness and running time of Algorithm 5.

Lemma 5.15

For any vertex v, Algorithm 5 stores at most \(3^k\) states at S(v).

Proof

For a given vertex v there are three possible choices for u (as G has maximum degree at most three), for each u there are \(3^{k-2}\) choices for x as x must be at distance at most \(k-2\) from u in a graph of maximum degree at most three, and for each x there are at most three choices for y as y is a neighbour of x. \(\square \)

Recall that we will represent a set of optimal embeddings by storing two pieces of data. The first is \(P_b\), the set of states that actually correspond to the optimal embeddings (recall that these states do not, by themselves, store enough to recreate a complete embedding). Secondly, we store the sets \(P_v\) for \(v\in V(G)\), which we can use in a backtracking algorithm in combination with \(P_b\) to reconstruct the optimal embeddings.

We will prove the correctness of Algorithm 5 next. This proof is split into the following four ideas, which are then combined to prove the overall correctness of Algorithm 5.

  • The algorithm correctly identifies states at any vertex not coloured with \(c_l\).

  • The algorithm correctly calculates \(I'_w(v,w,x,y)\) for each state.

  • The algorithm correctly identifies states at any vertex coloured with \(c_l\).

  • The algorithm correctly returns a representation of all embeddings that achieve a maximum improvement to the neighbourhood discrepancy.

Lemma 5.16

For a given coloured cycle \(R^a\) on \(\ell {:=}|V(R^a)|\) distinct colours such that \(r^a_i\) has colour \(c_i\), and a (not necessarily proper) vertex-coloured graph G with \(\Delta (G) = 3\) on the same set of \(\ell \) colours, Algorithm 5 correctly calculates the states for each vertex \(v\in V(G)\) where v does not have colour \(c_\ell \).

Proof

We proceed inductively on i, and will iterate through the colours \(c_i\), beginning with \(i=2\) as no states are stored, or need to be stored, at any vertex of colour \(c_1\). Let w be an arbitrary vertex in V(G) with colour \(c_2\). Recall that \(R^a_i\) denotes the induced subgraph of \(R^a\) induced by the first i vertices. Thus \(R^a_2\) has only two vertices, and so for a state (vwvy) there must be an embedding \(\psi \) that corresponds to (vwvy) such that \(\psi (v^r_1v^r_2) = vw\) and v is a neighbour of y. We see that the algorithm exactly considers all vertices of colour \(c_1\) on Line 5, and considers all pairs of edges vw and vy on Lines 6 and 7, and that no other edges are considered when calculating states, so the correct states are determined for vertices of colour \(c_2\). Next, consider \(I'_w(v,w,x,y)\) for an arbitrary state (vwxy) at w. As w has colour \(c_2\), by definition we must have \(v=x\). We also know that y is a neighbour of v and y has colour \(c_\ell \). Let \(\psi \) be an embedding such that \(I(\psi (r^a_1), \{\,vw,wy\,\}) = I'_w(v,w,x,y)\) (i.e., \(\psi \) is an embedding that achieves the best improvement over all embeddings that correspond to (vwwy)). Note that \(\psi \) maps \(r^a_1\) to v, \(r^a_2\) to w, and \(r^a_\ell \) to y, and those vertices are the only vertices involved when calculating \(I(v, \{\,vw, wy\,\}) = I'_w(v,w,x,y)\). Therefore when the algorithm considers all pairs of edges vw and vy on Lines 6 and 7, it must also find an embedding that will map \(r^a_1\) to v, \(r^a_2\) to w, and \(r^a_\ell \) to y, and no other edges are considered when calculating states, so the values of \(I'_w(v,w,x,y)\) are correctly determined for vertices of colour \(c_2\).

We now proceed inductively. For \(i\in \{\,2,\ldots ,\ell -1\,\}\), for any state s at a vertex w of colour \(c_{i+1}\) there must be an embedding \(\psi \) of \(R^a_{i+1}\) that corresponds to s. We can then consider the states at vertex \(\psi (r^a_i) = v\), where v must have colour \(c_i\) and be a neighbour of w. In particular, as \(\psi \) is an embedding of \(R^a_{i+1}\), there must be a valid state \(s'\) at v such that \(\psi \) corresponds to \(s'\). As Line 20 considers all such vertices v, Line 21 considers all states at v, and Line 22 considers all neighbours of v, it must be that the state corresponding to the embedding \(\psi \) is added to S(w). Additionally, no other states are added to S(w), and so the correct states are determined. \(\square \)

Lemma 5.17

For a given coloured cycle \(R^a\) on \(\ell {:=}|V(R^a)|\) distinct colours such that \(r^a_i\) has colour \(c_i\), and a (not necessarily proper) vertex-coloured graph G with \(\Delta (G) = 3\) on the same set of \(\ell \) colours, and for a state (vwxy) as calculated by Algorithm 5, Algorithm 5 correctly calculates the best improvement \(I'_v(u,v,x,y)\).

Proof

We show that, given a state (vwxy) at vertex w with colour \(c_i\), there is no embedding that achieves a better improvement than \(I'_w(v,w,x,y)\) as determined by Algorithm 5. To aid readability, in this paragraph we outline how this is shown, while the following paragraph will give the exact mathematical proof. Recall that if w has colour \(c_i\), then \(I'_w(v,w,x,y)\) is the best improvement on the first \(i-1\) vertices, as the improvement at vertex w depends on how the next vertex is embedded. We will take an embedding \(\psi \) that corresponds to (vwxy) and achieves an improvement of \(I_w(v,w,x,y)\) (i.e., \(\psi \) achieves the best improvement over all embeddings considered by Algorithm 5), and then towards a contradiction assume that there exists some other embedding \(\psi '\) that also corresponds to (vwxy), but achieves a strictly better improvement on the first \(i-1\) vertices (which implies that \(\psi '\) is not considered by Algorithm 5). We create the embedding \(\psi ''\) to be equal to \(\psi '\) on the first \(i-2\) vertices, and equal to \(\psi \) on the remaining vertices. Then by its construction \(\psi ''\) must be considered by Algorithm 5, but attain a strictly better improvement than \(\psi \), a contradiction.

We now give the technical details for the previous paragraph. Let \(\psi \) be an embedding such that \(I'_w(v,w,x,y) = \sum _{j=1}^{i-1} I(\psi (r^a_j),E(\psi (R^a)))\) (i.e., \(\psi \) gives a maximal improvement on the first \(i-1\) vertices for the state (vwxy) over all embeddings considered by Algorithm 5), and let u be the vertex in V(G) such that \(\psi (r^a_{i-2}) = u\). Then \(\psi \) must correspond to a state (uvxy) at v. We claim that \(\psi \) must also obtain the maximum improvement over the first \(i-2\) vertices over all embeddings that correspond to (uvxy). Towards a contradiction, assume there is some \(\psi '\ne \psi \) that also corresponds to (uvxy) such that

$$\begin{aligned} \sum _{j=1}^{i-2} I(\psi (r^a_j),E(\psi (R^a))) < \sum _{j=1}^{i-2} I(\psi '(r^a_j),E(\psi '(R^a))) \end{aligned}$$

(i.e., \(\psi \) will not give a maximal improvement over the first \(i-2\) vertices for the state (uvxy)). Consider next the embedding \(\psi ''\) where \(\psi '(r^a_j) = \psi ''(r^a_j)\) for \(j\in \{\,1,\ldots , i-2\,\}\) and \(\psi (r^a_j) = \psi ''(r^a_j)\) for \(j \ge i-1\). We note that \(\psi ''(r^a_{i-2}) = u\) and \(\psi ''(r^a_{i-1}) = v\), and as (uvxy) is a state at v it must be that uv is an edge. This, combined with both \(\psi \) and \(\psi '\) being embeddings, ensures that \(\psi '\) is an embedding. Then

$$\begin{aligned} I(\psi (r^a_{i-1},\psi (E(R^a))) = I(\psi (r^a_{i-1},\psi ''(E(R^a))) \end{aligned}$$

and we get

$$\begin{aligned} \sum _{j=1}^{i-2} I(\psi (r^a_j),E(\psi (R^a))) < \sum _{j=1}^{i-2} I(\psi '(r^a_j),E(\psi '(R^a))) \end{aligned}$$

which is a contradiction. Thus, there is no embedding corresponding to (uvxy) that achieves a better improvement to the neighbourhood discrepancy than \(\psi \) over the first \(i-2\) vertices. Therefore, \(I'_w(v,w,x,y) = I'_v(u,v,x,y) + I(v, \{\,uv,vw\,\})\) is the maximum improvement attained by any embedding of the first \(i-1\) vertices. By induction we therefore know that Algorithm 5 correctly calculates the maximum improvement attainable at a given state over all embeddings. \(\square \)

Lemma 5.18

For a given coloured cycle \(R^a\) on \(\ell {:=}|V(R^a)|\) distinct colours such that \(r^a_i\) has colour \(c_i\), and a (not necessarily proper) vertex-coloured graph G with \(\Delta (G) = 3\) on the same set of \(\ell \) colours, Algorithm 5 correctly calculates the best improvement \(I_b\) to neighbourhood discrepancy achievable by embedding \(R^a\) into G such that colours agree, as well as all states that achieve this improvement.

Proof

By Lemma 5.16, the states calculated at each vertex that is not coloured with \(c_\ell \) are correct, and by Lemma 5.17 Algorithm 5 calculates the best improvement for any such state. Lines 41 through 45 then check each vertex v of colour \(c_\ell \), and then check each state (uvxy) at such a vertex to see if \(y = v\). As we have been preserving colours when mapping \(R^a\) into G, this ensures that this is a colour-preserving embedding of the cycle \(R^a\). The maximum value of \(I'_v(u,v,x,y) + I(v, \{\,uv, xy\,\})\), as calculated on Line 45 is thus tracked as \(I_b\) and this is used to store the states that achieve this optimal value. \(\square \)

Lemma 5.19

For a given coloured cycle \(R^a\), Algorithm 5 correctly returns a representation of all embeddings that achieve the best improvement to neighbourhood discrepancy over all embeddings of \(R^a\) onto G such that the colours of the vertices of \(R^a\) and G agree.

Proof

We claim that we return a representation of all embeddings that achieve this maximum; these are returned in \(P_b\) and \(P_v\) for \(v\in G\). By Lemma 5.18, for any \(v \in G\) with colour \(c_\ell \) and for any embedding that achieves the optimal score and maps \(\ell \) to v, we store in \(P_b\) all of the states of v that represent this embedding.

For any such state \((u,v,x,y)\in P_b\), we can rebuild all embeddings \(\psi \) that achieve this state as follows. We first see that \(\psi (r^a_\ell ) = y\), \(\psi (r^a_0) = x\), and \(\psi (r^a_{\ell -1}) = u\). Then we can consider all elements of the set \(P_v(u,v,x,y)\). Each element of this set leads to at least one embedding. Starting with \(i=\ell -1\), if (uvxy) is our current state then for each element \((t,u,x,y) \in P_v(u,v,x,y)\), we know there is at least one optimal, colour-preserving embedding with \(\psi (r^a_{i-1}) = t\). By repeating this process with decreasing values of i and letting (tuxy) be the new current state until \(i=2\), we can rebuild any embedding that achieves this optimal. Recall that, at this point, we have a set of states for each vertex, and each such state contains the previous vertex, so the next state to consider and the previous vertex can both be accessed in constant time. We note that this rebuilding process takes \(O(\ell )\) time per embedding found. \(\square \)

We note that Algorithm 5 can easily be extended to also find optimal embeddings of paths by using sentinel values (like \(\emptyset \)) instead of the first and last vertices of the current cycle in the definition of states (i.e., each state will “look like” \((u,v,\emptyset ,\emptyset )\) where \(u,v \in V(G)\)), and excluding the check on Line 43. This will not increase the number of states stored at any vertex.

Lemma 5.20

Algorithm 5 can be implemented to run in \(O(3^kn)\) time.

Proof

A vertex is considered in one of the three loops on Lines 5, 20, or 41. Additionally, if a vertex is considered on Line 20, it is considered in one iteration of the loop begun on Line 19, as the vertex has one colour. For the loop on Line 5, we iterate through all pairs of neighbours. As there are at most three neighbours, there are at most six such pairs. For any other of the loops, we need to iterate through states at a vertex (of which there are at most \(3^k\) by Lemma 5.15), and neighbours of the correct colour (again, there may be at most three neighbours), giving the result.

Lastly, we highlight that for any individual vertex v and set F, the value of I(vF) can be calculated in \(O(\max d(v), |F|)\) time, which, in Algorithm 5, is constant time. \(\square \)

Corollary 5.21

Enumerate Embed Coloured Cycle can be solved with \(f(k) \cdot n^{O(1)}\) precalculation time and O(k) delay time when parameterised by k.

Proof

The embeddings can be enumerated by first running Algorithm 5 and then backtracking through the vertices of \(R^a\) to build the actual embeddings. By Lemma 5.20, Algorithm 5 runs in \(O(3^kn)\) time, and rebuilding an embedding from the \(P_b\) and \(P_w(u,v,x,y)\) for vertices w and states (uvxy) takes O(k) time, giving the desired result. \(\square \)

We now describe how to combine all of these results.

Theorem 5.12

For an integer \(k\ge 1\), all solutions to Enum-\(k\)-Correlation Subgraph Optimisation can be output on graphs with maximum degree three with a precalculation time of \(2^{k(2 \log k + O(1))} n \log n\), with O(n) delay, and with O(n) postcalculation time.

Proof

In the precalculation, we determine the maximum value of the score function and build the data required to output all solutions. We do this by first considering all potential options for the graph \(R = G {\setminus } H\). For each such R we consider all potential degree sequences of \(H = G \setminus \psi (R)\) for some embedding \(\psi \), and then for each pair of R and degree sequence, we determine the maximum value the score function can take, and store sufficient data to recreate all embeddings that achieve said maximum. We iterate this over such graphs R and degree sequences. In doing so, we determine the maximum score function, and also have created all information required to recreate all embeddings that achieve said optimal score. Once this optimal score is found, we then simply iterate through all graphs R that achieve this optimal solution, all degree sequences of \(G \setminus \psi (R)\) for an embedding \(\psi \) that achieve this optimal solution, and all embeddings \(\psi \) that achieve this optimal solution.

By Lemma 5.13 there are \(O(2^k e^{\sqrt{k}})\) choices for R, and we will consider each in turn, so for now consider one choice of such a graph. By Lemma 5.14, for a given R there are O(k) choices for the degree sequence of \(H = G{\setminus } \psi (R)\) for some embedding \(\psi \). Again we will deal with each in turn, so for now consider one choice of a degree sequence. In particular, this fixes h, the number of vertices of degree one in R that are mapped to vertices of degree three in G. As R can have at most k components, and thus at most k paths and at most 2k vertices of degree one, there are \(\left( {\begin{array}{c}2k\\ h\end{array}}\right) \) (or \(O((2k)^{2k})\)) ways to choose h vertices of degree one in R that will be mapped to vertices of degree three in G. Let g be a function mapping the vertices of R to \(\{\,2,3\,\}\) such that if \(g(v^r_i) = b\) then \(v^r_i\) must be mapped to a vertex in G of degree b.

Note that when calculating

$$\begin{aligned} {{\,\textrm{score}\,}}(H, f) = \sum _{v\in V(G)} \ln \deg _H(v) - n \ln \left[ \sum _{v\in V(G)}\deg _H(v) {{\,\textrm{ND}\,}}_H(v,f) \right] , \end{aligned}$$

knowing the degree sequence of H exactly determines the value of

$$\begin{aligned} \sum _{v\in V(G)} \ln \deg _H(v). \end{aligned}$$

It therefore remains to minimise the value of

$$\begin{aligned} n \ln \left[ \sum _{v\in V(G)} \deg _H(v) {{\,\textrm{ND}\,}}_H(v,f)\right] . \end{aligned}$$

For this, we use colour coding to embed the paths and cycles of R iteratively. Colour coding involves colouring the vertices of R with each colouring in a set of colourings in turn, and solving the problem on each colouring in turn. We later show the existence of a suitable set \({\mathcal {C}}\) of colourings, where \(|{\mathcal {C}}|\) is logarithmic in n but exponential in k. To begin with, however, assume we have coloured the vertices of R with |V(R)| colours such that each vertex has a different colour. We also colour the vertices of G with the same |V(R)| colours, noting that this colouring of G need not be a proper colouring.

However, we must also ensure that any embedding we find also maps the correct number of vertices of degree one in R to vertices of degree three in G. Recall that \(R=\bigcup _{a=1}^{m} R^a\) where each \(R^a\) is connected (i.e., each \(R^a\) is either a cycle or path). Then, for a given component \(R^a\) of R, we take a copy \(G'\) of G, and for each colour \(c_i\), and for each vertex v of \(G'\) with colour \(c_i\), if \(d_G(v) \ne F(v^r_i)\) (i.e., v does not have our desired degree to match our degree sequence), then delete v from \(G'\). This modification means that any colour-preserving embedding of \(R^a\) into \(G'\) that we find must exactly map each vertex \(v^r_i\) of degree one of R to a vertex of degree \(F(v^r_i)\) in G, and therefore we can guarantee that we obtain the desired degree sequence. Thus, for each component \(R^a\), create the corresponding graph \(G'\) and then run Algorithm 5 on \((R^a, G')\). As each vertex in R has a different colour, we know that any two distinct components \(R^a\) and \(R^b\) have no vertices in common, so the best improvement to the neighbourhood discrepancy found for any embedding of R can be calculated by summing together the best improvement to the neighbourhood discrepancy found for embedding \(R^a\) over all connected components \(R^a\) in R.

It then remains to determine how many different colourings of G must be processed to guarantee that an optimal solution is found. This guarantee is given if, for any subset of vertices \(S\subseteq V(G)\) with \(|S| \le |V(R)| \le 2k\), there is some colouring that is processed such that all vertices in S have different colours. Such a set of colourings is equivalent to a 2k-perfect family of hash functions. From [27, Theorem 3, (iii)],Footnote 4 there exists a set \({\mathcal {C}}\) of colourings of G with \(|{\mathcal {C}}| = e^{2k} (2k)^{O(\log 2k)} \log n\) such that for any set S of 2k vertices in G, there is a colouring \(C\in {\mathcal {C}}\) such that each of the 2k vertices in S are differently coloured by C (and by the same paper, such a set can also be constructed in time linear in the size of the set \({\mathcal {C}}\)).

We therefore have \(O(2^k e^{\sqrt{k}})\) choices for R, O(k) choices for the degree sequence of R, \(O((2k)^{2k})\) choices for h, and \(O(2^k)\) choices for our function g. For each pair (Rg) we need to use \(e^{2k} (2k)^{O(\log 2k)} \log n\) colourings, and the best improvement for each colouring can be found in \(O(k3^kn)\) time by Lemma 5.20 and the fact that R has at most k components. This gives a running time of \(k^{O(k \log k)} \cdot n \log n\).

All of the above choices are checked in the precalculation step, and we keep a list of all choices of colourings, degree sequences and graphs R that achieve the optimal value in the obvious manner.

Then, once the precalculation has been completed, we need only output the subgraphs that achieve this optimal. Each embedding can be generated in O(k) time (as all the requisite data has already been produced). As G has maximum degree three, it also has at most 3n/2 edges and so we can re-construct each optimal subgraph H in O(n) time. Thus, there is a precalculation time of \(2^{k(2\log k + O(1))} n \log n\) before the first subgraph is output, and further subgraphs have a delay of O(n).

Lastly, the postcalculation time must include checks that there are no more choices of embeddings, colourings, degree sequences, or graphs R to make. As choices of colourings, degree sequences or graphs R are each completely calculated in the precalculation step, the lack of further choices for these can be determined in constant time by checking whether we have tried all choices that lead to optimal embeddings. However, choices of embeddings are calculated during the output step from the \(P_w(u,v,x,y)\) sets, and so the post-calculation step to know that no more embeddings are available may take O(n) time, as required. \(\square \)

If k is fixed to some constant integer, then as G has at most 3n edges, the number of subgraphs of G that have \(|E(G)|-k\) edges is \(n^{O(k)}\), which is polynomial in n. Therefore the number of solutions is, at worst, polynomial in n, meaning our algorithm is guaranteed to run in polynomial time.

We note that the decision-variant of Enum-\(k\)-Correlation Subgraph Optimisation where we only want to either find one optimal embedding, or even the optimal value of the score function, is in FPT when parameterised by k, the maximum number of edges that can be removed, by solving Enum-\(k\)-Correlation Subgraph Optimisation and returning the appropriate value when the first embedding is output.

Theorem 5.22

For an integer \(k\ge 1\), \(k\)-Correlation Subgraph Optimisation can be solved on graphs with maximum degree three in time \(2^{k(2 \log k + O(1))} n \log n\).

6 Discussion and Conclusions

Correlation Subgraph Optimisation is a graph optimisation problem arising from spatial statistics with direct applications to epidemiology and social science that we show is intractable unless P = NP. We also show that it is resistant to common techniques in graph algorithms, but is in XP when parameterised by both treewidth and maximum degree, and is fixed-parameter tractable when parameterised by the number of edges that can be removed and the maximum degree is limited to three. Both of our algorithms are extended to not just solve the decision but also output all optimal subgraphs. This results in two enumeration algorithms which respectively have XP (in treewidth and maximum degree) and FPT (in the maximum number of edges that can be removed) precalculation times, linear delays and linear postcalculation times. However, the question still remains whether Correlation Subgraph Optimisation itself is hard when the maximum degree of the input graph is bounded. We also note as an interesting open problem whether Correlation Subgraph Optimisation admits efficient parameterised algorithms with respect to (combinations of) parameters other than the maximum degree. Additionally, the original paper that introduced Correlation Subgraph Optimisation gives one heuristic for solving the problem, but leaves open any guarantee on the performance of this heuristic. Thus the investigation of the performance of this heuristic, or indeed of any new approximation algorithms, form two other significant open problems for Correlation Subgraph Optimisation.