1 Introduction

Recently, there has been a growing interest in creating synergies between Combinatorial Optimization (CO) and Machine Learning (ML), and vice-versa. This is a natural connection since ML algorithms can be seen in essence as optimization algorithms that try to minimize prediction error. In this paper, we focus on how CO techniques can be applied to improve the interpretability and accuracy of decision tree classifiers in ML.

A Decision Tree (DT) classifier is a supervised ML technique that builds a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome. In essence, every path from the root to a leaf is a classification rule that determines to which class belongs the input query.

ML research has been traditionally concerned with developing accurate classifiers, i.e., classifiers whose prediction error is as low as possible. However, lately, the research community is looking at the trade-off between accuracy and interpretability, where the latter refers to the human ability to understand how the classifier works. Interpretability is fundamental in many senses: to explain ML models and understand their value and accuracy, to debug machine learning algorithms, to make more informed decisions, etc.

Among supervised learning techniques, decision tree classifiers are interpretable in itself, as long as they are small, i.e., as few rules as possible, and as short as possible. As stated in Bessiere et al. (2009), informally, this follows from the principle of Occam’s Razor that states that one should prefer simpler hypotheses. However, notice that this is just one of the diverse aspects that may have to do with interpretability at least from a human perspective.

CO approaches have been applied to compute small trees in the sense of building decision trees of minimum size (number of nodes) or minimum depth. Minimizing the size corresponds indirectly to minimizing the number of rules associated with the DT and minimizing the depth corresponds to minimizing the length of the rules. Among these approaches we find several contributions coming from Operation Research (OR) (Aglin et al. 2020; Verwer and Zhang 2019) and from Constraint Programming (CP) (Hu et al. 2019; Verhaeghe et al. 2019).

Another question is whether these DTs are pure (i.e., classify correctly the training dataset). To avoid overfitting on the training set, and therefore get lower accuracy on the test set, there is a rule of thumb that makes us prefer some impurity to get more generalization power. However, it is difficult to anticipate how much impurity we should allow.

In this paper, we first address the question of how to efficiently compute minimum pure DTs and discuss later how accurate they are. We focus on CP techniques where SATisfiability (SAT) (Biere et al. 2009) based approaches have been shown to be very effective. Particularly, we will work on the Maximum Satisfiability (MaxSAT) problem, which is the optimization version of the SAT problem.

We revisit the SAT approach in Narodytska et al. (2018) which encodes into SAT the decision query of whether there is a pure DT of exactly n nodes. Then, by performing a sequence of such SAT queries the optimum can be located. We extend this SAT approach to the MaxSAT case, i.e., we encode into MaxSAT the query representing which is the minimum number of nodes n for which such a pure DT exists.

Equipped with a MaxSAT encoding we can now leverage the power of current (and future) MaxSAT solvers. Notice that the MaxSAT formalism allows us to state explicitly that we are dealing with an optimization problem, while the original incremental SAT solving approach in Narodytska et al. (2018) is agnostic in this sense and the objective has to be managed by an ad hoc piece of code. Our experimental results show that our approach in practice is able to obtain minimal pure DTs on a wide variety of benchmarks faster than the original SAT incremental counterpart in Narodytska et al. (2018) even using the same underlying SAT solver. We also show that our approach provides comparable performance to other approaches (Janota and Morgado 2020) that exploit explicitly structural properties of the DT.

Secondly, and the main contribution of the paper, we show that by exploring the space of optimal solutions (minimal pure DTs) we can be indeed competitive with respect to DT classifiers generated with sklearn in terms of producing smaller trees but yet accurate enough.

In order to achieve the main goal of this paper, we solve yet another optimization problem that extracts multiple optimal solutions, maximizing their diversity, from the MaxSAT encoding that allows us to compute minimum pure DTs for the training set. Then, thanks to a validation process, we select among these multiple diverse solutions one that is expected to report better accuracy on the test set.

This is an interesting result since all the optimal solutions we consider are minimum pure DTs for the training set, yet they do not seem to suffer substantially from the overfitting phenomena. Notice that what makes them special is that they all are of minimum size, a feature that can not be exploited by current DT ML approaches.

Finding MPDTs is an NP-hard problem, and therefore a time-consuming process. Our goal is to provide a solution for applications where high interpretability and accuracy of DTs prevail over the training speed, such as the AI systems considered as “high-risk” in the proposed regulation (European Commission 2021). In particular, this solution is suitable for systems where a certain stability for the models (i.e. not re-training the model frequently) is desired. This would be the case of models used in administration selection processes, financial risk ratings, or patient screening. Also, having interpretable models in contraposition to black box models (such as neural networks) allows to extract knowledge from those models and analyze the model to detect potential biases.

2 Preliminaries

Definition 1

An example e is a set of pairs \(\langle f_r,v \rangle \) plus the class label c, where \(f_r\) is a feature and v is the value assigned to the feature. A Decision Tree (DT) is a set of rules (namely paths) that are constructed by traversing a tree-like structure composed of questions (decision nodes) and answers until a terminal node with a label (leaf) is found. The depth of a node is the length of the path (i.e. count of nodes traversed) from the root to this node. The depth of the DT is the length of the largest path from the root to any leaf. A Complete Decision Tree (CDT) is a tree where all the leaves are located at the same depth. A Pure Decision Tree (PDT) for a set of examples \(\varepsilon \), \(PDT(\varepsilon )\), is a DT that classifies correctly all the examples (Hautaniemi et al. 2005). By definition, its accuracy on this set is \(100\%\). A Minimum Pure Decision Tree (MPDT) for a set of examples \(\varepsilon \), \(MPDT(\varepsilon )\), is a DT that is proven to be the smallest PDT for \(\varepsilon \) in terms of size (i.e. the number of nodes in the DT). We denote the minimum size as \(|MPDT(\varepsilon ) |\).

Example 1

Table 1 shows the set of examples \(\varepsilon \) for the meteo dataset. Column # shows the index of each example e and column classes their class label c.

In Fig. 1 we can see a Minimum Pure Decision Tree (MPDT) for the set of examples in Table 1, where the decision nodes are drawn as ellipses and the leaves as rectangles. It has 5 paths and a depth of 3. In particular, the leaves of the MPDT define the following partition of the examples, from top to bottom and left to right, \(\langle \{1,4,8,9\},\{11,12,14\},\{5,7\},\{2,3,6\},\{10,13\} \rangle \). As we can see all the examples classified in a leave belong to the same class (Pure) and there is no other Pure tree with less nodes (Minimum).

Table 1 The meteo dataset
Fig. 1
figure 1

MPDT for the meteo dataset

Definition 2

A literal is a propositional variable x or a negated propositional variable \(\lnot x\). A clause is a disjunction of literals. A Conjunctive Normal Form (CNF) is a conjunction of clauses. A weighted clause is a pair (cw), where c is a clause and w, its weight, is a natural number or infinity. A clause is hard if its weight is infinity (or no weight is given); otherwise, it is soft. A Weighted Partial MaxSAT instance is a multiset of weighted clauses.

Definition 3

A truth assignment for an instance \(\phi \) is a mapping that assigns to each propositional variable in \(\phi \) either 0 (False) or 1 (True). A truth assignment is partial if the mapping is not defined for all the propositional variables in \(\phi \). A truth assignment I satisfies a literal x \((\lnot x)\) if I maps x to 1 (0); otherwise, it is falsified. A truth assignment I satisfies a clause if I satisfies at least one of its literals; otherwise, it is violated or falsified. The cost of a clause (cw) under I is 0 if I satisfies the clause; otherwise, it is w. Given a partial truth assignment I, a literal or a clause is undefined if it is neither satisfied nor falsified. A clause c is a unit clause under I if c is not satisfied by I and contains exactly one undefined literal. The cost of a formula \(\phi \) under a truth assignment I, \(cost(I, \phi )\), is the aggregated cost of all its clauses under I.

Definition 4

The Weighted Partial MaxSAT problem for an instance \(\phi \) is to find an assignment in which the sum of weights of the falsified soft clauses, cost(\(\phi \)), is minimal and, the hard clauses are satisfied. The Partial MaxSAT problem is the Weighted Partial MaxSAT problem where all weights of soft clauses are equal. The SAT problem is the Partial MaxSAT problem when there are no soft clauses. An instance of Weighted Partial MaxSAT, or any of its variants, is unsatisfiable if its optimal cost is \(\infty \). A SAT instance \(\phi \) is satisfiable if there is a truth assignment I, called model, such that \(cost(I, \phi ) = 0\).

Definition 5

A pseudo-Boolean constraint (PB) is a Boolean function of the form \( \sum _{i=1}^{n} q_il_i \diamond k \), where k and the \(q_i\) are integer constants, \(l_i\) are literals, and \(\diamond \in \{<, \le , =, \ge , >\}\) is a comparison operator. An At Least K (AtLeastK) constraint is a PB of the form \( \sum _{i=1}^{n} l_i \ge k \). An At Most K (AtMostK) constraint is a PB of the form \( \sum _{i=1}^{n} l_i \le k \).

3 Related work

In a pioneering work (Bessiere et al. 2009), the first SAT approach to compute MPDTs was presented. They encode into SAT the decision query of whether there exists a PDT of size n. They use as upper bound the solution from a CP model. Then, they perform a sequence of queries to a SAT solver (in the SAT4J package (Le Berre and Parrain 2010) where n is iteratively bounded by adding an AtLeastK constraint on the number of useless nodes that must be in the solution.

In Narodytska et al. (2018) a more efficient SAT encoding was provided to encode the decision query of whether there exists a PDT of size n. The authors also solve the problem through a sequence of SAT queries. In this case, the upper bound is computed with the algorithm ITI (Utgoff 1989). Then, at every iteration, they reencode the problem with the next possible value of n. We refer to this approach as MinDT.

In Avellaneda (2020) two methods are proposed. The Infer-depth approach iteratively decreases the depth of the tree, till it locates the minimum depth to build a complete PDT. The Infer-size approach computes the MPDT with the depth found with Infer-depth. While both approaches compute small PDTs, none of them guarantees to find MPDTs.

In Janota and Morgado (2020), one of the approaches, dtfinder, split the search space into the possible topologies (tree layouts) of the DT till a certain depth. At least one of these topologies is guaranteed to belong to an MPDT.

In Hu et al. (2020), DT(max), the approach answers to which is the best tree in terms of accuracy in the training set that can be obtained given an upper bound on the depth. Note that if the upper bound is smaller than the minimal depth of MPDT, their approach reports a tree with less than 100% of accuracy on the training set (i.e. impure tree). Additionally, an AdaBoost (Freund and Schapire 1997) approach based on MaxSAT is presented. We will not explore AdaBoost strategies in this paper, since they produce ensembles, which are composed of several DTs. Having more than one DT weakens the interpretability of the classifier.

Table 2 shows a summary, presenting the objective function, whether the approach returns a PDT and if it is minimum (MPDT). The only algorithms that are complete in the sense of certifying the minimum size of the PDTs are (Bessiere et al. 2009), MinDT (Narodytska et al. 2018) and dtfinder (Janota and Morgado 2020).

We will focus on comparing our approach MPDT (see Sect.  4) with MinDT in terms of computing efficiently MPDTs since other approaches build on MinDT. In terms of computing rather small trees with good accuracy, we will compare with DT(max) (Hu et al. 2020) which in contrast to the other approaches focuses also on maximizing the accuracy.

Table 2 State-of-the-art: SAT-based approaches for computing DTs

4 Computing MPDTs with MaxSAT

We extend the SAT encoding in Narodytska et al. (2018), MinDT, to a Partial MaxSAT (and its weighted variant) encoding that queries for an MPDT given an upper bound in size of n. Our approach is inspired by the MaxSAT encoding in Ansótegui et al. (2013) for constructing minimum covering arrays. Recent work in Hu et al. (2020) follows a similar approach with the \(m_i\) variable, which is used to relax all the constraints of the encoding. We selectively use the relevant variable only in the constraints that require it. For completeness and the aim of reproducibility, we describe in full detail the original SAT encoding and the extensions to the MaxSAT case, highlighting the changes from the original encoding.

We assume that: (1) the features are binary; (2) the root node is labeled as node 1 (the rest of them are labeled into breadth-first order); and (3) positive (negative) answers to decision nodes are assigned to the right (left) child. In Tables 3 and 4 we show the symbols, functions, and the propositional variables of the encoding.

Table 3 Symbols and functions used in the MaxSAT encoding
Table 4 Variables used in the MaxSAT encoding

We add a new variable \(\eta _i\) which represents if a node i is used. The encoding works on an upper-bounded number of nodes. If the upper bound is not optimal, some nodes will not be used. Since nodes are numbered in breadth-first order, two consecutive nodes with the same parent will be both in the solution or not. Therefore, we define function \(\eta (i)\) as:

$$\begin{aligned} {} \eta (i) = {\left\{ \begin{array}{ll} \eta _{i+1} &{} \textrm{if}\; i \in N^e; \\ \eta _{i} &{} \textrm{if}\; i \in N^o \end{array}\right. } \end{aligned}$$

The soft clauses in the Partial MaxSAT encoding that describe the objective function of the MPDT problem are:

$$\begin{aligned} {} \bigwedge _{i \in N^{o}}(\lnot \eta _i, 1)\qquad \qquad \qquad \qquad ({ SoftN}) \end{aligned}$$

Now, we describe the hard constraints. To encode PB constraints we use OptiLog (Ansótegui et al. 2021).

Node usage hard clauses:

• If an odd node is not used, then both himself and the previous one are not leaves.

$$\begin{aligned} { \bigwedge _{i \in N^o\ |\ i\ > 1} \lnot \eta _i \rightarrow \lnot v_i \wedge \lnot v_{i-1} } \end{aligned}$$
(1)

• If a node is used, the previous ones are used:

$$\begin{aligned} { \bigwedge _{i \in N^o\ |\ i\ > 1} \eta _i \rightarrow \eta _{i-2} } \end{aligned}$$
(2)

• To enforce a lower bound lb (3 by default) we add the following clause.

$$\begin{aligned} {\eta (lb)} \end{aligned}$$
(3)

Tree layout hard clauses:

  • The root node is not a leaf.

    $$\begin{aligned} \lnot v_1 \end{aligned}$$
    (4)
  • (i) Leaves do not have a left child, and (ii) if node i is used in the DT, and has no left child candidates, then it is a leaf.

    $$\begin{aligned} \bigwedge _{i \in N} {\left\{ \begin{array}{ll} \bigwedge _{j \in lr(i)} (v_i \rightarrow \lnot l_{i,j})) &{} {\textrm{if}\ \vert lr(i)\vert \ne 0}\\ {\eta (i) \rightarrow v_i} &{} {\textrm{if}\ \vert lr(i)\vert = 0} \end{array}\right. } \end{aligned}$$
    (5)
  • Node j is the left child of node i iff node \(j+1\) is the right child of node i.

    $$\begin{aligned} \bigwedge _{i \in N} \bigwedge _{j \in lr(i)} (l_{i,j} \leftrightarrow r_{i,j+1}) \end{aligned}$$
    (6)
  • If a used node i is not a leaf, then it has exactly one left child.

    $$\begin{aligned} \bigwedge _{i \in N} \left( (\lnot v_i {\wedge \eta (i)}{}) \rightarrow \left( \sum _{j \in lr(i)} l_{i,j} = 1 \right) \right) \end{aligned}$$
    (7)
  • • The symmetry equations on \(p_{j,i},l_{i,j}\) and \(r_{i,j}\).

    $$\begin{aligned} \begin{aligned} \bigwedge _{i \in N} \bigwedge _{j \in lr(i)} (p_{j,i} \leftrightarrow l_{i,j}) \\ \bigwedge _{i \in N} \bigwedge _{j \in rr(i)} (p_{j,i} \leftrightarrow r_{i,j}) \end{aligned} \end{aligned}$$
    (8)

    Given a pair of ij nodes, we have that \(j \in lr(i) \cup rr(i)\), and \(lr(i) \cap rr(i) = \emptyset \). This, combined with Constraint 8, sets an equivalence between the variable \(p_{j,i}\) and one of \(l_{i,j}\) or \(r_{i,j}\) depending on the value j being even or odd. Thus, in practice, we can perform a direct substitution of the variable \(p_{j,i}\) with the corresponding one of those two variables in the encoding, and remove the \(p_{j,i}\) variable and the Constraint 8 altogether. In this description, for clarity purposes, we kept the \(p_{j,i}\) variable in the encoding.

  • If node j is used in the DT (and it is not the root) it will have exactly one parent.

    $$\begin{aligned} \bigwedge _{j \in N\ |\ j > 1} \left( {\eta (j) \rightarrow }{} \sum _{i={\lfloor }\frac{j}{2}{\rfloor }}^{j-1} p_{j,i} = 1 \right) \end{aligned}$$
    (9)

Feature assignment hard clauses:

• Feature \(f_r\) with value 0 (1) is already discriminated by node j iff the feature \(f_r\) is attached to the parent of j and node j is the right (left) child or the feature \(f_r\) with value 0 (1) is already discriminated by the parent of j.

$$\begin{aligned}&\bigwedge _{j \in N, r \in F} \left( d_{r,j}^{0} \leftrightarrow \bigvee \limits _{i={\lfloor }\frac{j}{2}{\rfloor }}^{j-1} \left( (a_{r,i} \wedge r_{i,j}) \vee (p_{j,i} \wedge d_{r,i}^{0}) \right) \right) \end{aligned}$$
(10)
$$\begin{aligned}&\bigwedge _{j \in N, r \in F} \left( d_{r,j}^{1} \leftrightarrow \bigvee \limits _{i={\lfloor }\frac{j}{2}{\rfloor }}^{j-1} \left( (a_{r,i} \wedge l_{i,j}) \vee (p_{j,i} \wedge d_{r,i}^{1}) \right) \right) \end{aligned}$$
(11)

• No feature is discriminated at the root node.

$$\begin{aligned} \bigwedge _{r \in F} \left( \lnot d_{r,1}^{0} \wedge \lnot d_{r,1}^{1} \right) \end{aligned}$$
(12)

• If some feature \(f_r\) is already discriminated by node i, then the feature can not be assigned to any of its children.

$$\begin{aligned} \bigwedge _{j \in N, r \in F} \bigwedge \limits _{i={\lfloor }\frac{j}{2}{\rfloor }}^{j-1} ((u_{r,i} \wedge p_{j,i}) \rightarrow \lnot a_{r,j}) \end{aligned}$$
(13)

• Feature \(f_r\) is discriminated at node j iff \(f_r\) has been assigned to node j or it is already discriminated at the parent of j.

$$\begin{aligned} \bigwedge _{j \in N, r \in F} \left( u_{r,j} \leftrightarrow \left( a_{r,j} \vee \bigvee \limits _{i={\lfloor }\frac{j}{2}{\rfloor }}^{j-1} (u_{r,i} \wedge p_{j,i})\right) \right) \end{aligned}$$
(14)

• If node j is not a leaf and it is in the DT, then it has exactly one feature assigned to it.

$$\begin{aligned} \bigwedge _{j \in N} \left( (\lnot v_j {\wedge \eta (j)}{}) \rightarrow \sum _{r \in F} a_{r,j} = 1 \right) \end{aligned}$$
(15)

• If node j is a leaf, then it has no feature assigned to it.

$$\begin{aligned} \bigwedge _{j \in N} \left( v_j \rightarrow \sum _{r \in F} a_{r,i} = 0 \right) \end{aligned}$$
(16)

• For every positive (negative) example, if node j is a leaf and it is assigned to class 0 (1), then at least one feature is discriminated with the value in the example at node j.

$$\begin{aligned} \bigwedge _{j \in N, e_q \in \varepsilon ^{+}} \left( (v_j \wedge \lnot c_j) \rightarrow \bigvee \limits _{r \in F} d_{r,j}^{\sigma (r,q)} \right) \end{aligned}$$
(17)
$$\begin{aligned} \bigwedge _{j \in N, e_q \in \varepsilon ^{-}} \left( (v_j \wedge c_j) \rightarrow \bigvee \limits _{r \in F} d_{r,j}^{\sigma (r,q)} \right) \end{aligned}$$
(18)

Additional Pruning hard clauses: Depending on which nodes are decided to be leaves or not, the possible list of children for a given node may change. These constraints enforce this simplification. We refer to the original description (Narodytska et al. 2018). We incorporate as in previous constraints the \(\eta (i)\) variable to take into account in counters \(\tau _{t,i}\) and \(\lambda _{t,i}\) only the nodes that are used in the solution.

$$\begin{aligned}&\bigwedge _{i \in N} \lambda _{0,i} \wedge \tau _{0,i} \end{aligned}$$
(19)
$$\begin{aligned}&\bigwedge _{i \in N, t \in [1, {{\lceil }\frac{i}{2}{\rceil }}{}]} ( \lambda _{t,i} \leftrightarrow ( \lambda _{t,i-1} \vee \lambda _{t-1,i-1} \wedge v_i {\wedge \eta (i)}{} ) ) \end{aligned}$$
(20)
$$\begin{aligned}&\bigwedge _{i \in N, t \in [1, i]} ( \tau _{t,i} \leftrightarrow ( \tau _{t,i-1} \vee \tau _{t-1,i-1} \wedge \lnot v_i {\wedge \eta (i)}{} ) ) \end{aligned}$$
(21)
$$\begin{aligned}&\bigwedge _{i \in N, t \in [1, {{\lceil }\frac{i}{2}{\rceil }}{}]} \lambda _{t,i} \rightarrow \lnot l_{i, 2(i-t+1)} \wedge \lnot r_{i, 2(i-t+1) + 1} \end{aligned}$$
(22)
$$\begin{aligned}&\bigwedge _{i \in N, t \in [{\lceil }\frac{i}{2}{\rceil }, i]} \tau _{t,i} \rightarrow \lnot l_{i, 2(t - 1)} \wedge \lnot r_{i, 2t - 1} \end{aligned}$$
(23)

4.1 Multilabel encoding

We extend MinDT to support multilabels where \(c_{k,j}\) variables now represent that the leaf j belongs to class \(k \in Cls\).

$$\begin{aligned} {} \bigwedge _{k \in Cls} \bigwedge _{j \in N, e_q \in \varepsilon ^{k}} \left( (v_j \wedge c_{k,j}) \rightarrow \bigvee \limits _{r \in K} d_{r,j}^{\sigma (r,q)} \right) \end{aligned}$$
(24)

• A decision node has no class assigned:

$$\begin{aligned} {} \bigwedge _{j \in N, k \in Cls} (\lnot v_j \rightarrow \lnot c_{k,j}) \end{aligned}$$
(25)

• A leaf has only one label:

$$\begin{aligned} {} \bigwedge _{j \in N}\left( v_j \rightarrow \sum _{k \in Cls} = 1\right) \end{aligned}$$
(26)

4.2 Partial MaxSAT encoding

We assume a dataset \(\varepsilon \), an upper (ub) and lower (lb) bounds on the size of the tree. Then:

Proposition 1

Let \(PMSat_{PDT}^{\varepsilon ,n=ub,lb}\) be \(SoftN \wedge (eq.1) \wedge \ldots \wedge (eq.26)\). The optimal cost of the Partial MaxSAT instance \(PMSat_{PDT}^{\varepsilon ,n=ub,lb}\) is \({\lceil }\frac{|MPDT(\varepsilon )|}{2}{\rceil }\).

5 The Linear MaxSAT algorithm

We present an slightly variation of the Linear SAT-based MaxSAT algorithm (Eén and Sörensson 2006; Berre and Parrain 2010) that we will use to solve the Partial MaxSAT encoding presented in Sect. 4.

Essentially, Algorithm 1 solves a (Weighted) Partial MaxSAT optimization problem through a sequence of SAT queries. It receives as input parameters a (Weighted) Partial MaxSAT formula \(\phi \) (a set of soft \(\phi _s\) and hard \(\phi _h\) clauses), an optional timeout t, and an optional upper bound. First, we create an incremental SAT solver s (line 1). Then, we add all the hard clauses \(\phi _h\) and a copy of the soft clauses each extended with a new blocking variable \(b_i\) (line 2). Then, the upper bound ub of \(\phi \) is computed as the sum of the weights \(w_i\) in \(\phi _s\) plus one (line 3). An incremental PB encoder (an encoder which its k can be updated) is initialized in line 4. This PB encoder initially restricts the sum of the \(b_i\) variables and their associated weights \(w_i\) to be \(\le ub\). In line 5, we use the PB encoder to translate into CNF the initial PB constraint. While this PB constraint in combination with the hard constraints is satisfiable, we extract the model, update the ub to the cost of this model and refine the PB constraint (lines 7–12). Finally, we return the upper bound ub, the model m, and a reference to the incremental SAT solver s.

figure a

6 Extracting multiple diverse solutions

We can obtain multiple solutions of a given optimal cost c by generating the SAT instance representing the query of whether there is a solution of cost c. Then, we execute a SAT solver in incremental mode on the SAT instance, and whenever we get a solution, we add its negation to the SAT solver and ask the solver to solve the augmented instance.

In our approach, we take advantage of the fact that the SAT instance representing the optimal solutions is already created during the execution of the Linear MaxSAT algorithm which may be also augmented by useful learned clauses added in previous calls to the SAT solver. In particular, it is the last satisfiable SAT instance seen by Linear MaxSAT.

To be able to access to the clauses of this last satisfiable instance we modify the Linear MaxSAT algorithms described in Algorithm 1 as follows. First, we move the statement of line 8 (\(s.add\_clauses(pb.update(ub-1))\)) inside the conditional block (\(is\_sat = True\)) of line 10 and add as an assumption to the solver in line 9 the auxiliary variables that reify the clauses in \(pb.update(ub-1)\).Footnote 1 At this point, the formula containing all the possible optimal solutions corresponds to s.clauses().

However, these solutions may be too similar. In our case, we expect that diverse solutions (diverse MPDTs) may help to get a more robust approach in terms of accuracy.

To enforce this diversity we will solve another MaxSAT problem. Algorithm MDSOL (Alg. 2) shows the pseudocode to extract multiple diverse solutions. As input we have the SAT formula that compiles the solutions, the number of solutions we want to extract, and the target vars on which we will enforce our diversity criterion.

figure b

The set of hard clauses of the MaxSAT formula we iteratively solve contains the clauses of the input SAT instance and the clauses discarding the solutions extracted so far (line 8) restricted to the target variables.

The soft clauses contain, as unit clauses of cost 1, the negation of the literals (restricted to the target variables) appearing in the solutions retrieved so far (line 10). This way we prefer solutions where the polarity of the target variables appears less in the previous retrieved solutions. Notice that eventually, this approach can return all the optimal solutions.

7 Computing minimum pure decision trees

Finally, algorithm MPDT (Alg.3) describes how we compute MPDTs for a given dataset. We split randomly the input dataset into training and selection datasets, where p is the percentage for examples to add to the training set (line 1).

In line 3 we perform a local search on the training dataset to find an upper bound on the tree size. In particular, we use ITI (Utgoff 1989) (see Sect. 9) to find a pure decision tree (that is not guaranteed to be minimum).

Then, we create the MaxSAT encoding for computing MPDTs, described in Sect. 4 (line 4), on the training set \(\varepsilon _{tr}\) and the lower and upper bounds.

In line 5 we call the MaxSAT solver. We assume that as in the Linear MaxSAT pseudocode the MaxSAT solver returns a container for the SAT clauses compiling all the optimal solutions.

In line 6 we call MDSOL (Alg. 2) that extracts k multiple diverse MPDTs. We add as target vars the \(a_{r,j}\) vars which represent that feature r is assigned to node j. Preliminary experiments showed that using this variable is central and that adding other variables such as \(u_{r,j}\) or \(v_i\) does not lead to better results.

We consider also a variation, MPDT_lay, inspired on Janota and Morgado (2020). In short, we identify the \(k'\) tree layouts to a certain depth that can lead to an MPDT. Then, we compute for each of these \(k'\) layouts \(k/k'\) solutions through MDSOL algorithm.

From lines 8 to 10, we add to \(best_{dts}\) the MPDTs such that their accuracy, evaluated on the selection set \(\varepsilon _{sel}\) is greater than or equal the best accuracy in \(best_{dts}\) minus a deviation percentage \(\delta \) (line 10). Finally, we return the set \(best_{dts}\). In this paper, we just take randomly one of the MPDTs from \(best_{dts}\) but other solutions combining more than one MPDT can be applied as other well-known approaches in machine learning (e.g. random forests).

figure c

8 Preprocesing datasets

We use the |KBinsDiscretizer| transformer from sklearn to convert the continuous variables into discrete values using 8 bins (intervals of values) with the uniform strategy (same amount of values for each range). Then, we convert all the non-binary features to binary using our custom binary encoding that applies a mapping from the original domain to a numeric domain. If the feature is ordinal, the order is kept. Then, those numeric values are represented in binary, and a new feature is created for each bit required to encode the entire domain for the original feature. Thus, for a feature with a domain D we replace it with \(log_2(|D |)\) new features. As described in Narodytska et al. (2018), the size of the encoding grows in \({\mathcal {O}}(kn^2+mnk)\), with k being the number of features. Note that the binary encoding should have a natural advantage in the encoding size compared to the typical one-hot encoding (OHE)Footnote 2 for non-binary datasets (see Table 5).

Finally, for each dataset benchmark, we split (following the stratification strategy) the dataset into a 64% training set, 16% selection set, and 20% test set.

9 Experimental results

As benchmarks we use datasets (binary and non-binary) from the CP4IM repository (De Raedt et al. 2008), the UCI repository (Dua and Graff 2017), and the PennML repository (Romano et al. 2021; Olson et al. 2017). Table 5 shows their statistics. Our execution environment consists of a computer cluster with nodes equipped with two AMD 7403 processors (24 cores at 2.8 GHz) and 21 GB of RAM per core.

Table 5 Statistics from the datasets used in the experimental results

9.1 On MPDT runtime

The first question we address is whether the MPDT MaxSAT encoding described in Sect. 4 in combination with a MaxSAT solver allows us to compute faster MPDTs than the MinDT approach described in Narodytska et al. (2018).

For this, we adapted MinDT and the Linear algorithm (Algorithm 1) used to solve the MaxSAT encoding to report the intermediate results found during the search process. Both of them are set to use the SAT solver CaDiCaL (Biere 2019) as the SAT oracle. Then, we compare those approaches between them and against the best incomplete solvers on the unweighted track of the 2021 MaxSAT evaluation (Bacchus et al. 2021). In particular, we use the incomplete solvers SATLike-c (Lei et al. 2021), SATLike-ck (Lei et al. 2021), and TT-Open-WBO-Inc-21 (Nadel 2021).

We set a time limit of 1h and a memory limit of 20GB. We focused on large datasets (rows/features) and took 20 samples at each percentage \(S_{pctgs}=\{40\%, 60\%, 80\%\}\), for a total of 580 instances for each size. This is a relevant experiment since if one wants to scale to large datasets as in Schidler and Szeider (2021) using a kind of Large Neighborhood Search approach, it needs to efficiently solve samples of the original dataset with a SAT-based approach. Given these limits, we are interested in which approach is able to report the best upper bound (i.e. the smallest pure tree) on the given instances. For completeness, we also run the same experiment on the \(80\%\) samples with a timeout of 10h.

The ub for each instance is computed using the same approach as Narodytska et al. (2018), i.e. the upper bound is the size of the computed \(PDT(\varepsilon _{spl})\) using ITI algorithm (Utgoff 1989). The lb is trivially set to 3.

Table 6 reports, for each sample size (column “Sample size”), the number of instances (column “# Best UB”) where each approach (column “Solver”) is able to report the best upper bound on the instance. Column “Encoding” indicates which encoding is used.

Table 6 Number of instances where the best upper bound was reported by each incomplete solver

As we can see, the satlike-ck solver is the best solver, while MinDT performs the worst in this experiment. This indicates that using the MaxSAT encoding allows for a faster reduction of the bounds in the MPDT problem.

Notice that MaxSAT algorithms share learnt clauses during the entire optimization process, do not require generating the SAT query at each step, and can skip suboptimal bounds. This is not the case for SAT incremental approaches as MinDT, which can explain the differences that we observe in the results.

This experiment confirms it is useful to solve optimization problems as MaxSAT instances rather than with an agnostic SAT incremental approach provided memory requirements hold.

Notice anyway that the main goal of our paper is addressed in the next experiments, where most of the runtime is consumed on extracting multiple solutions rather than on computing the first MPDT.

9.2 On MPDT accuracy

The second question we want to address is how accurate can be the MPDT approach described in Sect. 6. We compare with the DecisionTreeClassifier model from sklearn (Scikit-learn developers 2020). For these experiments, we set a time limit of 1 day and 100GB of memory.

We recall that datasets are split into 64% training set, 16% selection set, and 20% test set. We try exactly 50 different random splits of the selection and test sets, select the best DT (from those returned by the particular approach) on the selection split and report the average accuracy on the test set.

For MPDT we evaluate mptd-1 that only generates one solution, and the variations that extract multiple solutions (with no diversity enforcement) (mpdt-nd), multiple diverse solutions (mpdt). Following Janota and Morgado (2020), we also adapted our algorithm to consider the explicit layouts for the decision trees, having mpdt-lay-nd for the version that does not enforce diversity, and mpdt-lay for the diverse solutions algorithm. This experiment considers the complete version of the MPDT algorithm.

For sklearn, we evaluate three variations on the training set with as many seeds as DTs returned by the MPDT version with more solutions. In pure sklearn (psk) we run sklearn till it achieves 100% accuracy. In limited sklearn (lsk) we restrict the DT size to the one reported by MPDT. Finally, in All sklearn (ask) we explore each possible DT size from 3 to the maximum size reached by the psk version. Notice that in ask we evaluate many more DTs than for the previous approaches. Notice that lsk and ask may not compute Pure DTs in contrast to MPDT and psk.

Table 7 Test accuracy score

Table 7 shows the average accuracy for each benchmark. For psk and ask we show the size of the DT (which may not be optimal). Additionally, for each dataset, we highlight in bold the approach with the best accuracy (ties are broken by DT size).

As we can see, extracting multiple diverse solutions (mpdt) helps to improve the average accuracy of the MPDT approach. Moreover, with respect to sklearn, mpdt obtains an average improvement on accuracy of 1.75%, 4.7% and 2.2% with respect to psk, lsk and ask.

Therefore, on the same size mpdt shows a much better accuracy-interpretability tradeoff than sklearn. Moreover, it is remarkable that mpdt using much fewer DTs than ask, and using Pure DTs, it is yet able to get a better average result. It seems that minimizing the size of PDTs helps to mitigate the overfitting phenomena. In this sense, it is interesting to explore the interactions with other properties such as the robustness of DTs (Moshkovitz et al. 2021).

Additionally, it is clear that mpdt misses some good solutions, in particular, the first one used by mpdt-1. Therefore, we also studied, whether any approach was able to see a DT with better accuracy on the test set but was not chosen with the selection set. In particular, mpdt obtains an average of 91.31%, psk 89.31%, lsk 86.65% and ask 90.51%.

We also compare with DT(max) (Hu et al. 2020)Footnote 3 which uses the incomplete MaxSAT solver Loandra (Berg et al. 2020). Although Loandra is (according to the results of the MaxSAT evaluation (Bacchus et al. 2021) superior to the Linear MaxSAT solver we use, our approach mpdt still reports better results than DT(max) using, as in the original paper, depth limits of 2, 3 and 4 (see Sect. 3). In particular, we find improvements of 7.86%, 4.37%, and 1.69% respect to DT(max) limited on depths 2, 3, and 4 respectively.

Finally, although in this paper we focus on the trade-off between interpretability versus accuracy we may wonder how is the performance in terms of accuracy of sklearn on the original datasets prior to discretizing the continuous features in the preprocessing step described in Sect. 8. We conducted this experiment as in Table 8 and we observed that although ask and pskFootnote 4 improve their mean (88.10% and 88.4%, respectively) and median (92.56% and 93.31%) performance they do not beat the mpdt approach which additionally guarantees the minimum size of the DT. Certainly, as future work, a promising research line is to study how to discretize better the continuous parameters to improve both the size of the DT and its accuracy through the mpdt approaches.

Table 8 Test accuracy score compared to SKlearn using the binary encoding and no encoding at all vs. complete approaches with the binary encoding

10 Conclusions and future work

We have presented an efficient approach for computing MPDTs providing new insights on the accuracy-interpretability tradeoff. We have shown that by computing MPDTs for the training set, we can also get in practice accurate DTs on the test set compared to DT heuristic approaches in ML. This is a curious result since technically we overfit on the training set. As future work, we will explore further the relation of the size of the tree and its pureness on the training set in terms of prediction accuracy on the test set. We will also explore how to produce better multiple optimal solutions from our MaxSAT approach and combine them effectively to construct other predictors.