All Different Variants

In the earlier posts I’ve described some of the basic ideas behind CSPs:

  • how the problems are represented using variables and constraints,
  • how the solver searches for a solution by building up the search tree,
  • how heuristics guide the search, and
  • how constraint propagation helps eliminate parts of the search space that will not contain solutions.

In this post I will look at how modeling the problem itself can have a significant influence on how quickly the solver is able to find a solution. In particular, I’ll talk about different ways of implementing the all-different constraint, and look at the various implementations’ performance characteristics.

An interesting thing about CSPs is that for a given problem type there is often more than one way to model it. For example, when I spoke about modeling the N-queens problem I said that one way to prevent queens from attacking each other diagonally is to ensure that if any pair of queens are connected by a line the slope of that line isn’t 1 or -1. Another way to achieve the same result is to enumerate diagonals of the NxN chessboard, and make sure that all queens are assigned to different diagonals (each queen will claim two diagonals - one from bottom left to top right and another from top left to bottom right). Both representations are totally valid, but the performance of search will vary depending on how the problem is modeled, and how the underlying constraints are implemented.

In fact, the “all diagonals must be different” constraint can be implemented by having binary constraints, one for each pair of queens, that ensure that the queens don’t each other diagonally, which is exactly how I’ve described the problem originally. Importantly though, the all-different constraint could be implemented very differently.

Before getting too far ahead, let’s formally define what an all-different constraint means. Recall that a CSP itself is composed of variables, each with a domain of possible values, and constraints on those variables. An all-different constraint may defined on any subset of variables of the overall problem, and it is satisfied if the following condition is true: $$ \forall u, v \in C_V: value(u) \ne value(v) $$

From this representation, it’s easy to see why it can be tempting, and totally valid, to model this constraint just as a set of binary not-equals constraints.

In my solver all constraints implement the following interface:

1
2
3
4
type Constraint interface {
  Variables() []*Variable
  Supported(v *Variable, val int) bool
}

When the solver applies constraint propagation it checks constraints by iterating over its Variables() and checking whether the remaining values in each of the variable’s live domains are still Supported() by the constraint. When they are not, they are removed as no solution can exist with that variable-value assignment. You can see how constraint propagation works in the earlier post.

There are many ways to enforce this constraint. Let’s look at a bunch.

We’ve already covered the simplest, which boils down to taking each pair of variables and adding binary not-equal constraints. This is exactly what we did with N-queens, and results in a quadratic number of binary constraints. One big benefit of this approach is just how easy it is to understand and implement.

1
2
3
4
5
6
7
8
9
func buildBinaryAllDiff(vs []*model.Variable) []model.Constraint {
  cs := []model.Constraint{}
  for i := 0; i < len(vs); i++ {
    for j := i + 1; j < len(vs); j++ {
      cs = append(cs, binary.NewNotEquals(vs[i], vs[j]))
    }
  }
  return cs
}

Rather than defining many binary constraints, we can do a bit better by defining a single constraint that tracks variables with one live value left and disallows assigning that value to any of the other variables.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
type allDiffBasic struct { vs []*model.Variable }

func (a *allDiffBasic) Variables() []*model.Variable { return a.vs }

func (a *allDiffBasic) Supported(supV *model.Variable, supVal int) bool {
  for _, other := range a.vs {
    if supV == other {
      continue
    }
    if other.LiveLeft() == 1 && other.LiveDomain().LastLiveValue() == supVal {
      return false
    }
  }
  return true
}

This one is a little bit more clever by being incremental and tracking state change - whenever a variable is assigned, we track which value is “claimed” and disallow other variables from that value, and when a variable is unassigned, we allow other variables to be assigned to that value again. You can see that happening in the Set() and Cleared() methods below, which are part of incremental interface (not shown).

This is very similar in spirit to “Basic”, but doesn’t require explicitly checking every variable whenever we’re seeing if the constraint is violated. The reason I call this type of implementation sparse is because it works well when we have many values to choose from. This is the implementation I use for the crossword builder, for ensuring words are not repeated - there are very many words in the dictionary to choose from… but more on this another time.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
type allDiffSparse struct {
  vs []*model.Variable
  claimedValues map[int]bool
}

func (a *allDiffSparse) Variables() []*model.Variable { return a.vs }

func (a *allDiffSparse) Supported(supV *model.Variable, supVal int) bool {
  return !a.claimedValues[supVal]
}

func (a *allDiffSparse) Set(v *model.Variable, value int) {
  a.claimedValues[value] = true
}

func (a *allDiffSparse) Cleared(v *model.Variable, prevValue int) {
  delete(a.claimedValues, prevValue)
}

This implementation is based on the Pigeonhole principle, which, to quote Wikipedia, states that if n items are put into m containers, with n > m, then at least one container must contain more than one item. In other words, if we ever get to a state where we have fewer values left to choose from than unassigned variables, we know that the all-different constraint cannot be met. For example, this implementation can detect that if we have three variables, each with domain of {1, 2}, no solution is possible. The more naive implementations would not. So while they all will enforce the constraint given a full assignment, this one can detect infeasibility earlier, saving us from pointlessly trying to find a solution when none exist.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
type allDiffPigeonHole struct { vs []*model.Variable }

func (a *allDiffPigeonHole) Variables() []*model.Variable { return a.vs }

func (a *allDiffPigeonHole) Supported(supV *model.Variable, supVal int) bool {
  otherVals := map[int]bool{}
  for _, other := range a.vs {
    if supV == other {
      continue
    }
    atLeastOneSupport := false
    other.LiveDomain().ForEachValue(func(value int) bool {
      if value != supVal {
        otherVals[value] = true
        atLeastOneSupport = true
      }
      return true
    })
    if !atLeastOneSupport {
      return false
    }
  }
  return len(otherVals)+1 >= len(a.vs)
}

This implementation is the most thorough, because it performs a bi-partite match from variables on one side to possible values on the other, and only succeeds when a maximum match exists. In other words, given a partial assignment of values to variables, this constraint will only remain feasible when a possible all-different assignment still exists. An overly simple example to demonstrate that this implementation is more powerful than pigeonhole one is this:

  • A problem with 3 variables: X, Y and Z.
  • X must be 1, Y must 1, but Z can be either 2 or 3.
  • Pigeonhole implementation will say there are 3 values and 3 variables, so a solution might exist, but clearly there is no maximal bipartite match as X and Y will be equal, so the more sophisticated bipartite implementation will flag this as an infeasible state.
  • As a side-note, while pigeonhole wouldn’t detect infeasibility in this example, all of the simpler constraints would. So it’s not true that pigeonhole is strictly stronger than the simpler implementations, but it is true that the bipartite one is.

There’s ~250 lines of code for this one, so I won’t include that verbatim, but in short this implementation uses Hopcroft-Karp maximum matching algorithm, porting this implementation to Go, and adds incremental state tracking; e.g., by removing edges from the match as variables are assigned. When the constraint is re-evaluated via Supported(), this allows to start from a partial and often mostly complete matching, rather than from scratch.

Cool, so we’ve got the implementations. We know that ultimately they’ll all give us to a valid sudoku, but what is their performance like? Let’s look!

We’ll be looking at generalized NxN sudokus, as they are effectively one big glob of variables, one per sudoku cell, and all-different constraints enforcing horizontal, vertical and block uniqueness. We’ll search for the first, randomly encountered solution, and to do so we’ll be using a solver with randomized variable and value ordering heuristics and apply forward checking after every assignment. We’ll be looking at sudokus of sizes 16x16, 25x25 and 36x36. You can also try most of these online.

We’ll cover two basic metrics - time taken by the solver, and the number of steps the solver took, to reach the first solution. End users will typically only care about the time taken, but the number of steps can give us some insights into how our constraints work.

We’ll be using Go’s benchmarking tests to evaluate performance of these implementations. Here’s what this benchmark/test looks like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
const timeout = 10 * time.Minute

type experiment struct {
  name string
  // For my sudokus "size" is the size of a sub-square, so 
  // "size = 3" results in a regular 9x9 sudoku, and so on.
  size int
  kind alldiff.AlgoKind
}

var experiments = []experiment{
  {name: "basic", size: 4, kind: alldiff.Basic},
  {name: "binary", size: 4, kind: alldiff.Binary},
  {name: "sparse", size: 4, kind: alldiff.Sparse},
  {name: "pigeonhole", size: 4, kind: alldiff.Pigeonhole},
  {name: "bipartite", size: 4, kind: alldiff.Bipartite},

  {name: "basic-lg", size: 5, kind: alldiff.Basic},
  {name: "binary-lg", size: 5, kind: alldiff.Binary},
  {name: "sparse-lg", size: 5, kind: alldiff.Sparse},
  {name: "pigeonhole-lg", size: 5, kind: alldiff.Pigeonhole},
  {name: "bipartite-lg", size: 5, kind: alldiff.Bipartite},

  {name: "basic-vlg", size: 6, kind: alldiff.Basic},
  {name: "binary-vlg", size: 6, kind: alldiff.Binary},
  {name: "sparse-vlg", size: 6, kind: alldiff.Sparse},
  {name: "pigeonhole-vlg", size: 6, kind: alldiff.Pigeonhole},
  {name: "bipartite-vlg", size: 6, kind: alldiff.Bipartite},
}

func solveSudoku(e experiment) (bool, int) {
  sudoku := GeneralizedSudoku(e.size, map[string]int{}, e.kind)
  config := search.Config{
    VariableHeuristic: search.NewDomWDegVariableHeuristic(sudoku),
    ValueHeuristic:    search.NewRandomValueHeuristic(time.Now().UTC().UnixNano()),
  }
  s, _, _ := search.NewSearch(sudoku, config)
  ctx, cancel := context.WithTimeout(context.Background(), timeout)
  defer cancel()
  var solved = s.Run(ctx) != nil
  return solved, s.Steps()
}

func BenchmarkSudoku(b *testing.B) {
  for _, e := range experiments {
    f, _ := os.OpenFile("benchmarks/"+e.name+".dat", os.O_RDWR|os.O_CREATE|os.O_TRUNC, 0755)
    w := bufio.NewWriter(f)
    b.Run(e.name, func(b *testing.B) {
      for i := 0; i < b.N; i++ {
        start := time.Now()
        solved, steps := solveSudoku(e)
        timeTaken := time.Since(start).Microseconds()
        w.WriteString(fmt.Sprintf("%v %v %s\n", timeTaken, steps, strconv.FormatBool(solved)))
      }
    })
    w.Flush()
    f.Close()
  }
}

The performance data is then generated via:

go test -bench=BenchmarkSudoku -benchtime=500x -run=^# -timeout=24h

and plotted with gnuplot. I generated 1000 instances for 16x16, 500 for 25x25, and 200 for 36x36 graphs.

We’ll look at steps taken first, because there is a clear winner - as one might expect, the most sophisticated bipartite implementation wins hands-down. In fact, if you try it online with lexicographic ordering you’ll notice that it rarely needs to backtrack. Even with randomized ordering it backtracks far less than the alternatives.

Here is the performance of the five implementations for 16x16 sudokus (all graphs in this post are cumulative distribution functions):

Bipartite is the fastest, and it is followed closely by pigeonhole. Basic is in third, followed by binary, and with sparse coming in last. These last three don’t perform particularly great in terms of steps taken. One interesting bit is that basic takes slightly fewer steps than sparse - I think this is because it is slightly more aggressive at removing values, as it considers all variables with one variable, rather that just assigned variables like sparse does (so it can cover cases where a variable is not assigned yet, but its live domain only has one value left).

Pretty different from steps taken, eh? Let’s see what’s going on here:

  • Sparse, the slowest in terms of number of steps, is performing really well when it comes to time. Why? There are a couple of reasons: the speed of evaluating a given constraint, and the fact that we only have one constraint for each set of 16 cells. It’s fast because it incrementally maintains state, and when asked if a value is still viable just checks if it has already been taken. And, unlike binary constraints, where we’d have 120 binary constraints to represent all-difference of 16 cells (15 * 16 / 2 = 120), there’s only one constraint to check. 16x16 sudokus also appear to be under-constrained, so this fast, simple implementation, does well.
  • Basic comes in next. This isn’t too surprising given it is similar to sparse, but not incremental, and hence a bit slower, despite taking fewer steps.
  • Bipartite follows. It’s sophisticated, but slower than the simpler implementations above. It is incremental though, which is important in that when the state changes, the re-evaluations are generally quick.
  • Binary is second to last for most percentiles (and in fact it is the worst at high percentiles). Here, we have the benefits of “simple and fast” evaluation, but we have more constraints to evaluate, and this ultimately hurts the runtime, despite the simplicity.
  • Pigeonhole is the slowest of the bunch, and this is primarily/probably because it is not incremental, so every time it is evaluated, it is somewhat expensive to do so.

Well, our sophisticated implementation didn’t exactly kick butt in terms of time on the smaller instances. Let’s look at 25x25 sudokus next. Here, the less sophisticated variants start getting rather slow, so to keep things simple I’m capping their runtime at 10 minutes. The timeouts aren’t too frequent yet:

  • Basic: 6/500 runs timed out.
  • Binary: 19/500 runs timed out.
  • Sparse: 4/500 runs timed out.

Looking at steps taken, as before, we can see that bipartite implementation is far smarter and takes way fewer steps than others. Pigeonhole is second best, as before. This isn’t too surprising, but perhaps what is interesting is how quickly the simpler implementations start to degrade.

Looking at the runtime reinforces that story - the less sophisticated implementations complete about an order of magnitude faster when they’re “lucky”, but many orders of magnitude slower when they are not (say, at 95 percentile). So, while on average a couple of these implementations are competitive, on occasion we can hit particularly bad runs that take many minutes to complete… and binary, the simplest of them all, clearly is struggling to keep up. In contrast, the bipartite implementation is very steady, with basically all of the runs completing in 1.5 - 2 seconds. Pigeonhole isn’t quite as fast as bipartite, but still tends to hold up well against the simpler implementations. The runtime of the top few percentiles does get significantly worse though.

The even larger, 36x36 set of sudokus, is where we can see the that the simpler approaches start consistently timing out after 10 minutes:

  • Basic: 25/30 runs timed out.
  • Binary: 28/30 runs timed out.
  • Sparse: 24/30 runs timed out.

These make sense - based on the smaller instances we already noticed that all 3 of these tend to require many more steps, binary suffers from having too many finer-grained constraints to evaluate, basic is non-incremental, and sparse is only a slight improvement over basic. So none of them do well, binary is worse than the other two, while the incrementality of sparse over basic may be slightly helping.

Only bipartite and pigeonhole implementations remain viable, with bipartite continuing to be in the lead:

In short, bipartite is clearly starting to win, and the trend continues for larger instances, based on a handful of ad hoc experiments.

Like with many things, it comes down to trade-offs. In this case, choosing the right implementation of an all-different constraint comes down to choosing between simplicity, speed, and consistent performance. Determining which of these is more important to you can help you choose the right implementation:

  • Fast implementations are easy to understand, and can be “good enough” for smaller problems, but have unstable run times, and don’t scale well.
  • More sophisticated implementations are the opposite, they’re harder to understand and implement, don’t provide a ton of value on smaller problems, but truly shine on large problems, and have consistent performance characteristics to boot.

Lastly, it’s worth noting that more efficient algorithms for all-different constraints and sudokus are out there, and I’ve yet to experiment with them. Regin’s all-different algorithm and Knuth’s Dancing links algorithm are two examples of these, respectively.