Rule 110 and Karnaugh Maps

2025-02-03 18:51:54 PST (last update 2025-02-05 16:34:54 PST)

Bart Massey 2025

I've given an assignment in a couple of Rust classes I've taught recently involving Cellular Automaton (CA) "Rule 110". The latest version of the assignment is at the end of this post — please take a second to read it, noting especially the transition table there.

Implementing Rule 110

The obvious way to generate a next row using Rule 110 is to walk over the current row one bit at a time, take the "left" (L), "center" (C) and "right" (R) bit relative to the current position, and use Rule 110 to compute an output bit. Note that the output bit must be placed in a new row, to avoid corrupting the row computation.

This algorithm allows various implementations of the row datatype. A good starting place in a language with a Boolean type and sized arrays — Rust — is to use an array of Booleans as the row type. In Rust this looks like [bool;8] and occupies one byte per boolean. The result is pretty inefficient but easy to work with: for example, unlike in C Rust array assignment and parameter passing is by value (copied) instead of by reference.

Once you've fixed a representation of the row, and of the LCR values — in Rust, just [bool; 3] — you can write a function that, given an LCR, returns the correct Boolean; that function finishes the algorithm. There are many ways to accomplish this: a simple match (like C switch) statement is probably as easy in Rust as anything. The Rust compiler will even notice and report errors like incomplete or overlapping matches, which is nice for catching copy-paste bugs.

Going Bitwise

An obvious next step is to improve the performance of everything by changing the data representation. Rows of length up to 64 can conveniently be represented as a simple unsigned integer: u64 in Rust, uint64_t in modern C. (Rust also supports u128, but since that is not a native type in most CPUs, more than 64 bits per row might indicate time to just switch to a full-on bit-vector package.)

Now some shifting and masking can get the LCR bits at each position into the low-order position, at which point the Rule 110 calculation can just take an integer in the range [0..7] to a Boolean. Indeed, this works for any CA rule: my code contains this function

fn by_rule(rule: u8, bits: u64) -> u64 {
    ((rule >> (bits & 0x7) as u32) & 1) as u64
}

(Excuse the gross as casts in the Rust code. Rust does no automatic numeric promotion, so getting the types right is down to casting without screwing up: as casts cannot panic, so will do weird (deterministic) things if you get them wrong. Not a great language feature, honestly.)

This optimization provides a substantial speedup in my application, but it still leaves a bunch on the table. For "short" rows like this, we should be able to get bitwise parallelism if we are careful. However, the only way I see to do this is rule-specific: we can hard-wire a Boolean computation using the structure of (for example) Rule 110 that produces the right answer. Let's do that.

Optimizing Boolean Functions

Rule 110 is defined by a Boolean truth table. These are familiar objects to Computer Scientists and Electrical Engineers. Let's write down the actual table, but upside down.

The reason for the name "Rule 110" is thus made apparent: reading Q from top to bottom as a binary number yields decimal 110. (You can tell that rule name inventor Stephen Wolfram is not a "real" computer person. The rule numbering should have been hex — 67 here — for ease of use.)

Those who recall days gone by or work with this stuff regularly may see that there's an obvious Boolean expression here that will produce the right answer. Let's let the negation of a positive literal be its lowercase: for example, "not L" is "l". Then the "sum of products" (SOP) expression for Q is

LCr + LcR + lCR + lCr + lcR

where plus is bitwise "or" and adjacency is bitwise "and".

This is a fairly expensive expression to compute, so we'd like to see if we can simplify it to something cheaper that gives the same answer.

Boolean Algebra

One way is just using Boolean algebra: we can easily factor L and l to get

L(Cr + cR) + l(CR + Cr + cR)

The extremely clever will note that we can then use Boolean "xor", which we will denote with "^" as per horrible C (and Rust) convention. (Why horrible? Because people confuse it with exponentiation all the time, and then the typechecker passes it, and then the bad thing happens.)

L(C ^ R) + l(CR + (C ^ R))

The next steps go as follows

L(C ^ R) + lCR + l(C ^ R)
(L + l)(C ^ R) + lCR
(C ^ R) + lCR

This is a boolean expression we can compute fairly cheaply: it requires an "xor", an "or", two "and"s and a "not" for a total of 5 operations.

Karnaugh Maps

By using Boolean algebra we got from

LCr + LcR + lCR + lCr + lcR

(C ^ R) + lCR

which is quite an improvement. But it was also a fairly involved process with many fiddly steps to get wrong.

The Karnaugh Map is a well-known device in Electrical Engineering, but maybe less well-known in Computer Science: a graphical method of quickly simplifying a Boolean expression.

(Here we will work with our specific 3-literal SOP expression. Karnaugh Maps can be made for any number of literals, but they get hard to reason with as the number of literals becomes large. They also can be used to simplify product-of-sums (POS) expressions.)

The Karnaugh Map for our initial expression might look like this:

   CR Cr cR cr
L   0  1  1  0
l   1  1  1  0

We have done the ordering "backward" because the original truth table is backward: this is not normal, but it doesn't matter to the method. We have also chosen to group C and R: any grouping will work, although changing the representation may make salient features more or less graphically obvious.

Now, the heart of the Karnaugh Map method is to try to create large rectangles that cover all-ones areas. Each rectangle corresponds to a min-term in the simplified SOP formula.

Let's start with the large rectangle in the center, starring each element.

   CR Cr cR cr
L   0 *1 *1  0
l   1 *1 *1  0

This could give us two min-terms either vertically or horizontally. However, the graphical view gives us an easy way to spot what we spotted before: we will add the combined term C ^ R. Now we have covered all the one-bits but the lower-left one: we will cover that with a second 1×1 rectangle, getting

   CR Cr cR cr
L   0  1  1  0
l  *1  1  1  0

This gives us a second min-term lCR, so our combined expression is as before.

(C ^ R) + lCR

That was way easier. Our expression still has a cost of 5 operations.

But wait — there's more! In that last step we violated our rule of making big rectangles. Nothing says the rectangles can't overlap. Making the 1×3 rectangle

   CR Cr cR cr
L   0  1  1  0
l  *1 *1 *1  0

gives us l(C + cR) which isn't as cheap as before. (In general, crossing even boundaries is expensive.) But making the 1×2 rectangle

   CR Cr cR cr
L   0  1  1  0
l  *1 *1  1  0

gives us lC which is a 20% savings! Our expression costs 4 now:

(C ^ R) + lC

This seems counterintuitive, so we should probably check it with a truth table.

LCR C^R lC Q
111   0  0 0
110   1  0 1
101   1  0 1
100   0  0 0
011   0  1 1
010   1  1 1
001   1  0 1
000   0  0 0

Huh. Works perfectly.

Could we do better? Notice that Q has more 1 bits than 0 bits. Maybe we should pay for a negation and work with the map for q?

   CR Cr cR cr
L   1  0  0  1
l   0  0  0  1

This leads to q as

cr + LCR

This has 7 operations counting the invisible negation, so not as good.

Since we need Q we could try to deMorgan-ize and factor to get

!(cr) !(LCR)
(C + R) (l + c + r)
(C + R) (l + c + r)
Cl + Cr + lR + cR
l(C + R) + (C ^ R)

5 operations, but we ended doing some fiddly Boolean algebra again. Not an improvement.

Bitwise Parallelism

Now we have a really nice Boolean formula for Rule 110. Yay. Remember that we started with a really nice formula generalized over all rule numbers. What does the specialization buy us?

Well, now we can use "bitwise parallelism" to generate all the bits of the next row at once. Notice that R[i] is just (C << 1)[i] and L[i] is (C >> 1)[i]. Soo…

let left = cur >> 1;
let right = cur << 1;
let next = (cur ^ right) | (!left & cur);

Note that Rust uses ! for bitwise negation of integer types. Wouldn't have been my choice, but it's fine.

We've computed a whole next row of up to 64 bits at once. This is nice. Really nice.

There's only one remaining problem. The end positions will not be computed correctly. Right now we're shifting 0 into the right value, and whatever was lying around in the next-upper bit into the left. Then we're negating left without consideration of how many bits actually need to flip, essentially trashing the left bit for the next iteration.

Wrapping Around

Recall that we are to "wrap around" at the ends. Let's add a little masking and rotate action. Let nbits be the number of least-significant bits that are supposed to be part of the row. We can rotate right by shifting right and moving the left bit back around. Similarly, we can rotate left by shifting left and moving the right bit back around. The mask at the end ensures that there's a 0 bit left of the significant bits on the next cycle.

let left = (cur >> 1) | (cur << (nbits - 1));
let right = (cur << 1) | (cur >> (nbits - 1));
let mask = (1u64 << nbits).wrapping_sub(1);
let next = ((cur ^ right) | (!left & cur)) & mask;

The Rust compiler, when run in the default debug mode, inserts integer wrap assertions in the code. (Honestly, these should be on by default in optimized "release" compiles as well. Sigh.) The wrapping_sub() disables the wrap check, important when nbits == 64.

With the code above we can apparently produce a new row of up to 64 bits in 15 cycles. Of course, that's a really gross approximation to real performance. The calculation assumes that everything can be in 64-bit registers all the time — not unreasonable on a modern "big" CPU. The calculation also doesn't take into account cycle times: the integer operations may issue at two or more per cycle depending on… things. But still…

Testing

We have done a lot to get to this point. We should probably start by comparing with a known-good implementation to make sure that our fiddly bit-wiggling works. I'll check the first 10 rows of the 8-cell table from our assignment — that should be good enough. (Remember the assignment? Seems like a long time since we started there.) One moment, please.

Ok, full disclosure. The version of the row operation given above is not what I started with. I made one masking bug and missed one optimization in the first version. This is why we test: the core idea was sound, but my code was not. Fixed.

Benchmarking and Performance

The whole point of all this fiddliness was, in some sense to get a really fast implementation of Rule 110. Let's see how we did.

The results? The old method (already faster) makes 100M rows in about 3.36s. The new method makes 100M rows in 0.15s. Roughly a 22× speedup. 1.46s for 1B rows, about 685M rows per second on a single core of my 2.2GHz machine.

But wait — there's one more set of tricks to apply. I can set up rustc to compile for my real "modern" CPU (AMD Ryzen 9 3900X) instead of some least-common-denominator x86_64. I can also turn all the rustc optimizations I know about up to 11. The result? Got 1B rows in 1.42s! Yeah, compiler don't care.

After applying the same optimizations to the previous version, I found 3.07s and 0.14s for 100M rows, respectively — our speedup is about the same. The compiler can do a lot more with the "slow" version, apparently, because there's more code there to work with.

One final idea: this benchmark is somewhat specific. The row width shouldn't matter much to the fast version, but it matters a lot to the old version. I chose 8 bits just because that is what we started with. Let's try a couple more widths on 100M rows just to verify a bunch of assumptions. With width 1, I get 0.75s and 0.14s respectively. Still a respectable 5× speedup in the worst ("real") case. With width 64, I get 21.9s and 0.14s. That's kind of a big deal.

We have won.

Future Work

Battles don't stay won. There's things we could do to be stronger, faster, better than before.

Bit parallelism is the best parallelism. If you want to go wider than 64-bit rows, though, you'll probably want to investigate the other kind. The trick here, of course, is the boundaries: some communication between cores is needed on each cycle. This can slow you down massively. However, because end corruption can only travel inward at one bit per row, you can do some batching and fix things up later. It all gets fiddly and really hard to make fast.
The manual process we went through to get the Rule 110 map seems automatable. Some kind of nice Boolean formula optimizer seems in order. Since there are only 256 rules (some of which are utterly boring) we could pre-compute a Boolean formula for each and hard-code it into a table. Alternatively, we could find some reasonably fast optimizer and do this on the fly. We really should handle arbitrary rules.

Conclusions

I set out to write a short little thing to explain to those interested what I was doing today. I would estimate that I ended up spending about four hours writing and about two hours actually doing the thing. This is not an unusual ratio for me, and I think for others.

The good news is that I ended up with some interesting insights for myself (even on this well-traveled road). Hopefully you learned a thing or two as well.

Acknowledgments

CA Rule 110 was first introduced to me in Fall 2008 when I was teaching a Multicore Computing class. (I ended up losing to only a couple of students from the class of 20 or so, pitting some forgotten variant of my single-threaded implementation described in this article against their 16-core things.) Thanks much to the student who brought this problem to the class.

Keith Packard and others taught me the art of bit-banging, long ago. Thanks for that! Also thanks to Hank Warren's Hacker's Delight, the one true bit-parallel book.

Availability

The Rust source code for everything here is available under an open-source license at https://github.com/pdx-cs-rust/hw-rule110-solution. Please see the README there for details.

Appendix: Rule 110 Assignment

Bart Massey 2025

Background

In this exercise you will create an ASCII rendering of the Rule 110 Elementary Cellular Automaton (CA). Not as scary as it sounds, and a nice simple Rust thing to play with.

A CA starts with a random row of bits. We will represent 1 bits with * and 0 bits with .. Our starting row will specifically be this 8-bit position for now.

*.*..*..

To produce the next row, take bits three at a time, "wrapping around" if a boundary is hit (row 7 is next to row 0 and so forth). A group of three bits in row n will determine the center bit position in row n + 1 according to Rule 110:

So for our start, the first two rows will be

*.*..*..
***.**.*

(Notice the wraparound at the beginning and end.)

The Program

Write a program that prints ten rows starting with the two given.

Challenges

(Challenges are not required, but give a chance to do something extra. Turn them in and write them up in the README.)

Make your program take a command-line argument for the starting row.
Time your program. See how fast you can make it go. Compare performance with a C or C++ implementation. You will probably have to modify the program to print a lot more rows: maybe some number given by a command-line argument.