Rule 110 and Karnaugh Maps
2025-02-03 18:51:54 PST (last update 2025-02-03 18:52:04 PST)
Bart Massey 2025
I've given an assignment in a couple of Rust classes I've taught recently involving Cellular Automaton (CA) "Rule 110". The latest version of the assignment is at the end of this post — please take a second to read it, noting especially the transition table there.
Implementing Rule 110
The obvious way to generate a next row using Rule 110 is to walk over the current row one bit at a time, take the "left" (L), "center" (C) and "right" (R) bit relative to the current position, and use Rule 110 to compute an output bit. Note that the output bit must be placed in a new row, to avoid corrupting the row computation.
This algorithm allows various implementations of the row
datatype. A good starting place in a language with a Boolean
type and sized arrays — Rust — is to use an array of
Booleans as the row type. In Rust this looks like [bool;8]
and occupies one byte per boolean. The result is pretty
inefficient but easy to work with: for example, unlike in C
Rust array assignment and parameter passing is by value
(copied) instead of by reference.
Once you've fixed a representation of the row, and of the
LCR values — in Rust, just [bool; 3]
— you can write a
function that, given an LCR, returns the correct Boolean;
that function finishes the algorithm. There are many ways to
accomplish this: a simple match
(like C switch
)
statement is probably as easy in Rust as anything. The Rust
compiler will even notice and report errors like incomplete
or overlapping matches, which is nice for catching
copy-paste bugs.
Going Bitwise
An obvious next step is to improve the performance of
everything by changing the data representation. Rows of
length up to 64 can conveniently be represented as a simple
unsigned integer: u64
in Rust, uint64_t
in modern C.
(Rust also supports u128
, but since that is not a native
type in most CPUs, more than 64 bits per row might indicate
time to just switch to a full-on bit-vector package.)
Now some shifting and masking can get the LCR bits at each
position into the low-order position, at which point the
Rule 110 calculation can just take an integer in the range
[0..7]
to a Boolean. Indeed, this works for any CA rule:
my code contains this function
fn by_rule(rule: u8, bits: u64) -> u64 {
((rule >> (bits & 0x7) as u32) & 1) as u64
}
(Excuse the gross as
casts in the Rust code. Rust does no
automatic numeric promotion, so getting the types right is
down to casting without screwing up: as
casts cannot
panic, so will do weird (deterministic) things if you get
them wrong. Not a great language feature, honestly.)
This optimization provides a substantial speedup in my application, but it still leaves a bunch on the table. For "short" rows like this, we should be able to get bitwise parallelism if we are careful. However, the only way I see to do this is rule-specific: we can hard-wire a Boolean computation using the structure of (for example) Rule 110 that produces the right answer. Let's do that.
Optimizing Boolean Functions
Rule 110 is defined by a Boolean truth table. These are familiar objects to Computer Scientists and Electrical Engineers. Let's write down the actual table, but upside down.
LCR Q
111 0
110 1
101 1
100 0
011 1
010 1
001 1
000 0
The reason for the name "Rule 110" is thus made apparent:
reading Q
from top to bottom as a binary number yields
decimal 110. (You can tell that rule name inventor Stephen
Wolfram is not a "real" computer person. The rule numbering
should have been hex — 67 here — for ease of use.)
Those who recall days gone by or work with this stuff
regularly may see that there's an obvious Boolean expression
here that will produce the right answer. Let's let the
negation of a positive literal be its lowercase: for
example, "not L
" is "l
". Then the "sum of
products" (SOP) expression for Q
is
LCr + LcR + lCR + lCr + lcR
where plus is bitwise "or" and adjacency is bitwise "and".
This is a fairly expensive expression to compute, so we'd like to see if we can simplify it to something cheaper that gives the same answer.
Boolean Algebra
One way is just using Boolean algebra: we can easily factor
L
and l
to get
L(Cr + cR) + l(CR + Cr + cR)
The extremely clever will note that we can then use Boolean "xor", which we will denote with "^" as per horrible C (and Rust) convention. (Why horrible? Because people confuse it with exponentiation all the time, and then the typechecker passes it, and then the bad thing happens.)
L(C ^ R) + l(CR + (C ^ R))
The next steps go as follows
L(C ^ R) + lCR + l(C ^ R)
(L + l)(C ^ R) + lCR
(C ^ R) + lCR
This is a boolean expression we can compute fairly cheaply: it requires an "xor", an "or", two "and"s and a "not" for a total of 5 operations.
Karnaugh Maps
By using Boolean algebra we got from
LCr + LcR + lCR + lCr + lcR
to
(C ^ R) + lCR
which is quite an improvement. But it was also a fairly involved process with many fiddly steps to get wrong.
The Karnaugh Map is a well-known device in Electrical Engineering, but maybe less well-known in Computer Science: a graphical method of quickly simplifying a Boolean expression.
(Here we will work with our specific 3-literal SOP expression. Karnaugh Maps can be made for any number of literals, but they get hard to reason with as the number of literals becomes large. They also can be used to simplify product-of-sums (POS) expressions.)
The Karnaugh Map for our initial expression might look like this:
CR Cr cR cr
L 0 1 1 0
l 1 1 1 0
We have done the ordering "backward" because the original truth table is backward: this is not normal, but it doesn't matter to the method. We have also chosen to group C and R: any grouping will work, although changing the representation may make salient features more or less graphically obvious.
Now, the heart of the Karnaugh Map method is to try to create large rectangles that cover all-ones areas. Each rectangle corresponds to a min-term in the simplified SOP formula.
Let's start with the large rectangle in the center, starring each element.
CR Cr cR cr
L 0 *1 *1 0
l 1 *1 *1 0
This could give us two min-terms either vertically or
horizontally. However, the graphical view gives us an easy
way to spot what we spotted before: we will add the combined
term C ^ R
. Now we have covered all the one-bits but the
lower-left one: we will cover that with a second 1×1 rectangle,
getting
CR Cr cR cr
L 0 1 1 0
l *1 1 1 0
This gives us a second min-term lCR
, so our combined
expression is as before.
(C ^ R) + lCR
That was way easier. Our expression still has a cost of 5 operations.
But wait — there's more! In that last step we violated our rule of making big rectangles. Nothing says the rectangles can't overlap. Making the 1×3 rectangle
CR Cr cR cr
L 0 1 1 0
l *1 *1 *1 0
gives us l(C + cR)
which isn't as cheap
as before. (In general, crossing even boundaries is
expensive.) But making the 1×2 rectangle
CR Cr cR cr
L 0 1 1 0
l *1 *1 1 0
gives us lC
which is a 20% savings! Our expression
costs 4 now:
(C ^ R) + lC
This seems counterintuitive, so we should probably check it with a truth table.
LCR C^R lC Q
111 0 0 0
110 1 0 1
101 1 0 1
100 0 0 0
011 0 1 1
010 1 1 1
001 1 0 1
000 0 0 0
Huh. Works perfectly.
Could we do better? Notice that Q
has more 1 bits than 0 bits.
Maybe we should pay for a negation and work with the map
for q
?
CR Cr cR cr
L 1 0 0 1
l 0 0 0 1
This leads to q
as
cr + LCR
This has 7 operations counting the invisible negation, so not as good.
Since we need Q
we could try to deMorgan-ize and factor to
get
!(cr) !(LCR)
(C + R) (l + c + r)
(C + R) (l + c + r)
Cl + Cr + lR + cR
l(C + R) + (C ^ R)
5 operations, but we ended doing some fiddly Boolean algebra again. Not an improvement.
Bitwise Parallelism
Now we have a really nice Boolean formula for Rule 110. Yay. Remember that we started with a really nice formula generalized over all rule numbers. What does the specialization buy us?
Well, now we can use "bitwise parallelism" to generate all
the bits of the next row at once. Notice that R[i]
is just
(C << 1)[i]
and L[i]
is (C >> 1)[i]
. Soo…
let left = cur >> 1;
let right = cur << 1;
let next = (cur ^ right) | (!left & cur);
Note that Rust uses !
for bitwise negation of integer
types. Wouldn't have been my choice, but it's fine.
We've computed a whole next row of up to 64 bits at once. This is nice. Really nice.
There's only one remaining problem. The end positions will
not be computed correctly. Right now we're shifting 0 into
the right
value, and whatever was lying around in the
next-upper bit into the left
. Then we're negating left
without consideration of how many bits actually need to
flip, essentially trashing the left bit for the next
iteration.
Wrapping Around
Recall that we are to "wrap around" at the ends. Let's add a
little masking and rotate action. Let nbits
be the number
of least-significant bits that are supposed to be part of
the row. We can rotate right by shifting right and moving
the left bit back around. Similarly, we can rotate left by
shifting left and moving the right bit back around. The mask
at the end ensures that there's a 0 bit left of the
significant bits on the next cycle.
let left = (cur >> 1) | (cur << (nbits - 1));
let right = (cur << 1) | (cur >> (nbits - 1));
let mask = (1u64 << nbits).wrapping_sub(1);
let next = ((cur ^ right) | (!left & cur)) & mask;
The Rust compiler, when run in the default debug mode,
inserts integer wrap assertions in the code. (Honestly,
these should be on by default in optimized "release"
compiles as well. Sigh.) The wrapping_sub()
disables the
wrap check, important when nbits == 64
.
With the code above we can apparently produce a new row of up to 64 bits in 15 cycles. Of course, that's a really gross approximation to real performance. The calculation assumes that everything can be in 64-bit registers all the time — not unreasonable on a modern "big" CPU. The calculation also doesn't take into account cycle times: the integer operations may issue at two or more per cycle depending on… things. But still…
Testing
We have done a lot to get to this point. We should probably start by comparing with a known-good implementation to make sure that our fiddly bit-wiggling works. I'll check the first 10 rows of the 8-cell table from our assignment — that should be good enough. (Remember the assignment? Seems like a long time since we started there.) One moment, please.
Ok, full disclosure. The version of the row operation given above is not what I started with. I made one masking bug and missed one optimization in the first version. This is why we test: the core idea was sound, but my code was not. Fixed.
Benchmarking and Performance
The whole point of all this fiddliness was, in some sense to get a really fast implementation of Rule 110. Let's see how we did.
I have two branches in my current implementation. One uses bit representation but does the per-bit looping thing. The other uses our new technique. (I haven't published the solution source code yet. It will soon be available along with the full assignment as https://github.com/pdx-cs-rust/hw-rule110.)
The results? The old method (already faster) makes 100M rows in about 3.36s. The new method makes 100M rows in 0.15s. Roughly a 22× speedup. 1.46s for 1B rows, about 685M rows per second on a single core of my 2.2GHz machine.
But wait — there's one more set of tricks to apply. I can
set up rustc
to compile for my real "modern" CPU (AMD
Ryzen 9 3900X) instead of some least-common-denominator
x86_64
. I can also turn all the rustc
optimizations I
know about up to 11. The result? Got 1B rows in 1.42s! Yeah,
compiler don't care.
After applying the same optimizations to the previous version, I found 3.07s and 0.14s for 100M rows, respectively — our speedup is about the same. The compiler can do a lot more with the "slow" version, apparently, because there's more code there to work with.
One final idea: this benchmark is somewhat specific. The row width shouldn't matter much to the fast version, but it matters a lot to the old version. I chose 8 bits just because that is what we started with. Let's try a couple more widths on 100M rows just to verify a bunch of assumptions. With width 1, I get 0.75s and 0.14s respectively. Still a respectable 5× speedup in the worst ("real") case. With width 64, I get 21.9s and 0.14s. That's kind of a big deal.
We have won.
Future Work
Battles don't stay won. There's things we could do to be stronger, faster, better than before.
-
Bit parallelism is the best parallelism. If you want to go wider than 64-bit rows, though, you'll probably want to investigate the other kind. The trick here, of course, is the boundaries: some communication between cores is needed on each cycle. This can slow you down massively. However, because end corruption can only travel inward at one bit per row, you can do some batching and fix things up later. It all gets fiddly and really hard to make fast.
-
The manual process we went through to get the Rule 110 map seems automatable. Some kind of nice Boolean formula optimizer seems in order. Since there are only 256 rules (some of which are utterly boring) we could pre-compute a Boolean formula for each and hard-code it into a table. Alternatively, we could find some reasonably fast optimizer and do this on the fly. We really should handle arbitrary rules.
Conclusions
I set out to write a short little thing to explain to those interested what I was doing today. I would estimate that I ended up spending about four hours writing and about two hours actually doing the thing. This is not an unusual ratio for me, and I think for others.
The good news is that I ended up with some interesting insights for myself (even on this well-traveled road). Hopefully you learned a thing or two as well.
Acknowledgments
CA Rule 110 was first introduced to me in Fall 2008 when I was teaching a Multicore Computing class. (I ended up losing to only a couple of students from the class of 20 or so, pitting some forgotten variant of my single-threaded implementation described in this article against their 16-core things.) Thanks much to the student who brought this problem to the class.
Keith Packard and others taught me the art of bit-banging, long ago. Thanks for that! Also thanks to Hank Warren's Hacker's Delight, the one true bit-parallel book.
Appendix: Rule 110 Assignment
Bart Massey 2025
Background
In this exercise you will create an ASCII rendering of the Rule 110 Elementary Cellular Automaton (CA). Not as scary as it sounds, and a nice simple Rust thing to play with.
A CA starts with a random row of bits. We will represent 1
bits with *
and 0 bits with .
. Our starting row will
specifically be this 8-bit position for now.
*.*..*..
To produce the next row, take bits three at a time, "wrapping around" if a boundary is hit (row 7 is next to row 0 and so forth). A group of three bits in row n will determine the center bit position in row n + 1 according to Rule 110:
111 → 0
110 → 1
101 → 1
100 → 0
011 → 1
010 → 1
001 → 1
000 → 0
So for our start, the first two rows will be
*.*..*..
***.**.*
(Notice the wraparound at the beginning and end.)
The Program
Write a program that prints ten rows starting with the two given.
Challenges
(Challenges are not required, but give a chance to do something extra. Turn them in and write them up in the README.)
-
Make your program take a command-line argument for the starting row.
-
Time your program. See how fast you can make it go. Compare performance with a C or C++ implementation. You will probably have to modify the program to print a lot more rows: maybe some number given by a command-line argument.