# User:Vasiln/Goblin Logic 2

The key to that is to have a very small memory space that we can perform operations on, and a much larger memory space with very limited functionality.

The least functionality we need from memory is the ability to read and write. If we break it down further, we need the ability to read a true, read a false, write a true, and write a false.

```######
rddc
######
rdhc
######
#<#
#h#
#^##
##hd#
#dh##
##^#
#h#
#<#
######
rhdc
######
rhhc
######
```

r is reset, c is completion, < is stairway pathing to exit

This is the start of that functionality. We see 5 elements here. From top to bottom: read true from memory; write true to memory; memory; read false from memory; write false to memory.

We've talked about memory before, and you should recognize it. As for the other four pathways, we're going to call them 'addressing paths.'

The left door or hatch in each pathway is set by the truth of either our memory cell (for the read paths) or the truth of the cell we're reading from (for the write paths). The second is determined by the truth of a read bit-- a bit that keeps track of whether we're currently reading or writing. If we're reading, the doors and hatches open; if we're writing, they stay closed.

The completion will send a signal: either set memory to true or false, in the case of writes, or set a distant memory bit to true or false, in the case of reads.

So only one of these four pathways is going to be open at any given time. That's nice-- it means we can just connect all the entries to a single reset position, fed from each completion point.

There's one other thing we can do here.

```######
rdddc
######
rdhdc
######
#<#
#h#
#^##
##hd#
#dh##
##^#
#h#
#<#
######
rhddc
######
rhhdc
######
```

I've added an extra door before completion. This is going to be set by the state of our addressing bit. Imagine a second structure like this, a few z-levels below, fed from the same reset position-- the only difference being that we have hatches linked to our addressing bit instead of doors. We still only ever have one pathway-- but we can switch which memory cell we're going to read from or write to by changing the position of our addressing bit.

One addressing bit will let us address 2 bits separately. 4 will let us address 16 bits separately. Imagine 16 of the following, forming a tower of addressable memory:

```########
rddddddc
########
rdhddddc
########
#<#
#h#
#^##
##hd#
#dh##
##^#
#h#
#<#
########
rhdddddc
########
rhhddddc
########
```

Different levels would see different combinations of hatches and doors at the end of each completion.

If you do make a tower out of these, don't forget practical considerations. Either each cell needs to be offset, or there need to be two z-levels between each cell so that we can pit goblins without being interrupted by goblins from above. (We could get away with 1 z-level between cells if we only ever pitted from bottom to top, but then if there's a problem with a lower cell, we'd have to kill every goblin above that cell before we could pit a new goblin.)

We're using 2-bits for demonstration purposes though. We can address our first bit and our second bit independently, and read each to a buffer-- the first to the buffer's first bit, the second to our buffer's second bit. So really, we need two of these towers.

Remember our toggle memory? It had no way to set true or false, only a way of toggling it. That doesn't mean we can't address it though-- we just a few more pathways.

```#########
rdddddd c
#########
rdhdddddc
#########
#########
rdhddddhc
#########
#<#
#h#
#^#
#h#
#h#
#^#
#h#
#<#
#########
rhddddd c
#########
rhhddddhc
#########
rhhdddddc
#########
```

Whereas before we had pathways for read true from memory, write true to memory, read false from memory, write false from memory--

We've changed two of those pathways, and added two more. We still only have two read functions, but now we have four write functions: Write true to true memory, write true to false memory, write false to true memory, and write false to false memory. The extra hatches and doors at the ends of these write functions are linked to the status of our memory. Writing true to true memory doesn't trigger anything, but writing true to false memory triggers a toggle. The extra complexity of writing to toggle memory means that we're not going to use it. Settable memory is preferable.

If we operate on all of our bits at the same time, we'll need to know when our operations have completed. Managing that task is tricky. We can't know when to evaluate for completion, because we have to know that we've completed to know that none of our completion flags are going to undergo any change! Luckily, we CAN know that the flags aren't going to change until we tell them to, because we can use steady-state signalling. So we can get away with something that I said not to do. Here we have a circuit where the goblin has the potential to be without path, yet with multiple tiles to stand on.

```####    ####
#<h######h<#
###^dd^h^###
##h######h#
##     d##
########
```

We can only do this because we know that:

1. All doors will open before any close (because it's only movement through the trip that will lead to their close)
2. Addressing manager reset is completely infrastructural, not linked to any outside operations

## Buffers and Registers: Full-featured memory

So remember: we've built minimal functionality into our basic memory, but we're going to build maximum functionality into a few pieces of memory. Let's ignore maximal functionality for right now and just make one bit of our buffer with basic functionality: read true from memory, write true to memory, read false from memory, write false from memory.

``` #<#
#h#
#^##
##hd#
#dh##
##^#
#h#
#<#
```

Woah. That's just memory. Where's the read and write?

We don't need it. We built that functionality into our basic memory, and it's shared functionality. Every read true from memory trips the southern door; every read false from memory trips the northern door. Meanwhile, our northern and southern pressure plates are linked to the doors and hatches in our memory addresses, permitting only write true to memory, for instance, while our buffer bit is true.

Couldn't we build that functionality into our buffer, rather than our memory? Really, memory read/writes with addressing isn't built into either the memory or the buffer. It's a function that works between the two of them. The physical location of our memory addressing circuit doesn't matter-- we can build it next to the buffer, or we can build it next to the memory, or we can build it someplace else entirely. But we can't make it any simpler by building it one place than the other: we still need four paths per bit. Placing it near the memory lets us keep an eye on it at the same time that we keep on eye on memory, and since both are proportional in size to the same value (number of bits), that makes the most sense-- at least to me. Really, it doesn't matter where you put it.

So are we going to build in functionality, for instance, for our ordinal compare? Not yet. Remember, ordinal compares require two values. To do an ordinal compare with every piece of memory would take 16 compare circuits, if we had 16 words of memory! So really, what we need to do is give functionality to our buffer to write to at least one other new piece of memory-- a register. That way, we can read a value, write it to the register, read a new value, and run an ordinal compare on any two words.

But wait-- we might need to output something from our compare to a piece of memory, and the only way we have of writing to memory is through our buffer. So instead of reading from and writing to a single bit, we need to build our buffer such that it's capable of reading from at least 1 bit, and writing to at least 2 bits.

```#####
rdddc      write true from buffer to register 1
#####
rddhc      write true from buffer to register 2
#####
rhddc      read true to buffer from register 1
#####
#<#
#h#
#^##
##hd#
#dh##
##^#
#h#
#<#
#####
rdddc     write false from buffer to register 1
#####
rddhc     write false from buffer to register 2
#####
rhhdc     read false to buffer from register 1
#####
```

You can see this is pretty similar to our previous addressing system. The only differences are that we have only a 1 bit address (2 values, register 1 or register 2) and that we have extra pathways, allowing us to write to multiple locations. In reality, we probably want to be able to write to more than two locations from our buffer, but we can get away with reading from only one extra location. We could string these buffers along: we could write to buffer 1 instead of register 1, for instance, and from there, write to register 1 or 2, and write to buffer 2 instead of register 2, from there writing to registers 3 and 4, but that's unnecessary. To write to four different locations, we just need 2 new paths for every extra location, for a total of 8 write pathways, and we need an extra door, to allow addressing of 4 different locations. 4 locations will be enough. (Two are going to be data registers, and one is going to be an instruction register, that keeps track of what our instruction is while we execute it, but we'll get to that later. The extra location is just that-- extra. We can do whatever we want with it later. It's never a bad idea to build in reserve like this when you can.)

```######
rdddhc      write true from buffer to instruction register
######
rddhhc      reserved
######
rddddc      write true from buffer to register 1
######
rddhdc      write true from buffer to register 2
######
rhdddc      read true to buffer from register 1
######
#<#
#h#
#^##
##hd#
#dh##
##^#
#h#
#<#
######
rdhdhc     write false from buffer to instruction register
######
rdhhhc     reserved
######
rdhddc     write false from buffer to register 1
######
rddhdc     write false from buffer to register 2
######
rhhddc     read false to buffer from register 1
######
```

Our complete buffer. Next, of course, we'll need a few registers. But before that, we have to know exactly what kinds of operations we want our registers to do: for instance, we've shown how the capability to add or subtract can be built into memory. Before I recreate that, though, I want to try and figure out if I can make a look-ahead adder, and I need to spell out exactly which operations our registers are capable of.

If you remember, I demonstrated two different ways to add earlier in the thread.

Two? Yes. First I demonstrated an adder that adds a bit, sending a percolating effect through the word, waits 200 ticks, then adds the next bit. After that, I demonstrated a way to add multiple words, by keeping track of a carry bit between adds. If you think about it, the second method would work fine for single bit words as well as it does for two-bit words or eight-bit words, although it should be clear that this method leads to much slower computation. Either method is very space efficient; we need slightly modified memory, a delay circuit, a bit of memory for our carry, and a disgustingly simple operation pipeline.

That's still slow. It's still going to take over a day to process the addition of two 6-bit words. There's another way to do it. What we're going to do is include processing our carry with our add.

Adding our first bit is easy. There's no carry, so we only need three pathways: adding a 1 to a 0, adding a 1 to a 1, or adding a 0 to a 0.

``` #######
##hh   1
# ######
r#dd   2
# ######
## #hd##
## ## 3
##dh##
#####
```

Here we have 3 continuation points, labelled 1 to 3. You'll notice that two paths lead to continuation 3, but only one is ever going to be open. You might also notice that this is extraordinarily similar to the begin of our ordinal compare. We have an XNOR for the top two paths and an XOR for the bottom two paths.

Let's consider what each of continuation points mean. 1 means neither bit was true. 0+0 is 0, so we write 0 to the output, and we also write 0 to our carry. 2 means both were true. 1+1 is 2, so we write 0 to our output, and 1 to our carry. 3 means one or the other were true, but not both. 1+0 is 1, so we write 1 to the output, and we write a 0 to our carry.

Adding the next bit is where it gets ugly. This is an octopus function.

```      ######
##ddd 1
## #####
## #hhd 2
## #######
## #  hdd 3
## #########
# #    dhd 4
r###########
# #    hdh 5
## #########
## #  dhh 6
## #######
## #ddh 7
## #####
##hhh 8
######
```

The leftmost door or hatch corresponds to the second bit of the first word; the next door, the second bit of the second word. The third corresponds to the state of our carry bit. Essentially, what we're going to do is to get around rippling our carry by including it in our add so that we aren't limited by the latency of our memory.

I don't know what to call this. I'm not sure it exists in an electronic version.

Let's make sure we know what the 8 completion points mean:

```# Both bits true, carry true: output 1 to bit and carry
# Both bits false, carry true: output 1 to bit, 0 to carry
# First bit false, second bit true, carry true: output 0 to bit, 1 to carry
# First bit true, second bit false, carry true: Ditto
# First bit false, second bit true, carry false: output 1 to bit, 0 to carry
# First bit true, second bit false, carry false: Ditto
# Both bits true, carry false: output 0 to bit, 1 to carry
# Both bits false, carry false: output 0 to both bit and carry
```

We could write the carry to memory (ideally, more than one bit, maybe alternating, maybe 8 bits, to get around latency)-- unless--

Let's rearrange this and expand it.

```            ####                   ####
##  ##                 ##  ##
######d## ##           ######d## ##
##hdhh5#### ##         ##hdhh5#### ##
## #####h<### ##       ## #####h<### ##
## #hhdh2###### ##     ## #hhdh2###### ##
## #######d    ## ##   ## #######d    ## ##
## #  dddh1##### ## ## ## #  dddh1##### ## ##
## #########h<#### ## ### #########h<#### ## ##
# #    hddh3####### ## # #    hddh3####### ## ##
r###########d         #r###########d         #
# #    dhdh4####### ## # #    dhdh4####### ## ##
## #########h<#### ## ### #########h<#### ## ##
## #  ddhh7##### ## ## ## #  ddhh7##### ## ##
## #######d    ## ##   ## #######d    ## ##
## #dhhh6###### ##     ## #dhhh6###### ##
## #####h<# # ##       ## #####h<# # ##
##hhhh8#### ##         ##hhhh8#### ##
######d## ##           ######d## ##
##  ##                 ##  ##
####                   ####
```

Nope. Doesn't work.

What I tried to do was see if I could design it in such a way that tripping a carry would last until the goblin next reached a carry=true completion point. In general, you've got 10 moves before the close part of a pressure plate signal comes; but even with redesign, I've got 23 moves, many of them diagonals. Not going to work.

However, this is still a decent design for memory based carry bits-- because of the latency of memory, we have to make sure that we're not going to write to it particularly frequently! In this case, I see that our shortest path is 23 straight steps long, no problem for using a single bit to store carry. Enough time to propagate the close from carry no longer being true, enough time by far for a goblin to move from false position to true position.

But our longest carrying is still 23 steps! That's far longer than our path with our parallel-ripple-add. This design is going to function more slowly than our parallel ripple add. 200+ ticks per bit is good enough, I guess.

It was a good design exercise, in any case.

##### Improved XOR

I've been using a fully expanded (two path) XOR, but I just realized this isn't necessary.

``` ###
##d##
dhh
#####
```

This is a single path XOR that will work as well as our two-path XOR for everything but our compare function. With it, our octopus can become a sextopus. That's not enough to save it though.

There's one more design to consider. Can we add in parallel, design a two-bit carry, then add our carry in parallel? It might be problematic-- what if our carry generates its own carry? Of course, it couldn't with two-bit word size, like we've been demonstrating; we have to consider it in something bigger, like four bits. It would look quite a bit like our previous carry check pathway: slower than it takes to propagate ripple carries.

What if we evaluated each bit in parallel (at the same time that we added each bit in parallel) for its ability to create, propagate, or end a carry? We'd know where carries began, we'd know where carries ended, and we could send a minimum of carries, all at the same time, to propagate in parallel. Each carry evaluation would be the same size (about 20 tiles across), so we'd have our addition and our look-ahead completed after about 200 ticks, regardless of our word size. Then we'd propagate the carries, at the precise locations called for.

In a best case scenario (no carries), we'd add two words of any size in 200 ticks. In a worst-case scenario, we'd have to wait an extra 800 ticks or so for propagation across an 8-bit word. 1600 ticks, down to 200-1000 ticks? Less than half the speed.

I didn't really want to design it, but I guess I have to now. We'll do it in three bits.

Assume memory a1-a3 (operand 1), b1-c3 (operand 2), c1-c3 (output)

``` ####
##dh#
# dh1
r#####
# dd2
# ###
##hh3
####
```

This is our basic add. Each operand is linked to two doors and two hatches. Top is an XOR, completion at 1 outputs a 1. Bottom is an XNOR, completion at 2 or 3 outputs a 0. We have one of these for each bit-- so 3 of them. Each can be run in, at most, 5 steps (2 diagonal)-- that's about 65 ticks to add all of our bits with no carry.

Why separate completion points? We're going to keep track of carry state. Completion 1 propagates a carry. Completion 2 creates a carry. Completion 3 ends a carry.

In an electronic circuit, carry determination would normally be done in parallel with adding, I believe, but in our circuit, our carry determination is our add. So it's sort of like running in parallel.

Notice that we need a few more bits of memory. We need to keep track of each bit's carry status: propagate, generate, or end. We could design special 3-state memory for this purpose, but for now, let's just imagine that we have 3 carry state bits, each keeping track of propagate T/F, generate T/F, end T/F. Memory latency adds another 105 ticks.

Now, we need to create a carry word out of our carry states. This can turn into another octopus. It's simpler in the case of 3-bit words. Let's spell out what our LCU(2) (look-ahead carry unit, bit 2) does in pseudocode before we design it.

If generate(1) OR (generate(0) AND propagate(1)) then carry(2):=1

We can't just do an else with creature logic. We have to specify the else.

If end(1) OR (propagate(1) AND (end(0) OR propagate(0))) then carry(2):=0

Let's simplify that last one.

If end(1) OR (propagate(1) AND NOT generate(0)) then carry(2):=0

Pay careful attention to the delimiters. End(x) means that at bit location x, carry state is end. Carry(x) refers to the carry bit (not the carry state!) at bit location x.

What about bits 0 and 1? Bit 0 can't be written to by a carry. We don't need an LCU(0). Bit 1 gets a carry IFF generate(0).

We need one more bit, so that we can use multiple words to add big numbers-- a carry(3) and LCU(3)

If generate(2) OR (propagate(2) AND (generate(1) OR (propagate(1) AND generate(0)))) then carry(3):=1

If (end(2) OR (propagate(2) AND NOT (generate(1) OR (propagate(1) AND generate(0))) then carry(3):=0

Let's test our logic:

``` a       011
b       010
XOR->a  001
carry   egp
LCU->b 0100
a       001
XOR     101
(carry   pep)
(LCU    0000)
```

Perfect.

Just for fun, let's spell out the carry bit for carry(9)-- 8-bit words.

If generate(8) OR (propagate(8) AND (generate(7) OR (propagate(7) AND (generate(6) OR (propagate(6) AND (generate(5) OR (propagate(5) AND (generate(4) OR (propagate(4) AND (generate(3) OR (propagate(3) AND (generate(2) OR (propagate(2) AND (generate(1) OR (propagate(1) AND generate (0) (with a whole mess of closing delimiters here) then carry(9):=1

If end(8) OR (propagate(8) AND NOT (generate(7) OR (propagate(7) AND NOT (generate(6) OR (propagate(6) AND NOT (generate(5) OR (propagate(5) AND NOT (generate(4) OR (propagate(4) AND NOT (generate(3) OR (propagate(3) AND NOT (generate(2) OR (propagate(2) AND NOT (generate(1) OR (propagate(1) AND NOT generate (0) (with a whole mess of closing delimiters here) then carry(9):=0

What an ugly mess. I think you can see why I didn't want to have to design this. However, I think it's clear how to extend the technique to any bit length.

Time to design our LCU(2).

```    ####
##dd##                 AND propagate(1) AND generate(0)
## ## #
##d#d ##    propagate(2) AND generate(1)
## ### ##
# #d  #1#    generate(2)                                            carry(2):=1
r########
# #d  #2#    end(2)                                                 carry(2):=0
## ### ##
##d#h# #    propagate(2) AND NOT generate(1)
## ## #
##hh##                 AND NOT (propagate(1) AND generate(0))
####
```

Looks like a 10-long path at worst. That means with 3 bits adds, we can perform at 600 ticks with parallel ripple, or ~250 ticks with XOR(a,b), LCU(a,b), XOR(a, LCU). We don't really have to worry about the timing of LCU(3)-- it won't be accessed this instruction, giving us plenty of time.

Crazy time: LCU(7). Carry(7):=1 arm displayed. This is much more than an octopus.

```            ####
##dd##                                                                             AND propagate(1) AND generate(0)
## ## #
##d#d ##                                                            AND propagate(2) AND generate(1)
## ## ##
##d#d ##                                             AND propagate(3) AND generate(2)
## ## ##
##d#d ##                              AND propagate(4) AND generate(3)
## ## ##
##d#d ##               AND propagate(5) AND generate(4)
## ## ##
##d#d ##    propagate(6) AND generate(5)
## ### ##    OR
# #d  #1#    generate(6)                                            carry(7):=1
r########
```

Max length, 26 steps. LCU(8) would for 30 steps. ~365 ticks. XOR, LCU, XOR for 8-bit add in less than 600 ticks, whereas the parallel ripple add would take us 1600. Total size, 29x10x1, plus infrastructure.

Let's do an 8-bit test:

``` a        10110010
b       011011011
XOR->a   01101001
carry    gppgpegp
LCU->b  111100100
a        01101001
XOR      10001101
(carry    pggeppep)
(LCU     111000000)
```

What do you know? It works!

### Operation 0101: XOR with LCU

So if we're going to do adds via XOR, with write to LCU. That's really cool. That's because we can manage with very few instructions. We were going to need an XOR anyways, to write to output.

From here on out, we're talking about using 8 bit words, containing bits 0-7. Operation XOR will take operands a and b, write XOR to a, and write an LCU-generated word to b. Operand b will have to be 9 bits long; XOR will write a zero to bit 0 and output to bits 1-8, whereas add will operate on bits 0-7.

Of course, we need to get our inputs in, and outputs out.

• Operation 0001: Write from memory to register a, takes memory address as argument
• Operation 0010: Write from memory to register b, takes memory address as argument

Remember, register b is a 9 bit value-- so our writes to memory will only write the first 8 bits of b.

• Operation 0011: Write from register a to memory, takes memory address as argument

In reality, these operations will consist of an intermediary step-- writing to the buffer. Since we're now working with 8-bit words and 4-bit addresses, and these are 4-bit instructions, we can now specify a memory address to read from or write to. Notice that our XOR doesn't take any argument at all-- it always works on the same two registers. That's fine for now. In the future, we can expand the operation to single-instruction add, add with carry, AND, OR, NOR-- anything our heart desires.

Here's a sequence of instructions to add two numbers and output the result to a memory cell.

``` 0000  00010101  Load value in 0101 into register a
0001  00100110  Load value in 0110 into register b
0010  01010000  XOR a and b->a, carry->b
0011  01010000  XOR a and b->a, (carry->b)
0100  00110111  Write value in register a into 0111
0101  xxxxxxxx  Operand 1
0110  xxxxxxxx  Operand 2
0111  xxxxxxxx  Output
```

Addresses 0101, 0110, and 0111 are assumed to be data cells. This program will add the value in 0101 to the value in 0110 and output the result to 0111. Notice that we called XOR twice, and that the "0000" argument for it is both arbitrary and meaningless. In fact, because XOR is working only on predefined registers, we could use that argument to space to make 16 different commands, all with the same instruction set.

Note that in reality, what will happen is that our program will add the two values-- and then try to evaluate 0101 as an instruction. We'll get to that later. Also note that 16 words are not very many! Just adding two numbers required 8 words. One more thing to pay attention to: our create carry circuit will always function AFTER our XOR, so we don't need to worry about writing to register b from that circuit at the same time that we're reading from that circuit to do an XOR. That's very important!

But wait a minute. What if we want to add larger numbers? Numbers too large to be expressed in 8 bits? We've broken our carry functionality by our redesign. We need a way to access LCU(8).

### Operation 0110: Arithmetic right shift on register b, takes number of shifts as argument

What we can do is shift LCU right 8 times! We'll move the value of each bit from x to x-8, and fill in the blank spaces with zeroes. So LCU=100110000 -> LCU=000000001, and we can XOR it with the next value to carry the one from an earlier addition.

But arithmetic right shift is a really useful function for other stuff too. For instance, right shifting 1 space is equivalent to dividing a number by two and dropping the remainder. We'll make it so that arithmetic right shift operates on register b, and takes an argument in the last four bits specifying the number of bits to shift. Since 0000=0 shifts, which is totally useless, we'll make an argument of 0000 shift it 8 times, rather than zero times; and since we might want to use this for other functions, we'll make shifts of fewer than 8 drop the 9th bit (read in 0s instead). That way, we can also use this to create an XOR word for writing to specific bits.

Of course, we'll need to be able to save the LCU value to be able to do this.

• Operation 0100: Write from register b to memory, takes memory address as argument

Again, remember that register b is 9 bits big, so when we write to register b, we'll only write the first 8 bits, and fill in the 9th with a zero.

We'll want a left shift as well, and we might want to do a bit rotate too. A left shift is just like a right shift but in reverse. So a left shift by one multiplies a number by 2 (so long as we don't overflow our bit space!) So we can do this by adding a number to itself. A right-hand bit rotate is just like a right shift, except it doesn't write in zeroes on the right hand side-- for instance, 101, rotated once to the right, stays 101, and 001 becomes 100. Since there's no possible reason to want to do this with our carry bit, we'll ignore the 9th bit of b for our rotate.

Operation 0111: Right-hand bit rotate, takes number of shifts as argument

Again, notice that 0111000-01111111 are going to be a useless instructions-- here's 8 more places we can cram in more instructions if we need to.

Between right hand bit rotate and right-hand shift, we can do any left hand operations. To left-hand rotate, simply right hand rotate in 8's complement-- that is, rotate it 7 times to shift it once to the left, rotate it 6 times to shift it twice to the left. To do a left hand arithmetic shift of n units, first rotate it right 8-n units, rightshift it n units, then right rotate it 8-n units again.

We don't actually need bit shifts and rotates to be turing complete (we can reach them via additions), but we do need add with carry, and bitshift is how our architecture is going to get there. Here's a program that adds two two-word (16 bit) values and outputs the result to a third, 2-word, space.

``` 0001  xxxxxxxx  Operand 1 (*256)
0010  xxxxxxxx  Operand 1 (*1)
0011  xxxxxxxx  Operand 2 (*256)
0100  xxxxxxxx  Operand 2 (*1)
0101  xxxxxxxx  Carry variable
0110  xxxxxxxx  Output(*256)
0111  xxxxxxxx  Output(*1)
1000  00010010  Load 0010 into a
1001  00100100  Load 0100 into b
1010  01010000  XOR->a, carry->b
1011  00110111  Write a into 0111
1100  01000101  Write b into 0101
1101  01100000  Right shift b 8 times
1110  00010001  Load 0001 into a
1111  01010000  XOR->a, carry->b
0000  01010000  XOR->a, (carry->b) Carry will be ignored
0001  00100011  Load 0011 into b
0010  01010000  XOR->a, carry->b
0011  01010000  XOR->a, (carry->b)  Carry will be ignored
0100  00110110  Write a to 0110
0101  00010111  Load 0111 into a
0110  00100101  Load 0101 into b
0111  01010000  XOR->a, (carry->b)  Carry will be ignored
1000  00110111  Write a into 0111
```

You can see that we've already exceeded our 16 bits-- our program and data wraps around. You can also see that this is an inefficient way to do an add with carry-- it involves a lot of writes to memory. Adding more instructions, or parameters to our existing XOR, could reduce the number of times we need to swap memory around. But it works, and we're not anticipating doing any adds with carry. We just want the capability. In this program, I've located the data earlier in memory than our instructions. It really doesn't matter where stuff goes, but our instructions are going to be executed sequentially, so we want all of our instructions in one block separate from our data-- at least, for now.

Bit shifting and bit rotation is simple in our heads, but in our fortresses, it's going to be complex. For each bit (and we have 9), we need 15 paths: write true and false to any of the other 7 bit locations, as well as a null path for writing to self. Additionally, since our shifts and rotates write to the same register they read from, we need intermediate memory-- we need to write to buffer, then, from there, write to register b. It's going to take up a lot of space.

Here's the circuit for ARS(8) (arithmetic right shift, position 8-- position 8 is actually our ninth bit of register b):

```      ####
######h1#                                              and b(8) is 0, write 0 to b(0)
#hhhhd###    If argument is 0000 and mode is bit shift
r#####d2#                                              and b(8) is 1, write 1 to b(0)
# d#3####
# d ##
# d #        If argument is not 0000 or mode is not bit shift, don't write anything anyplace.
# d #
##h##
###
```

Here's part of the circuit for RS(0) (arithmetic right shift or right rotate, position 0)

```  ###  ####
## ####d1#                                          and mode is bit shift, write 0 to buffer(7)
#h#hhhd####     Argument is 0001 (shift 1 place to b(7))
r######hh2#                                         and mode is bit rotate  and b(0) is false  write 0 to buffer(7)
#d###                                                                 and b(0) is true
#3#                                                                                      write 1 to buffer(7)
###
```

But bit shifts are going to write on higher values of RS. We take 8 arguments (0000, or >0111, return null); each non-null argument differentiates between mode; each mode writes a true or a false. 29 arms. For christ's sake.

``` #########
######d^#                                                                                                                        write 1 to buffer(x-3) for all circuits where x>=3
##### ###
####d#h^#
#### #####
#### ##d^#
##### # ###
###dd#h#h^#                                          And argument(1) is 1 and argument(0) is 1 and mode is bit rotate and b(x)=0, write 0 to buffer(x-3)
## ########
# ######d^#                                                                                                                       write 1 to buffer(x-2) for all circuits where x>=2
# ##### ###
# ####d#h^#
# ### #####
# ### ##d^#
# ### # ###
# #dh#h#h^#                                          And argument(1) is 1 and argument(0) is 0 and mode is bit rotate and b(x)=0, write 0 to buffer(x-2)
# #########
# ######d^#                                                                                                           and b(x)=1, write 1 to buffer(x-1) for all circuits where x>=1
# ##### ###
# ####d#h^#                                                                                     and mode is bit shift and b(x)=0, write 0 to buffer(x-1)
# ### #####
# ### ##d^#                                                                                                           and b(x)=1, write 1 to buffer(x-1)
## ## # ###
#h#hd#h#h^#                     And argument(2) is 0 and argument(1) is 0 and argument(0) is 1 and mode is bit rotate and b(x)=0, write 0 to buffer(x-1)
#h#########   Argument(3) is 0
# hhhh##      Argument is 0000
r#####^#                        No write
# d   ##      Argument(3) is 1 (argument>0111)
#h#########   Argument(3) is 0
#d#hh#h#h^#                     And argument(2) is 1 and argument(1) is 0 and argument(0) is 0 and mode is bit rotate and b(x)=0, write 0 to buffer(x-4)
## ## # ###
# ### #d^##                                                                                                           and b(x)=1, write 1 to buffer(x-4)
# ### #####
# ####d#h^#                                                                                    and mode is bit shift  and b(x)=0, write 0 to buffer(x-4)
# ##### ###
# ######d^#                                                                                                           and b(x)=1, write 1 to buffer(x-4) for all circuits where x>=4
# #########
```

Etcetera, etcetera, etcetera.

I told you it was going to be a monster. Although it's narrow enough, a full bitshift circuit would be 55 tiles long. Notice that we don't write 1 to buffer(x) when in shift mode and when argument>x. That means that shift(7) is going to be our largest circuit, and shift(0) can be made smaller-- although not significantly smaller.

Of course, following our writing of a shifted/rotated b to buffer, we'll have to write buffer back to b. We know how to do that.

Before we go any further, I want to show you something really cool we can do with an arithmetic shift. Let's take one of the cells from our last program:

``` 1000  00010010  Load 0010 into a
```

What happens if we shift it right 4 times, then left 4 times? We have 00010000-- which can still be read as an instruction. Now, the instruction is just "Load 0000 into a." What if we shift it left 4 times, then right 4 times? We have 00000010. It can still be read as an instruction, although we haven't defined 0000 as an instruction. Now what happens if we take those two values and XOR them? We have 00010010: exactly what we started out with. Except we can use that process to assign any instruction to some address, or to assign any address to some instruction. It's hard to imagine what we might do with that-- it kind of has the whiff of black magic to it. Very powerful; very dangerous.

There's one thing you might be worried about. We totally forgot about subtraction! Before, we had this beautiful memory design that can be used to add or subtract, but now, our system is so complex that nobody knows if we can subtract with it. Don't worry-- there's another trick we can do to make everything work.

### Subtraction: Twos complement

By changing a number to its two's complement value, we can just add it instead of subtracting it.

What's a two's complement (TC)? Basically, it's the number you get when you subtract a number from 0. So in a way, we're generating a negative number. For instance, with three bit words, the TC of 111 is 001-- summed, they make 000. So instead of subtracting 1 from a 3 bit word, we can just add 111. Don't believe me? Try it out:

``` 010
-001
001
```
```  010
+111
```

(1)001

See that carry? We'll need to invert it for subtracting multiple bits. That's easy-- we can just XOR it with a 1.

TC of 010 is 110.

```   001
-010
(1)111
```
```   001
+110
(0)111
```

For numbers that span multiple words (subtracting number cd from ab), we add a to TCc, XOR the carry with 1 (or with 00000001, for 8 bit words), TC the XORed carry, add the TCed XORed carry to b, then add b to TCd.

But how do we make the TC? We invert each bit, then add 1 to the final value. I just told you how to invert it-- XOR with 1 (or with 11111111). We know how to add, right?

So here's our program to subtract:

``` 0000 00000001 (constant value)
0001 11111111 (constant value)
0010 xxxxxxxx operand 1 (minuend)
0011 xxxxxxxx operand 2(subtrahend)
0100 xxxxxxxx output (difference)
0101 00010011 Load 0011 into a
0111 00100001 Load 0001 into b
1000 01010000 XOR a and b (carry->b)      we drop the carry here, don't care about it
1001 00100000 Load 0000 into b
1010 01010000 XOR a and b (carry->b)
1011 01010000 XOR a and b
1100 00100010 Load 0010 into b
1101 01010000 XOR a and b (carry->b)
1110 01010000 XOR a and b
1111 00110100 Write a to 0100

```

Look at that: it just fits exactly in 16 bytes. Similar to how every add consists of two XORs, every TC consists of a total of 3 XORs-- plus another 2, to add it.

### Multiplication: Compare and jump if

There's more than one way to do multiplication. We're going to do it like we did back in second grade.

What is 3*3? It's just 3, plus 3, plus 3. If we can find a way to keep track of how many times we've added two values, we can multiply.

So far, we've just been going sequentially through our instructions. But we need to do something different now. We're going to add a number of times-- we don't even know how many times-- so we need to jump back to our add.

I mentioned earlier that we were going to need a space to keep track of where we are in our program-- so that we can figure out which instruction to execute next. Jump is extraordinarily simple to implement, so simple that I'm not going to make a diagram for it. When we jump, all that we do is replace the values in instruction counter-- that is, our memory address-- with the argument we send with the jump instruction.

There's one problem-- eventually, we're going to want to stop adding. We only want to do it a certain number of times. So for one thing, we only want our jump to occur if a register is at a certain value. For another, we need a way to keep track of how many times we've added, and to make sure it's the right number.

We've already designed a compare function. Currently, that compare function returns greater than, less than, or equal to, but we didn't specify exactly what that meant. So let's say that if a is greater than b, we write 00000001 into b, and if a is equal to b, we write 00000000 into b, and if a is less than b, we write 11111111 into b. This is actually how many computers handle this exact situation, by the way.

Now, we need to pick a time to jump. Let's say that we'll jump if and only if register b is 00000000. In other words, we jump if not b. Now we can make our multiplication program: