Monday 29 August 2016

A little more on the ALU

The ALU I discussed in the last episode works nicely as an ALU but does have a couple or three practical issues.

Multiplexors

The first is that pair of multiplexors in the middle of the arithmetic circuitry. In practical terms, they're easy to implement but bulky in terms of board area - each multiplexor requires two 8-wide tri-state buffers - twenty-pin wide packages. At least it is not necessary to implement a separate 8-wide inverter; an inverting buffer is available. But we can do a little better... consider an XOR gate's logic:

A B Q
0 0 0
0 1 1
1 0 1
1 1 0

The Q output is high only if both the A and B inputs are different. Looked at another way, though, it can be thought of as a controllable inverter. The output is B if the A input is low, and ~B if the A input is high (and of course the same using the A input to control B - it's symmetrical) which is exactly what we want. Using standard 74HC parts, that still needs two chips, but they're narrow and only fourteen pins each - much less board real estate is required.

That sorts out one of the multiplexors; what about the other? The first one that the B signal meets is doing nothing more than switching between the B input and zero. Here's an AND gate:

A B Q
0 0 0
0 1 0
1 0 0
1 1 1

Here the output is high if and only if both inputs are high. We can use the A input to control things here; if it's high, then the output is the B input. Thus we generate either our B input or zeros, in two 14-pin chips. But we  can do better...

We will need to put the B register through a buffer anyway - actually, the buffer will be the 'enable' output of the register latch. When the buffer is not enabled, the output is high impedance; it's pulling neither to the high or low level. If we apply some pull-down resistors on the output, then the input to the next stage of logic will be zero if the buffer is not enabled, or the contents of the register if it is.


A note about timing and clock signals

Throughout this discussion I have glossed over just how this processor will be clocked. Here's the basic idea:
  • a system clock called phase0 is fed with a square wave - an equal mark-space ratio.
  • When phase0 is low, a selected register's output is placed on an internal data bus. That register might be a simple register, or something more complex such as the ALU, or a value from external memory.
  • As phase0 rises, the internal databus contents are latched into a selected register.

The B input

The second point concerns the B input.

The accumulator needs two inputs and provides a single output, plus the flags. I'll implement this such that one of the inputs is internally latched - call it the B register - and the other is the internal data bus. Any operation takes two clock cycles: on the first, the B register is loaded and on the second the desired operation takes place using the internal bus as the A input; at the rising edge of the clock on the second operation the result and flags are latched.

The increment and decrement operations, which do not use the B register, take just a single cycle.

The Flags

We have only two flags on the ALU - zero and carry. We never need to read the flags directly (I'm not planning on implementing an interrupt function, where they would need to be stored unchanged across the interrupt) so they don't need multiplexing to the internal data bus; instead they'll have dedicated lines back to the control logic so that conditional jumps can be made.

The zero flag is set on every ALU operation; if the result of the operation is zero, the flag is set. The carry flag, on the other hand, is set only for the the arithmetic and shift operations - the logic operations don't affect it. A little glue logic enables it at the correct times. 

However, the carry flag must be set appropriately before addition or subtraction. This is done by two dedicated lines from the control logic. Since the carry flag is held in the ALU, the carry input can be connected directly to the carry output without a need for a direct input.

Finally, since the A and B inputs are valid at different times, we don't need two inputs and can join them together.

Here's the completed ALU. Note that on Logisim diagrams green lines are single bits - light for high and dark for low - and black lines are multiple bit buses.




Sunday 7 August 2016

The ALU

Arithmetic Logic Unit

The clue's in the name - this block handles both arithmetic and logical operations. We want to implement a total of ten operations:
  • Addition with carry
  • Subtraction with borrow
  • Increment
  • Decrement
  • AND
  • OR
  • XOR
  • Shift right from and into the carry
  • Shift left from and into the carry
  • Compare
Though with the aim of simplifying the hardware, we're only going to implement the first eight of these directly; we'll look at the last two towards the end of this article.

Let's start with the arithmetic, since the logical instructions are trivial. If we use the conventional two's complement notation, we can do it all with adders. The 74hc283 (datasheet here) can add two four bit numbers and a carry input to give a four bit plus carry result; two can be used for eight bit data simply by taking the carry out of the low nibble adder and using it as the carry in for the high nibble.

So far, so good (though we have not yet addressed the carry, for reasons which will become obvious). But what about subtraction?

Subtraction is mathematically equivalent to adding a negative number. Using two's complement notation, a number has its sign changed (i.e. it is multiplied by minus one) by first inverting all the bits and then adding one, so -1 in decimal becomes 0b11111111 in binary. Let's add an inverter:

We use the multiplexor to select either the inverted or non-inverted B, and then add it to A and the carry. For the subtraction to be correct, we have to set the carry; if the carry is clear, the result will be one less than expected - the carry is acting as an inverted borrow. Also, the carry out also behaves as an inverted borrow - it is cleared if the calculation overflows and needs to borrow from a further stage, otherwise it is set. So when we add, we have a carry, and when we subtract, we have an inverted borrow - but this is internally consistent and works properly for multi-byte arithmetic:

clc
lda param1_lo
adc param2_lo    ; carry is clear if no overflow
sta result_lo
lda param1_hi
adc param2_hi
sta result_hi

and

sec
lda param1_lo
sbc param2_lo    ; ~borrow is set if no underflow
sta result_lo
lda param1_hi
sbc param2_hi
sta result_hi

But what about increment and decrement operations? They're extremely common, and while we could of course just add or subtract immediate values that takes an extra byte in the instruction and an extra clock cycle to get that data. We'd also have to preset the carry/~borrow - another instruction. It's faster to have this internally arranged.

I've added a further multiplexor. This one allows us to replace the B input with a fixed zero. There are also now a couple of control lines which control the multiplexors - so we can select either the normal or inverted version of the B input or zero.

To increment, we select the zero, don't invert it, and add it to A - making sure that we set the carry. So the result is A+1. To decrement, we invert the zero to get minus one, and add that but with the carry cleared.

A little combinational logic ensures that the carry is arranged correctly: it's passed in unchanged for addition and subtraction, or set or cleared as required for the increment and decrement. The mode bits - three of them - tell the ALU which operation it should be performing.

There's one extra function that's sneaked in there: a compare input. To perform a compare, you need to subtract the operand from the accumulator, noting the flags but discarding the result. To do that, it's convenient not to need to worry about the carry flag, which must be set. The compare input looks after that for us.

Now we can look at the logic functions, which are indeed trivial (though note that black lines in the drawing are all eight bits wide, so there's a lot of wiring to do!).

The logic routes are obvious; the mess of wiring at the bottom left is a shift right unit. Each bit of the input is wired to the next bit to the right of the output, with the carry being used to provide the top bit. The lowest bit of the input will be moved to the carry output. A multiplexor controlled by the mode bits selects the desired logical output.

The only thing left to do then is a mechanism to select the logic or arithmetic output and to route the flags from the appropriate place, and a way to detect a zero condition.

Two more multiplexors are controlled by the high bit of mode to select the appropriate outputs, and a comparator against zero (in practice, an eight-input nor gate) gives a high output if the output is zero.

And that's the ALU complete. At an estimate, this will need four chips for the adder, eight chips for the eight-bit multiplexors and another for the carry multiplexor, two each for the three logic channels, one for the zero detect, and a handful of smaller single or double gate chips. Twenty or so chips. That should fit on our target half-eurocard size without too much trouble.

But can we do it another way? Of course we can... the complete ALU is nothing more than combinational logic with sixteen input bits of data, four bits to control the function (including the compare input), and one bit for carry in. It has ten bits of output: eight bits of output, one carry, and one zero flag. We can put this in a programmable device - a PROM of some kind.

A glance through the catalogue will reveal a startling lack of 2M x 10 bit parts - but one might consider a 4*4 ALU - in that case, each prom requires four bits each for A and B, four bits to select the function, and the carry - and produces four bits of output plus carry and zero. This is a much more manageable 8k x 6 bits, so two of these, with the carry passed through. An extra bit doubles the number of functions you can provide, should you wish (though a *left* shift is tricky). One issue is that PROM access is not as fast as discrete logic, and you need to pass through two PROMs in series. This may be a speed limiting factor at a lower speed than might be wanted.


Here's the logisim circuit for the complete ALU.

There's just one thing I haven't mentioned - the missing instruction SHL - shift left through the carry. It turns out it doesn't need any extra hardware... we simply add the accumulator to itself using the normal ADC logic. It may take an extra clock cycle to get the accumulator into the B input, but I can live with that for a simpler ALU.

Friday 5 August 2016

Meanwhile, back in the hardware

The addressing unit

So, with all the bits for the various address parts in place, what does it look like?

 It's a bit tricky to have the picture large enough to see, but you can download the circuit here. It's not yet in a position to be used as a component, but we'll address <ahem> that later...

Each of the blocks we've discussed is present: the programme counter (PC), the stack pointer (SP), the address register (AR), the X and Y registers, and the adder that adds the AR to X and Y.

The two multiplexers on the right hand side select one of the four blocks and supply it directly to the external address bus: whichever is selected will point to a memory address. But there is another multiplexer at the top...

That one is more subtle; it selects an eight-bit output from any one of the registers (and indeed, eight further registers which haven't yet been discussed) and passes it back to a common bus - bus_A. The register is selected by the four-bit signal sel_source. Bus_A is applied to the inputs of all the registers in the system...

Why? Because that's a way to move the contents of any register to any other register. We use the clock signal to keep things happening at the right time: when the clock is low, the output of any of the registers is stable, and one of them is routed through the multiplexer. When the clock rises, the register selected by the decoder - fed by sel_destination - latches the data from bus_A.

This works for any register, though we haven't discussed half of them yet. It's just a question of making everything look like a register.

Thursday 4 August 2016

A diversion into programming

A diversion into programming

Before we get too deep into the design, it's perhaps worth looking at how we might expect to use the code.

Many languages use something similar to the way C works to encapsulate data locally and to pass and return items on the stack. I'm not proposing (yet) any languages beyond an assembler but there's no reason not to, and plenty of reasons to write assembler in a similar way. Here's how it might work (and remember that the software has to agree internally how it's going to implement this):

A (hypothetical) routine is passed a couple of sixteen-bit parameters and passes one sixteen bit parameter back. It uses three sixteen-bit variables for local storage. Before calling the routine, the caller has to save those variables on the stack:

[param1]
[param2]
[      ] <-- stack pointer points here

Once the routine is called, that changes to include the return address:

[param1]
[param2]
[return]
[      ] <-- stack pointer points here

So if we want to access param1, we need to add four to the stack pointer's value; to access param2 we need to add two. So we might envisage a stack-pointer plus constant addressing mode.

It works to access local variables, too, provided we don't move the stack:

[param1]
[param2]
[return]
[local1] <-- stack pointer points here
[local2]
[local3]

And now we can access the parameters by *adding* to the stack pointer, and access the local variables by subtracting from it. Which is fine, but there's a problem: what if we want to push some temporary value on the stack? Two things happen - the first being that the stack pointer moves one byte lower - so the offsets to the parameters and local variables all need adjusting by one - and worse, the byte we just saved has been thumped. Not helpful...

It's worse if we do something really radical, like calling a subroutine. In that case, the first thing we do is save any parameters, and then call the routine, leaving a return address. What the routine does we don't know but whatever happens, our local variables have been zapped.

So how do we resolve that? Let's have a look at a slightly more sophisticated option that addresses both of these issues. We call the routine as before, pushing the parameters on the stack:

[param1]
[param2]
[return]
[      ] <-- stack pointer points here

Now we save the XY register.

[param1]
[param2]
[return]
[XY    ]
[      ] <-- stack pointer points here

Now we copy the stack pointer to the XY pair, and subtract from the stack pointer so it moves past our local variables

[param1]
[param2]
[return]
[XY    ]
[local1] <-- XY points here
[local2]
[local3]
[      ] <-- stack pointer points here

Now it's clear that whatever we do with the stack can't affect our local variables. We can push new data onto the stack if we need to, pop it off again, and call subroutines. Here are the addresses, relative to the XY pointer:

[param1] <-- XY+8
[param2] <-- XY+6
[return] <-- XY+4
[XY    ] <-- XY+2
[local1] <-- XY
[local2] <-- XY-2
[local3] <-- XY-4
[      ] <-- stack pointer points here

So rather than a SP + constant addressing mode, it's more useful to have an XY + constant addressing mode. Why save XY? If we adopt this programming model we need to make sure we don't thump XY which might have been set by the calling routine - and if we call a subroutine ourself, we expect XY to remain constant afterwards. So, we save it.

But what happens when we need to return to the caller? At the moment, our stack pointer isn't pointing at the return address so we have to add to it so it's pointing to the right place. We also need to restore XY. But there's one other thing: we need to sort out our return parameter, and we need to do that before we restore XY since that's the only reference we have.

There are two choices which can be made here: either the caller or the callee cleans up the stack. It's *cleaner* if the callee cleans things up, but it needs a lot of juggling to get things in the right place;
  • the return data is written over the first parameter at XY+8
  • the return address is copied from XY+4 and stored over the second parameter at XY+6
  • XY is restored from XY-2
  • 10 is added to the SP so it points at the return address
  • and finally, we return
On the other hand, if we make the caller responsible, we need only:
  • the return data is written over the first parameter at XY+8
  • SP has 6 added to it so it points at XY
  • XY restored by popping the stack; now the SP points to the return address
  • and return
The difference is that in the first case, after the return the stack pointer is pointing immediately after the return parameter, which can be popped straight off the stack. In the second case, it's pointing after the second parameter, which must be popped off and discarded before the return parameter is available.

We don't have to decide which way to go at the moment, so we won't.

[XY]+k

All this is much simplified if an [XY]+k addressing mode exists. So we implement it... what we will do is use the address register to hold the constant which we add in all cases when using this instruction. We need to add sixteen bits to XY to ensure that the rollover happens properly; we use a sixteen-bit two's-complement constant as part of the instruction. Alternatively, we might use an eight-bit constant - most accesses will surely be close to the stack pointer - but (a) this is more complex in the adder since the sign extension must be implemented across the top eight bits, and (b) a sixteen bit constant allows more scope for complex data structures.

Wednesday 3 August 2016

The stack pointer, the address register, and the X and Y registers

Program counter - update

Nothing significant - but one of the rather fine features of logisim is that it can handle datapaths of various widths. That means you can arrange the inputs and outputs to a circuit to be, say, 8 bits wide but showing a single connection and wire, which keeps things simple.

So here is a version of the 8-bit up counter modified to have an eight bit input and output, instead of eight individual bits. Right click and save as always...

Stack pointer

At first glance, the stack pointer is pretty similar to the program counter, but there are two significant differences: it doesn't need have a reset, but it needs to count in both directions.

We will build a stack which follows convention and (a) builds downwards as new items are added, and (b) always points at an empty location. To use it, early in the initialise sequence of the processor the stack pointer is set to a convenient point in ram - often the top of memory on a small system such as ours.

When we want to write to the stack, we just write at the stack pointer's current  position, and then let it auto-decrement so it points now at the next free location. To read from the stack, we auto-increment to point at the last used memory, and then read from it.

So we never need to clear the stack pointer (as we do the program counter) but we do need to count in both directions. Once again, we reach for the 74hc catalogue and out pops the 74hc191. (There's an excellent discussion of various counters here and I have cheerfully stolen their logisim implementation of the 191 rather than taking the time to do it myself - thanks guys).

The 191 has exactly the features we need - you'd almost think the designers had computers in mind - with the sole exception of being only four bits wide. Not a problem; we'll bolt a couple together in a similar manner to the 161 we used in the program counter.

One other minor difference - the 161, and so the counter we have built from it, counts on the falling edge of the clock pulse, while the 191 counts on the rising edge. Since we will eventually want to clock everything from a single clock, we might need to invert the clock feed to one or the other of these counters. We can decide which once we decide what else needs to happen on these edges - of which more later.

Logisim model: 74191
Logisim model: 8-bit up down counter with load

The address register

Here's one I haven't mentioned, but which has an essential function: when an address is provided as the second and third bytes of an instruction - say a load, store, or jump - then the address must somehow find its way onto the external address bus. (Note that we haven't considered yet just *how* the external address bus is driven by the program counter or the stack pointer - just take my word for it for now.)

We can't use the program counter; it has to remain pointing at the next byte of the instruction, or at the succeeding instruction. We can't use the stack pointer since it needs to remember the position of the stack. But what we can do is transfer the values as we read them to the address register, and then send the address register to the address bus.

It doesn't need to do anything other than this; it's just a latching register. Well, two latching registers, since it has to be sixteen bits wide and we can only access eight at a time, but you get the idea. We might use any number of 74hc series chips, but the 74hc574 is one possibility; on the rising edge of the clock it will latch the data presented at its inputs and if the output enable is active (low) then it will present the latched data at its outputs.

The X and Y registers

In a similar fashion, we can implement the X and Y registers using single 574s for each.

In total, we have eight actual or virtual 'registers' that talk to the address bus.

With this in place, we have most of the parts we need to put the addressing section together. What we haven't discussed, though, is getting the address out from the appropriate registers to the address bus. I'll cover this in the next section, along with the problem of getting the data in the registers to where we might need it.

Tuesday 2 August 2016

The program counter

The program counter 

Let's look at a short program in a hypothetical assembly language:

0000    ldx #7
0002    call 0x100
0005    dex
0006    jnz 0x0002

What this program does is simple; it calls a subroutine at address 0x100 seven times (it doesn't matter what happens at 0x100; sufficient for the example to know that it does something and then returns to the instruction after it was called).

The numbers on the left are the addresses of the first byte of the instruction - not all the instructions take the same amount of space since some of them carry data as well: the immediate value 7 for the ldx instruction, so it takes two bytes, and a two-byte address for the call and jump instructions, so they each take three bytes. The dex has no parameters so takes only one byte.

From this it should be clear that there are a number of things the program counter has to do.
  • it must increment between each memory access as the instruction is executed.
  • it must be possible to set a non-sequential value in the case of a jump or call.
  • and not illustrated but essential - it has to be possible to preset a value from which everything must start when the system is reset.
It's possible to do all this by running everything in a loop through the arithmetic logic unit (ALU), but I think there's a more elegant way: the 74HC161 pre-settable, resettable, binary counter datasheet (pdf).

Here's the internal connections of the HC161, courtesy the NXP datasheet:


(I chose the NXP datasheet since it can be simulated directly in Logisim; the Texas version is functionally identical but implemented with a slightly more complex flip-flop).

The Logisim circuit is here: right-click to download and save it. With it saved, you can either open it directly in Logisim, or you can import it and then use it as a component. Place it and double click to see the interior view.

If you replace the CK input with a clock component (from the wires section) you can let Logisim drive this counter on your behalf. You'll find that you need both enable lines (CEP and CET) high, and also the load input ~PE and the reset input ~MR.

Taking either enable line high will make the thing stop counting, remembering the count until they're both high again. Taking the ~MR line low at any time will force the output to all zeros, and it will remain at zero until the ~MR line is raised again. But the ~PE input is a bit different; if it is taken low, the values on the input will be transferred to the output - but only when the clock next ticks. This is exactly the behaviour we need for our program counter.

One small problem though: this counter is only four bits wide, and we need sixteen to cover our full address range. We need to tie four of these in series, but we'll do it in two groups of eight since we also need to access these in eight-bit bytes.

That's easy enough to implement:

Once again, it's implemented as a single component you can download here.