Wang 2200VP Microarchitecture Description

The same design team that produced the first generation 2200 also produced the 2nd generation CPU, internally known as the 2600. The microarchitecture document for the 2600 was substantially in place by October, 1974 or earlier. The architecture document 2600 Calculator Structure was authored by Norman Lourie, Bob Kolk, and Bruce Patterson, so they were probably the chief architects. The 2600 CPU was a complete redesign, incorporating the latest technology and a much more efficient microarchitecture. The 2200 MVP architecture document was very well done, leaving little to the imagination.

In the VP microarchitecture, microinstructions could operate on 8b and 16b operands in just 600 ns, whereas the first generation CPU only operated on 4b quantities in 1600 ns. The revised microarchitecture also had a larger AUX register file and a larger subroutine stack. Finally, even though there were only 3 bits more per microword (23b vs 20b), the instruction set was far richer in the new microarchitecture. As an extreme example, loading the PC register (this is the memory pointer, not the instruction pointer) took one instruction (600 ns) in the new microarchitecture versus four instructions (6400 ns) in the old.

Although the microarchitecture retained some of the flavor of the first generation, its differences were great enough that the BASIC interpreter had to be completely rewritten from scratch. Wang BASIC also got a major overhaul with many new features and was dubbed BASIC-2.

Bruce Patterson and Dave Angel wrote almost all the microcode for BASIC-2. Despite the complete rewrite and all the new features, BASIC-2 was 99% upwardly compatible with the original Wang BASIC. A BASIC program running on a 2600 CPU is about 8x faster than the exact same program running on a 2200T CPU; a factor of 2.5 of that was due to the faster cycle time of the machine, and the other factor of three came from the more powerful microarchitecture instruction set combined with more efficient algorithms.

Page 4 of Wang Systems Newsletter #4 has this comparison:

Q. How much faster is the "VP" than the "T" CPU?

A. That's a good question. In general, one can safely state that the VP is 6-8 times faster overall. To help compare the two CPU's, here are some timings against specific functions.

Function 2200VP 2200T

X+Y 0.11 ms 0.8 ms

X*Y 0.38 ms 3.9 ms

X/Y 0.76 ms 7.4 ms

X^Y 6.2 ms 45.4 ms

LOG 3.2 ms 23.2 ms

SQR 1.7 ms 46.4 ms

TAN 7.7 ms 78.5 ms

RND 0.27 ms 24.0 ms

Function	2200VP	2200T
X+Y	0.11 ms	0.8 ms
X*Y	0.38 ms	3.9 ms
X/Y	0.76 ms	7.4 ms
X^Y	6.2 ms	45.4 ms
LOG	3.2 ms	23.2 ms
SQR	1.7 ms	46.4 ms
TAN	7.7 ms	78.5 ms
RND	0.27 ms	24.0 ms

One great improvement in the 2600 CPU was that the microcode was no longer stored in ROMs -- it was downloaded from disk on start up, making it much easier to fix bugs in the field. This feature also made it possible to run diagnostics on the machine every so often to make sure the hardware was operating right.

Although the CPU microarchitecture was entirely incompatible, the I/O structure was kept from the first generation 2200, allowing people to upgrade to the VP without having to throw away all of the their I/O cards and peripherals.

Microarchitecture Details (link)

The following information is intended to give the flavor of the microarchitecture, but doesn't cover everything. The view of the CPU presented to the microprogrammer is as follows.

Table 1: Wang VP CPU Register Resources
Register name [array size]	Register width	Function
IC	16b	microcode instruction counter
ICSTACK[96]	16b	microcode return stack
PH, PL	16b (8b, 8b)	memory address pointer; scratch register
AUX[32]	16b	auxiliary PC file
F[8]	8b	scratch data registers
CH, CL	16b (8b,8b)	memory read data
K	8b	8b data to/from the I/O bus
SH	8b	high status register
SL	8b	low status register

IC points at the current microinstruction being executed. Each microinstruction is 24b wide, of which one is parity. Most microinstructions take six 10 MHz clock cycles, although a few take eight, eleven, or sixteen clocks. The IC can be loaded with a 16b immediate value (i.e., JUMP or CALL); its value can be saved on the next location in the ICSTACK or its value restored from the same.

ICSTACK holds return addresses from the microcode subroutine calls, and it can also be used to push the current PC (with a -3 to +3 offset) or to pop the newest value into the PC . The stack is 96 deep; if the call nesting gets deeper than 96 levels, the ICSTACK pointer just wraps around and overwrites the oldest entry.

PH, PL are respectively the high and low bytes of the 16b PC register. PC supplies the memory address when an instruction contains a memory access operation. The address is a byte address, which is what limits the architecture to accessing at most 64 KB of RAM. Later versions of the CPU added bank address bits (provided from SL) allowing more RAM to be addressed, although a single process never saw more than 64 KB. The register is often used like an accumulator to generate addresses that get stored elsewhere.

AUX[32] is a file of thirty two 16b registers. These are used for holding and supplying 16b values to the PC. They are required because saving/restoring the PC value to memory takes many microinstructions. When a value is transferred from the PC to an AUX register, the value can be adjusted by -3 to +3. This makes advancing a pointer through memory efficient.

F[8] is a file of eight 8b values. These are used as a scratch pad for holding the results of calculations from the ALU.

CH,CL are a pair of 8b registers that work together. Every memory read gets two bytes and the data is saved in CH,CL. Because PC is byte addressed, PC may be even or odd. The byte address by PC is saved in CH; the byte addressed by (PC^0x0001) is saved in CL.

K is another 8b register. It is used to send 8b values over the I/O bus or to capture 8b values read from the I/O bus.

Finally, there are two 8b status registers. SH contains a collection of ad hoc status/control bits that do things hold the carry flag and detect when I/O operations have completed. SL is just an 8b read/write register that the microcode uses for various state control so it doesn't have to go to memory for this state.

You can see a very simple block diagram of the microarchitecture.

Microinstruction Encoding (link)

There are a few different formats for microcode instructions. The 2200 MVP architecture document contains a wealth of information, including everything required to write the VP CPU emulation code. Because it is so well written, if you really want the details, see the source document. Below are some of the most important details, enough to provide an overview of what the microarchitecture was all about.

The software development manual contains a very helpful table of microword encodings. It has been recreated as an HTML table below.

Table 2: Wang VP Microinstruction Encoding
		22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	3	1
I. REGISTER INSTRUCTIONS		OPCODE					X		Carry		DD		C-BUS				A-BUS	B-BUS
OR	Or	0	0	0	0	0	X	0	CaCa		DD		CCCC				AAAA	BBBB
XOR	Exclusive	0	0	0	0	1	X	0	CaCa		DD		CCCC				AAAA	BBBB
AND	And	0	0	0	1	0	X	0	CaCa		DD		CCCC				AAAA	BBBB
SC	Binary Subtract with Carry	0	0	0	1	1	X	0	CaCa		DD		CCCC				AAAA	BBBB
DAC	Decimal Add with Carry	0	0	1	0	0	X	0	CaCa		DD		CCCC				AAAA	BBBB
DSC	Decimal Subtract with Carry	0	0	1	0	1	X	0	CaCa		DD		CCCC				AAAA	BBBB
AC	Binary Add with Carry	0	0	1	1	0	X	0	CaCa		DD		CCCC				AAAA	BBBB
M	Binary Multiply	0	0	1	1	1	X	0	HbHa		DD		CCCC				AAAA	BBBB
SHFT	Shift	0	0	0	HbHa		X	0	0	1	DD		CCCC				AAAA	BBBB
II. IMMEDIATE REGISTER INSTRUCTIONS		OPCODE					IMMEDIATE (HIGH)				DD		C-BUS				IMMEDIATE (LOW)	B-BUS
ORI	Or Immediate	0	1	0	0	0	IIII				DD		CCCC				IIII	BBBB
XORI	Exclusive Or Immediate	0	1	0	0	1	IIII				DD		CCCC				IIII	BBBB
ANDI	And Immediate	0	1	0	1	0	IIII				DD		CCCC				IIII	BBBB
AI	Binary Add Immediate	0	1	0	1	1	IIII				DD		CCCC				IIII	BBBB
DACI	Decimal Add with Carry Immediate	0	1	1	0	0	IIII				DD		CCCC				IIII	BBBB
DSCI	Decimal Subtract with Carry Immediate	0	1	1	0	1	IIII				DD		CCCC				IIII	BBBB
ACI	Binary Add with Carry Immediate	0	1	1	1	0	IIII				DD		CCCC				IIII	BBBB
MI	Binary Multiply Immediate	0	1	1	1	0	0	-	Hb	-	DD		CCCC				IIII	BBBB
III. MINI INSTRUCTIONS		OPCODE									DD							B-BUS
TAP	Transfer Aux to PC's	0	0	0	1	0	1	1	1	-	DD		0	- -		AxAxAxAxAx		BBBB
TPA	Transfer PC's to Aux	0	0	0	0	0	0	1	1	+/-	DD		0	InIn		AxAxAxAxAx		BBBB
XPA	Exchange PC's to Aux	0	0	0	0	0	1	1	1	+/-	DD		0	InIn		AxAxAxAxAx		BBBB
TPS	Transfer PC's to Stack	0	0	0	0	1	0	1	1	+/-	DD		0	InIn		- - - - -		BBBB
TSP	Transfer Stack to PC's	0	0	0	1	1	0	1	1	-	DD		- - - - - - - -					BBBB
SR,RCM	Read Control Memory + SR	0	0	0	0	1	1	1	1	-	- -		0	1	1	- - - - -		- - - -
SR,WCM	Write Control Memory + SR	0	0	0	0	1	1	1	1	-	- -		0	1	0	- - - - -		- - - -
SR	Subroutine Return	0	0	0	0	1	1	1	1	-	DD		0	0	- - - - - -			BBBB
CIO	Control Input/Output	0	0	1	0	1	1	1	1	-	0	0	S	TTT TTTT				- - - -
LPI	Load PC's Immediate	0	0	1	1	II		1	II		DD		IIII IIII IIII
IV. MASK BRANCH INSTRUCTIONS		OPCODE					BRANCH FIELD (LOW 10-Bits)										MASK	B-BUS
BT	Branch if True	1	1	0	0	Hb	RRRRRRRRRR										MMMM	BBBB
BF	Branch if False	1	1	0	1	Hb	RRRRRRRRRR										MMMM	BBBB
BEQ	Branch if = Mask	1	1	1	0	Hb	RRRRRRRRRR										MMMM	BBBB
BNE	Branch if != Mask	1	1	1	1	Hb	RRRRRRRRRR										MMMM	BBBB
V. REGISTER BRANCH INSTRUCTIONS		OPCODE					BRANCH FIELD (LOW 10-Bits)										A-BUS	B-BUS
BLR	Branch if < Register	1	0	0	0	X	RRRRRRRRRR										AAAA	BBBB
BLER	Branch if <= Register	1	0	0	1	X	RRRRRRRRRR										AAAA	BBBB
BER	Branch if = Register	1	0	1	0	0	RRRRRRRRRR										AAAA	BBBB
BNR	Branch if != Register	1	0	1	1	0	RRRRRRRRRR										AAAA	BBBB
VI. BRANCH INSTRUCTIONS		OPCODE					BRANCH FIELD (LOW 10-Bits)										BRANCH FIELD (HIGH 6-Bits)
SB	Subroutine Branch	1	0	1	0	1	RRRRRRRRRR										RRRRRR		- -
B	Unconditional Branch	1	0	1	1	1	RRRRRRRRRR										RRRRRR		- -

Table 3: Microinstruction Encoding Key
AAAA	A-BUS Register Address
BBBB	B-BUS Register Address
CCCC	C-BUS Register Address
DD	Read/Write Specification 00 = no read/write 01 = read (CH<=MEM[PC]; CL<=MEM[PC^1]) 10 = write 1 (MEM[PC] <= C-BUS result) 11 = write 2 (MEM[PC^1] <= C-BUS result)
Hb, Ha	High/Low 4-bits of register Ha = 0: select low 4-bits of A-Bus register Ha = 1: select high 4-bits of A-Bus register Hb = 0: select low 4-bits of B-Bus register Hb = 1: select high 4-bits of B-Bus register
II...I	Immediate Operand
MMMM	Immediate Mask
AxAxAxAxAx	Address of auxiliary register
+/- In In	Increment/decrement specification 000 = PC's 001 = PC's + 1 010 = PC's + 2 011 = PC's + 3 100 = PC's 101 = PC's - 1 110 = PC's - 2 111 = PC's - 3
CaCa	Set carry (SH₀) specification 00 = do not set carry 10 = set carry to 0 before ALU operation 11 = set carry to 1 before ALU operation
X	Extended operation if X = 1
RR...R	Branch address
S	Set IOB flip-flops if S = 1
TTTTTT	Strobe specification
-	Bit ignored (0 or 1 legal)

Table 4: A-, B-, C-Bus Register Addressing
Binary Encoding	A-BUS	B-BUS	C-BUS
0000-0111	File registers (F0-F7)	F0-F7	F0-F7
1000	CL with PC's = PC's - 1	PL	PL
1001	CH with PC's = PC's - 1	PH	PH
1010	CL	CL	illegal
1011	CH	CH	illegal
1100	CL with PC's = PC's + 1	SL	SL
1101	CH with PC's = PC's + 1	SH	SH
1110	Dummy with PC's = PC's + 1	K	K
1111	Dummy with PC's = PC's - 1	Dummy	Dummy

When the A-BUS or B-BUS is specified as Dummy, a constant zero is supplied. When the C-BUS is specified as Dummy, it means the ALU result won't be stored to a register (although the result can still be stored to memory with a ",W1" or ",W2" specifier, if the microinstruction format has the DD field).

Table 5: Extended Operation Register Pairs
Binary Encoding	A-BUS	B-BUS	C-BUS
0000	F1, F0	F1, F0	F1, F0
0001	F2, F1	F2, F1	F2, F1
0010	F3, F2	F3, F2	F3, F2
0011	F4, F3	F4, F3	F4, F3
0100	F5, F4	F5, F4	F5, F4
0101	F6, F5	F6, F5	F6, F5
0110	F7, F6	F7, F6	F7, F6
0111	CL, F7	PL, F7	PL, F7
1000	CH, CL	PH, PL	PH, PL
1001	CL, CH	CL, PH	illegal
1010	CH, CL	CH, CL	illegal
1011	CL, CH	SL, CH	illegal
1100	CH, CL	SH, SL	SH, SL
1101	Dummy, CH	K, SH	K, SH
1110	Dummy, Dummy	Dummy, K	Dummy, K
1111	F0, Dummy	F0, Dummy	F0, Dummy

When a microinstruction has an X bit, X=0 means that an 8b operation is to be performed. When X=1, the instruction is converted into a 16b operation, where the first 8b acts on the registers as specified in the encoding, and the second half acts on the 8b operands selected by the register encoding + 1. Table 5 specifies the possible combinations. Note that the operation is a true 16b operation, not two 8b operations in a row, that is, if the CaCa field indicates that carry is to be set or cleared, it happens before the first byte operation but not the second byte operation; for the 16b versions of BLR and BLER, the comparison is a 16b comparison, not just the top byte of the compare. When an extended microinstruction takes place, the increment and decrement of the PC's that would occur for the 8b version is suppressed and the PC value is unaffected. Extended mode instructions that specify a write to memory, only the high order byte of the result is written. Note that extended mode instructions operate in the same amount of time as a normal mode instruction.

Finally, there are some pseudo-operations that the assembler supported. There are more than one way to achieve the same purpose, but the ones chosen by the assembler are as follows:

Table 6: Standard Pseudo Operations
Mnemonic	Actual Code	Meaning
NOP	ORI 0,,	Don't do anything (C-BUS gets zero)
MVI imm, dst	ORI imm,,dst	Move 8b immediate to register
MV src, dst	ORI 0,src,dst	8b register to register move
MVX src, dst	ORX 00,src,dst	16b register to register move

Microarchitecture Example Code (link)

The above description gives many details, but they are best understood by looking at real code to see how they work together. In order to compare the VP microarchitecture to that of the 2200T CPU, I've attempted to re-write the code examples from the 2200 microarchitecture page (which was real microcode from a shipping CPU). Because I haven't tried to find the exact same code buried somewhere in BASIC-2, I've just written it myself; perhaps a more experienced VP microcoder could do a better job.

uCode Example #1A: 2200T
IC	Mnemonic	Behavior
02A1	TA 4	transfer the contents of AUX[4] to the PC register, wiping out the previous contents of PC
02A2	TP+2,R 4	transfer PC+2 back to AUX[4]; read the byte at RAM[PC], storing it in C. We increment by two because PC is a nibble address, and we are advancing to the next byte.
02A3	BNE 2,CL,02A5	jump to return if low nibble isn't 2
02A4	BEQ 0,CH,02A1	loop back if high nibble is 0; (note the nibble swap: this is seeking HEX(20), which is space)
02A5	SR	return to caller

uCode Example #1B: 2200VP
IC	Mnemonic	Behavior
0100	MVI 20,F0	space character
0101	TAP 4	transfer the contents of AUX[4] to the PC register, wiping out the previous contents of PC
0102	OR,R +,,	read RAM[PC] and save it in CH; increment PC
0103	BER CH,F0,0102	if the character is a space, get the next character
0104	TPA 4	CH still holds the first non-space character; AUX[4] points to the following byte
0105	SR	return to caller

uCode fragment #1 scans a line of code, skipping ahead until a non-space is found. AUX[4] contains the 16b pointer to the current byte being scanned, and returns with C containing the first non-space and AUX[4] pointing to the byte after it. Undoubtedly in the original source code the constant "4" would have been represented by a symbolic name.

The 2200T code takes four instructions (6.4 uS) per byte processed; the 2200VP code takes two instructions per byte (1.2 uS), which is about a five times speed difference. To be fair, the 2200VP code is one instruction longer and uses F0 as a scratch register.

uCode Example #2A: 2200T
IC	Mnemonic	Behavior
03B9	ANDI 0E,ST1,ST1	clear bit 0 of ST1; this is the carry bit
03BA	ACI 0E,F0,F0	subtract two from the 16b quantity stored in {F3,F2,F1,F0}
03BB	ACI 0F,F1,F1
03BC	ACI 0F,F2,F2
03BD	ACI 0F,F3,F3
03BE	BF 1,ST1,03C4	test bit 1 of ST1 (carry); if there is no carry, we are done
03BF	XP-2 1	this and the next instruction simply decrement PC by 2 using AUX[1] as a temporary register
03C0	XP 1
03C1	AI,W1 0,F5,	store {F4,F5} in memory at the byte pointed at by PC
03C2	AI,W2 0,F4,
03C3	B 03B9	loop back to the start of the routine
03C4	SR	return from subroutine

This routine uses {F3,F2,F1,F0} as a 16b count of the number of nibbles to fill with a constant byte. The byte is supplied by {F5,F4}. The fill proceeds backwards, that is {F3,F2,F1,F0} initially points to one byte past where the fill should begin. This code takes 11 instructions (17.1 usec) per byte filled.

uCode Example #2B: 2200VP
IC	Mnemonic	Behavior
0100	SCX,0 F3F2,F3F2,F3F2	subtract {F3,F2} from itself with borrow, so that {F3,F2} = -1
0101	ANDI 0FE,SH,SH	clear the carry bit
0102	ACX F1F0,F3F2,F1F0	{F1,F0} = {F1,F0} + {F3,F2}
0103	BFL 1,SH,03C4	test carry bit; if there is no carry, we are done
0104	OR -,,	decrement PC by 1
0105	ORI,W1 0,F4,	store F4 in memory at the byte pointed at by PC
0106	B 0101	loop back to the start of the routine
0107	SR	return from subroutine

In the VP version, things are changed a bit. Because the registers are 8b wide, let's assume {F1,F0} contains a byte count, and that F4 contains the fill byte. This code takes six instructions (3.6 usec) per byte filled, about five times faster. Allowing a couple more instructions, the VP code could be brought down to five instructions per byte. Allowing more extensive rearrangement, the inner loop could be brought down to two instructions:

uCode Example #2C: 2200VP
IC	Mnemonic	Behavior
0100	ORI,W1 -,F4,	write F4 to MEM[PC]; PC=PC-1
0101	BLERX F1F0,PHPL,*-1	keep going while {F1,F0} <= PC