CDC 6600
From Free net encyclopedia
The CDC 6600, a member of the CDC 6000 series, was a mainframe computer from Control Data Corporation, first manufactured in 1964. It is generally considered to be the first successful supercomputer, outperforming the fastest machines of the era by about three times. It remained the world's fastest computer from 1964 to 1969, when it relinquished that status to its successor, the CDC 7600.
The basic organization of the 6600 was also used to develop the simpler (and slower) CDC 6400, and later a version containing two 6400 processors known as the CDC 6500. These machines were instruction-compatible with the 6600, but ran slower due to a different processor design. The 7600 was originally to be compatible as well, starting its life as the 6800, but during the design compatibility was dropped in favour of outright performance.
Contents |
History and impact
CDC's first products were based on the machines designed at ERA, which Seymour Cray had been asked to update after moving to CDC. After an experimental machine known as the Little Character, they delivered the CDC 1604, one of the first commercial transistor-based computers, and one of the fastest machines on the market. Management was delighted, and made plans for a new series of machines that were more tailored to business use; they would include instructions for character handling and record keeping for instance. Cray was not interested in such a project, and set himself the goal of producing a new machine that would be 50 times faster than the 1604. When asked to complete a detailed report on future plans at one and five years into the future, he wrote back that his five year goal was "to produce the largest computer in the world", and his one year plan "to be one-fifth of the way".
Taking his core team to new offices nearby the original CDC headquarters, they started to experiment with higher quality versions of the "cheap" transistors Cray had used in the 1604. After much experimentation they found that there was simply no way the germanium-based transistors could be run much faster than the 1604. In fact the "business machine" that management had originally wanted, now forming as the CDC 3600, pushed them about as far as they could go. Cray then decided the solution was to work with the then-new silicon-based transistors from Fairchild Semiconductor, which were just coming onto the market and offered dramatically improved switching performance.
During this period CDC grew from a startup to a large company. Cray became increasingly frustrated with what he saw as ridiculous management requirements. Things became considerably more tense in 1962 when the new 3600 started to near production quality, and appeared to be exactly what management wanted, when they wanted it. Cray eventually told CDC's CEO, William Norris that something had to change, or he would leave the company. Norris felt he was too important to lose, and gave Cray the green light to set up a new lab wherever he wanted.
After a short search, Cray eventually decided to return to his home town of Chippewa Falls, WI, where he purchased a block of land and started up a new lab. Although this process introduced a fairly lengthy delay in the design of his new machine, once in the new lab things started to progress quickly. By this time the new transistors were becoming quite reliable, and modules built with them tended to work properly on the first try. Working with Jim Thornton, who was the system architect and the 'hidden genius' behind the 6600, the machine soon took form.
About 50 CDC 6600's were sold over the machine's lifetime. Most of these went to various nuclear bomb-related labs, although some found their way into university computing labs as well. Cray immediately turned his attention to its replacement, this time setting a goal of 10 times the performance of the 6600, delivered as the CDC 7600. The later CDC Cyber 70 and 170 computers were much like the CDC 6600.
Description
Typical machines of the era used a single complex CPU to drive the entire system. A typical program would first load data into memory (often using pre-rolled library code), process it, and then write it back out. This required the CPU's to be fairly complex in order to handle the complete set of instructions they needed to run. A complex CPU implied a large CPU, introducing signalling delays while information flowed between the individual modules making it up. These delays set a maximum upper limit on performance, the machine could only operate at a cycle speed that allowed the signals time to arrive at the next module.
Cray took another approach. At the time, CPUs generally ran slower than the main memory they were attached to. For instance, a processor might take 15 cycles to multiply two numbers, while each memory access took only one or two. This meant there was a significant time where the main memory was idle. It was this idle time that the 6600 extracted.
Instead of trying to make the CPU handle all the tasks, the 6600's handled math and logic only. This resulted in a much smaller CPU, which in turn allowed it to operate at a higher clock speed. Combined with the faster switching speeds of the silicon transistors, the new CPU design would easily outperform anything then available. The new design ran at a clock speed of 100ns (10 MHz), about ten times that of other machines on the market. Additionally the simple processor also made operations themselves faster; for instance, the CPU could complete a multiplication in only three cycles.
Of course, being simple, it wouldn't be able to do much, either. In order to handle all of the normal "housekeeping" tasks a typical CPU was asked to handle memory and input/output as well. Cray removed these instructions from the main CPU, and instead implemented them in separate hardware. By allowing the CPU and I/O to operate in parallel, the design effectively doubled the performance of the machine.
Of course this would also make the machine dramatically more expensive. Key to the 6600's design was to make the I/O processors, known as Peripheral Processors or PPs, as simple as possible. The PPs were based on the simple 12-bit CDC 160A, which ran much slower than the CPU, gathering up data and "squirting" it into main memory at high speed via dedicated hardware. To make up for their slow speed, the 6600 included ten PP's in total.
The machine as a whole operated in a fashion known as "barrel and slot", the "barrel" referring to the ten PP's, and the "slot" the main CPU. For any given slice of time, one PP was given control of the CPU, asking it to complete some task (if required). Control was then handed off to the next PP in the barrel. Programs were written, with some difficulty, to take advantage of the extact timing of the machine to avoid any "dead time" on the CPU. For instance, a program might use one PP to load data from a tape drive into an array stored in memory, another to copy one element of that array into a register in the CPU, another to multiply the register by a constant, and two more to copy the information back out and then to tape. With the CPU running much faster than normal each memory access required ten of these faster cycles to complete, so by using ten PP's, each PP was guaranteed one memory access per machine cycle.
The basis for the 6600 CPU is what we would today refer to as a RISC system, one in which the processor is tuned to do instructions which are comparatively simple and have limited and well defined access to memory. The philosophy of many other machines was toward using instructions which were complicated — for example, a single instruction which would fetch an operand from memory and add it to a value in a register. In the 6600, loading the value from memory would require one instruction, and adding it would require a second. While slower in theory due to the additional memory accesses, the PPs offloaded this expense. This simplification also forced programmers to be very aware of their memory accesses, and therefore code deliberately to reduce them as much as possible.
The Central Processor (CP)
The Central Processor, or CP, has eight general purpose 60-bit registers X0 through X7, eight 18-bit address registers A0 through A7, and eight 18-bit scratchpad registers B0 through B7 (typically used for array indexing, with B0 permanently set to zero). Additional registers used for bookkeeping (such as the scoreboard register) are not accessible to the programmer. Additional registers (such as RA and FL) can only be loaded through the operating system. The CP has no instructions for input and output, which is accomplished through Peripheral Processors (below). In keeping with the RISC "load/store" philosophy, there are no instructions to read or write from/to core memory. All memory accesses are performed through loading an address into the A registers; loading A1 through A5 with an address would cause the data word at that location to be read into the corresponding X register (X1 through X5), while loading an address into A6 or A7 would cause register X6 or X7 to be written out to memory at that address. (Registers X0 and A0 were not involved in load/store operations this way). A separate hardware load/store unit handled the actual data movement independent of the operation of the instruction stream, allowing other operations to complete while memory was being accessed, which required (best case) eight cycles. In modern designs this sort of operation is normally supported directly by load/store instructions, which are given an explicit memory location to read or write, instead of the address registers used in the 6600. Floating-point operations were given pride of place in this architecture: the CDC6600 (and kin) stand virtually alone in being able to execute a 60-bit floating point multiplication faster than a program branch.
The CP included several parallel functional units, allowing multiple instructions to be worked on at the same time. Today this is known as a superscalar design, while at the time it was simply "unique". The system read and decoded instructions from memory as fast as possible, generally faster than they could be completed, and fed them off to the units for processing. The units included two floating point multipliers, a divider, an adder and "long" adder, two incrementers, a shifter, a boolean logic unit and a branch unit.
Previously executed instructions went into an eight-word pipeline (officially called a "stack") kept in onboard CP registers. Since the 15-bit instructions were packed four to a word, the system could pick any one of up to 32 previous instructions to run depending on which units were free. The pipeline was always flushed by an unconditional jump; it was sometimes faster (and would never be slower) than a conditional jump. The system used a 10 megahertz clock, but used a four-phase signal to match the four-wide instructions, so the system could at times effectively operate at 40 MHz. A floating point multiply took about three cycles, while a divide took about ten, and the overall performance considering memory delays and other issues was about 1 MFLOPS. Using the best available compilers, late in the machine's history, FORTRAN programs could expect to maintain about 0.5 MFLOPS.
Memory organization
User programs are restricted to use only a portion of contiguous core memory. The portion of memory the program has access to is controlled by the RA (Relative Address) and FL (Field Length) registers, and when a user program tries to read or write a word in central memory at address a, the processor will first check that a is between 0 and FL-1. If this passes, the processor will access the word in central memory at address RA+a. This process is known as logical address translation; each user program sees core memory as a contiguous block of FL words starting at address 0, while in fact the program may be anywhere in the physical memory. Using this technique, each user program can be moved around in core memory by the operating system, as long as the RA register reflects its position in memory. A user program trying to access memory outside the allowed range will trigger an error, and will be terminated by the operating system. When this happens, a core dump will be output in a file, allowing the developer a way to know what happened. However, in contrast to virtual memory systems, the entirety of a process addressable space must be in core memory. Support for virtual memory came much with later with the CDC Cyber 180 models.
Peripheral Processors (PPs)
To handle the 'household' tasks which other designs put in the CPU, Cray included ten other processors, based partly on his earlier computer, the CDC 160A. These machines, called Peripheral Processors, or PPs, were full computers in their own right, but were tuned to performing I/O tasks and running the operating system. One of the PP's was in overall control of the machine, including control of the program running on the main CPU, while the others would be dedicated to various I/O tasks. When the program needed to perform some sort of I/O, it instead loaded a small program into one of these other machines and let it do the work. The PP would then inform the CPU when the task was complete with an interrupt.
Each PP included its own memory (up to 4096 12-bit words), both for I/O buffering as well as program storage, but the execution units were shared by 10 PPs, in a configration called the Barrel and slot. This meant that the execution units (the "slot") would execute one instuction cycle from the first PP, then one instruction cycle from the second PP, etc. in a round robin fashion. This was done both to reduce costs, and because access to CP memory required 10 PP clock cycles: when a PP accesses CP memory, the data is available next time the PP receives its slot time.
Wordlengths, characters
The central processor had 60-bit words, whilst the peripheral processors had 12-bit words. CDC used the term "byte" to refer to 12-bit entities used by peripheral processors; characters were 6-bit, and central processor instructions were either 15 bits, or 30 bits with a signed 18-bit address field, the latter allowing for a directly addressable memory space of 128K words (converted to modern terms, with 8-bit bytes, this is 0.94 megabytes). The signed nature of the address registers limited an individual program to 128K. The actual CPU could have 256K words of memory (budget permitting). Central processor instructions started on a word boundary when they were the target of a jump statement or subroutine return jump instruction, so no-operations were sometimes required to fill out the last 15, 30 or 45 bits of a word.
The 6-bit characters, called display code, could be used to store up to 10 characters in a word. They permitted a character set of 64 characters, which is enough for all upper case letters, digits, and some punctuation. Certainly, enough to write FORTRAN, or print financial or scientific reports. There were actually two variations of the display code character sets in use, 64-character and 63-character. The 64-character set had the disadvantage that two consecutive ':' (colon) characters might be interpreted as the end of a line if they fell at the end of a 10-byte word. A later variant, called 6/12 display code, was also used in the KRONOS and NOS timesharing systems to allow full use of the ASCII character set in a manner somewhat compatible with older software.
With no byte addressing instructions at all, code had to be written to pack and shift characters into words. The very large words, and comparatively small amount of memory, meant that programmers would frequently economise on memory by packing data into words at the bit level.
Physical design
The machine was built in a plus-sign-shaped cabinet with a pump and heat exchanger in the outermost 18 inches of each of the 4 arms. Cooling was done with Freon circulating within the machine and exchanging heat to an external chilled water supply. Each arm could hold 4 chasses, each about 8 inches thick, hinged near the center, and opening a bit like a book. The intersection of the "plus" was filled with cables which interconnected the chasses. The chasses were numbered from 1 (containing all 10 PPUs and their memories, as well as the rather minimal I/O channels) to 16. The main memory for the CPU was spread over many of the chasses.
The logic of the machine was packaged into modules about 2.5 inches square and about an inch thick. Each module had a connector (roughly 20 pins in each of 2 vertical rows) on one edge, and 6 test points on the opposite edge. The module was placed between two aluminum cold plates to remove heat. The module itself consisted of two parallel printed circuit boards, with components mounted either on one of the boards or between the two boards. This provided a very dense, if somewhat difficult to repair, package with good heat removal that was known as cordwood packaging.
Operating system and programming
If there was a sore point with the 6600 it was the operating system support, which took entirely too long to work out.
The machines originally ran a very simple job-control system known as CHOPS, the CHippawa OPerating System, which was quickly "thrown together" based on the earlier CDC 3600 operating system in order to have something running to test the systems for delivery. However the machines were intended to be delivered with a much more powerful system known as SIPROS, for SImultaneous PRocessing Operating System, which was being developed at the company's System Sciences Division in Los Angeles. Customers were impressed with SIPROS's feature list, and many had SIPROS written into their delivery contracts.
SIPROS turned out to be a major fiasco. Development timelines contined to slip, costing CDC major amounts of profit in the form of delivery delay penalties. After several months of waiting with the machines ready to be shipped, the project was eventually cancelled. Luckily the programmers who had worked on CHOPS had little faith in SIPROS (likely due largely to not invented here syndrome) and had continued working on it. Many customers eventually took delivery of their systems with this system instead, now known as SCOPE (Supervisory Control Of Program Execution).
However it was a third system, MACE, which allowed the system to reach its potential. Written largely by a single programmer in the off-hours when machines were available, MACE literally squeezed every possible cycle out of the design for maximum performance. While its feature set was similar to the simple CHOPS/SCOPE, it ran many times faster. MACE was never an official product, although many customers were able to wrangle a copy from the company.
MACE was later used as the basis of KRONOS, originally a request by a customer who wanted to use their 6400 as the basis of a time sharing system. SCOPE was considered too slow to work well in this fashion, so MACE was instead adapted and became completely "official" when it was released in 1967. A further development added in any missing features from SCOPE into KRONOS to produce NOS, the Network Operating System. NOS was the operating system for all CDC machines, a fact CDC promoted heavily, so when a few SCOPE customers refused to switch to NOS, they simply renamed it NOS/BE, and were able to claim that everyone was thus running NOS.
References
- Grishman, Ralph (1974). Assembly Language Programming for the Control Data 6000 Series and the Cyber 70 Series. New York, NY: Algorithmics Press.
- CONTROL DATA 6400/6500/6600 COMPUTER SYSTEMS Reference Manual
- Thornton, J. (1970). Design of a Computer -- The Control Data 6600. Glenview, IL: Scott, Foresman and Co.
See also
External links
- Parallel operation in the Control Data 6600, James Thornton
- Presentation of the CDC 6600 and other machines designed by Seymour Cray – by C. Gordon Bell of Microsoft Research (formerly of DEC)es:CDC 6600