1.2 CENTRAL PROCESSING UNIT


上一页(1.1 GENERAL INTRODUCTION OF COMPUTER HARDWARE)	\|	返回【目录页】	\|	下一页(1.3 MEMORY SYSTEM)

1.2 CENTRAL PROCESSING UNIT

A processor is a functional unit that interprets and carries out instructions. Every processor comes with a unique set of operations such as ADD, STORE, or LOAD that represent the processor 's instruction set. Computer designers are fond of calling their computers machines, so the instruction set is sometimes referred to as machine instructions and the binary language in which they are written is called machine language. You shouldn't confuse the processor's instruction set with the instructions found in high-level programming languages, such as BASIC or Pascal.

An instruction is made up of operations that specify the function to be performed and operands that represent the data to be operated on. For example, if an instruction is to perform the operation of adding two numbers, it must know what the two numbers are and where the two numbers are. When the numbers are stored in the computer's memory, they have their addresses to indicate where they are. So if an operand refers to data in the computer's memory it is called an address. The processor's job is to retrieve instructions and operands from memory and to perform each operation. Having done that, it signals memory to send it the next instruction.

This step-by-step operation is repeated over and over again at awesome speed. A timer called a clock releases precisely timed electrical signals that provide a regular pulse for the processor's work. The term that is used to measure the computer's speed is borrowed from the domain of electrical engineering and is called a megahertz (MHz), which means million cycles per second. For example, in an 8-MHz processor, the computer's clock ticks 8 million times to every second tick of an ordinary clock.

A processor is composed of two functional units—a control unit and an arithmetic/logic unit and a set of special workspaces called registers.

1. The Control Unit

The control unit is the functional unit that is responsible for supervising the operation of the entire computer system. In some ways, it is analogous to a telephone switch-board with intelligence because it makes the connections between various functional units of the computer system and calls into operation each unit that is required by the program currently in operation.

The control unit fetches instructions from memory and determines their types or decodes them. It then breaks each instruction into a series of simple small steps or actions. By doing this, it controls the step-by-step operation of the entire computer system.

2. The Arithmetic and Logic Unit

The arithmetic and logic unit (ALU) is the functional unit that provides the computer with logical and computational capabilities. Data are brought into the ALU by the control unit, and the ALU performs whatever arithmetic or logic operations are required to help carry out the instr-

uction.

Arithmetic operations include adding, subtracting, multiplying, and dividing. Logic operations make a comparison and take action based on the results. For example, two numbers might be compared to determine if they are equal. If they are equal, processing will continues; if they are not equal, processing will stop.

3. Registers

A register is a storage location inside the processor. Registers in the control unit are used to keep track of the overall status of the program that is running. Control unit registers store information such as the current instruction, the location of the next instruction to be executed, and the operands of the instruction. In the ALU, registers store data items that are added, subtracted, multiplied, divided, and compared. Other registers store the results of arithmetic and logic operations.

An important factor that affects the speed and performance of a processor is the size of the registers. Technically, the term word size (also called word length) describes the size of an operand register, but it is also used more loosely to describe the size of the pathways to and from the processor. Currently, word sizes in general purpose computers range from 8 to 64 bits. If the operand registers of a processor are 16 bits wide, the processor is said to be a 16-bit processor.

KEYWORDS

instruction	指令	clock	时钟
instruction set	指令集，指令系统	megahertz (MHz)	兆赫
processor	处理器	control unit	控制器，控制部件
operation	操作、操作码、操作码指令	arithmetic and logic unit (ALU)	算术/逻辑部件

operand	操作数	word size (word length)	字长
register	寄存器	machine language	机器语言

NOTES

（1）processor（处理器）。它是解释并执行命令的功能部件。

（2）central processing unit（CPU）（中央处理器）。我们常说的CPU通常指微处理器，它是一种小型化处理器，其所有元件都集成在一块或数块电路内。在具有固定指令集的微型计算机中，微处理器由算术逻辑部件和控制逻辑部件组成。在具有微程序控制的指令集的微型计算机中，它还包含附加的控制—存储部件。

（3）operand（操作数）。也称做运算数，运算对象；运算域。

（4）registers（寄存器）。它是一种存储装置，具有规定的存储容量，如一个位、一个字节或一个计算机字，通常有特定的用途。

EXERCISES

1．Match the following terms to the appropriate definition.

（1） Processor.

（2） Instruction set.

（3） Clock.

（4） Machine language.

（5） Operation.

（6） Operand.

（7） Megahertz (MHz).

（8） Control unit.

（9） Arithmetic and logic unit (ALU).

（10） Register.

（11） Word size.

A．The part of an instruction that specifies the function that is to be performed.

B．The binary language in which a computer's instruction set is written.

C．A timer in a processor that releases precisely timed signals that provide a pulse for the processor's work.

D．A functional unit that interprets and carries out instructions.

E．A unique set of operations that comes with every processor.

F．The part of an instruction that tells where data that are operated on are located.

G．Million cycles per second.

H．The function unit that is responsible for supervising the operation of the entire computer system.

I．A function unit that provides the computer with logical and computational capability.

J．The term used to describe the size of operand registers and buses.

K．A storage location inside the processor.

2．Fill in the blank with appropriate words or phrases.

（1）We usually call our computers .

（2）An instruction set can sometimes be referred to as .

（3）The binary language is called .

（4）We don't confuse the processor's instruction set with the instructions of .

（5）An instruction consists of .

（6）An operand that refers to data in the memory is called an .

（7）A timer can give precisely timed .

（8）A processor includes two functional units, they are .

（9）The ways by which the control unit works are analogous to .

（10）The control unit takes out the from memory.

A．address

B．high-level programming languages

C．instructions

D．machines

E．electrical signals

F．machine instructions

G．a telephone switch-board with intelligence

H．operations and operands

I．machine language

J．control unit and ALU

READING MATERIALS

MULTIPROCESSING

1. Multiprocessing versus Uniprocessing

A uniprocessor (UP) can accomplish parallelism inside the processor itself. For example, the fixed point unit and the floating point unit can run several instructions within the same CPU (central processing unit) cycle. Power, Power 2 and Power PC architectures have a very high level of instruction parallelism. However, only one task at a time can be processed.

Uniprocessor designs have built-in bottlenecks. The address and data buses restrict data transfers to a one-at-a-time flow of traffic. The program counter forces instructions to be run in strict sequence. Even if improvements in performance are achieved by means of faster processors and more instruction parallelism, operations are still run in strict sequence. However, in a uniprocessor, an increase in processor speed is not the total answer because other factors, such as the system bus and memory, come into play.

Adding more processors seems to be a good solution to increase the overall performance of a system. Having more processors in the system increases the system throughput because the system can perform more than one task at a time. However, the increase in performance is not directly proportional to an increase in the number of processors because there are many other factors to be taken into consideration, such as resource sharing.

2. Multiprocessor Types

There are basically four different types of multiprocessors (MP).

（1）Shared-Nothing MP.

A shared-nothing MP has some of the following characteristics: Each processor is a stand-alone machine. Processors share nothing; each one has its own caches, memory and disks. Also, each processor runs a copy of the operating system. Processors can be interconnected by a LAN if they are loosely coupled or interconnected by a switch if they are tightly coupled. Com-

munication between processors is done via a message-passing library.

Example of a shared-nothing MP are in the IBM RS/6000SP range, as well as Tandem, Teradata and most of the massively parallel machines, including Thinking Machines, Intel Touchstone, Ncube and so on.

（2）Shared-Disks MP.

In a shared-disks MP, each processor has its own caches and memory, but disks are shared. Also ,each processor runs a copy of the operating system-Processors are interconnected through a LAN or a switch. Communication between processors may be done via message passing.

Examples of shared-disks MPs are IBM RS/6000s with HACMP and DEC VAX-clusters.

（3）Shared-Memory Multiprocessor.

In this type of MP, all of the processors are tightly coupled inside the same box with a high-speed bus or a switch between the processor, the I/O subsystem and the memory. Each processor has its own caches, but they share the same global memory, disks and I/O devices. Only one copy of the operating system runs across all of the processors. This means that the operating system itself has to be designed to exploit this type of architecture (it has to be paralleled). Note that, as this work is done for the operating system, all single-thread applications that were running on uniprocessors can, in general, run without notable changes on shared- memory MP.

Shared-memory MP is one of the most common multiprocessing implementations for transaction processing.

（4）Non Uniform Memory Architectures (NUMA).

Here are some of the characteristics of the NUMA model:

· Each group of processors (typically four) has a shared local memory.

· Each group can access all memory from remote groups as a normal address space access.

· Only one version of the operating system is needed.

· Communication between groups is done via a high-speed bus or ring.

3. Symmetric versus Asymmetric Shared-Memory Multiprocessors

In an asymmetric, shared-memory MP, processors are not equal. one processor is designed as the master processor, and the others are slave processors. The master processor is a general-

purpose processor which can perform I/O operations as well as computation while running the operating system. Slave processors can only perform computation. On a slave processor, all I/O operations and general kernel requests are routed to the master processor. Utilization of the slave processor might be poor if the master processor does not service slave processor requests efficiently. As an example, I/O-bound jobs may not run efficiently since only the master CPU runs the I/O operations.

In a symmetric shared-memory multiprocessor (SMP), all of the processors are functionally equivalent; they all can perform kernel, I/O and computational operations. The operating system manages a pool of identical processors, any one of which may be used to control any I/O devices or refer to any storage unit. Since all processors are equal, the system can be reconfigured in the event of a processor failure and, following reboot continue to run. Performance can be optimized, and more processors can be used.

COMPUTER SYSTEM ARCHITECTURE

1. Parallel Processing

Parallel processing is a term used to denote a large class of techniques that are used to provide simultaneous data-processing tasks for the purpose of increasing the computational speed of a computer system. Instead of processing each instruction sequentially as in a conventional computer a parallel processing system is able to perform concurrent data processing to achieve faster execution time. For example, while an instruction is being executed in the ALU, the next instruction can be read from memory. The system may have two or more ALUs and be able to execute two or more instructions at the same time. Furthermore, the system may have two or more processors operating concurrently. The purpose of parallel processing is to speed up the computer processing capability and increase its throughput, that is, the amount of processing that can be accomplished during a given interval of time. The amount of hardware increases with parallel processing, and with it, the cost of the system increases. However, technological development has reduced hardware costs to the point where parallel processing techniques area economically feasible.

Parallel processing can be viewed from various levels of complexity. At the lowest level, we distinguish between parallel and serial operations by the type of registers used. Shift registers operate in serial fashion one bit at a time, while registers with parallel load operate with all the bits of the word simultaneously. Parallel processing at a higher level of complexity can be achieved by having a multiplicity of functional units that perform identical or different operations simultaneously. Parallel processing is established by distributing the data among the multiple functional units. For example, the arithmetic logic, and shift operations can be separated into three units and the operands diverted to each unit under the supervision of a control unit.

There are a variety of ways that parallel processing can be classified. It can be considered from the internal organization of the processors, from the interconnection structure between processors, or from the flow of information through the system. One classification introduced by M.J.Flynn considers the organization of a computer system by the number of instructions and data items that are manipulated simultaneously. The normal operation of a computer is to fetch instructions from memory and execute them in the processor. The sequence of instructions read from memory constitutes an instruction stream. The operations performed on the data in the processor constitute a data stream. Parallel processing may occur in the instruction stream, in the data stream, or in both. Flynn's classification divides computers into four major groups as follows:

· Single instruction stream, single data stream (SISD).

· Single instruction stream, multiple data stream (SIMD).

· Multiple instruction stream, single data stream (MISD).

· Multiple instruction stream, multiple data stream (MIMD).

SISD represents the organization of a single computer containing a control unit, a process unit, and a memory unit. Instructions are executed sequentially and the system may or may not have internal parallel processing capabilities. Parallel processing in this case may be achieved by means of multiple functional units or by pipeline processing.

SIMD represents an organization that includes many processing units under the supervision of a common control unit. All processors receive the same instruction from the control unit but operate on different items of data. The shared memory unit must contain multiple modules so that it can communicate with all the processors simultaneously.

MISD structure is only of theoretical interest since no practical system has been constructed using this organization.

MIMD organization refers to a computer system capable of processing several programs at the same time. Most multiprocessor and multicomputer systems can be classified in this category.

Flynn's classification depends on the distinction between the performance of the control unit and the data-processing unit. It emphasizes the behavioral characteristics of the computer system rather than its operational and structural interconnections.

2. Pipelining

Pipelining is a technique of decomposing a sequential process into suboperations, with each subprocess being executed in a special dedicated segment that operates concurrently with all other segments. A pipeline can be visualized as a collection of processing segments through which binary information flows. Each segment performs partial processing dictated by the way the task is partitioned. The result obtained from the computation in each segment is transferred to the next segment in the pipeline. The final result is obtained after the data have passed through all segments. The name “pipeline” implies a flow of information analogous to an industrial assembly line. It is characteristic of pipelines that several computations can be in process in distinct segments at the same time. The overlapping of computation is made possible by associating a register with each segment in the pipeline. The registers provide isolation between each segment so that each can operate on distinct data simultaneously.

Perhaps the simplest way of viewing the pipeline structure is to imagine that each segment consists of an input register followed by a combinational circuit. The register holds the data and the combinational circuit performs the suboperation in the particular segment. The output of the combinational circuit in a given segment is applied to the input register of the next segment. A clock is applied to all registers after enough time has elapsed to perform all segment activities. In this way the information flows through the pipeline one step at a time.

The pipeline organization will be demonstrated by means of a simple example. Suppose that we want to perform the combined multiply and add operations with a stream of numbers.

A_i×B_i+C_i i = 1, 2, 3, 4

Each suboperation is to be implemented in a segment within a pipeline. Each segment has one or two registers and a combinational circuit as shown in Fig. 1-3, R1 through R5 are registers that receive new data with every clock pulse. The multiplier and adder are combinational circuits. The suboperations performed in each segment of the pipeline are as follows:

R₁←A_i, R2←B_i

R3←R₁×R2, R4←C_i

R5←R3+R4

Fig. 1-3 Pipeline processing

The five registers are loaded with new data every clock pulse. The effect of each clock is shown in Table 1-4. The first clock pulse transfers A₁and B_l into R1 and R2. The second clock pulse transfers the product of R1 and R2 into R3 and C_l into R4. The same clock pulse transfers A₂and B₂ into R1 and R2. The third clock pulse operates on all three segments simultaneously. It places A₃ and B₃ into R1 and R2, transfers the product of R1 and R2 into R3, transfers C₂ into R4, and places the sum of R3 and R4 into R5. It takes three clock pulses to fill up the pipe and receive the first output from R5. From there on, each clock produces a new output and moves the data one step down the pipeline. This happens as long as new input data now into the system. When no more input data are available, the clock must continue until the last output emerges out of the pipeline.

Table 1-4 content of registers and process steps

Clock Pulse	Segment 1		Segment 2		Segment 3
Clock Pulse	R1	R2	R3	R4	R5
1	A₁	B₁
2	A₂	B₂	A₁×B₁	C₁
3	A₃	B₃	A₂×B₂	C₂	A₁×B₁+C₁
4	A₄	B₄	A₃×B₃	C₃	A₂×B₂+C₂
5			A₄×B₄	C₄	A₃×B₃+C₃
6					A₄×B₄+C₄

3. Vector Processing

There is a class of computational problems that are beyond the capabilities of a conventional computer. These problems are characterized by the fact that they require a vast number of computations that will take a conventional computer days or even weeks to complete. In many science and engineering applications, the problems can be formulated in terms of vectors and matrices that lend themselves to vector processing.

Computers with vector processing capabilities are in demand in specialized applications. The following are representative application areas where vector processing is of the utmost importance.

· Long-range weather forecasting.

· Petroleum explorations.

· seismic data analysis.

· Medical diagnosis.

· Aerodynamics and space flight simulations.

· Artificial intelligence and expert system.

· Mapping the human genome.

· Image processing.

Without sophisticated computer, many of the required computations cannot be completed within a reasonable amount of time. To achieve the required level of high performance it is necessary to utilize the fastest and most reliable hardware and apply innovative procedures from vector and parallel processing techniques.

Many scientific problems require arithmetic operations on large arrays of numbers. These numbers are usually formulated as vector and matrices of floating-point numbers. A vector is an ordered set of a one-dimensional array of data items. A vector X of length n is represented as a row vector by X=[X₁, X₂, X₃, …, X_n]. It may be represented as column vector if the data items are listed in a column. A conventional sequential computer is capable of processing operands once at a time. Consequently, operations on vectors must be broken down into single computations with subscripted variables. The element X_i of vector X is written as X(I) and the index I refers to a memory address or register where the number is stored. To examine the difference between a conventional scalar processor and a vector processor, consider the following Fortran DO loop:

DO ab I=1,100

ab C(I) = B(I) + A(I)

This is a program for adding two vectors A and B of length 100 to produce a vector C. This is implemented in machine language by the following Sequence of operations.

Initialize I=0

ab Read A(I)

Read B(I)

Store C(I)=A(I)+B(I)

Increment I=I+1

If I≤100 go to ab

Continue

This constitutes a program loop that reads a pair of operands from arrays A and B and performs a floating-point addition. The loop control variable is then updated and the steps repeat 100 times.

A computer capable of vector processing eliminates the overhead associated with the time it takes to fetch and execute the instructions in the program loop. It allows operations to be specified with a single vector instruction of the form.

C(1:100)=A(1:100)+B(1:100)

The vector instruction includes the initial address of the operands, the length of the vector, and the operation to be performed, all in one composite instruction.

A possible instruction format for a vector instruction is that which is essentially a three-address instruction with three fields specifying the base address of the operands and an additional field that gives the length of the data items in the vectors. This assumes that the vector operands reside in memory. It is also possible to design the processor with a large number of registers and store all operands in registers prior to the addition operation. In that case the base address and length in the vector instruction specify a group of CPU registers.

4. RISC

The world of microprocessors and CPUs can be divided into two parts: complex instruction set computers using CISC processors and reduced instruction set computers with RISC processors. Both seek to improve system performance, though paradoxically using directly opposite approaches.

The first microprocessors ever developed were very simple processors with very simple instruction sets. As microprocessors became more complex, more instructions were incorporated into their instruction sets. Current CISC microprocessor instruction sets may include over 300 instructions. Some of these instructions, such as register moves, are used frequently; others are very specialized and are used only rarely.

In general, the greater the number of instruction in an instruction set, the larger the propagation delay is within the CPU. To illustrate this, consider the hardwired control unit for the Relatively Simple CPU. We used a 4-to-16 decoder to generate outputs corresponding to the 16 instructions in the instruction set. If the CPU had 32 instructions, a 5-to-32 decoder would have been needed. This decoder would require more time to generate its outputs than the smaller 4-to-16 decoder would, which would reduce the maximum clock rate of the CPU.

This led some designers to consider eliminating some rarely used instructions from the instruction sets of CPUs. They reasoned that reducing the propagation delay within the CPU would allow the CPU to run at a higher frequency, thus performing each instruction more quickly.

However, as in most engineering designs, there is a trade-off. The eliminated instructions generally correspond to specific statements in higher-level languages. Eliminating these instructions from the microprocessor's instruction set would force the CPU to use several instructions instead of one to perform the same function; this would invariably require more time. Depending on how frequently the eliminated instructions were needed, and the number of instructions needed to perform their functions, this approach might or might not improve system performance. Consider a CPU that has a clock period of 20 ns. It is possible to remove some instructions from its instruction set and reduce its clock period to 18 ns. These instructions comprise 2 percent of all code in a typical program and would have to be replaced by three of the remaining instructions in an assembly language program. Assume that every instruction requires the same number of clock cycles, the c to be fetched decoded, and executed. If the instructions were not removed from the CPU's instruction set, 100 percent of the instructions would require (20·c) ns to be processed. If they were removed, 98 percent of program code would require (18·c) ns, since the CPU would have a lower clock period. The remaining 2 percent of the code would require three times as many instructions, or (18·c·3) ns = (54·c) ns. Comparing the two yields the following result.

100%(20c)vs. 98%(18c)+2%(54c)

20cvs.17.64c+1.08c

20>18.72

On average, the CPU with fewer instructions yields better performance in this case. However, if the removed instruction constitutes 10 percent of the typical program, removing these instructions would actually reduce the overall performance of the CPU.

As their names imply, CISC and RISC differ in the complexities of their instruction sets. CISC processors have larger instruction sets that often include some particularly complex instructions. These instructions usually correspond to specific statements in high-level languages. Intel's Pentium-class microprocessors fall into this category. In contrast, RISC processors exclude these instructions, opting for a smaller instruction set with simpler instructions. There are a number of other differences, which we discuss in follow.

RISC processors have fewer and simpler instructions than CISC processors. As a result, their control units are less complex and easier to design. This allows them to run at higher clock frequencies than CISC processors and reduces the amount of space needed on the processor chip, so designers can use the extra space for additional registers and other components. Simpler control units can also lead to reduce development costs. With a simpler design, it is easier to incorporate parallelism into the control unit of a RISC CPU.

With fewer instructions in their instruction sets, the compilers for RISC processors are less complex than those for CISC processors. As a general guideline, CISC processors were originally designed for assembly language programming, whereas RISC processors are geared toward compiled, high-level language programs. However, the same compiled high-level program will require more instructions for a RISC CPU than for a CISC CPU.

The CISC methodology offers some advantages as well. Although CISC processors are more complex, this complexity does not necessarily increase development costs. Current CISC processors are often the most recent addition to an entire family of processors, such as Intels Pentium family. As such, they may incorporate portions of the designs of previous processors in their families. This reduces the cost of design and can improve reliability, since the previous design has been proved to work.

CISC processors also provide backward compatibility with other processors in their families. If they are pin compatible, it may be possible simply to replace a previous generation processor with the newest model without changing the rest of the computer's design. This same backward compatibility, whether pin compatible or not, allows the CISC CPU to run the same software as used by the predecessors in its family. For instance, a program that runs successfully on a Pentium II should also run successfully on a Pentium III. This can translate into significant savings for the user and can determine the success or failure of a microprocessor.

The fate of the Neanderthals is a topic hotly debated by anthropologists. On one side are those who believe that Homo sapiens and Neanderthals intermingled, eventually becoming a single species. Opposing this view are those who believe that Homo sapiens out-competed Neanderthals, ultimately driving them to extinction. No one can say with certainty which view is correct. Likewise no one can say who is right (Researchers who believe that RISC processors will replace CISC processors or those who believe the two will coexist, if not intermingle). Some see the RISC processor as the Homo sapiens, out-competing its Neanderthal CISC counterpart. This topic has been strongly debated, rebutted, and counter-rebutted for the past several years.

However one assigns RISC and CISC to this analogy, intermingling is currently underway. CISC designs generally incorporate instruction pipeline, which has improved performance dramatically in RISC processors. As technology allows more devices to be incorporated into a single microprocessor chip, CISC CPUs are adding more registers to their designs, again to achieve the performance improvements they provide to RISC processors. Newer processor families, such as Power PC microprocessors, draw some features from RISC methodology and others from CISC, making them a hybrid of RISC and CISC.

上一页(1.1 GENERAL INTRODUCTION OF COMPUTER HARDWARE)

返回【目录页】

下一页(1.3 MEMORY SYSTEM)

『

首页

简介&目录

Top ↑

』