15A05402- COMPUTER ORGANIZATION Prepared By Ms. M. Latha Reddy, Assistant Professor. • JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY ANANTAPUR • B. Tech III-I Sem. (ECE) LTPC 3103 15A05402- COMPUTER ORGANIZATION • UNIT-I Computer types, Functional units, basic operational concepts, Bus structures, Data types, Software: Languages and Translators, Loaders, Linkers, Operating systems. Memory locations addresses and encoding of information – main memory operations Instruction formats and instruction sequences – Addressing modes and instructions -Simple input programming – pushdown stacks – subroutines. • UNIT-II • Register transfer Language, Register transfer, Bus and Memory Transfers, Arithmetic Micro operations, Logic Micro operations, shift Micro operations, Arithmetic Logic Shift Unit. • Stack organization, instruction formats, Addressing modes, Data transfer and manipulation, Execution of a complete instruction, Sequencing of control signals,Program Control. • UNIT-III Control Memory, address Sequencing, Micro Program Example, Design of Control Unit. Addition and Subtraction, Multiplication Algorithms, Division Algorithms, Floating Point Arithmetic Operations, Decimal Arithmetic Unit, Decimal Arithmetic Operations. • UNIT-IV • Peripheral Devices, Input-Output Interface, Asynchronous Data Transfer, Modes of Transfer, Priority Interrupt, Direct Memory Access (DMA), Input-Output Processor (IOP), Serial Communication. Memory hierarchy, main memory, auxiliary memory, Associative memory, Cache memory, Virtual memory, Memory management hardware. • UNIT-V • Parallel Processing, Pipelining, Arithmetic Pipeline, Instruction Pipeline, RISC Pipeline Vector Processing, Array Processors. Characteristics of Multiprocessors, Interconnection Structures, Inter processor Arbitration, Inter-processor Communication and Synchronization, Cache Coherence. • Text Books: 1. M. Morris Mano, “Computer system Architecture”, Prentice Hall of India (PHI), Third edition. 2. William Stallings,“Computer organization and programming”, Prentice Hall of India(PHI) Seventh Edition, Pearson Education(PE) Third edition, 2006. COURSE OUTCOMES • C311.1 Identify functional units, bus structure and addressing modes. • C311.2 Explain the functional units of the processor such as register file and ALU. • C311.3 Make use of memory and I/O devices and virtual memory effectively. • C311.4 Explain the input/output devices. • C311.5 Apply the algorithms for exploring the pipelining and basic characteristics of multiprocessors. Basic Structure of Computers Content Coverage Main Memory System Address Data/Instructio n Central Processing Unit (CPU) Cache memory Operationa l Registers Program Counter Arithmeti c and Logic Unit Instruction Sets Control Unit Input/Output System Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 7 Functional A computer consists of three main parts: Units A processor (CPU) A main-memory system An I / O system The CPU consists of a control unit, registers, the arithmetic and logic unit, the instruction execution unit, and the interconnections among these components The information handled by a computer Instruction Govern the transfer information within a computer as well as between the computer and its I / O devices Specify the arithmetic and logic operations to be performed Data Numbers and encoded characters that are used as operands by the instructions Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 8 Progra A list of instructions that performs a task is called a m program The program usually is stored in a memory called program memory The computer is completely controlled by the stored program, except for possible external interruption by an operator or by I / O devices connected to the machine Information handled by a computer must be encoded in a suitable format. Most present-day hardware employs digital circuits that have only two stable states, 0 (OFF) and 1 (ON) Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 9 Memory Memory The storage area in whichUnit programs are kept when they are running and that contains the data needed by the running programs Types of memory Volatile memory: storage that reta ins data only if it is receiving power, such as dynamic random access memory (DRAM) Nonvolatile memory: a form of memory that retains data even in the absence of a power source and that is used to store programs between runs, such as flash memory Usually, a computer has two classes of storage Primary memory and secondary memory Primary memory Also called main memory. Volatile memory used to hold programs while they are running; typically consists of DRAM in today’s computers Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 1 0 Memory Secondary memory Unit Nonvolatile memory used to st ore programs and data between runs; typically consists of magnetic disks in today’s computers The memory consists of storage cells, each capable of storing one bit of information The storage cells are processed in groups of fixed size called words To provide easy access to any word in the memory, a distinct address is associated with each word location The number of bits in each word is often referred to as the word length of the computer Typical word length from 16 to 64 bits The capacity of the memory is one factor that characterizes the size of a computer Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 1 1 Memory Instruction and data can be written into the memory or Unit read out under the control of the processor It is essential to be able to access any word location in the memory as quickly as possible Memory in which any location can be reached in a short and fixed amount of time after spec ifying its address called randomaccess memory (RAM) The time required to access one word is called the memory access time This time is fixed, independent of the location of the word being accessed The memory of a computer is normally implemented as a memory hierarchy of three or four levels The small, fast, RAM units are called caches The largest and slowest unit is referred to as the main memory Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 1 2 Arithmetic and Logic Most computer operations are performed in the Unit arithmetic and logic unit (ALU) of the processor For example, consider two numbers stored in the memory are to be added They are brought into the processor, and the actual addition is carried out by the ALU. Then su m may be stored in the memory or retained in the processor for immediate use Typical arithmetic and logic operation Addition, subtraction, multiplic ation, division, comparison, complement, etc. When operands are brought into the processor, they are stored in high-speed storage elements called registers. Each register can store one word of data Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 1 3 Control The control unit is the nerve center that sends control Unit signals to other units and senses their states Thus the control unit serves as a coordinator of the memory, arithmetic and logic, and input / output units The operation of a computer can be summarized as follows: The computer accepts information in the form of programs and data through an input unit and stores it in the memory Information stored in the memo ry is fetched, under program control, into an ALU, where it is processed Processed information leaves the computer through an output unit All activities inside the machine are directed by the control unit Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 14 Computer Components: TopLevel View Memory Input/Output System Bus MAR PC Processor IR MDR R0 R1 . . . Control ALU Rn-1 n general purpose registers Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 15 Basic Operational Concepts Memory Input/Output System Bus MAR MDR PC R0 R1 Processor IR . . . Control ALU Rn-1 n general purpose registers Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 16 A Partial Program Execution Example Memory 300 1 9 4 0 301 5 9 4 1 302 2 9 4 1 CPU Register 300 PC 1940 AC IR Memory 300 1 9 4 0 301 5 9 4 1 302 2 9 4 1 300 301 302 0003 0002 Step 1 Memory CPU Register 1940 5941 2941 940 941 301 PC 0 0 0 3 AC 5 9 4 1 IR 0003 0002 940 941 Step 3 Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 0003 0002 Memory 300 1 9 4 0 301 5 9 4 1 302 2 9 4 1 . 940 941 301 PC 0 0 0 3 AC 1 9 4 0 IR . . 940 941 CPU Register Step 2 CPU Register 302 PC 3 AC 0005 5 9 4 1 IR . 3+2=5 0003 0002 Step 4 17 A Partial Program Execution Example Memory 300 1 9 4 0 301 5 9 4 1 302 2 9 4 1 CPU Register 302 PC 0 0 0 5 AC 2 9 4 1 IR Memory 300 1 9 4 0 301 5 9 4 1 302 2 9 4 1 0003 0002 303 PC 0 0 0 5 AC 2 9 4 1 IR . . 940 941 CPU Register Step 5 Advanced Reliable Systems (ARES) Lab. 940 941 Jin-Fu Li, EE, NCU 0003 0005 Step 6 18 Interru Normal execution of programs may be preempted if some pt device requires urgent servicing To deal with the situation immediately, the normal execution of the current program must be interrupted Procedure of interrupt operation The device raises an interrupt signal The processor provides the requested service by executing an appropriate interrupt-service routine The state of the processor is first saved before servicing the interrupt Normally, the contents of the PC, the general registers, and some control information are stored in memory When the interrupt-service routine is completed, the state of the processor is restored so that the interrupted program may continue Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 19 Classes of Program Interrupts Generated by some condition that occurs as a result of an instruction execution such as arithmetic overflow, division by zero, attempt to execute an illegal machine instruction, or reference outside a user’s allowed memory space Timer Generated by a timer within the processor. This allows the operating system to perform cert ain functions on a regular basis I /O Generated by an I / O controller, to signal normal completion of an operation or to signal a variety of error conditions Hardware failure Generated by a failure such as po wer failure or memory parity error Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 20 Bus A group of lines that serves a connecting path for several Structures devices is called a bus In addition to the lines that carry the data, the bus must have lines for address and control purposes The simplest way to interconnect functional units is to use a single bus, as shown below Input Advanced Reliable Systems (ARES) Lab. Output Jin-Fu Li, EE, NCU Memory Processor 21 Drawbacks of the Single Bus The devices connected to a bus vary widely in their speed Structure of operation Some devices are relatively slow , such as printer and keyboard Some devices are considerably fast, such as optical disks Memory and processor units operate are the fastest parts of a computer Efficient transfer mechanism th us is needed to cope with this problem A common approach is to include buffer registers with the devices to hold the information during transfers An another approach is to use two-bus structure and an additional transfer mechanism A high-performance bus, a low-performance, and a bridge for transferring the data between the two buses. ARMA Bus belongs to this structure Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 22 Softwa In order for a user to enter and run an application re program, the computer must already contain some system software in its memory System software is a collection of programs that are executed as needed to perform functions such as Receiving and interpreting user commands Running standard application programs such as word processors, or games Managing the storage and retrieval of files in secondary storage devices Running standard application programs such as word processors, etc Controlling I / O units to receiv e input information and produce output results Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 23 Softwa Translating programs from source form prepared by the user into re instructions object form consisting of machine Linking and running user-written application programs with existing standard library routines, such as numerical computation packages System software is thus responsible for the coordination of all activities in a computing system Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 24 Operating Operating system (OS) System This is a large program, or actually a collection of routines, that is used to control the sharing of and interaction among various computer units as they perform application programs The OS routines perform the tasks required to assign computer resource to individual application programs These tasks include assigning me mory and magnetic disk space to program and data files, movi ng data between memory and disk units, and handling I / O operations In the following, a system with one processor, one disk, and one printer is given to explain the basics of OS Assume that part of the program’s task involves reading a data file from the disk into the memory, performing some computation on the data, and printing the results User Program and OS Routine Sharing t0-t1: OS routine initiates loading the application program from disk to memory, waits until the transfer is completed, and passes execution control to the application program. Printer Disk OS routines Program t0 Advanced Reliable Systems (ARES) Lab. t1 t3 t2 Jin-Fu Li, EE, NCU t4 t5 Time 26 Multiprogramming or Multitasking Printer Disk OS routines Program t0 Advanced Reliable Systems (ARES) Lab. t1 t3 t2 Jin-Fu Li, EE, NCU t4 t5 Time 27 Performan The speed with which a computer executes programs is ce affected by the design of its hardware and its machine language instructions Because programs are usually written in a high-level language, performance is also affected by the compiler that translates programs into machine languages For best performance, the following factors must be considered Compiler Instruction set Hardware design Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 28 Performan Processor circuits are controll ed by a timing signal called ce a clock The clock defines regular time intervals, called clock cycles To execute a machine instruction, the processor divides the action to be performed into a sequence of basic steps, such that each step can be completed in one clock cycle Let the length P of one clock cycle, its inverse is the clock rate, R=1 /P Basic performance equation T=(NxS) / R, where T is the processor time required to execute a program, N is the number of inst ruction executions, and S is the average number of basic steps ne eded to execute one machine instruction Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 29 Faster Clock=Shorter Running Faster steps do not necessarily mean shorter travel time Time? Solution 1 GHz 4 steps 20 steps 2 GHz [Source: B. Parhami, UCSB] Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 30 System Balance is Note that system balance is absolutely essential for Essential improving performance If one replaces a machine’s processor with a model having twice the performance, this will not double the overall system performance unless corresponding improvements are made to other parts of the system CPU - boundtask Input Processing I/O -bound task Advanced Reliable Systems (ARES) Lab. Output [Source: B. Parhami, UCSB] Jin-Fu Li, EE, NCU 27 Performance Pipelining and superscalar operation Improvement Pipelining: by overlapping the execution of successive instructions Superscalar: different instructions are concurrently executed with multiple instruction pipelines. This means that multiple functional units are needed Clock rate improvement Improving the integrated-circuit technology makes logic circuits faster, which reduces the time needed to complete a basic step Reducing amount of processing done in one basic step also makes it possible to reduce the clock period, P. However, if the actions that have to be performed by an instruction remain the same, the number of basic steps needed may increase Reduce the number of basic steps to execute Reduced instruction set comp uters (RISC) and complex instruction set computers (CISC) Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 32 Reporting Computer Measured or estimated execution times for three Performance programs Time on machine X Time on machine Y Speedup of Y over X Program A 20 200 0.1 Program B 1000 100 10.0 Program C 1500 150 10.0 All 3 prog’s 2520 450 5.6 Analogy If a car is driven to a city 100 km away at 100 km / hr and returns at 50 km / hr, the average speed is not (100 + 50) / 2 but is obtained from the fact that it travels 200 km in 3 hours [Source: B. Parhami, UCSB] Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 33 Machine Instructions & Programs Outlin Numbers, Arithmetic Operations, and Characters e Memory Locations and Addresses Memory Operation Instructions and Instruction Sequencing Addressing Modes Assembly Language Basic Input / Output Operations Stacks and Queues Subroutines Linked List Encoding of Machine Instructions Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 35 Content Coverage Main Memory System Address Data/Instructio n Central Processing Unit (CPU) Cache memory Operationa l Registers Program Counter Arithmeti c and Logic Unit Instruction Sets Control Unit Input/Output System Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 36 Number vector B b Kb b , where b =0 or 1 for Consider an n-bit Representation 0 i n1 n 1 1 0 i The vector B can represent unsigned integer values V in the range 0 to 2n 1, where n1 V (B) bn1 2 L b1 21 b0 2 0 We need to represent positive and negative numbers for most applications Three systems are used for representing such numbers Sign-and-magnitude 1’s-complement 2’s-complement Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 37 Number In sign-and-magnitude system Systems Negative values are represented by changing the most significant bit from 0 to 1 In 1’s-complement system Negative values are obtained by complementing each bit of the corresponding positive number The operation of forming the 1’s-complement of a given number is equivalent to subtracting that number from 2 n-1 In 2’s-complement system The operation of forming the 2’s-complement of a given number is done by subtracting that number from 2 n The 2’s-complement of a number is obtained by adding 1 to 1’s-complement of that number Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 38 An Example of Number Representations bb bb 3 2 1 0 0 1 1 1 0 1 1 0 0 1 0 1 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 Advanced Reliable Systems (ARES) Lab. sign and magnitude 1’s-complement 2’s-complement +7 +6 +5 +4 +3 +2 +1 +0 -0 -1 -2 -3 -4 -5 -6 -7 +7 +6 +5 +4 +3 +2 +1 +0 -7 -6 -5 -4 -3 -2 -1 -0 +7 +6 +5 +4 +3 +2 +1 +0 -8 -7 -6 -5 -4 -3 -2 -1 Jin-Fu Li, EE, NCU 39 2’s-Complement +7+(-3) System +4 13 (1101) steps +7 Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 40 Addition of Numbers in 2’s Complement Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 41 Sign Extension of 2’s Sign extension Complement To represent a signed number in 2’s complement form using a larger number of bits, repeat the sign bit as many times as needed to the left Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 42 Memory A memory consists of cells, each of which can store a bit Locations of binary information (0 or 1) Because a single bit represents a very small amount of information Bits are seldom handled individually The memory usually is orga nized so that a group of n bits can be stored or retrieved in a single, basic operation Each group of n bits is referred to as a word of information, and n is called the word length A unit of 8 bits is called a byte Modern computers have word lengths that typically range from 16 to 64 bits Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 43 Memory Accessing the memory to store or retrieve a single item of information, either Addresses a word or a byte, requires distinct names or addresses for each item location It is customary to use numbers from 0 to 2 k-1 as the address space of successive locations in the memory K denotes address 2k-1 denotes address space of memory locations For example, a 24-bit address generates an address space of 224 (16,777,216) locations Terminology 210: 1K (kilo) 220: 1M (mega) 230: 1G (giga) 240: 1T (tera) Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 44 Memory words Memory Words A signed integer n bits 32 bits Word 0 b31 b30 b1 b0 Word 1 Four characters 8 bits 8 bits ASCII character Word w-1 Advanced Reliable Systems (ARES) Lab. 8 bits 8 bits Jin-Fu Li, EE, NCU 45 Big-Endian & Little-Endian Byte addresses can be assigned across words Assignments in two ways Big-endian and little-endian Word address Word address Byte address Byte address 0 0 1 2 3 0 3 2 1 0 4 4 5 6 7 4 7 6 5 4 2k-4 2k-4 2k-3 2k-2 2k-1 2k-4 2k-1 2k-2 2k-3 2k-4 Big-endian assignment Advanced Reliable Systems (ARES) Lab. Little-endian assignment Jin-Fu Li, EE, NCU 46 Memory Random access memories must have two basic operations Operation Write: writes a data into the specified location Read: reads the data stored in the specified location In machine language program, the two basic operations usually are called Store: write operation Load: read operation The Load operation transfers a copy of the contents of a specific memory location to the processor. The memory contents remain unchanged The Store operation transfers an item of information from the processor to a specific memory location, destroying the former contents of that location Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 47 Instructio A computer must have instruct ions capable of performing ns four types of operations Data transfers between the memory and the processor registers Arithmetic and logic operations on data Program sequencing and control I / O transfers Register transfer notation The contents of a location are denoted by placing square brackets around the name of the location For example, R1 [ L O C ] means that the contents of memory location LOC are transferred into processor register R1 As another example, R 3 [ R 1 ] + [ R 2 ] means that adds the contents of registers R1 and R2, and then places their sum into register R3 Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 48 Assembly Language Types of instructions Notation Zero-address instruction One-address instruction Two-address instruction Three-address instruction Zero-address instruction For example, store operands in a structure called a pushdown stack One-address instruction Instruction form: Operation Destination For example, Add A: add the contents of memory location A to the contents of the accumulator register and place the sum back into the accumulator As another example, Load A: copies the contents of memory location A into the accumulator Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 49 Assembly Language Two-address instruction Notation Instruction form: Operation Source, Destination For example, Add A, B: performs the operation B [ A ] + [ B ] . When the sum is calculated, the result is sent to the memory and stored in location B As another example, Move B, C : performs the operation C[B], leaving the contents of location B unchanged Three-address instruction Instruction form: Operation Source1, Source2, Destination For example, Add A, B, C: adds A and B, and the result is sent to the memory and stored in location C If k bits are needed to specify the memory address of each operand, the encoded form of the above instruction must contain 3k bits for addressing purposes in addition to the bits needed to denote the Add operation Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 50 Instruction How a program is executed Execution The processor contains a register called the program counter (PC), which holds the address of the instruction to be executed next. To begin executing a program, the address of its first instruction must be placed into the PC, then the processor control circuits use the information in the PC to fetch and execute instruction, one at a time, in the order of increasing address Basic instruction cycle START Fetch Instruction Advanced Reliable Systems (ARES) Lab. Execute Instruction Jin-Fu Li, EE, NCU HALT 51 A Program for Address C[A]+[B] i Move A, R0 i+4 Add i+8 Move R0, C B, R0 3-instruction program segment A R0 B Data for the program C Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 52 Straight-Line i Sequencing Move NUM1, R0 i+4 Add i+8 Add NUM3, R0 i+4n-4 i+4n Add NUM2, R0 NUMn, R0 Move R0, SUM SUM NUM1 NUM2 NUMn Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 53 Branchi Move ngN, R1 Clear LOOP Program loop R0 Determine address of “Next” number and add “Next” number to R0 Decrement Branch>0 R1 LOOP Move R0, SUM SUM N n NUM1 NUMn Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 54 Condition The processor keeps track of information about the results Codes of various operations for use by subsequent conditional branch instructions. This is accomplished by recoding required information in individual bits, often called condition code flags Four commonly used flags are N (negative): set to 1 if the results is negative; otherwise, cleared to 0 Z (zero): set to 1 if the result is 0; otherwise, cleared to 0 V (overflow): set to 1 if arithmetic overflow occurs; otherwise, cleared to 0 C (carry): set to 1 if a carry-out results from the operation; otherwise, cleared to 0 N and Z flags caused by an arithmetic or a logic operation, V and C flags caused by an arithmetic operation Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 22 Addressing Programmers use data structures to represent the data Modes used in computations. These include lists, linked lists, array, queues, and so on A high-level language enables the programmer to use constants, local and global variables, pointers, and arrays When translating a high-level language program into assembly language, the compiler must be able to implement these constructs using the facilities in the instruction set of the computer The different ways in which the location of an operand is specified in an instruction are referred to as addressing modes Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 56 Name Generic Addressing Assembler syntax Addressing function Modes Immediate #Value Operand=Value Register Ri EA=Ri Absolute (Direct) LOC EA=LOC Indirect (Ri) EA=[Ri] (LOC) EA=[LOC] Index X(Ri) EA=[Ri]+X Base with index (Ri, Rj) EA=[Ri]+[Rj] Base with index and offset X(Ri, Rj) EA=[Ri]+[Rj]+X Relative X(PC) EA=[PC]+X Autoincrement (Ri)+ EA=[Ri]; Increment Ri Autodecrement -(Ri) Decrement Ri; EA=[Ri] EA: effective address Value: a signed number Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 57 Register, Absolute and Immediate Register mode: the operand is the contents of a processor Modes register; the name (address) of the register is given in the instruction For example, Add Ri, Rj (adds the contents of Ri and Rj and the result is stored in Rj) Absolute mode: the operand is in a memory location; the address of this location is given explicitly in the instruction. (In some assembly languages, this mode is called Direct) For example, Move LOC, R2 (moves the content of the memory with address LOC to the register R2) The Absolute mode can represent global variables in a program. For example, a declaration such as Integer A, B; Immediate mode: the operand is given explicitly in the instruction Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 58 Indirection and Indirect mode: the effective address of the operand is the Pointers contents of a register or memory location whose address appears in the instruction Indirection is denoted by placing the name of the register or the memory address given in the instruction in parentheses The register or memory locati on that contains the address of an operand is called a pointer Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 59 Two Types of Indirect Addressing Through a memory location Through a general-purpose register Add (A), R0 Add (R1), R0 Main Memory B Operand R1 B Advanced Reliable Systems (ARES) Lab. Register Jin-Fu Li, EE, NCU A B B Operand 60 Register Indirect Addressing Diagram Instruction Opcode Register Address R Memory Registers Operand Pointer to Operand Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 61 Using Indirect Addressing in a Program Address Contents Move LOOP Advanced Reliable Systems (ARES) Lab. Move N, R1 #NUM1, R2 Clear R0 Add (R2), R0 #4, R2 Add Decrement R1 Branch>0 LOOP Move R0, SUM Jin-Fu Li, EE, NCU Initialization 62 Indexing and Index mode: the effective address of the operand is Arrays generated by adding a constant value to the contents of a register The register used may be either a special register provided for this purpose, or, more commonly, it ma y be any one of a set of generalpurpose registers in the processor. It is referred to as an index register The index mode is useful in dealing with lists and arrays We denote the Index mode symbolically as X(Ri), where X denotes the constant value contained in the instruction and Ri is the name of the register involved. The effective address of the operand is given by EA=X+(Ri). The contents of the index register are not changed in the process of generating the effective address Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 63 Indexed Offset is Addressing given as a constant Add 20(R1), R2 1000 1000 R1 Offset=20 1020 Advanced Reliable Systems (ARES) Lab. Operand Jin-Fu Li, EE, NCU 64 Indexed Offset is Addressing in the index register Add 1000(R1), R2 20 1000 R1 Offset=20 1020 Advanced Reliable Systems (ARES) Lab. Operand Jin-Fu Li, EE, NCU 65 An Example for Indexed AddressingMove #LIST, R0 N n Clear R1 LIST Student ID Clear R2 LIST+4 Test 1 Clear R3 LIST+8 Test 2 LIST+12 Test 3 4(R0), R1 LIST+16 Student ID Move LOOP Add Add Test 1 Add 12(R0), R3 Test 2 Add #16, R0 Test 3 Decrement R4 Branch>0 LOOP Move R1, SUM1 Move R2, SUM2 Move R3, SUM3 Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU N, R4 8(R0), R2 66 Variations of Indexed Addressing A second register may be used to contain the offset X, in Mode which case we can write the Index mode as (Ri,Rj) The effective address is the sum of the contents of registers Ri and Rj The second register is usually called the base register This mode implements a two-dimensional array Another version of the Index mode use two registers plus a constant, which can be denoted as X(Ri,Rj) The effective address is the sum of the constant X and the contents of registers Ri and Rj This mode implements a three-dimensional array Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 67 Additional Autoincrement mode: the effective address of the operand Modes is the contents of a register specified in the instruction. After accessing the operand, the contents of this register are automatically incremented to point to the next item in a list The Autoincrement mode is denoted as (Ri)+ Autodecrement mode: the contents of a register specified in the instruction are first automatically decremented and are then used as the effective address of the operand The Autodecrement mode is denoted as –(Ri) Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 68 An Example of Autoincrement Addressing Move N, R1 Move #NUM1, R2 Clear R0 LOOP Add Decrement Advanced Reliable Systems (ARES) Lab. (R2)+, R0 R1 Branch>0 LOOP Move R0, SUM Jin-Fu Li, EE, NCU 69 Assembly A complete set of symbolic names and rules for their use Language constitute a programming language, generally referred to as an assembly language Programs written in an assembly language can be automatically translated into a sequence of machine instructions by a program called an assembler When the assembler program is executed, it reads the user program, analyzes it, and then generates the desired machine language program The user program in its origin al alphanumeric text format is called a source program, and the assembled machine language program is called an object program Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 70 Assembler In addition to providing a mechanism for representing Directives instructions in a program, the assembly language allows the programmer to specify ot her information needed to translate the source program into the object program Suppose that the name SUM is used to represent the value 200. The fact may be conveyed to the assembler program through a statement such as SUM EQU 200 This statement does not denote an instruction that will be executed when the object program is run; it will not even appear in the object program Such statements, called assembler directives (or commands) Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 71 100 Move N, R1 104 Move #NUM1, R2 108 Clear R0 112 Add (R2) , R0 116 Add #4, R2 120 Decrement R1 124 Branch>0 LOOP 128 Move R0, SUM Assembl er Assembler directives SUM Statements that ORIGIN 204 N DATAWORD 100 NUM1 RESERVE 400 START ORIGIN MOVE 100 N, R1 MOVE machine CLR R0 ADD (R2), R0 ADD #4, R2 DEC R1 BGTZ LOOP MOVE R0, SUM instructions LOOP UM 200 100 200 generate 132 N 204 EQU Assembler directives RETURN END Memory arrangement Advanced Reliable Systems (ARES) Lab. #NUM1, R2 START Assembly language representation Jin-Fu Li, EE, NCU 72 Number When dealing with numerical values, most assemblers Notation allow numerical values to be specified in different ways For example, consider the number 93, which is represented by the 8-bit binary number 01011101. If the value is to be used as immediate operand, It can be given as a decimal number , as in the instruction ADD #93, R1 It can be given as a binary number, as in the instruction ADD #%01011101,R1 (a binary number is identified by a prefix symbol such as percent sign) It also can be given as a hexadecimal number, as in the instruction ADD #$5D, R1 (a hexadecimal num ber is identified by a prefix symbol such as dollar sign) Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 73 Basic Input/Output Bus connection for processor, keyboard, and display Operations Processor DATAIN DATAOUT 1 SIN 0 10 SOUT Keyboard Display DATAIN, DATAOUT: buffer registers SIN, SOUT: status control flags Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 74 Wait In order to perform I / O transfers, we need machine Loop instructions that can check the state of the status flags and transfer data between the processor and I / O device Wait loop for Read operation READWAIT Branch to READWAIT if SIN=0 Input from DATAIN to R1 Wait loop for Write operation WRITEWAIT Branch to WRITEWAIT if SOUT=0 Output from R1 to DATAOUT We assume that the initial stat e of SIN is 0 and the initial state of SOUT is 1 Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 75 Memory Many computers use an arrangement called memoryMapped I/O mapped I / O in which some memory address values are used to refer to peripheral device buffer registers, such as DATAIN and DATAOUT Thus no special instructions are needed to access the contents of these registers; data can be transferred between these registers and the processor using instructions that we have discussed, such as Move, Load, or Store Also, the status flags SIN and SOUT can be handled by including them in device status registers, one for each of the two devices Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 76 Read and Write Assume that bit b in registers INSTATUS and Programs OUTSTATUS corresponds to SIN and SOUT, respectively 3 Read Loop READWAIT Testbit #3, INSTATUS Branch=0 READWAIT MoveByte DATAIN, R1 Write Loop WRITEWAIT Testbit #3, OUTSTATUS Branch=0 WRITEWAIT MoveByte R1, DATAOUT Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 77 Stacks and A stack is a list of data elem ents, usually words or bytes, Queues with the accessing restriction that elements can be added or removed at one end of the list only It is also called a last-in-first-out (LIFO) stack A stack has two basic operations: push and pop The terms push and pop are used to describe placing a new item on the stack and removing the top item from the stack, respectively. Another useful data structure that is similar to the stack is called a queue Data are stored in and retrieved from a queue on a first-in-firstout (FIFO) basis Two pointers are needed to keep track of the two ends of the queue Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 78 A Stack of Words in the Memory Low address Stack pointer register SP -28 Current top element 17 739 Stack BOTTOM 43 Bottom element High address Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 79 Push and Pop Assume that a byte-addressable memory with 32-bit words Operations The push operation can be implemented as Subtract Move #4, SP NEWITEM, (SP) The pop operation can be implemented as Move Add (SP), ITEM #4, SP If the processor has the Autoincrement and Autodecrement addressing modes, then the push operation can be implemented by the single instruction Move NEWITEM, -(SP) And the pop operation can be implemented as Move (SP)+, ITEM Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 80 Exampl es SP 19 SP -28 SP -28 17 SP 17 43 NEWITEM 43 19 ITEM Push operation Advanced Reliable Systems (ARES) Lab. -28 Pop operation Jin-Fu Li, EE, NCU 81 Checking for Empty and Full When a stack is used in a program, it is usually allocated Errors a fixed amount of space in the memory We must avoid pushing an item onto the stack when the stack has reached in its maximum si ze, i.e., the stack is full On the other hand, we must avoid popping an item off the stack when the stack has reached in its minimum size, i.e., the stack is empty Routine for a safe pop or a safe push operation Compare src, dst Perform [dst]-[src] Sets the condition code flags according to the result Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 82 Subroutin In a given program, it is often necessary to perform a particular es data valu es. Such a subtask is subtask many times on different called a subroutine. Memory location 200 204 Memory location Calling program . . . Call SUB next instruction . . . 1000 Subroutine SUB first instruction . . . Return The location where the calling program resumes execution is the location pointed by the updated PC while the Call instruction is being executed. Hence the contents of the PC must be saved by the Call instru ction to enable correct return to the calling program Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 83 Subroutine The way in which a computer makes it possible to call Linkage and return from subroutines is referred to as its subroutine linkage method Subroutine linkage using a link register 1000 PC 204 Link 204 Return Call Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 84 Subroutine A common programming practice, called subroutine Nesting nesting, is to have one subroutine call another Subroutine nesting call be carried out to any depth. Eventually, the last subroutine called completes its computations and returns to the subroutine that called it The return address needed for this first returns is the last one generated in the nested call sequence. That is, return addresses are generated and used in a last-in-first-out order Many processors do this by using a stack pointer and the stack pointer points to a stack called the processor stack Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 85 Example of Subroutine Main Program SUBNesting 1 SUB 2 SUB 3 C A B . . . . . . . . . . . . C+4 BA++44 AA++44 Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 86 Example of Subroutine Nesting main Prepare to call PC jal abc Prepare to continue abc Procedure abc Save xyz jal Procedure xyz xyz Restore jr $ra jr $ra [Source: B. Parhami, UCSB] Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 87 Parameter When calling a subroutine, a program must provide to Passing the subroutine the parameters, that is, the operands or their addresses, to be used in the computation. Later, the subroutine returns other parameters, in this case, the result of computation The exchange of information between a calling program and a subroutine is referre d to as parameter passing Parameter passing approaches The parameters may be placed in registers or in memory locations, where they can be accessed by the subroutine The parameters may be placed on the processor stack used for saving the return address Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 88 Passing Parameters with passed by value Registers passing by reference Calling program Move Move Call M.ove . . Subroutine LISTADD LOOP N, R1 #NUM1, R2 LISTADD R0, SUM Clear R0 Add (R2)+, R0 Decrement R1 Branch>0 LOOP Return Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU R1 serves as a counter R2 points to the list Call subroutine Save result Initialize sum to 0 Add entry from list Return to calling program 89 Passing Parameters with Assume top of stack is at level 1Stack below. Move Move Call #NUM1, -(SP) N, -(SP) LISTADD Move Add 4(SP), SUM #8, SP Push parameters onto stack Call subroutine (top of stack at level 2) Save result Restore top of stack Level 3 (top of stack at level 1) . . . LISTADD MoveMultiple R0-R2, -(SP) LOOP Move Move Clear Add Decrement Branch>0 Move MoveMultiple Return 16(SP), R1 20(SP), R2 R0 (R2)+, R0 R1 LOOP R0, 20(SP) (SP)+, R0-R2 Advanced Reliable Systems (ARES) Lab. Save registers Level 2 (top of stack at level 3) Initialize counter to N. Initialize pointer to the list Level 1 Initialize sum to 0 Add entry from list [R2] [R1] [R0] Return address N NUM1 Put result on the stack Restore registers Return to calling program Jin-Fu Li, EE, NCU 90 Stack Frame SP (Stack pointer) Saved [R1] Saved [R0] -12(FP) -8(FP) -4(FP) FP (Frame pointer) localvar3 localvar2 localvar1 saved [FP] Return address param1 param2 param3 param4 8(FP) 12(FP) 16(FP) 20(FP) Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU Stack frame 91 Logical shifts Logic shift left Shift Instructions Carry flag C R0 Before: 0 0 1 1 1 0 . . . . . . 0 1 1 After: 1 1 1 0 . . . . . . 0 1 1 0 0 0 LShiftL #2, R0 Arithmetic shifts shift right Sign bit R0 AShiftR #2, R0 Advanced Reliable Systems (ARES) Lab. C Before: 1 0 0 1 1 . . . . . . 0 1 0 0 After: 1 1 1 0 0 1 1 . . . . . . . 0 1 Jin-Fu Li, EE, NCU 92 Rotate Rotate left without carry Instructions RotateL #2, R0 R0 C Before: 0 0 1 1 1 0 ..... . 0 1 1 After: 1 1 1 0 . . . . . . 0 1 1 0 1 Rotate left with carry RotateLC #2, R0 R0 C Before: 0 0 1 1 1 0 ..... . After: 1 1 1 0 ..... . Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 0 1 1 0 1 1 0 0 93 Link address Record 1 Linked List Record 2 Record k 0 Tail Head Linking structure Record 1 Record 2 New Record Inserting a new record Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 94 A List of Student Test Data field Address Key field Link field Scores First record 2320 27243 1040 Second record 1040 28106 1200 1200 28370 2880 2720 40632 1280 1280 47871 0 Last Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU Head Tail 95 Encoding of Machine To be executed in a processor, an instruction must be encoded in a compact binaryInstructions pattern. Su ch encoded instructions are properly referred to as machine instructions. The instructions that use symbolic names and acronyms are called assembly language instructions, which are converted into the machine instructions using assembler program For a given instruction, the type of operation that is to be performed and the type of oper ands used may be specified using an encoded binary pattern referred to as the OP code In addition to the OP code, th e instruction has to specify the source and destination registers, and addressing mode, etc, Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 96 Exampl Assume that 8 bits are allocate d for OP code, and 4 bits are es needed to identify each register, and 6 bits are needed to specify an addressing mode The instruction Move 24(R0), R5 The instruction LshiftR #2, R0 Require 16 bits to denote the OP code and the two registers Require 6 bits to choose the addressing mode Only 10 bits are left to give the index value Require 18 bits to specify the OP code, the addressing modes, and the register This limits the size of the immediate operand to what is expressible in 14 bits In the two examples, the instructions can be encoded in a 32-bit word. Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 97 Encoding Instructions into 32-bit 8 7 7 10 Words OP code Source Destination Other info One-word instruction OP code Source Destination Other info Memory address / Immediate operand Two-word instruction OP code Ri Rj Rk Other info Three-operand instruction Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 98 Encoding Instructions into 32-bit But, what happens if we want to specify a memory Words operand using the Absolute addressing mode? The instruction Move R2, LOC Require 18 bits to denote the OP code, the addressing modes, and the register The leaves 14 bits to express the address that corresponds to LOC, which is clearly insufficient If we want to be able to give a complete 32-bit address in the instruction, an instruction must have two words If we want to handle this type of instructions: Move LOC1, LOC2 An instruction must have three words Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 99 CISC & RISC Using multiple words, we can implement quite complex instructions, closely resembling operations in high-level programming language The term complex instruction set computer (CISC) has been used to refer to processors that use instruction sets of this type The restriction that an instru ction must occupy only one word has led to a style of computers that have become known as reduced instruction set computer (RISC) Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 10 0 Computer Arithmetic Arithmetic & Logic Unit • Performs arithmetic and logic operations on data – everything that we think of as “computing.” • Everything else in the computer is there to service this unit • All ALUs handle integers • Some may handle floating point (real) numbers • May be separate FPU (math co-processor) • FPU may be on separate chip (486DX +) ALU Inputs and Outputs Integer Representation • We have the smallest possible alphabet: the symbols 0 & 1 represent everything • No minus sign • No period • Signed-Magnitude • Two’s complement Benefits of 2 ’s • One representation of zero complement • Arithmetic works easily (see later) • Negating is fairly easy — 3 = 00000011 — Boolean complement gives 11111100 — Add 1 to LSB 11111101 Geometric Depiction of Twos Complement Integers 2 ’s complement • “Taking the 2’s complement” (complement and negation add 1) is computing the arithmetic negation of a number • Compute y = 0 – x — Or • Compute y such that x + y = 0 Addition and • For addition use normal binary addition Subtraction — 0+0=sum 0 carry 0 — 0+1=sum 1 carry 0 — 1+1=sum 0 carry 1 • Monitor MSB for overflow — Overflow cannot occur when adding 2 operands with the different signs — If 2 operand have same sign and result has a different sign, overflow has occurred • Subtraction: Take 2’s complement of subtrahend and add to minuend — i.e. a - b = a + (-b) • So we only need addition and complement circuits Hardware for Addition and Subtraction Side note: Carry look • Binary addition would seem to be -ahead dramatically slower for large registers — consider 0111 + 0011 — carries propagate left-to-right — So 64-bit addition would be 8 times slower than 8bit addition • It is possible to build a circuit called a “carry look-ahead adder” that speeds up addition by eliminating the need to “ripple” carries through the word Carry look-ahead • Carry look-ahead is expensive • If n is the number of bits in a ripple adder, the circuit complexity (number of gates) is O(n) • For full carry look-ahead, the complexity is O(n3) • Complexity can be reduced by rippling smaller look-aheads: e.g., each 16 bit group is handled by four 4-bit adders and the 16-bit adders are rippled into a 64-bit adder Multiplica • A complex operation compared with addition tion and subtraction • Many algorithms are used, esp. for large numbers • Simple algorithm is the same long multiplication taught in grade school — Compute partial product for each digit — Add partial products Multiplication Example • 1011 Multiplicand (11 dec) • x 1101 Multiplier (13 dec) • 1011 Partial products • 0000 Note: if multiplier bit is 1 copy multiplicand (place value) • 1011 otherwise zero • 1011 • 10001111 Product (143 dec) • Note: need double length result Simplifications for Binary • Partial products are easy to compute: Arithmetic — If bit is 0, partial product is 0 — If bit is 1, partial product is multiplicand • Can add each partial product as it is generated, so no storage is needed • Binary multiplication of unsigned integers reduces to “shift and add” Control logic and • 3 n bit registers, 1 bit carry register CF registers • Register set up — Q register <- multiplier — M register <- multiplicand — A register <- 0 — CF <- 0 • CF for carries after addition • Product will be 2n bits in A Q registers Unsigned Binary Multiplication Multiplication • Repeat n times: Algorithm — If Q = 1 Add M into A, store carry in CF 0 — Shift CF, A, Q right one bit so that: – An-1 <- CF – Qn-1 <- A0 – Q0 is lost • Note that during execution Q contains bits from both product and multiplier Flowchart for Unsigned Binary Multiplication Execution of Example Two’s complement • Shift and add does not work for two’s multiplication complement numbers • Previous example as 4-bit 2’s complement: -5 (1011) * -3 (1101) = -113 (10001111) • What is the problem? — Partial products are 2n-bit products When the multiplicand is negative • Each addition of the negative multiplicand must be negative number with 2n bits • Sign extend multiplicand into partial product • Or sign extend both operands to double precision • Not efficient When the multiplier is • When the multiplier (Q register) is negative, the bits negative of the operand do not correspond to shifts and adds needed • 1101 <->1*2^3 + 1*2^2 + 1*2^0 = -(2^3 + 2^2 + 2^0) • But we need -(2^1 + 2^0) The obvious solution • Convert multiplier and multiplicand to unsigned integers • Multiply • If original signs differed, negate result • But there are more efficient ways Fast • Consider the product 6234 * 99990 multiplication — We could do 4 single-digit multiplies and add partial sums • Or we can express the product as 6234 * (106 – 101 ) • In binary x * 00111100 can be expressed as x * (25 + 24 + 23 + 22) = x * 60 • We can reduce the number of operations to 2 by observing that 00111100 = 01000000 – 00000010 (64-4 = 60) — x * 00111100 = x * 26 – x * 22 — Each block of 1’s can be reduced to two operations — In the worst case 01010101 we still have only 8 operations Booth’s Algorithm Registers • 3 n bit registers, 1 bit register logically to Setupas Q-1) the right ofand Q (denoted • Register set up — Q register <- multiplier — Q-1 <- 0 — M register <- multiplicand — A register <- 0 — Count <- n • Product will be 2n bits in A Q registers Booth’s Algorithm • Bits of the multiplier are scanned one at a a time (the Control Logic current bit Q ) 0 • As bit is examined the bit to the right is considered also (the previous bit Q-1 ) • Then: 00: Middle of a string of 0s, so no arithmetic operation. 01: End of a string of 1s, so add the multiplicand to the left half of the product (A). 10: Beginning of a string of 1s, so subtract the multiplicand from the left half of the product (A). 11: Middle of a string of 1s, so no arithmetic operation. • Then shift A, Q, bit Q-1 right one bit using an arithmetic shift • In an arithmetic shift, the msb remains unchanged Booth’s Algorithm Example of Booth’s Algorithm ( 7* 3= 21) Example: -3 * 2 = -6 ( -3 = 1101) A 0000 Q 1101 Q-1 0 M 0010 C/P Comment Initial Values 1110 1111 1101 0110 0 1 0010 0010 10 A <- A - 2 = -2 >>1 0001 0000 0110 1011 1 0 0010 0010 01 A <- A + 2 >>1 1110 1111 1011 0101 0 1 0010 0010 01 A <- A - 2 = -2 >>1 1111 1010 1 0010 11 >>1 A:Q = -6 Example: 6 * -1 = -6 ( A Q Q Comment M C/P 0000 1111 0 1111 0110= -1) Initial Values -1 1010 1101 1111 0111 1 1 0110 10 0110 A <- A - 6 = -6 >>1 1110 1011 1 0110 11 >>1 1111 0101 1 0110 11 >>1 1111 1010 1 0110 11 >>1 A:Q = -6 Example: 3 * -2 = -6 ( A Q Q Comment M C/P 0000 0011 0 1110 1110= -2) Initial Values -1 0010 0001 0011 0001 0 1 0000 1000 1 1110 1111 1000 0100 1111 1010 0 1 0 1110 10 1110 A <- A -(-2) = 2 >> 1 1110 11 >> 1 1110 1110 01 A <- A +(-2) = -2 >>1 1110 00 >> 1 A:Q = -6 Divisi • More complex than multiplication to on implement (for computers as well as humans!) — Some processors designed for embedded applications or digital signal processing lack a divide instruction • Basically inverse of add and shift: shift and subtract • Similar to long division taught in grade school Unsigned Division In 147 / 11 = 13 with remainder 4 Principle 00001101 Quotient 1011 10010011 1011 001110 Partial 1011 Remainders 001111 1011 100 Dividend Divisor Remainder Unsigned Division • Using same registers (A,M,Q, count) as algorithm multiplication • Results of division are quotient and remainder — Q will hold the quotient — A will hold the remainder • Initial values — Q <- 0 — A <- Dividend — M <- Divisor — Count <- n Unsigned Division Flowchart Example Two’s complement • More difficult than unsigned division division • Algorithm: 1. M <- Divisor, A:Q <- dividend sign extended to 2n bits; for example 0111 -> 00000111 ; 1001-> 11111001 (note that 0111 = 7 and 1001 = -3) 2. Shift A:Q left 1 bit 3. If M and A have same signs, perform A <- A-M otherwise perform A <- A + M 4. The preceding operation succeeds if the sign of A is unchanged – – If successful, or (A==0 and Q==0) set Q0 <- 1 If not successful, and (A!=0 or Q!=0) set Q0 <- 0 and restore the previous value of A 5. Repeat steps 2,3,4 for n bit positions in Q 6. Remainder is in A. If the signs of the divisor and dividend were the same then the quotient is in Q, otherwise the correct quotient is 0-Q 2 ’s complement division examples 2 ’s complement division examples 2 ’s complement • 7/ 3 = 2 R 1 remainders • 7 / -3 = -2 R 1 • -7 / 3 = -2 R -1 • -7 / -3 = 2 R -1 • Here the remainder is defined as: Dividend = Quotient * Divisor + Remainder IEEE-754 Floating Point • Format wasNumbers discussed earlier in class • Before IEEE-754 each family of computers had proprietary format: Cray,Vax, IBM • Some Cray and IBM machines still use these formats • Most are similar to IEEE formats but vary in details (bits in exponent or mantissa): — IBM Base 16 exponent — Vax, Cray: bias differs from IEEE • Cannot make precise translations from one format to another • Older binary scientific data not easily accessible IEEE 754 • +/- 1.significand x 2exponent • Standard for floating point storage • 32 and 64 bit standards • 8 and 11 bit exponent respectively • Extended formats (both mantissa and exponent) for intermediate results Floating Point Examples FP • For a 32 bit number Ranges — 8 bit exponent — +/- 2256 1.5 x 1077 • Accuracy — The effect of changing lsb of mantissa — 23 bit mantissa 2-23 1.2 x 10-7 — About 6 decimal places Expressible Numbers Density of Floating Point Numbers •Note that there is a tradeoff between density and precision For a floating point representation of n bits, if we increase the precision by using more bits in the mantissa then then we decrease the range If we increase the range by using more bits for the exponent then we decrease the density and precision Floating Point Arithmetic Operations FP Arithmetic • Addition and subtraction are more + / -than multiplication and division complex • Need to align mantissas • Algorithm: — Check for zeros — Align significands (adjusting exponents) — Add or subtract significands — Normalize result FP Addition & Subtraction Flowchart for Z < - X + Y and Z < - X - Y Zero check • Addition and subtraction identical except for sign change • For subtraction, just negate subtrahend (Y in Z = X-Y) then compute Z = X+Y • If either operand is 0 report the other as the result Significand • Manipulate numbers so that both exponents Alignment are equal • Shift number with smaller exponent to the right – if bits are lost they will be less significant Repeat Shift mantissa right 1 bit Add 1 to exponent Until exponents are equal • If mantissa becomes 0 report other number as result Additi • Add mantissas together, taking sign into on account • May result in 0 if signs differ • Can result in mantissa overflow by 1 bit (carry) — Shift mantissa right and increment exponent — Report error if exponent overflow Normaliza • While (MSB of mantissa == 0) tion — Shift mantissa left one bit — Decrement exponent — Check for exponent underflow • Round mantissa FP Arithmetic Multiplication • Simpler processes than addition and subtractionand Division — Check for zero — Add/subtract exponents — Multiply/divide significands (watch sign) — Normalize — Round Floating Point Multiplication Multiplica • If either operand is 0 report 0 tion • Add exponents — Because addition doubles bias, first subtract the bias from one exponent • If exponent underflow or overflow, report error — Underflow may be reported as 0 and overflow as infinity • Multiply mantissas as if they were integers (similar to 2’s comp mult.) — Note product is twice as long as factors • Normalize and round — Same process as addition — Could result in exponent underflow Floating Point Division Divisi • If divisor is 0 report error or infinity; dividend 0 then on is 0 result • Subtract divisor exponent from dividend exp. — Removes bias so add bias back • If exponent underflow or overflow, report error — Underflow may be reported as 0 and overflow as infinity • Divide mantissas as if they were integers (similar to 2’s comp mult.) — Note product is twice as long as factors • Normalize and round — Same process as addition — Could result in exponent underflow IEEE Standard for Binary • Specifies practices and procedures beyond Arithmetic format specification — Guard bits (intermediate formats) — Rounding — Treatment of infinities — Quiet and signaling NaNs — Denormalized numbers Precision • Floating point arithmetic is inherently inexact considerations except where only numbers composed of sums of powers of 2 are used • To preserve maximum precision there are two main techniques: — Guard bits — Rounding rules Guard • Length of FPU registers > bits in mantissa bits • Allows some preservation of precision when — aligning exponents for addition — Multiplying or dividing significands • We have seen that results of arithmetic can vary when intermediate stores to memory are made in the course of a computation Roundi • Conventional banker’s rounding (round up if ng has a slight bias toward the larger 0.5) number • To remove this bias use round-to-even: 1.5 -> 2 2.5 -> 2 3.5 -> 4 4.5 -> 4 Etc IEEE • Four types are defined: Rounding — Round to nearest (round to even) — Round to + infinity — Round to – infinity — Round to 0 Round to • If extra bits beyond mantissa are 100..1.. then nearest round up • If extra bits are 01… then truncate • Special case: 10000…0 — Round up if last representable bit is 1 — Truncate if last representable bit is 0 Round to + / • Useful for interval arithmetic infinity — Result of fp computation is expressed as an interval with upper and lower endpoints — Width of interval gives measure of precision — In numerical analysis algorithms are designed to minimize width of interval Round to 0 • Simple truncation, obvious bias • May be needed when explicitly rounding following operations with transcendental functions Infinities • Infinity treated as limiting case for real arithmetic • Most arithmetic operations involving infinities produce infinity Quiet and Signaling NaNs • NaN = Not a Number • Signaling NaN causes invalid operation exception if used as operand • Quiet NaN can propagate through arithmetic operations without raising an exception • Signaling NaNs are useful for initial values of uninitialized variables • Actual representation is implementation (processor) specific Quiet NaNs Denormalized Numbers • Handle exponent underflow • Provide values in the “hole around 0” Unnormalized Numbers • Denormalized numbers have fewer bits of precision than normal numbers • When an operation is performed with a denormalized number and a normal number, the result is called an “unnormal” number • Precision is unknown • FPU can be programmed to raise an exception for unnormal computations HARDWIRED CONTROL AND MICROPROGRAMMED CONTROL 172 Connection Between the Processor and the Memory Memory MAR MDR Control PC R0 R1 Processor IR ALU Rn - 1 n general purpose registers Figure 1.2. Connections between the processor and the memory. 173 Overview • Instruction Set Processor (ISP) • Central Processing Unit (CPU) • A typical computing task consists of a series of steps specified by a sequence of machine instructions that constitute a program. • An instruction is executed by carrying out a sequence of more rudimentary operations. 174 Fundamental Concepts • Processor fetches one instruction at a time and perform the operation specified. • Instructions are fetched from successive memory locations until a branch or a jump instruction is encountered. • Processor keeps track of the address of the memory location containing the next instruction to be fetched using Program Counter (PC). • Instruction Register (IR) 175 Executing an Instruction • Fetch the contents of the memory location pointed to by the PC. The contents of this location are loaded into the IR (fetch phase). IR ← [[PC]] • Assuming that the memory is byte addressable, increment the contents of the PC by 4 (fetch phase). PC ← [PC] + 4 • Carry out the actions specified by the instruction in the IR (execution phase). 176 Processor Organization Internal processor bus Control signals PC Instruction Address lines decoder and MAR control logic Memory bus MDR Data lines IR Datapath Y R0 Constant 4 Select MUX Add ALU control lines Sub A B ALU Carry-in XOR Textbook Page 413 R n - 1 TEMP Z Figure 7.1. Single-bus organization of the datapath inside a processor. 177 Executing an Instruction • Transfer a word of data from one processor register to another or to the ALU. • Perform an arithmetic or a logic operation and store the result in a processor register. • Fetch the contents of a given memory location and load them into a processor register. • Store a word of data from a processor register into a given memory location. 178 Register Transfers Internal processor bus Riin Ri Riout Yin Y Constant 4 Select MUX A B ALU Zin Z Textbook Page 416 Zout Figure 7.2. Input and output gating for the registers in Figure 7.1. 179 Performing an Arithmetic or Logic Operation • • • The ALU is a combinational circuit that has no internal storage. ALU gets the two operands from MUX and bus. The result is temporarily stored in register Z. What is the sequence of operations to add the contents of register R1 to those of R2 and store the result in R3? 1. R1out, Yin 2. R2out, SelectY, Add, Zin 3. Zout, R3in 180 Fetching a Word from Memory • Address into MAR; issue Read operation; data into MDR. Memory-bus data lines MDRoutE MDRout Internal processor bus MDR MDR inE MDRin Figure7.4. 7.4. Connection Connection and control signals for register Figure and control signals forgister re MDR. MDR. 181 Fetching a Word from Memory • The response time of each memory access varies • The processor waits until it receives an MFC indication . • Move (R1), R2 MAR ← [R1] Start a Read operation on the memory bus Wait for the MFC response from the memory Load MDR from the memory bus R2 ← [MDR] 182 Execution of a Complete Instruction • Add (R3), R1 • Fetch the instruction • Fetch the first operand (the contents of the memory location pointed to by R3) • Perform the addition • Load the result into R1 183 Architecture Internal processor bus Riin Ri Riout Yin Y Constant 4 Select MUX A B ALU Zin Z Zout Figure 7.2. Input and output gating for the registers in Figure 7.1. 184 Execution of a Complete Instruction Internal processor bus Add (R3), R1 Step Control signals PC Action 1 PCout , MAR in , Read, Select4,Add, Zin 2 Zout , PCin , Y in , WMF C 3 MDR out , IR in 4 R3out , MAR in , Read 5 R1out , Y in , WMF C 6 MDR out , SelectY, Add, Zin 7 Zout , R1in , End Instruction Address lines decoder and MAR Memory bus MDR Data lines IR Y R0 Constant 4 Select MUX Add ALU control lines Sub A B R n - 1 ALU Carry-in Figure 7.6. Control sequencefor execution of the instruction Add (R3),R1. XOR Add R2, R1 ? control logic TEMP Z 185 Figure 7.1. Single-bus organization of the datapath inside a processor. Execution of a Complete Instruction Internal processor bus Add R2, R1 Step PCout , MAR in , Read, Select4,Add, Zin 2 Zout , PCin , Y in , WMF C 3 MDR out , IR in 4 R3out , MAR in , Read 5 R1out , Y in , WMF C 7 PC Action 1 6 Control signals decoder and MAR control logic Memory bus MDR Data lines IR Y R0 Constant 4 R2outMDR out , SelectY, Add, Zin Zout , R1in , End Instruction Address lines Select MUX Add ALU control lines Sub A B R n - 1 ALU Carry-in Figure 7.6. Control sequencefor execution of the instruction Add (R3),R1. XOR TEMP Z 186 Figure 7.1. Single-bus organization of the datapath inside a processor. Execution of Branch Instructions • A branch instruction replaces the contents of PC with the branch target address, which is usually obtained by adding an offset X given in the branch instruction. • The offset X is usually the difference between the branch target address and the address immediately following the branch instruction. • Conditional branch 187 Execution of Branch Instructions Step Action 1 PCout , MAR in , Read, Select4,Add, Z in 2 Zout , PCin , Yin , WMF C 3 MDR out , IR in 4 Offset-field-of-IRout, Add, Z in 5 Z out , PCin , End Figure 7.7. Control sequence for an unconditional branch instruction. 188 Exercise Internal processor bus • What is the control sequence for execution of the instruction Add R1, R2 including the instruction fetch phase? (Assume single bus architecture) Control signals PC Instruction Address lines decoder and MAR control logic Memory bus MDR Data lines IR Y R0 Constant 4 Select MUX Add ALU control lines Sub A B R n - 1 ALU Carry-in XOR TEMP Z 189 Figure 7.1. Single-bus organization of the datapath inside a processor. Hardwired Control 190 Overview • To execute instructions, the processor must have some means of generating the control signals needed in the proper sequence. • Two categories: hardwired control and microprogrammed control • Hardwired system can operate at high speed; but with little flexibility. 191 Control Unit Organization Clock CLK Control step counter External inputs IR Decoder/ encoder Condition codes Control signals Figure 7.10. Control unit organization. 192 Detailed Block Description Clock CLK Control step counter Reset Step decoder T 1 T2 Tn INS1 External inputs INS2 IR Instruction decoder Encoder Condition codes INSm Run End Control signals Figure 7.11. Separation of the decoding and encoding functions. 193 Generating Zin • Zin = T1 + T6 • ADD + T4 • BR + … Branch T4 Add T6 T1 Figure 7.12. Generation of the Zin control signal for the processor in Figure 7.1. 194 Generating End • End = T7 • ADD + T5 • BR + (T5 • N + T4 • N) • BRN +… Branch<0 Add T7 N Branch N T5 T4 T5 End Figure 7.13. Generation of the End control signal. 195 A Complete Processor Instruction unit Integer unit Instruction cache Floating-point unit Data cache Bus interface Processor System b us Main memory Input/ Output Figure 7.14. Block diagram of a complete processor . 196 Microprogrammed Control Micro instruction PCout MAR in Read MDRout IRin Yin Select Add Zin Z out R1out R1in R3out WMFC End • Control signals are generated by a program similar to machine language programs. Control Word (CW); microroutine; microinstruction : Textbook page430 PCin • 1 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 2 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 3 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 4 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 5 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 6 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 Figure 7.15 An example of microinstructions for Figure 7.6. 197 Overview Textbook page 421 Step Action 1 PCout , MAR in , Read, Select4,Add, Zin 2 Zout , PCin , Y in , WMF C 3 MDR out , IR in 4 R3out , MAR in , Read 5 R1out , Y in , WMF C 6 MDR out , SelectY, Add, Zin 7 Zout , R1in , End Figure 7.6. Control sequencefor execution of the instruction Add (R3),R1. 198 Basic organization of a microprogrammed control unit • Control store IR Starting address generator Clock PC Control store One function cannot be carried out by this simple organization. CW Figure 7.16. Basic organization of a microprogrammed control unit. 199 Conditional branch • • The previous organization cannot handle the situation when the control unit is required to check the status of the condition codes or external inputs to choose between alternative courses of action. Use conditional branch microinstruction. AddressMicroinstruction 0 PCout , MAR in , Read, Select4,Add, Z in 1 Zout , PCin , Y in , WMFC 2 MDRout , IR in 3 Branch to starting addressof appropriatemicroroutine . ... .. ... ... .. ... .. ... ... .. ... ... .. ... .. ... ... .. ... .. ... ... .. ... .. 25 If N=0, then branch to microinstruction0 26 Offset-field-of-IRout , SelectY, Add, Z in 27 Zout , PCin , End Figure 7.17. Microroutine for the instruction Branch<0. 200 Microprogrammed Control External inputs IR Starting and branch address generator Clock µPC Control store Figure 7.18. Condition codes CW Organization of the control unit to allow 201 conditional branching in the microprogram. Microinstructions • A straightforward way to structure microinstructions is to assign one bit position to each control signal. • However, this is very inefficient. • The length can be reduced: most signals are not needed simultaneously, and many signals are mutually exclusive. • All mutually exclusive signals are placed in the same group in binary coding. 202 Partial Format for the Microinstructions Microinstruction F1 F2 F3 F4 F5 F1 (4 bits) F2 (3 bits) F3 (3 bits) F4 (4 bits) F5 (2 bits) 0000: No transfer 0001: PC out 0010: MDRout 0011: Zout 0100: R0out 0101: R1out 0110: R2out 0111: R3out 1010: TEMPout 1011: Offsetout 000: No transfer 000: No transfer 0000: Add 001: PCin 001: MARin 0001: Sub 010: IRin 010: MDRin 011: Zin 011: TEMPin 1111: XOR 100: R0in 100: Yin 101: R1in 16 ALU 110: R2in functions 111: R3in F6 F7 F8 F6 (1 bit) F7 (1 bit) F8 (1 bit) 0: SelectY 1: Select4 0: No action 1: WMFC 0: Continue 1: End 00: No action 01: Read 10: Write What is the price paid for this scheme? Require a little more hardware Figure 7.19. An example of a partial format for field-encoded microinstructions. 203 Further Improvement • Enumerate the patterns of required signals in all possible microinstructions. Each meaningful combination of active control signals can then be assigned a distinct code. • Vertical organization Textbook page 434 • Horizontal organization 204 Microprogram Sequencing • If all microprograms require only straightforward sequential execution of microinstructions except for branches, letting a μPC governs the sequencing would be efficient. • However, two disadvantages: Having a separate microroutine for each machine instruction results in a large total number of microinstructions and a large control store. Longer execution time because it takes more time to carry out the required branches. 205 Microinstructions with Next-Address Field Condition IR • The microprogram which requires several branch microinstructions Solution • A powerful alternative approach is to include an address field as a part of every microinstruction to indicate the location of the next microinstruction to be fetched. • External Inputs Condition codes Decoding circuits A R Control store I R Next address Microinstruction decoder Control signals 206 Figure 7.22. Microinstruction-sequencing organization. Memory Chapter 7 - Memory 7-2 1. 2. 3. 4. 5. 6. 7. 8. 9. Chapter Contents The Memory Hierarchy Random-Access Memory Memory Chip Organization Case Study: Rambus Memory Cache Memory Virtual Memory Advanced Topics Case Study: Associative Memory in Routers Case Study: The Intel Pentium 4 Memory System Chapter 7 - Memory 7-3 The Memory Hierarchy Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Chapter 7 - Memory 7-4 Functional Behavior of a RAM Cell Static RAM cell (a) and dynamic RAM cell (b). Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Chapter 7 - Memory 7-5 Simplified RAM Chip Pinout Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-6 Chapter 7 - Memory A FourWord Memory with Four Bits per Word in a 2D Organizatio n Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Chapter 7 - Memory 7-7 A Simplified Representation of the Four-Word by FourBit RAM Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-8 Chapter 7 - Memory 2-1/2D Organization of a 64Word by One-Bit RAM Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-9 Chapter 7 - Memory Two Four-Word by Four-Bit RAMs are Used in Creating a Four-Word by Eight-Bit RAM Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-10 Chapter 7 - Memory Two Four-Word by Four-Bit RAMs Make up an Eight-Word by Four-Bit RAM Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Chapter 7 - Memory 7-11 Single-InLine Memory Module • 256 MB dual in-line memory module organized for a 64-bit word with 16 16M × 8-bit RAM chips (eight chips on each side of the DIMM). Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-12 Chapter 7 - Memory Single-InLine Memory Module • Schematic diagram of 256 MB dual in-line memory module. (Source: adapted from http://wwws.ti.com/sc/ds/tm4en64 kpu.pdf.) Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Chapter 7 - Memory 7-13 A ROM Stores Four Four-Bit Words Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-14 Chapter 7 - Memory A Lookup Table (LUT) Implements an EightBit ALU Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Chapter 7 - Memory 7-15 Flash Memory • (a) External view of flash memory module and (b) flash module internals. (Source: adapted from HowStuffWorks.com.) Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Chapter 7 - Memory 7-16 Cell Structure for Flash Memory • Current flows from source to drain when a sufficient negative charge is placed on the dielectric material, preventing current flow through the word line. This is the logical 0 state. When the dielectric material is not charged, current flows between the bit and word lines, which is the logical 1 state. Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-17 Rambus • Comparison of DRAM and RDRAM configurations. Memory Computer Architecture and Organization by M. Murdocca and V. Heuring Chapter 7 - Memory © 2007 M. Murdocca and V. Heuring 7-18 Chapter 7 - Memory Rambus • Rambus technology on the Nintendo 64 motherboard (left) enables cost savings over the conventional Sega Saturn Memory motherboard design (right). • Nintendo 64 game console: Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-19 Chapter 7 - Memory Placement of Cache Memory in a Computer System • The locality principle: a recently referenced memory location is likely to be referenced again (temporal locality); a neighbor of a recently referenced memory location is likely to be referenced (spatial locality). Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-20 Chapter 7 - Memory An Associative Mapping Scheme for a Cache Memory Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Chapter 7 - Memory 7-21 Associative Mapping • Consider how an access to memory location (A035F014) is mapped to the cache for a 2 word Example memory. The memory is divided into 2 blocks of 2 = 32 words per block, and the cache consists of 2 slots: 16 32 5 27 14 • If the addressed word is in the cache, it will be found in word (14)16 of a slot that has tag (501AF80)16, which is made up of the 27 most significant bits of the address. If the addressed word is not in the cache, then the block corresponding to tag field (501AF80)16 is brought into an available slot in the cache from the main memory, and the memory reference is then satisfied from the cache. Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-22 Chapter 7 - Memory Associative Mapping Area • Area allocation for associative mapping scheme based on bits stored: Allocation Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-23 Chapter 7 - Memory Replacement • When there are no available slots in which to place a block, a replacement policy is implemented. The replacement policy governs Policies the choice of which slot is freed up for the new block. • Replacement policies are used for associative and set-associative mapping schemes, and also for virtual memory. • Least recently used (LRU) • First-in/first-out (FIFO) • Least frequently used (LFU) • Random • Optimal (used for analysis only – look backward in time and reverseengineer the best possible strategy for a particular sequence of memory references.) Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-24 A Direct Mapping Scheme for Cache Memory Computer Architecture and Organization by M. Murdocca and V. Heuring Chapter 7 - Memory © 2007 M. Murdocca and V. Heuring 7-25 Chapter 7 - Memory Direct Mapping Example • For a direct mapped cache, each main memory block can be mapped to only one slot, but each slot can receive more than one block. Consider how an access to memory location (A035F014) 16 is mapped to the cache for a 232 word memory. The memory is divided into 27 blocks of 25 = 32 words per block, and the cache consists of 214 slots: •2If the addressed word is in the cache, it will be found in word (14)16 of slot (2F80)16, which will have a tag of (1406)16. Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-26 Chapter 7 - Memory Direct Mapping Area • Area allocation for direct mapping scheme based on bits stored: Allocation Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring A Set Associative Mapping Scheme for a Cache Memory 7-27 Computer Architecture and Organization by M. Murdocca and V. Heuring Chapter 7 - Memory © 2007 M. Murdocca and V. Heuring Chapter 7 - Memory 7-28 Set-Associative Mapping • Consider how an access to memory location (A035F014) is mapped to the cache for a 2 word memory. The memory is divided into 2 blocks Example of 2 = 32 words per block, there are two blocks per set, and the cache 16 32 27 5 consists of 214 slots: • The leftmost 14 bits form the tag field, followed by 13 bits for the set field, followed by five bits for the word field: Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-29 Chapter 7 - Memory Set Associative Mapping Area Allocation • Area allocation for set associative mapping scheme based on bits stored: Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-30 Cache Read and Write Policies Computer Architecture and Organization by M. Murdocca and V. Heuring Chapter 7 - Memory © 2007 M. Murdocca and V. Heuring Chapter 7 - Memory 7-31 Hit Ratios and Effective Access • Hit ratio and effective access time for single level cache: Times • Hit ratios and effective access time for multi-level cache: Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Chapter 7 - Memory 7-32 Direct Mapped Cache • Compute hit ratio and effective access time for Example a program that executes from memory locations 48 to 95, and then loops 10 times from 15 to 31. • The direct mapped cache has four 16-word slots, a hit time of 80 ns, and a miss time of 2500 ns. Load-through is used. The cache is initially empty. Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Chapter 7 - Memory 7-33 Table of Events for Example Program Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Calculation of Hit Ratio and Effective Access Time for Example Program 7-34 Computer Architecture and Organization by M. Murdocca and V. Heuring Chapter 7 - Memory © 2007 M. Murdocca and V. Heuring Chapter 7 - Memory 7-35 Multi-level Cache As an example, consider a two-level cache in which the L1 hit time is 5 ns, the L2 hit time is 20 ns, and the L2 miss time is 100 ns. There are 10,000 Memory memory references of which 10 cause L2 misses and 90 cause L1 misses. Compute the hit ratios of the L1 and L2 caches and the overall effective access time. H1 is the ratio of the number of times the accessed word is in the L1 cache to the total number of memory accesses. There are a total of 85 (L1) and 15 (L2) misses, and so: (Continued on next slide.) Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Chapter 7 - Memory 7-36 Multi-level Cache Memory H is the ratio of the number of times the accessed word is in the L2 cache to the number of times the L2 cache is accessed, and so: (Cont’) 2 The effective access time is then: = 5.23 ns per access Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-37 Chapter 7 - Memory Neat Little LRU • A sequence is shown for the Neat Little LRU Algorithm for a cache with four slots. Main memoryAlgorithm blocks are accessed in the sequence: 0, 2, 3, 1, 5, 4. Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-38 Chapter 7 - Memory Cache • The goal of cache coherence is to ensure that every cache sees the same value for a referenced location, which means making sure that Coherency any shared operand that is changed is updated throughout the system. • This brings us to the issue of false sharing, which reduces cache performance when two operands that are not shared between processes share the same cache line. The situation is shown below. The problem is that each process will invalidate the other’s cache line when writing data without a real need, unless the compiler prevents this. Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-39 Chapter 7 - Memory Overlay • A partition graph for a program with a main routine and three subroutines: s Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-40 Chapter 7 - Memory Virtual • Virtual memory is stored in a hard disk image. The physical memory holds a small number of virtual pages in physical page frames. Memory • A mapping between a virtual and a physical memory: Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-41 Chapter 7 - Memory Page • The page table maps between virtual memory and physical memory. Table Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-42 Using the Page Table Chapter 7 - Memory • A virtual address is translated into a physical address: Typical page table entry Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-43 Using the Page Table • The (cont’) configuration of Chapter 7 - Memory a page table changes as a program executes. • Initially, the page table is empty. In the final configuration, four pages are in physical memory. Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-44 Chapter 7 - Memory Segmentati • A segmented memory allows two users to share the same word processor code, with different data spaces: on Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Chapter 7 - Memory 7-45 • (a) Free area of memory after initialization; (b) after fragmentation; (c) after coalescing. Fragmentati on Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-46 Chapter 7 - Memory Translation Lookaside • An example TLB holds 8 entries for a system with 32 virtual pages and 16 page frames. Buffer Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-47 Chapter 7 - Memory Putting it All • An example TLB holds 8 entries for a system with 32 virtual pages and 16 page frames. Together Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Content Addressable Memory – • Relationships between random access memory and content Addressing Chapter 7 - Memory 7-48 addressable memory: Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-49 Overview of CAM Chapter 7 - Memory • Source: (Foster, C. C., Content Addressable Parallel Processors, Van Nostrand Reinhold Company, 1976.) Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Chapter 7 - Memory 7-50 Addressing Subtrees for a CAM Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Chapter 7 - Memory 7-51 Associative Memory in Routers • A simple network with three routers. • The use of associative memories in high-end routers reduces the lookup time by allowing a search to be performed in a single operation. • The search is based on the destination address, rather than the physical memory address. • Access methods for this memory have been standardized into an interface interoperability agreement by the Network Processing Forum. Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring 7-52 Chapter 7 - Memory Block Diagram of Dual-Read • A dual-read or dual-port RAM RAM allows any two words to be simultaneously read from the same memory. Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Chapter 7 - Memory 7-53 The Intel 4 Pentium Memory System Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Input/Output Organization Outlin Accessing I / O Devicese Interrupts Direct Memory Access Buses Interface Circuits Standard I / O Interfaces Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 26 1 Content Coverage Main Memory System Address Data/Instructio n Central Processing Unit (CPU) Cache memory Operationa l Registers Program Counter Arithmeti c and Logic Unit Instruction Sets Control Unit Input/Output System Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 26 2 Accessing I/O Single-bus structure Devices The bus enables all the devices connected to it to exchange information Typically, the bus consists of three sets of lines used to carry address, data, and control signals Each I / O device is assigned a unique set of addresses Processor Memory Bus I/O device 1 Advanced Reliable Systems (ARES) Lab. I/O device n Jin-Fu Li, EE, NCU 26 3 I/O Memory mappedMapping I /O Devices and memory share an address space I / O looks just like memory read /write No special commands for I/ O Large selection of memory access commands available Isolated I / O Separate address spaces Need I / O or memory select lines Special commands for I /O Limited set Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 26 4 Memory-Mapped When I / O devices and the memory share the same I/O address space, the arrangement is called memorymapped I / O With memory-mapped I / O, any machine instruction that can access memory can be used to transfer data to or from an I / O device Most computer systems use memory-mapped I / O. Some processors have special IN and OUT instructions to perform I / O transfers When building a computer system based on these processors, the designer has the option of connecting I / O devices to use the special I / O address space or simply incorporating them as part of the memory address space Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 26 5 I/O Interface for an Input The address decoder, the data and status Device registers, and the control circuitry required to coordinate I / O transfers constitute the device’s Address lines interfac Data lines Bus e circuit Control lines Address decoder Control circuits Data and status registers Input device Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 26 6 I/O Techniques Programmed Interrupt driven Direct Memory Access (DMA) Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 26 7 Program-Controlled Consider a simple example I/O of I / O operations involving a keyboard and a display device in a computer system. The four registers shown below are used in the data transfer operations The two flags KIRQ and DIRQ in STATUS register are used in conjunction with interrupts DATAIN DATAOUT STATUS DIRQ KIRQ SOUT CONTROL 7 Advanced Reliable Systems (ARES) Lab. 6 5 4 Jin-Fu Li, EE, NCU DEN KEN 3 2 1 SIN 0 26 8 An A program that reads one line from the keyboard, Example stores it in memory buffer, and echoes it back to the display WAITK WAITD Move #LINE, R0 TestBit #0,STATUS Branch=0 WAITK Move DATAIN,R1 TestBit #1,STATUS Branch=0 WAITD Move R1,DATAOUT Move R1,(R0)+ Compare #$0D,R1 Branch=0 WAITK Move #$0A,DATAOUT Call PROCESS Advanced Reliable Systems (ARES) Lab. Initialize memory pointer Test SIN Wait for character to be entered Read character Test SOUT Wait for display to become ready Send character to display Store character and advance pointer Check if Carriage Return If not, get another character Otherwise, send Line Feed Call a subroutine to process the input line Jin-Fu Li, EE, NCU 26 9 Program-Controlled I/O The example described above illustrates program- controlled I / O, in which the processor repeatedly checks a status flag to achieve the required synchronization between the processor and an input or output device. We say that the processor polls the devices There are two other commonly used mechanisms for implementing I / O operations: interrupts and direct memory access Interrupts: synchronization is achieved by having the I / O device send a special signal over the bus whenever it is ready for a data transfer operation Direct memory access: it involves having the device interface transfer data Jin-Fu directly Li, EE, NCU to or from the memory Advanced Reliable Systems (ARES) Lab. 27 0 Interrupt To avoid the processor being s not performing any useful computation, a hardware signal called an interrupt to the processor can do it. At least one of the bus control lines, called an interrupt-request line, is usually dedicated for this purpose An interrupt-service routine usually is needed and is executed when an interrupt request is issued On the other hand, the processor must inform the device that its request has been recognized so that it may remove its interrupt-request signal. An interrupt-acknowledge signal serves this function Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 27 1 Exampl e Program 1 COMPUTE routine Program 2 PRINT routine 1 2 Interrupt occurs here i i+1 M Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 27 2 Interrupt-Service Routine & Subroutine Treatment of an interrupt-service routine is very similar to that of a subroutine An important departure from the similarity should be noted A subroutine performs a function required by the program from which it is called. The interrupt-service routine may not have anything in common with the program being executed at the time the interrupt request is received. In fact, the two programs often belong to different users Before executing the interrupt-service routine, any information that may be altered during the execution of that routine must be saved. This information must be restored before the interrupted program is resumed 27 Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 3 Interrupt The information that needs to be saved and Latency restored typically includes the condition code flags and the contents of any registers used by both the interrupted program and the interrupt-service routine Saving registers also increases the delay between the time an interrupt request is received and the start of execution of the interrupt-service routine. The delay is called interrupt latency Typically, the processor saves only the contents of the program counter and the processor status register. Any additional information that needs to be saved must be saved by program instruction at Jin-Fu Li, EE, NCU Advanced theReliable Systems (ARES) Lab. beginning of the interrupt-service routine 27 4 Interrupt An equivalent circuit for an open-drain bus Hardware used to implement a common interrupt-request line V dd Processor R INTR INTR INTR1 INTR2 INTRn INTR=INTR1+INTR2+…+INTRn Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 27 5 Handling Multiple Handling multiple devices gives rise to a number of Devices questions: How can the processor recogniz e the device requesting an interrupt? Given that different devices are likely to require different interrupt-service routines, how can the processor obtain the starting address of the approp riate routine in each case? Should a device be allowed to interrupt the processor while another interrupt is being serviced? How should two or more simult aneous interrupt request be handled? The information needed to determine whether a device is requesting an interrupt is available in its status register When a device raises an interrupt request, it sets to 1 one of the bits in its status register , which we will call the IRQ bit Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 27 6 Identify the Interrupting The simplest way toDevice identify the interrupting device is to have the interrupt-service routine poll all the I / O devices connected to the bus The polling scheme is easy to implement. Its main disadvantage is the time spen t interrogating all the devices A device requesting an interrupt may identify itself directly to the processor. Then, the processor can immediately start executing the corresponding interrupt-service routine. This is called vectored interrupts An interrupt request from a high-priority device should be accepted while the processor is servicing another request fromJin-Fu a Li,lower-priority device EE, NCU Advanced Reliable Systems (ARES) Lab. 27 7 Interrupt The processor’s priority is usually encoded in a few Priority bits of the processor status word. It can be changed by program instructions that write into the program status register (PS). These are privileged instructions, which can be executed only while the processor is running in the supervisor mode The processor is in the su pervisor mode only when executing operating system routines. It switches to the user mode before beginning to execute application program Advanced AnReliable attempt to executeJin-FuaLi,privileged instruction while 19 EE, NCU Systems (ARES) Lab. in the user mode leads to a special type of interrupt Implementation of Interrupt An example of thePriority implementation of a multiple- priority scheme Processor INTR1 Device 1 INTA1 INTRp Device 2 Device p INTAp Priority arbitration circuit Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 27 9 Simultaneous Consider the problem of simultaneous arrivals Requests of interrupt requests from two or more devices. The processor must have some means of deciding which request to service first Interrupt priority scheme with daisy chain Processor INTR INTA Device 1 Advanced Reliable Systems (ARES) Lab. Device 2 Jin-Fu Li, EE, NCU Device n 28 0 Priority Combination of the interrupt priority scheme Group with daisy chain and with individual interrupt- request and interrupt-acknowledge lines Processor INTR1 INTA1 Device Device Device Device Device Device INTRp INTAp Priority arbitration circuit Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 28 1 Direct Memory To transfer large blocks of data at high speed, a Access special control unit may be provided between an external device and the main memory, without continuous intervention by the processor. This approach is called direct memory access (DMA) DMA transfers are performed by a control circuit that is part of the I / O device in terface. We refer to this circuit as a DMA controller. Since it has to transfer blocks of data, the DMA controller must increment the memory address for successive words and keep track of the number of transfers Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 28 2 DMA Although a DMA controller can transfer data Controller without intervention by the processor, its operation must be under the control of a program executed by the processor An example 31 30 1 0 Status and control IRQ Done IE R/W Starting address Word count Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 28 3 DMA Controller in a Computer System Main memory Processor System bus Disk/DMA controller Disk Disk Advanced Reliable Systems (ARES) Lab. DMA controller Printer Keyboard Network Interface Jin-Fu Li, EE, NCU 28 4 Memory Access Memory accesses by the processor and the DMA Priority controllers are interwoven. Request by DMA devices for using the bus are always given higher priority than processor requests. Among different DMA devices, top priority is given to high-speed peripherals such as a disk, a highspeed network interface, etc. Since the processor originates most memory access cycles, the DMA controller can be said to “steal” memory cycles from the processor. Hence, this interweaving technique is usually called cycle stealing The DMA controller may transfer a block of data without interruption. This is called block/burst 28 Jin-Fu Li, EE, NCU Advanced Reliable Systems (ARES) Lab. mode 5 Bus A conflict may arise if both the processor and a DMA Arbitration controller or two DMA controllers try to use the bus at the same time to access the main memory. To resolve this problem, an arbitration procedure on bus is needed The device that is allowed to initiate data transfer on the bus at any given time is called the bus master. When the current master relinquishes control of the bus, another device can acquire this status Bus arbitration is the process by which the next device to become the bus master take into account the needs of various devices by establishing a Jin-Fu Li, EE, NCU Advanced Reliable Systems (ARES) Lab. priority system for gaining access to the bus 28 6 Bus There are two approaches to bus arbitration Arbitration Centralized and distributed In centralized arbitration, a single bus arbiter performs the required arbitration In distributed arbitration, all devices participate in the selection of the next bus master Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 28 7 Centralized Arbitration Processor BBSY BR BG1 DMA Controller 1 BG2 DMA Controller 2 BR BG1 BG2 BBSY Processor Advanced Reliable Systems (ARES) Lab. DMA controller 2 Jin-Fu Li, EE, NCU Processor 28 8 Distributed Arbitration Assume that IDs of A and B are 5 and 6. Also, the code seen by both devices is 0111 Vcc ARB3 ARB2 ARB1 ARB0 Start-Arbitration O.C. 0 1 0 1 0 1 1 1 Interface circuit for device A Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 28 9 Buse A bus protocol is the sets of rules that govern the behavior of various devices connected to the bus as to when to place information on the bus, assert control signals, and so on In a synchronous bus, all devices derive timing information from a common clock line. Equal spaced pulses on this line define equal time intervals In the simplest form of a synchronous bus, each of these intervals constitutes a bus cycle during which one data transfer can take place Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 29 0 A Synchronous Bus Timing of an inputExample transfer on a synchronous bus Bus clock Address and command Data t0 t1 t2 Bus Cycle Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 29 1 A Synchronous Bus Detail timing Example diagram Bus clock Seen by master tAM Address and command Data tDM Seen by slave Slave send the requested data tAS Address and command Data tDS t1 Advanced Reliable Systems (ARES) Lab. t1 Jin-Fu Li, EE, NCU t2 29 2 Input Transfer Using Multiple Clock Cycles 1 2 3 4 Clock Address Command Data Slave-ready Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 29 3 Asynchronous An alternative scheme for controlling data Bus transfers on the bus is based on the use of a handshake between the master and slave Address and command Master-ready Slave-ready Data t0 t1 t3 t2 t4 t5 Bus cycle Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 29 4 Asynchronous Handshake control ofBus data transfer during an output operation Address and command Data Master-ready Slave-ready t0 t1 t3 t2 t4 t5 Bus cycle Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 29 5 Discussio The choice of a particular design involves trade-offs n among factors such as Simplicity of the device interface Ability to accommodate device interfaces that introduce different amounts of delay Total time required for bus transfer Ability to detect errors results from addressing a nonexistent device or from an interface malfunction Asynchronous bus The handshake process eliminates the need for synchronization of the sender and receiver clock, thus simplifying timing design Synchronous bus Clock circuitry must be designed carefully to ensure proper synchronization, and delays must be kept within strict bounds Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 29 6 Circuits Keyboard toInterface processor connection When a key is pressed, the Valid signal changes from 0 o 1, causing the ASCII code to be loaded into DATAIN and SIN to be set to 1 The status flag SIN is cleared to 0 when the processor reads the contents of the DATAIN register Data DATAIN Data Address Processor SIN R/W Valid Master-ready Slave-ready Advanced Reliable Systems (ARES) Lab. Encoder and Debouncing circuit Keyboard switches Input Interface Jin-Fu Li, EE, NCU 29 7 Input Interface Circuit D7 D0 R/W Masterready A31 A1 Q7 D7 Keyboard data Q0 D0 SIN Slaveready DATAIN 1 Readstatus Status flag Valid Readdata Address decoder A0 Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 29 8 Circuit for the Status Flag Block SIN Read-data Master-ready Q D 1 Valid Q Clear Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 29 9 Printer to Processor The interface contains a data register, DATAOUT, and a status flag,Connection SOUT The SOUT flag is set to 1 when the printer is ready to accept another character, and it is cleared to 0 when a new character is loaded into DATAOUT by the processor When the printer is ready to acce pt a character, it asserts its idle signal Data DATAOUT Data Address SOUT Processor Printer R/W Valid Master-ready Slave-ready Advanced Reliable Systems (ARES) Lab. Output Interface Jin-Fu Li, EE, NCU Idle 30 0 D7 Output Interface Circuit DATAOUT D7 Q7 D1 Q1 D0 Q0 D1 D0 SOUT Slaveready R/W Masterready A31 A1 Printer data 1 Readstatus Handshake control Idle Valid Loaddata Address decoder A0 Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 30 1 D7 A General 8-Bit Parallel Interface P7 DATAIN D0 P0 DATAOUT Data Direction Register Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 30 2 Output Interface Circuit for a Bus Protocol DATAOUT D7 Q7 D7 Printer data D1 Q1 D0 Q0 D1 D0 SOUT Readstatus Handshake control Idle Valid Loaddata Respond Go=1 R/W Slaveready A31 A1 Go Address decoder Myaddress Timing Logic My-address Idle A0 Clock Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 30 3 Timing Diagram for an Output Operation Time 1 2 3 Clock Address R/W Data Go Slave-ready Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 30 4 Serial A serial port is used to connect the processor to Port I / O devices that require transmission of data one bit at a time The key feature of an interface circuit for a serial port is that it is capable of communicating in a bit-serial fashion on the device side and in a bit- parallel fashion on the bus side The transformation between the parallel and serial formats is achieved with shift registers that have parallel access capability Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 30 5 A Serial Interface Input shift register Serial input DATAIN D7 D0 DATAOUT Output shift register Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU Serial output 30 6 Standard I/O The processor bus is the bus defined by the signals on the processor chip itself.Interfaces Devices that require a very high speed connection to the processor, such as the main memory, may be connected directly to this bus The motherboard usually provides another bus that can support more devices. The two buses are interconnected by a circuit, which we called a bridge, that translates the signals and protocols of one bus into those of the other It is impossible to define a uniform standards for the processor bus. The structure of this bus is closely tied to the architecture of the processor The expansion bus is not subject to these limitations, and therefore it can use a standardized signaling structure Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 30 7 Peripheral Component Interconnect Bus Use of a PCI bus in a computer system Host Main memory PCI bridge PCI Bus Disk Advanced Reliable Systems (ARES) Lab. Printer Jin-Fu Li, EE, NCU Ethernet interface 30 8 PCI Bus The bus support three independent address spaces: memory, I / O, and configuration. The I / O address space is intended for use with processors, such Pentium, that have a separate I / O address space. However, the system designer may choose to use memory-mapped I / O even when a separate I / O address space is available The configuration space is intended to give the PCI its plug-and-play capability. A 4-bit command that accompanies the address identifies which of the three spaces is being used in a given data transfer operation Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 30 9 Data Transfer Signals on the PCI Bus Name Function CLK A 33-MHz or 66MHz clock FRAME# Sent by the initiator to indicate the duration of a transaction AD 32 address/data lines, which may be optionally increased to 64 C/BE# 4 command/byte-enable lines (8 for 64-bit bus) IRDY#, TRDY# Initiator-ready and Target-ready signals DEVSEL# A response from the device indicating that it has recognized its Address and is ready for a data transfer transaction IDSEL# Initialization Device Select Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 31 0 A Read Operation on the PCI Bus 1 2 3 5 4 6 7 CLK Frame# AD Address C/BE# Cmnd #1 #2 #3 #4 Byte enable IRDY# TRDY# DEVSEL# Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 31 1 Universal Serial Bus The USB has been designed (USB) to meet several key objectives Provide a simple, low-cost, and easy to use interconnection system that overcomes the difficulties due to the limited number of I / O ports available on a computer Accommodate a wide range of data transfer characteristics for I / O devices, including telephone and Internet connections Enhance user convenience through a “plug-and-play” mode of operation Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 31 2 USB A serial transmission format has been chosen for the USB Structure because a serial bus satisfies the low-cost and flexibility requirements Clock and data information are encoded together and transmitted as a single signal Hence, there are no limitations on clock frequency or distance arising from data skew To accommodate a large number of devices that can be added or removed at any time, the USB has the tree structure Each node of the tree has a device called a hub, which acts as an intermediate control point between the host and the I / O device At the root of the tree, a root hub connects the entire tree to the host computer Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 31 3 USB Tree Structure Host Computer Root hub Hub I/O device Hub I/O device Advanced Reliable Systems (ARES) Lab. Hub I/O device I/O device I/O device I/O device Jin-Fu Li, EE, NCU 31 4 USB Tree The tree structure enables many devices to be connected Structure while using only simple point-to-point serial links Each hub has a number of ports where devices may be connected, including other hubs In normal operation, a hub copies a message that it receives from its upstream connection to all its downstream ports As a result, a message sent by th e host computer is broadcast to all I / O devices, but only the addresse d device will respond to that message A message sent from an I / O de vice is sent only upstream towards the root of the tree and is not seen by other devices Hence, USB enables the host to communicate with the I / O devices, but it does not enable these devices to communicate with each other Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 31 5 USB All information transferred over the USB is Protocols organized in packets, where a packet consists of one or more bytes of information The information transferred on the USB can be divided into two broad categories: control and data Control packets perform such tasks as addressing a device to initiate data transfer, acknowledging that data have been received correctly, or indicating an error Data packets carry informatio n that is delivered to a device. For example, input and output data are transferred inside data packets Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 31 6 Interconnection Networks • Uses of interconnection networks – Connect processors to shared memory – Connect processors to each other • Interconnection media types – Shared medium – Switched medium Shared versus Switched Media Switch Network Topologies • View switched network as a graph – Vertices = processors or switches – Edges = communication paths • Two kinds of topologies – Direct – Indirect Direct Topology • Ratio of switch nodes to processor nodes is 1:1 • Every switch node is connected to – 1 processor node – At least 1 other switch node Indirect Topology • Ratio of switch nodes to processor nodes is greater than 1:1 • Some switches simply connect other switches 2-D Meshes Binary Tree Network • Indirect topology • n = 2d processor nodes, n-1 switches Hypertree Network Butterfly Network Routing Hypercube Addressing 1110 0110 0111 1010 0010 1111 1011 0011 1100 0100 0101 1000 0000 1101 0001 1001 Shuffle-exchange Illustrated 0 1 2 3 4 5 6 7 Shuffle-exchange Addressing 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Why Processor Arrays? • Historically, high cost of a control unit • Scientific applications have data parallelism Processor Array Data/instruction Storage • Front end computer – Program – Data manipulated sequentially • Processor array – Data manipulated in parallel Processor Array Performance • Performance: work done per time unit • Performance of processor array – Speed of processing elements – Utilization of processing elements 2-D Processor Interconnection Network Each VLSI chip has 16 processing elements Flynn’s Taxonomy • Instruction stream • Data stream • Single vs. multiple • Four combinations – SISD – SIMD – MISD – MIMD SISD • Single Instruction, Single Data • Single-CPU systems • Note: co-processors don’t count – Functional – I/O • Example: PCs SIMD • Single Instruction, Multiple Data • Two architectures fit this category – Pipelined vector processor (e.g., Cray-1) – Processor array (e.g., Connection Machine) MISD • Multiple Instruction, Single Data • Example: systolic array Pipelining Overview • Pipelining is widely used in modern processors. • Pipelining improves system performance in terms of throughput. • Pipelined organization requires sophisticated compilation techniques. Making the Execution of Programs Faster • Use faster circuit technology to build the processor and the main memory. • Arrange the hardware so that more than one operation can be performed at the same time. • In the latter way, the number of operations performed per second is increased even though the elapsed time needed to perform any one operation is not changed. Traditional Pipeline Concept • Laundry Example • Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold • Washer takes 30 minutes • Dryer takes 40 minutes • “Folder” takes 20 minutes A B C D Traditional Pipeline Concept 6 PM 7 8 9 10 11 Midnight Time 30 A B C D 40 20 30 40 20 30 40 20 30 40 20 • Sequential laundry takes 6 hours for 4 loads • If they learned pipelining, how long would laundry take? Traditional Pipeline Concept 6 PM T a s k O r d e r 7 8 9 10 11 Midnight Time 30 A B C D 40 40 40 40 20 • Pipelined laundry takes 3.5 hours for 4 loads Traditional Pipeline Concept 6 PM 7 8 9 Time T a s k O r d e r 30 A B C D 40 40 40 40 20 • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Pipeline rate limited by slowest pipeline stage • Multiple tasks operating simultaneously using different resources • Potential speedup = Number pipe stages • Unbalanced lengths of pipe stages reduces speedup • Time to “fill” pipeline and time to “drain” it reduces speedup • Stall for Dependences Use the Idea of Pipelining in a Computer Fetch + Execution Time I1 I2 I3 Time Clock cycle F 1 E 1 F 2 E 2 F 3 1 2 F1 E1 3 4 E 3 Instruction I1 (a) Sequential execution I2 F2 E2 Interstage buffer B1 I3 Instruction fetch unit Execution unit (b) Hardware organization F3 E3 (c) Pipelined execution Figure 8.1. Basic idea of instruction pipelining. Use the Idea of Pipelining in a Computer Time Clock cycle 1 2 3 4 5 6 F1 D1 E1 W1 F2 D2 E2 W2 F3 D3 E3 W3 F4 D4 E4 7 Instruction Fetch + Decode + Execution + Write I1 I2 I3 I4 W4 (a) Instruction execution divided into four steps Interstage ubffers D : Decode instruction and fetch operands F : Fetch instruction B1 E: Execute operation B2 (b) Hardware organization Textbook page: 457 Figure 8.2. A 4-stage pipeline. W : Write results B3 Role of Cache Memory • Each pipeline stage is expected to complete in one clock cycle. • The clock period should be long enough to let the slowest pipeline stage to complete. • Faster stages can only wait for the slowest one to complete. • Since main memory is very slow compared to the execution, if each instruction needs to be fetched from main memory, pipeline is almost useless. • Fortunately, we have cache. Pipeline Performance • The potential increase in performance resulting from pipelining is proportional to the number of pipeline stages. • However, this increase would be achieved only if all pipeline stages require the same time to complete, and there is no interruption throughout program execution. • Unfortunately, this is not true. Pipeline Performance Clock cycle 1 2 3 4 F1 D1 E1 W1 F2 D2 5 6 7 8 9 Instruction I1 I2 I3 I4 I5 F3 E2 W2 D3 E3 W3 F4 D4 E4 W4 F5 D5 E5 Figure 8.3. Effect of an e xecution operation taking more than one clock ycle. c Time Pipeline Performance • The previous pipeline is said to have been stalled for two clock cycles. • Any condition that causes a pipeline to stall is called a hazard. • Data hazard – any condition in which either the source or the destination operands of an instruction are not available at the time expected in the pipeline. So some operation has to be delayed, and the pipeline stalls. • Instruction (control) hazard – a delay in the availability of an instruction causes the pipeline to stall. • Structural hazard – the situation when two instructions require the use of a given hardware resource at the same time. Pipeline Performance Instruction hazard Clock cycle 1 2 3 4 F1 D1 E1 W1 5 Time 9 6 7 8 D2 E2 W2 F3 D3 E3 W3 8 Time 9 Instruction I1 I2 F2 I3 (a) Instruction execution steps in successive clock cycles Clock cycle 1 2 3 4 5 6 7 F1 F2 F2 F2 F2 F3 D1 idle idle idle D2 D3 E1 idle idle idle E2 E3 W1 idle idle idle W2 Stage F: Fetch D: Decode E: Execute W: Write (b) Function performed by each processor stage in successive clock cycles Figure 8.4. Pipeline stall caused by a cache miss in F2. Idle periods – stalls (bubbles) W3 Pipeline Performance Load X(R1), R2 Structural hazard Clock cycle Time 1 2 3 4 5 6 F1 D1 E1 W1 F2 D2 F3 7 E2 M2 W2 D3 E3 W3 F4 D4 E4 Instruction I1 I 2 (Load) I3 I4 I5 F5 D5 Figure 8.5. Effect of a Load instruction on pipeline timing. Pipeline Performance • Again, pipelining does not result in individual instructions being executed faster; rather, it is the throughput that increases. • Throughput is measured by the rate at which instruction execution is completed. • Pipeline stall causes degradation in pipeline performance. • We need to identify all hazards that may cause the pipeline to stall and to find ways to minimize their impact. Quiz • Four instructions, the I2 takes two clock cycles for execution. Pls draw the figure for 4-stage pipeline, and figure out the total cycles needed for the four instructions to complete. Data Hazards Data Hazards • We must ensure that the results obtained when instructions are executed in a pipelined processor are identical to those obtained when the same instructions are executed sequentially. • Hazard occurs A←3+A B←4×A • No hazard A←5×C B ← 20 + C • When two operations depend on each other, they must be executed sequentially in the correct order. • Another example: Mul R2, R3, R4 Add R5, R4, R6 Data Hazards Clock cycle 1 2 3 4 F1 D1 E1 W1 F2 D2 5 6 7 8 D2A E2 W2 D3 E3 W3 F4 D4 E4 9 Instruction I 1 (Mul) I 2 (Add) I3 I4 F3 W4 Figure 8.6. Pipeline stalled by data dependenc y between D 2 and W1. Figure 8.6. Pipeline stalled by data dependency between D2 and W1. Time Operand Forwarding • Instead of from the register file, the second instruction can get data directly from the output of ALU after the previous instruction is completed. • A special arrangement needs to be made to “forward” the output of ALU to the input of ALU. Source 1 Source 2 SRC1 SRC2 Register file ALU RSLT Destination (a) Datapath SRC1,SRC2 RSLT E: Execute (ALU) W: Write (Register file) Forwarding path (b) Position of the source and result registers in the processor pipeline Figure 8.7. Operand forwarding in a pipelined processor . Handling Data Hazards in Software • Let the compiler detect and handle the hazard: I1: Mul R2, R3, R4 NOP NOP I2: Add R5, R4, R6 • The compiler can reorder the instructions to perform some useful work during the NOP slots. Side Effects • The previous example is explicit and easily detected. • Sometimes an instruction changes the contents of a register other than the one named as the destination. • When a location other than one explicitly named in an instruction as a destination operand is affected, the instruction is said to have a side effect. (Example?) • Example: conditional code flags: Add R1, R3 AddWithCarry R2, R4 • Instructions designed for execution on pipelined hardware should have few side effects. Instruction Hazards Overview • Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline stalls. • Cache miss • Branch Unconditional Branches Time Clock cycle 1 2 F1 E1 3 4 5 6 Instruction I1 I 2 (Branch) I3 Ik I k+1 F2 Execution unit idle E2 F3 X Fk Ek Fk+1 Ek+1 Figure 8.8. An idle cycle caused by a branch instruction. Branch Timing Clock cycle 1 2 3 4 I1 F1 D1 E1 W1 F2 D2 E2 F3 D3 X F4 X I 2 (Branch) I3 - Branch penalty - Reducing the penalty I4 5 Fk Ik I k+1 6 7 8 Dk Ek Wk Fk+1 Dk+1 Ek+1 (a) Branch address computed inecute Ex stage Clock cycle 1 2 3 4 I1 F1 D1 E1 W1 F2 D2 I 2 (Branch) I3 Ik I k+1 F3 5 6 7 Dk Ek Wk Fk+1 D k+1 Ek+1 Time X Fk (b) Branch address computed in Decode stage Figure 8.9. Branch timing. Time Instruction Queue and Prefetching Instruction fetch unit Instruction queue F : Fetch instruction D : Dispatch/ Decode unit E : Execute instruction W : Write results Figure 8.10. Use of an instruction queue in the hardware organization of Figure 8.2b. Conditional Braches • A conditional branch instruction introduces the added hazard caused by the dependency of the branch condition on the result of a preceding instruction. • The decision to branch cannot be made until the execution of that instruction has been completed. • Branch instructions represent about 20% of the dynamic instruction count of most programs. Delayed Branch • The instructions in the delay slots are always fetched. Therefore, we would like to arrange for them to be fully executed whether or not the branch is taken. • The objective is to place useful instructions in these slots. • The effectiveness of the delayed branch approach depends on how often it is possible to reorder instructions. Delayed Branch LOOP NEXT Shift_left Decrement Branch=0 Add R1 R2 LOOP R1,R3 (a) Original program loop LOOP NEXT Decrement Branch=0 Shift_left Add R2 LOOP R1 R1,R3 (b) Reordered instructions Figure 8.12. Reordering of instructions for a delayed branch. Delayed Branch Time Clock cycle 1 2 F E 3 4 5 6 7 8 Instruction Decrement Branch Shift (delay slot) Decrement (Branch tak en) Branch Shift (delay slot) Add (Branch not tak en) F E F E F E F E F E F E Figure 8.13. Execution timing showing the delay slot being filled during the last two passes through the loop in Figure 8.12. Branch Prediction • To predict whether or not a particular branch will be taken. • Simplest form: assume branch will not take place and continue to fetch instructions in sequential address order. • Until the branch is evaluated, instruction execution along the predicted path must be done on a speculative basis. • Speculative execution: instructions are executed before the processor is certain that they are in the correct execution sequence. • Need to be careful so that no processor registers or memory locations are updated until it is confirmed that these instructions should indeed be executed. Incorrectly Predicted Branch Time Clock cycle 1 2 3 4 5 F1 D1 E1 W1 F2 D2/P2 E2 F3 D3 X F4 X 6 Instruction I 1 (Compare) I 2 (Branch>0) I3 I4 Ik Fk Dk Figure 8.14.Timing when a branch decision has been incorrectly predicted as not taken. Branch Prediction • Better performance can be achieved if we arrange for some branch instructions to be predicted as taken and others as not taken. • Use hardware to observe whether the target address is lower or higher than that of the branch instruction. • Let compiler include a branch prediction bit. • So far the branch prediction decision is always the same every time a given instruction is executed – static branch prediction. Influence on Instruction Sets Overview • Some instructions are much better suited to pipeline execution than others. • Addressing modes • Conditional code flags Addressing Modes • Addressing modes include simple ones and complex ones. • In choosing the addressing modes to be implemented in a pipelined processor, we must consider the effect of each addressing mode on instruction flow in the pipeline: Side effects The extent to which complex addressing modes cause the pipeline to stall Whether a given mode is likely to be used by compilers Recall Load X(R1), R2 Time Clock cycle 1 2 3 4 5 6 F1 D1 E1 W1 F2 D2 F3 7 E2 M2 W2 D3 E3 W3 F4 D4 E4 Instruction I1 I 2 (Load) I3 I4 I5 Load (R1), R2 F5 D5 Figure 8.5. Effect of a Load instruction on pipeline timing. Complex Addressing Mode Load (X(R1)), R2 Clock cycle 1 2 3 Load D X + [R1] F 4 5 6 [X +[R1]] [[X +[R1]]] Time 7 W Forward Next instruction F D (a) Complex addressing mode E W Simple Addressing Mode Add #X, R1, R2 Load (R2), R2 Load (R2), R2 Add F Load Load Next instruction D X + [R1] W F D [X +[R1]] W F D [[X +[R1]]] W F D E (b) Simple addressing mode W Addressing Modes • In a pipelined processor, complex addressing modes do not necessarily lead to faster execution. • Advantage: reducing the number of instructions / program space • Disadvantage: cause pipeline to stall / more hardware to decode / not convenient for compiler to work with • Conclusion: complex addressing modes are not suitable for pipelined execution. Addressing Modes • Good addressing modes should have: Access to an operand does not require more than one access to the memory Only load and store instruction access memory operands The addressing modes used do not have side effects • Register, register indirect, index Conditional Codes • If an optimizing compiler attempts to reorder instruction to avoid stalling the pipeline when branches or data dependencies between successive instructions occur, it must ensure that reordering does not cause a change in the outcome of a computation. • The dependency introduced by the conditioncode flags reduces the flexibility available for the compiler to reorder instructions. Conditional Codes Add Compare Branch=0 R1,R2 R3,R4 ... (a) A program fragment Compare Add Branch=0 R3,R4 R1,R2 ... (b) Instructions reordered Figure 8.17. Instruction reordering. Conditional Codes • Two conclusion: To provide flexibility in reordering instructions, the conditioncode flags should be affected by as few instruction as possible. The compiler should be able to specify in which instructions of a program the condition codes are affected and in which they are not. Datapath and Control Considerations Bus A Bus B Bus C Original Design Incrementer PC Register file MUX Constant 4 A ALU R B Instruction decoder IR MDR MAR Memory bus data lines Address lines Figure 7.8. Three-bus organization of the datapath. Register file Bus B Bus A Pipelined Design ALU R B Bus C - Separate instruction and data caches - PC is connected to IMAR - DMAR - Separate MDR - Buffers for ALU - Instruction queue - Instruction decoder output A PC Control signal pipeline Incrementer Instruction decoder IMAR Memory address (Instruction fetches) Instruction queue MDR/Write DMAR MDR/Read Instruction cache - Reading an instruction from the instruction cache - Incrementing the PC - Decoding an instruction - Reading from or writing into the data cache - Reading the contents of up to two regs - Writing into one register in the reg file - Performing an ALU operation Memory address (Data access) Data cache Figure 8.18. Datapath modified for pipelinedxecution, e with interstage ubffers at the input and output of the ALU. Superscalar Operation Overview • The maximum throughput of a pipelined processor is one instruction per clock cycle. • If we equip the processor with multiple processing units to handle several instructions in parallel in each processing stage, several instructions start execution in the same clock cycle – multiple-issue. • Processors are capable of achieving an instruction execution throughput of more than one instruction per cycle – superscalar processors. • Multiple-issue requires a wider path to the cache and multiple execution units. Superscalar F : Instruction fetch unit Instruction queue Floatingpoint unit Dispatch unit W : Write results Integer unit Figure 8.19. A processor with two execution units. Timing Time Clock cycle 1 2 3 4 5 6 I 1 (Fadd) F1 D1 E1A E1B E1C W1 I 2 (Add) F2 D2 E2 W2 I 3 (Fsub) F3 D3 E3 E3 E3 I 4 (Sub) F4 D4 E4 W4 7 W3 Figure 8.20. An example of instruction execution flow in the processor of Figure 8.19, assuming no hazards are encountered. Out-of-Order Execution • Hazards • Exceptions • Imprecise exceptions • Precise exceptions Time Clock cycle 1 2 3 4 5 6 I 1 (Fadd) F1 D1 E1A E1B E1C W1 I 2 (Add) F2 D2 E2 I 3 (Fsub) F3 D3 I 4 (Sub) F4 D4 7 W2 E3A (a) Delayed write E3B E3C W3 E4 W4 Execution Completion • It is desirable to used out-of-order execution, so that an execution unit is freed to execute other instructions as soon as possible. • At the same time, instructions must be completed in program order to allow precise exceptions. • The use of temporary registers • Commitment unit Clock cycle 1 2 3 4 5 6 I 1 (Fadd) F1 D1 E1A E1B E1C W1 I 2 (Add) F2 D2 E2 TW2 I 3 (Fsub) F3 D3 E3A E3B I 4 (Sub) F4 D4 E4 TW4 (b) Using temporary registers Time 7 W2 E3C W3 W4 Performance Considerations Overview • The execution time T of a program that has a dynamic instruction count N is given by: T N S R where S is the average number of clock cycles it takes to fetch and execute one instruction, and R is the clock rate. • Instruction throughput is defined as the number of instructions executed per second. R Ps S Overview • An n-stage pipeline has the potential to increase the throughput by n times. • However, the only real measure of performance is the total execution time of a program. • Higher instruction throughput will not necessarily lead to higher performance. • Two questions regarding pipelining How much of this potential increase in instruction throughput can be realized in practice? What is good value of n? Number of Pipeline Stages • Since an n-stage pipeline has the potential to increase the throughput by n times, how about we use a 10,000stage pipeline? • As the number of stages increase, the probability of the pipeline being stalled increases. • The inherent delay in the basic operations increases. • Hardware considerations (area, power, complexity,…)