Telechargé par karima_maintenance

Computer Organization Course Outline - 15A05402

15A05402- COMPUTER ORGANIZATION
Prepared By
Ms. M. Latha Reddy,
Assistant Professor.
• JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY ANANTAPUR
• B. Tech III-I Sem. (ECE)
LTPC
3103
15A05402- COMPUTER ORGANIZATION
• UNIT-I
Computer types, Functional units, basic operational concepts, Bus
structures, Data types, Software: Languages and Translators,
Loaders, Linkers, Operating systems. Memory locations addresses
and encoding of information – main memory operations
Instruction formats and instruction sequences – Addressing modes
and instructions -Simple input programming – pushdown stacks –
subroutines.
• UNIT-II
• Register transfer Language, Register transfer, Bus and Memory
Transfers, Arithmetic Micro operations, Logic Micro operations,
shift Micro operations, Arithmetic Logic Shift Unit.
• Stack organization, instruction formats, Addressing modes,
Data transfer and manipulation, Execution of a complete
instruction, Sequencing of control signals,Program Control.
• UNIT-III
Control Memory, address Sequencing, Micro Program Example,
Design of Control Unit. Addition and Subtraction,
Multiplication Algorithms, Division Algorithms, Floating Point
Arithmetic Operations, Decimal Arithmetic Unit, Decimal
Arithmetic Operations.
• UNIT-IV
• Peripheral Devices, Input-Output Interface, Asynchronous
Data Transfer, Modes of Transfer, Priority Interrupt, Direct
Memory Access (DMA), Input-Output Processor (IOP), Serial
Communication. Memory hierarchy, main memory, auxiliary
memory, Associative memory, Cache memory, Virtual
memory, Memory management hardware.
• UNIT-V
• Parallel Processing, Pipelining, Arithmetic Pipeline,
Instruction Pipeline, RISC Pipeline Vector Processing,
Array Processors. Characteristics of Multiprocessors,
Interconnection Structures, Inter processor
Arbitration, Inter-processor Communication and
Synchronization, Cache Coherence.
• Text Books:
1. M. Morris Mano, “Computer system Architecture”,
Prentice Hall of India (PHI), Third edition.
2. William Stallings,“Computer organization and
programming”, Prentice Hall of India(PHI) Seventh
Edition, Pearson Education(PE) Third edition, 2006.
COURSE OUTCOMES
• C311.1 Identify functional units, bus structure and
addressing modes.
• C311.2 Explain the functional units of the processor
such as register file and ALU.
• C311.3 Make use of memory and I/O devices and virtual
memory effectively.
• C311.4 Explain the input/output devices.
• C311.5 Apply the algorithms for exploring the pipelining
and basic characteristics of multiprocessors.
Basic Structure of Computers
Content
Coverage
Main Memory System
Address
Data/Instructio
n
Central Processing Unit (CPU)
Cache
memory
Operationa
l
Registers
Program
Counter
Arithmeti
c and
Logic Unit
Instruction
Sets
Control Unit
Input/Output System
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
7
Functional
 A computer consists of three main parts:
Units
A processor (CPU)


A main-memory system
 An I / O system
 The CPU consists of a control unit, registers, the
arithmetic and logic unit, the instruction execution unit,
and the interconnections among these components
 The information handled by a computer

Instruction



Govern the transfer information within a computer as well as
between the computer and its I / O devices
Specify the arithmetic and logic operations to be performed
Data

Numbers and encoded characters that are used as operands by the
instructions
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
8
Progra
 A list of instructions that performs a task is called a
m
program
 The program usually is stored in a memory called
program memory
 The computer is completely controlled by the stored
program, except for possible external interruption by an
operator or by I / O devices connected to the machine
 Information handled by a computer must be encoded in a
suitable format. Most present-day hardware employs
digital circuits that have only two stable states, 0 (OFF)
and 1 (ON)
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
9
Memory
 Memory
The storage area in whichUnit
programs are kept when they are

running and that contains the data needed by the running
programs
 Types of memory

Volatile memory: storage that reta ins data only if it is receiving
power, such as dynamic random access memory (DRAM)
 Nonvolatile memory: a form of memory that retains data even in
the absence of a power source and that is used to store programs
between runs, such as flash memory
 Usually, a computer has two classes of storage


Primary memory and secondary memory
Primary memory

Also called main memory. Volatile memory used to hold
programs while they are running; typically consists of DRAM in
today’s computers
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
1
0
Memory
 Secondary memory
Unit
Nonvolatile memory used to st ore programs and data between

runs; typically consists of magnetic disks in today’s computers
 The memory consists of storage cells, each capable of
storing one bit of information

The storage cells are processed in groups of fixed size called
words
 To provide easy access to any word in the memory, a distinct
address is associated with each word location
 The number of bits in each word is often referred to as the
word length of the computer

Typical word length from 16 to 64 bits
 The capacity of the memory is one factor that
characterizes the size of a computer
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
1
1
Memory
 Instruction and data can be written into the memory or
Unit
read out under the control of the processor

It is essential to be able to access any word location in the
memory as quickly as possible
 Memory in which any location can be reached in a short and
fixed amount of time after spec ifying its address called randomaccess memory (RAM)
 The time required to access one word is called the
memory access time

This time is fixed, independent of the location of the word being
accessed
 The memory of a computer is normally implemented as a
memory hierarchy of three or four levels

The small, fast, RAM units are called caches
 The largest and slowest unit is referred to as the main memory
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
1
2
Arithmetic and Logic
 Most computer operations are performed in the
Unit
arithmetic and logic unit (ALU) of the processor
 For example, consider two numbers stored in the memory
are to be added

They are brought into the processor, and the actual addition is
carried out by the ALU. Then su m may be stored in the memory
or retained in the processor for immediate use
 Typical arithmetic and logic operation

Addition, subtraction, multiplic ation, division, comparison,
complement, etc.
 When operands are brought into the processor, they are
stored in high-speed storage elements called registers.

Each register can store one word of data
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
1
3
Control
 The control unit is the nerve center that sends control
Unit
signals to other units and senses their states

Thus the control unit serves as a coordinator of the memory,
arithmetic and logic, and input / output units
 The operation of a computer can be summarized as
follows:

The computer accepts information in the form of programs and
data through an input unit and stores it in the memory
 Information stored in the memo ry is fetched, under program
control, into an ALU, where it is processed
 Processed information leaves the computer through an output
unit
 All activities inside the machine are directed by the control unit
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
14
Computer Components: TopLevel View
Memory
Input/Output
System Bus
MAR
PC
Processor
IR
MDR
R0
R1
.
.
.
Control
ALU
Rn-1
n general purpose registers
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
15
Basic Operational
Concepts
Memory
Input/Output
System Bus
MAR
MDR
PC
R0
R1
Processor
IR
.
.
.
Control
ALU
Rn-1
n general purpose registers
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
16
A Partial Program Execution
Example
Memory
300 1 9 4 0
301 5 9 4 1
302 2 9 4 1
CPU Register
300
PC
1940
AC
IR
Memory
300 1 9 4 0
301 5 9 4 1
302 2 9 4 1
300
301
302
0003
0002
Step 1
Memory
CPU Register
1940
5941
2941
940
941
301 PC
0 0 0 3 AC
5 9 4 1 IR
0003
0002
940
941
Step 3
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
0003
0002
Memory
300 1 9 4 0
301 5 9 4 1
302 2 9 4 1
.
940
941
301 PC
0 0 0 3 AC
1 9 4 0 IR
.
.
940
941
CPU Register
Step 2
CPU Register
302 PC
3 AC
0005
5 9 4 1 IR
.
3+2=5
0003
0002
Step 4
17
A Partial Program Execution
Example
Memory
300 1 9 4 0
301 5 9 4 1
302 2 9 4 1
CPU Register
302 PC
0 0 0 5 AC
2 9 4 1 IR
Memory
300 1 9 4 0
301 5 9 4 1
302 2 9 4 1
0003
0002
303 PC
0 0 0 5 AC
2 9 4 1 IR
.
.
940
941
CPU Register
Step 5
Advanced Reliable Systems (ARES) Lab.
940
941
Jin-Fu Li, EE, NCU
0003
0005
Step 6
18
Interru
 Normal execution of programs may be preempted if some
pt
device requires urgent servicing
 To deal with the situation immediately, the normal
execution of the current program must be interrupted
 Procedure of interrupt operation

The device raises an interrupt signal
 The processor provides the requested service by executing an
appropriate interrupt-service routine
 The state of the processor is first saved before servicing the
interrupt


Normally, the contents of the PC, the general registers, and some
control information are stored in memory
When the interrupt-service routine is completed, the state of the
processor is restored so that the interrupted program may
continue
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
19
Classes of
 Program
Interrupts
Generated by some condition that occurs as a result of an

instruction execution such as arithmetic overflow, division by
zero, attempt to execute an illegal machine instruction, or
reference outside a user’s allowed memory space
 Timer

Generated by a timer within the processor. This allows the
operating system to perform cert ain functions on a regular basis
 I /O

Generated by an I / O controller, to signal normal completion of
an operation or to signal a variety of error conditions
 Hardware failure

Generated by a failure such as po wer failure or memory parity
error
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
20
Bus
 A group of lines that serves a connecting path for several
Structures
devices is called a bus

In addition to the lines that carry the data, the bus must have
lines for address and control purposes
 The simplest way to interconnect functional units is to use a
single bus, as shown below
Input
Advanced Reliable Systems (ARES) Lab.
Output
Jin-Fu Li, EE, NCU
Memory
Processor
21
Drawbacks of the Single Bus
 The devices connected to a bus vary widely in their speed
Structure
of operation

Some devices are relatively slow , such as printer and keyboard
 Some devices are considerably fast, such as optical disks
 Memory and processor units operate are the fastest parts of a
computer
 Efficient transfer mechanism th us is needed to cope with
this problem

A common approach is to include buffer registers with the
devices to hold the information during transfers
 An another approach is to use two-bus structure and an
additional transfer mechanism

A high-performance bus, a low-performance, and a bridge for
transferring the data between the two buses. ARMA Bus belongs to
this structure
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
22
Softwa
 In order for a user to enter and run an application
re
program, the computer must already contain some
system software in its memory
 System software is a collection of programs that are
executed as needed to perform functions such as





Receiving and interpreting user commands
Running standard application programs such as word processors,
or games
Managing the storage and retrieval of files in secondary storage
devices
Running standard application programs such as word processors,
etc
Controlling I / O units to receiv e input information and produce
output results
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
23


Softwa
Translating programs from source form prepared by the user into
re instructions
object form consisting of machine
Linking and running user-written application programs with
existing standard library routines, such as numerical computation
packages
 System software is thus responsible for the coordination
of all activities in a computing system
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
24
Operating
 Operating system (OS)
System
This is a large program, or actually a collection of routines, that is

used to control the sharing of and interaction among various
computer units as they perform application programs
 The OS routines perform the tasks required to assign
computer resource to individual application programs

These tasks include assigning me mory and magnetic disk space
to program and data files, movi ng data between memory and
disk units, and handling I / O operations
 In the following, a system with one processor, one disk,
and one printer is given to explain the basics of OS

Assume that part of the program’s task involves reading a data
file from the disk into the memory, performing some
computation on the data, and printing the results
User Program and OS Routine
Sharing
t0-t1: OS routine initiates loading the application program
from disk to memory, waits until the transfer is
completed, and passes execution control to the
application program.
Printer
Disk
OS
routines
Program
t0
Advanced Reliable Systems (ARES) Lab.
t1
t3
t2
Jin-Fu Li, EE, NCU
t4
t5 Time
26
Multiprogramming or
Multitasking
Printer
Disk
OS
routines
Program
t0
Advanced Reliable Systems (ARES) Lab.
t1
t3
t2
Jin-Fu Li, EE, NCU
t4
t5 Time
27
Performan
 The speed with which a computer executes programs is
ce
affected by the design of its hardware and its machine
language instructions
 Because programs are usually written in a high-level
language, performance is also affected by the compiler
that translates programs into machine languages
 For best performance, the following factors must be
considered

Compiler
 Instruction set
 Hardware design
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
28
Performan
 Processor circuits are controll ed by a timing signal called
ce
a clock

The clock defines regular time intervals, called clock cycles
 To execute a machine instruction, the processor divides
the action to be performed into a sequence of basic steps,
such that each step can be completed in one clock cycle
 Let the length P of one clock cycle, its inverse is the clock
rate, R=1 /P
 Basic performance equation

T=(NxS) / R, where T is the processor time required to execute a
program, N is the number of inst ruction executions, and S is the
average number of basic steps ne eded to execute one machine
instruction
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
29
Faster Clock=Shorter Running
 Faster steps do not necessarily mean shorter travel time
Time?
Solution
1 GHz
4 steps
20 steps
2 GHz
[Source: B. Parhami, UCSB]
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
30
System Balance is
 Note that system balance is absolutely essential for
Essential
improving performance
 If one replaces a machine’s processor with a model
having twice the performance, this will not double the
overall system performance unless corresponding
improvements are made to other parts of the system
CPU - boundtask
Input
Processing
I/O -bound task
Advanced Reliable Systems (ARES) Lab.
Output
[Source: B. Parhami, UCSB]
Jin-Fu Li, EE, NCU
27
Performance
 Pipelining and superscalar operation
Improvement
Pipelining: by overlapping the execution of successive

instructions
 Superscalar: different instructions are concurrently executed with
multiple instruction pipelines. This means that multiple
functional units are needed
 Clock rate improvement

Improving the integrated-circuit technology makes logic circuits
faster, which reduces the time needed to complete a basic step
 Reducing amount of processing done in one basic step also makes
it possible to reduce the clock period, P. However, if the actions
that have to be performed by an instruction remain the same, the
number of basic steps needed may increase
 Reduce the number of basic steps to execute

Reduced instruction set comp uters (RISC) and complex
instruction set computers (CISC)
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
32
Reporting Computer
 Measured or estimated execution times for three
Performance
programs
Time on machine
X
Time on machine
Y
Speedup of Y
over X
Program A
20
200
0.1
Program B
1000
100
10.0
Program C
1500
150
10.0
All 3 prog’s
2520
450
5.6
 Analogy

If a car is driven to a city 100 km away at 100 km / hr and returns
at 50 km / hr, the average speed is not (100 + 50) / 2 but is
obtained from the fact that it travels 200 km in 3 hours
[Source: B. Parhami, UCSB]
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
33
Machine Instructions &
Programs
Outlin
 Numbers, Arithmetic Operations, and Characters
e
 Memory Locations and Addresses









Memory Operation
Instructions and Instruction Sequencing
Addressing Modes
Assembly Language
Basic Input / Output Operations
Stacks and Queues
Subroutines
Linked List
Encoding of Machine Instructions
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
35
Content
Coverage
Main Memory System
Address
Data/Instructio
n
Central Processing Unit (CPU)
Cache
memory
Operationa
l
Registers
Program
Counter
Arithmeti
c and
Logic Unit
Instruction
Sets
Control Unit
Input/Output System
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
36
Number
vector B  b Kb b , where b =0 or 1 for
 Consider an n-bit
Representation
0  i  n1
n 1
1
0
i
 The vector B can represent unsigned integer values V in
the range 0 to 2n 1, where
n1
 V (B)  bn1  2
 L  b1  21  b0  2 0
 We need to represent positive and negative numbers for
most applications
 Three systems are used for representing such numbers

Sign-and-magnitude
 1’s-complement
 2’s-complement
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
37
Number
 In sign-and-magnitude system
Systems
Negative values are represented by changing the most significant

bit from 0 to 1
 In 1’s-complement system

Negative values are obtained by complementing each bit of the
corresponding positive number
 The operation of forming the 1’s-complement of a given number
is equivalent to subtracting that number from 2 n-1
 In 2’s-complement system

The operation of forming the 2’s-complement of a given number
is done by subtracting that number from 2 n
 The 2’s-complement of a number is obtained by adding 1
to 1’s-complement of that number
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
38
An Example of Number
Representations
bb bb
3
2
1 0
0 1 1 1
0 1 1 0
0 1 0 1
0 1 0 0
0 0 1 1
0 0 1 0
0 0 0 1
0 0 0 0
1 0 0 0
1 0 0 1
1 0 1 0
1 0 1 1
1 1 0 0
1 1 0 1
1 1 1 0
1 1 1 1
Advanced Reliable Systems (ARES) Lab.
sign and
magnitude
1’s-complement
2’s-complement
+7
+6
+5
+4
+3
+2
+1
+0
-0
-1
-2
-3
-4
-5
-6
-7
+7
+6
+5
+4
+3
+2
+1
+0
-7
-6
-5
-4
-3
-2
-1
-0
+7
+6
+5
+4
+3
+2
+1
+0
-8
-7
-6
-5
-4
-3
-2
-1
Jin-Fu Li, EE, NCU
39
2’s-Complement
+7+(-3)
System
+4
13 (1101) steps
+7
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
40
Addition of Numbers in 2’s
Complement
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
41
Sign Extension of 2’s
 Sign extension
Complement
To represent a signed number in 2’s complement form using a

larger number of bits, repeat the sign bit as many times as
needed to the left
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
42
Memory
 A memory consists of cells, each of which can store a bit
Locations
of binary information (0 or 1)
 Because a single bit represents a very small amount of
information

Bits are seldom handled individually
 The memory usually is orga nized so that a group of n bits
can be stored or retrieved in a single, basic operation

Each group of n bits is referred to as a word of information, and n
is called the word length
 A unit of 8 bits is called a byte
 Modern computers have word lengths that typically
range from 16 to 64 bits
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
43
Memory
 Accessing the memory to store or retrieve a single item of
information, either Addresses
a word or a byte, requires distinct
names or addresses for each item location
 It is customary to use numbers from 0 to 2 k-1 as the
address space of successive locations in the memory

K denotes address
 2k-1 denotes address space of memory locations
 For example, a 24-bit address generates an address space
of 224 (16,777,216) locations
 Terminology
210: 1K (kilo)
 220: 1M (mega)
 230: 1G (giga)
 240: 1T (tera)

Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
44
Memory words
Memory
Words A signed integer
n bits
32 bits
Word 0
b31 b30
b1 b0
Word 1
Four characters
8 bits 8 bits
ASCII
character
Word w-1
Advanced Reliable Systems (ARES) Lab.
8 bits 8 bits
Jin-Fu Li, EE, NCU
45
Big-Endian & Little-Endian
 Byte addresses can be assigned across words
Assignments
in two ways
 Big-endian and little-endian
Word
address
Word
address
Byte address
Byte address
0
0
1
2
3
0
3
2
1
0
4
4
5
6
7
4
7
6
5
4
2k-4
2k-4
2k-3
2k-2
2k-1
2k-4
2k-1
2k-2
2k-3
2k-4
Big-endian assignment
Advanced Reliable Systems (ARES) Lab.
Little-endian assignment
Jin-Fu Li, EE, NCU
46
Memory
 Random access memories must have two basic operations
Operation
Write: writes a data into the specified location


Read: reads the data stored in the specified location
 In machine language program, the two basic operations
usually are called

Store: write operation
 Load: read operation
 The Load operation transfers a copy of the contents of a
specific memory location to the processor. The memory
contents remain unchanged
 The Store operation transfers an item of information from
the processor to a specific memory location, destroying
the former contents of that location
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
47
Instructio
 A computer must have instruct ions capable of performing
ns
four types of operations

Data transfers between the memory and the processor registers
 Arithmetic and logic operations on data
 Program sequencing and control
 I / O transfers
 Register transfer notation

The contents of a location are denoted by placing square brackets
around the name of the location
 For example, R1  [ L O C ] means that the contents of memory
location LOC are transferred into processor register R1
 As another example, R 3  [ R 1 ] + [ R 2 ] means that adds the contents
of registers R1 and R2, and then places their sum into register R3
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
48
Assembly Language
 Types of instructions
Notation
Zero-address instruction


One-address instruction
 Two-address instruction
 Three-address instruction
 Zero-address instruction

For example, store operands in a structure called a pushdown
stack
 One-address instruction

Instruction form: Operation Destination
 For example, Add A: add the contents of memory location A to
the contents of the accumulator register and place the sum back
into the accumulator
 As another example, Load A: copies the contents of memory
location A into the accumulator
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
49
Assembly Language
 Two-address instruction
Notation
Instruction form: Operation Source, Destination

For example, Add A, B: performs the operation B  [ A ] + [ B ] .
When the sum is calculated, the result is sent to the memory and
stored in location B
 As another example, Move B, C : performs the operation C[B],
leaving the contents of location B unchanged

 Three-address instruction

Instruction form: Operation Source1, Source2, Destination
 For example, Add A, B, C: adds A and B, and the result is sent to
the memory and stored in location C
 If k bits are needed to specify the memory address of each
operand, the encoded form of the above instruction must contain
3k bits for addressing purposes in addition to the bits needed to
denote the Add operation
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
50
Instruction
 How a program is executed
Execution
The processor contains a register called the program counter (PC),

which holds the address of the instruction to be executed next. To
begin executing a program, the address of its first instruction
must be placed into the PC, then the processor control circuits use
the information in the PC to fetch and execute instruction, one at
a time, in the order of increasing address
 Basic instruction cycle
START
Fetch Instruction
Advanced Reliable Systems (ARES) Lab.
Execute Instruction
Jin-Fu Li, EE, NCU
HALT
51
A Program for
Address
C[A]+[B]
i
Move A, R0
i+4
Add
i+8
Move R0, C
B, R0
3-instruction
program
segment
A
R0
B
Data for
the program
C
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
52
Straight-Line
i
Sequencing
Move NUM1, R0
i+4
Add
i+8
Add NUM3, R0
i+4n-4
i+4n
Add
NUM2, R0
NUMn, R0
Move R0, SUM
SUM
NUM1
NUM2
NUMn
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
53
Branchi
Move
ngN, R1
Clear
LOOP
Program
loop
R0
Determine address of
“Next” number and add
“Next” number to R0
Decrement
Branch>0
R1
LOOP
Move R0, SUM
SUM
N
n
NUM1
NUMn
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
54
Condition
 The processor keeps track of information about the results
Codes
of various operations for
use by subsequent conditional
branch instructions. This is accomplished by recoding
required information in individual bits, often called
condition code flags
 Four commonly used flags are

N (negative): set to 1 if the results is negative; otherwise, cleared
to 0
 Z (zero): set to 1 if the result is 0; otherwise, cleared to 0
 V (overflow): set to 1 if arithmetic overflow occurs; otherwise,
cleared to 0
 C (carry): set to 1 if a carry-out results from the operation;
otherwise, cleared to 0
 N and Z flags caused by an arithmetic or a logic operation,
V and C flags caused by an arithmetic operation
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
22
Addressing
 Programmers use data structures to represent the data
Modes
used in computations. These include lists, linked lists,
array, queues, and so on
 A high-level language enables the programmer to use
constants, local and global variables, pointers, and arrays
 When translating a high-level language program into
assembly language, the compiler must be able to
implement these constructs using the facilities in the
instruction set of the computer
 The different ways in which the location of an operand is
specified in an instruction are referred to as addressing
modes
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
56
Name
Generic Addressing
Assembler
syntax
Addressing function
Modes
Immediate
#Value
Operand=Value
Register
Ri
EA=Ri
Absolute (Direct)
LOC
EA=LOC
Indirect
(Ri)
EA=[Ri]
(LOC)
EA=[LOC]
Index
X(Ri)
EA=[Ri]+X
Base with index
(Ri, Rj)
EA=[Ri]+[Rj]
Base with index and offset
X(Ri, Rj)
EA=[Ri]+[Rj]+X
Relative
X(PC)
EA=[PC]+X
Autoincrement
(Ri)+
EA=[Ri]; Increment Ri
Autodecrement
-(Ri)
Decrement Ri; EA=[Ri]
EA: effective address
Value: a signed number
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
57
Register, Absolute and Immediate
 Register mode: the operand is the contents of a processor
Modes
register; the name (address) of the register is given in the
instruction

For example, Add Ri, Rj (adds the contents of Ri and Rj and the
result is stored in Rj)
 Absolute mode: the operand is in a memory location; the
address of this location is given explicitly in the
instruction. (In some assembly languages, this mode is
called Direct)

For example, Move LOC, R2 (moves the content of the memory
with address LOC to the register R2)
 The Absolute mode can represent global variables in a program.
For example, a declaration such as Integer A, B;
 Immediate mode: the operand is given explicitly in the
instruction
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
58
Indirection and
 Indirect mode: the effective address of the operand is the
Pointers
contents of a register or memory location whose address
appears in the instruction
 Indirection is denoted by placing the name of the register
or the memory address given in the instruction in
parentheses
 The register or memory locati on that contains the address
of an operand is called a pointer
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
59
Two Types of Indirect
Addressing
Through a memory location
Through a general-purpose register
Add (A), R0
Add (R1), R0
Main
Memory
B
Operand
R1
B
Advanced Reliable Systems (ARES) Lab.
Register
Jin-Fu Li, EE, NCU
A
B
B
Operand
60
Register Indirect Addressing
Diagram
Instruction
Opcode
Register Address R
Memory
Registers
Operand
Pointer to Operand
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
61
Using Indirect Addressing in a
Program
Address
Contents
Move
LOOP
Advanced Reliable Systems (ARES) Lab.
Move
N, R1
#NUM1, R2
Clear
R0
Add
(R2), R0
#4, R2
Add
Decrement
R1
Branch>0
LOOP
Move
R0, SUM
Jin-Fu Li, EE, NCU
Initialization
62
Indexing and
 Index mode: the effective address of the operand is
Arrays
generated by adding a constant value to the contents of a
register
 The register used may be either a special register provided for this
purpose, or, more commonly, it ma y be any one of a set of generalpurpose registers in the processor. It is referred to as an index
register
 The index mode is useful in dealing with lists and arrays
 We denote the Index mode symbolically as X(Ri), where X denotes
the constant value contained in the instruction and Ri is the name
of the register involved. The effective address of the operand is
given by EA=X+(Ri). The contents of the index register are not
changed in the process of generating the effective address
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
63
Indexed
Offset is Addressing
given as a
constant
Add
20(R1), R2
1000
1000
R1
Offset=20
1020
Advanced Reliable Systems (ARES) Lab.
Operand
Jin-Fu Li, EE, NCU
64
Indexed
Offset is Addressing
in the index
register
Add
1000(R1), R2
20
1000
R1
Offset=20
1020
Advanced Reliable Systems (ARES) Lab.
Operand
Jin-Fu Li, EE, NCU
65
An Example for Indexed
AddressingMove
#LIST, R0
N
n
Clear
R1
LIST
Student ID
Clear
R2
LIST+4
Test 1
Clear
R3
LIST+8
Test 2
LIST+12
Test 3
4(R0), R1
LIST+16
Student ID
Move
LOOP Add
Add
Test 1
Add
12(R0), R3
Test 2
Add
#16, R0
Test 3
Decrement
R4
Branch>0
LOOP
Move
R1, SUM1
Move
R2, SUM2
Move
R3, SUM3
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
N, R4
8(R0), R2
66
Variations of Indexed Addressing
 A second register may be used to contain the offset X, in
Mode
which case we can write the Index mode as (Ri,Rj)
 The effective address is the sum of the contents of registers Ri and
Rj
 The second register is usually called the base register
 This mode implements a two-dimensional array
 Another version of the Index mode use two registers plus
a constant, which can be denoted as X(Ri,Rj)
 The effective address is the sum of the constant X and the contents
of registers Ri and Rj
 This mode implements a three-dimensional array
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
67
Additional
 Autoincrement mode: the effective address of the operand
Modes
is the contents of a register specified in the instruction.
After accessing the operand, the contents of this register
are automatically incremented to point to the next item in
a list
 The Autoincrement mode is denoted as (Ri)+
 Autodecrement mode: the contents of a register specified
in the instruction are first automatically decremented and
are then used as the effective address of the operand
 The Autodecrement mode is denoted as –(Ri)
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
68
An Example of Autoincrement
Addressing
Move
N, R1
Move
#NUM1, R2
Clear
R0
LOOP Add
Decrement
Advanced Reliable Systems (ARES) Lab.
(R2)+, R0
R1
Branch>0
LOOP
Move
R0, SUM
Jin-Fu Li, EE, NCU
69
Assembly
 A complete set of symbolic names and rules for their use
Language
constitute a programming language, generally referred to
as an assembly language
 Programs written in an assembly language can be
automatically translated into a sequence of machine
instructions by a program called an assembler
 When the assembler program is executed, it reads the user
program, analyzes it, and then generates the desired
machine language program
 The user program in its origin al alphanumeric text format
is called a source program, and the assembled machine
language program is called an object program
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
70
Assembler
 In addition to providing a mechanism for representing
Directives
instructions in a program, the assembly language allows
the programmer to specify ot her information needed to
translate the source program into the object program
 Suppose that the name SUM is used to represent the value
200. The fact may be conveyed to the assembler program
through a statement such as
 SUM EQU 200
 This statement does not denote an instruction that will be
executed when the object program is run; it will not even
appear in the object program
 Such statements, called assembler directives (or commands)
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
71
100
Move N, R1
104
Move #NUM1, R2
108
Clear R0
112
Add (R2) , R0
116
Add #4, R2
120
Decrement R1
124
Branch>0 LOOP
128
Move
R0, SUM
Assembl
er
Assembler directives SUM
Statements that
ORIGIN
204
N
DATAWORD
100
NUM1
RESERVE
400
START
ORIGIN
MOVE
100
N, R1
MOVE
machine
CLR
R0
ADD
(R2), R0
ADD
#4, R2
DEC
R1
BGTZ
LOOP
MOVE
R0, SUM
instructions
LOOP
UM 200
100
200
generate
132
N 204
EQU
Assembler directives
RETURN
END
Memory arrangement
Advanced Reliable Systems (ARES) Lab.
#NUM1, R2
START
Assembly language representation
Jin-Fu Li, EE, NCU
72
Number
 When dealing with numerical values, most assemblers
Notation
allow numerical values to be specified in different ways
 For example, consider the number 93, which is represented
by the 8-bit binary number 01011101. If the value is to be
used as immediate operand,
 It can be given as a decimal number , as in the instruction ADD #93,
R1
 It can be given as a binary number, as in the instruction ADD
#%01011101,R1 (a binary number is identified by a prefix symbol
such as percent sign)
 It also can be given as a hexadecimal number, as in the instruction
ADD #$5D, R1 (a hexadecimal num ber is identified by a prefix
symbol such as dollar sign)
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
73
Basic Input/Output
 Bus connection for processor, keyboard, and display
Operations
Processor
DATAIN
DATAOUT
1 SIN
0
10 SOUT
Keyboard
Display
DATAIN, DATAOUT: buffer registers
SIN, SOUT: status control flags
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
74
Wait
 In order to perform I / O transfers, we need machine
Loop
instructions that can check the state of the status flags and
transfer data between the processor and I / O device
 Wait loop for Read operation
 READWAIT
Branch to READWAIT if SIN=0
Input from DATAIN to R1
 Wait loop for Write operation

WRITEWAIT Branch to WRITEWAIT if SOUT=0
Output from R1 to DATAOUT
 We assume that the initial stat e of SIN is 0 and the initial
state of SOUT is 1
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
75
Memory Many computers use an arrangement called memoryMapped
I/O
mapped I / O in which some memory address values are
used to refer to peripheral device buffer registers, such as
DATAIN and DATAOUT
 Thus no special instructions are needed to access the
contents of these registers; data can be transferred between
these registers and the processor using instructions that we
have discussed, such as Move, Load, or Store
 Also, the status flags SIN and SOUT can be handled by
including them in device status registers, one for each of
the two devices
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
76
Read and Write
 Assume that bit b in registers INSTATUS and
Programs
OUTSTATUS corresponds to SIN and SOUT, respectively
3
 Read Loop
 READWAIT
Testbit
#3, INSTATUS
Branch=0 READWAIT
MoveByte DATAIN, R1
 Write Loop
 WRITEWAIT
Testbit
#3, OUTSTATUS
Branch=0 WRITEWAIT
MoveByte R1, DATAOUT
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
77
Stacks and
 A stack is a list of data elem ents, usually words or bytes,
Queues
with the accessing restriction that elements can be added
or removed at one end of the list only

It is also called a last-in-first-out (LIFO) stack
 A stack has two basic operations: push and pop
 The terms push and pop are used to describe placing a new item
on the stack and removing the top item from the stack,
respectively.
 Another useful data structure that is similar to the stack is
called a queue

Data are stored in and retrieved from a queue on a first-in-firstout (FIFO) basis
 Two pointers are needed to keep track of the two ends of the
queue
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
78
A Stack of Words in the
Memory Low address
Stack pointer register
SP
-28
Current top element
17
739
Stack
BOTTOM
43
Bottom element
High address
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
79
Push and Pop
 Assume that a byte-addressable memory with 32-bit words
Operations
 The push operation can be implemented as
Subtract
Move
#4, SP
NEWITEM, (SP)
 The pop operation can be implemented as
Move
Add
(SP), ITEM
#4, SP
 If the processor has the Autoincrement and Autodecrement
addressing modes, then the push operation can be
implemented by the single instruction
Move NEWITEM, -(SP)
 And the pop operation can be implemented as
Move (SP)+, ITEM
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
80
Exampl
es
SP
19
SP
-28
SP
-28
17
SP
17
43
NEWITEM
43
19
ITEM
Push operation
Advanced Reliable Systems (ARES) Lab.
-28
Pop operation
Jin-Fu Li, EE, NCU
81
Checking for Empty and Full
 When a stack is used in a program, it is usually allocated
Errors
a fixed amount of space in the memory

We must avoid pushing an item onto the stack when the stack
has reached in its maximum si ze, i.e., the stack is full
 On the other hand, we must avoid popping an item off the stack
when the stack has reached in its minimum size, i.e., the stack is
empty
 Routine for a safe pop or a safe push operation

Compare src, dst
 Perform [dst]-[src]
 Sets the condition code flags according to the result
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
82
Subroutin
 In a given program, it is often necessary to perform a particular
es data valu es. Such a subtask is
subtask many times on different
called a subroutine.
Memory
location
200
204
Memory
location
Calling program
.
.
.
Call SUB
next instruction
.
.
.
1000
Subroutine SUB
first instruction
.
.
.
Return
 The location where the calling program resumes execution is
the location pointed by the updated PC while the Call
instruction is being executed. Hence the contents of the PC
must be saved by the Call instru ction to enable correct return
to the calling program
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
83
Subroutine
 The way in which a computer makes it possible to call
Linkage
and return from subroutines is referred to as its
subroutine linkage method
 Subroutine linkage using a link register
1000
PC
204
Link
204
Return
Call
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
84
Subroutine
 A common programming practice, called subroutine
Nesting
nesting, is to have one subroutine call another
 Subroutine nesting call be carried out to any depth.
Eventually, the last subroutine called completes its
computations and returns to the subroutine that called it
 The return address needed for this first returns is the last
one generated in the nested call sequence. That is, return
addresses are generated and used in a last-in-first-out
order
 Many processors do this by using a stack pointer and the
stack pointer points to a stack called the processor stack
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
85
Example of Subroutine
Main Program
SUBNesting
1
SUB 2
SUB 3
C
A
B
.
.
.
.
.
.
.
.
.
.
.
.
C+4
BA++44
AA++44
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
86
Example of Subroutine
Nesting
main
Prepare
to call
PC
jal
abc
Prepare
to continue
abc
Procedure
abc
Save
xyz
jal
Procedure
xyz
xyz
Restore
jr
$ra
jr
$ra
[Source: B. Parhami, UCSB]
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
87
Parameter
 When calling a subroutine, a program must provide to
Passing
the subroutine the parameters, that is, the operands or
their addresses, to be used in the computation. Later, the
subroutine returns other parameters, in this case, the
result of computation
 The exchange of information between a calling program
and a subroutine is referre d to as parameter passing
 Parameter passing approaches

The parameters may be placed in registers or in memory
locations, where they can be accessed by the subroutine
 The parameters may be placed on the processor stack used for
saving the return address
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
88
Passing Parameters with
passed by value
Registers passing by reference
Calling program
Move
Move
Call
M.ove
.
.
Subroutine
LISTADD
LOOP
N, R1
#NUM1, R2
LISTADD
R0, SUM
Clear
R0
Add
(R2)+, R0
Decrement R1
Branch>0 LOOP
Return
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
R1 serves as a counter
R2 points to the list
Call subroutine
Save result
Initialize sum to 0
Add entry from list
Return to calling program
89
Passing Parameters with
Assume top of stack is at level 1Stack
below.
Move
Move
Call
#NUM1, -(SP)
N, -(SP)
LISTADD
Move
Add
4(SP), SUM
#8, SP
Push parameters onto stack
Call subroutine
(top of stack at level 2)
Save result
Restore top of stack Level 3
(top of stack at level 1)
.
.
.
LISTADD MoveMultiple R0-R2, -(SP)
LOOP
Move
Move
Clear
Add
Decrement
Branch>0
Move
MoveMultiple
Return
16(SP), R1
20(SP), R2
R0
(R2)+, R0
R1
LOOP
R0, 20(SP)
(SP)+, R0-R2
Advanced Reliable Systems (ARES) Lab.
Save registers
Level 2
(top of stack at level 3)
Initialize counter to N.
Initialize pointer to the list
Level 1
Initialize sum to 0
Add entry from list
[R2]
[R1]
[R0]
Return address
N
NUM1
Put result on the stack
Restore registers
Return to calling program
Jin-Fu Li, EE, NCU
90
Stack Frame
SP (Stack pointer)
Saved [R1]
Saved [R0]
-12(FP)
-8(FP)
-4(FP)
FP (Frame pointer)
localvar3
localvar2
localvar1
saved [FP]
Return address
param1
param2
param3
param4
8(FP)
12(FP)
16(FP)
20(FP)
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
Stack frame
91
 Logical shifts
Logic shift left
Shift
Instructions
Carry flag
C
R0
Before:
0
0 1 1 1 0 . . . . . . 0 1 1
After:
1
1 1 0 . . . . . . 0 1 1 0 0
0
LShiftL #2, R0
 Arithmetic shifts
shift right
Sign bit
R0
AShiftR #2, R0
Advanced Reliable Systems (ARES) Lab.
C
Before: 1 0 0 1 1 . . . . . . 0 1 0
0
After: 1 1 1 0 0 1 1 . . . . . . . 0
1
Jin-Fu Li, EE, NCU
92
Rotate
 Rotate left without carry
Instructions
RotateL #2, R0
R0
C
Before:
0
0 1 1 1 0 ..... .
0 1 1
After:
1
1 1 0 . . . . . . 0 1
1 0 1
 Rotate left with carry
RotateLC #2, R0
R0
C
Before:
0
0 1 1 1 0 ..... .
After:
1
1 1 0 ..... .
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
0 1 1
0 1 1 0 0
93
Link address
Record 1
Linked
List
Record 2
Record k
0
Tail
Head
Linking structure
Record 1
Record 2
New
Record
Inserting a new record
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
94
A List of Student Test
Data field
Address Key field Link field
Scores
First
record
2320
27243
1040
Second
record
1040
28106
1200
1200
28370
2880
2720
40632
1280
1280
47871
0
Last
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
Head
Tail
95
Encoding of Machine
 To be executed in a processor, an instruction must be encoded
in a compact binaryInstructions
pattern. Su ch encoded instructions are
properly referred to as machine instructions. The instructions
that use symbolic names and acronyms are called assembly
language instructions, which are converted into the machine
instructions using assembler program
 For a given instruction, the type of operation that is to be
performed and the type of oper ands used may be specified
using an encoded binary pattern referred to as the OP code
 In addition to the OP code, th e instruction has to specify the
source and destination registers, and addressing mode, etc,
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
96
Exampl
 Assume that 8 bits are allocate
d for OP code, and 4 bits are
es
needed to identify each register, and 6 bits are needed to
specify an addressing mode
 The instruction Move 24(R0), R5




The instruction LshiftR #2, R0



Require 16 bits to denote the OP code and the two registers
Require 6 bits to choose the addressing mode
Only 10 bits are left to give the index value
Require 18 bits to specify the OP code, the addressing modes, and the
register
This limits the size of the immediate operand to what is expressible in 14
bits
In the two examples, the instructions can be encoded in a 32-bit word.
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
97
Encoding Instructions into 32-bit
8
7
7
10
Words
OP code
Source
Destination
Other info
One-word instruction
OP code
Source
Destination
Other info
Memory address / Immediate operand
Two-word instruction
OP code
Ri
Rj
Rk
Other info
Three-operand instruction
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
98
Encoding Instructions into 32-bit
 But, what happens if we want to specify a memory
Words
operand using the Absolute addressing mode?
 The instruction Move R2, LOC

Require 18 bits to denote the OP code, the addressing modes, and
the register
 The leaves 14 bits to express the address that corresponds to LOC,
which is clearly insufficient
 If we want to be able to give a complete 32-bit address in
the instruction, an instruction must have two words
 If we want to handle this type of instructions: Move
LOC1, LOC2

An instruction must have three words
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
99
CISC & RISC
 Using multiple words, we can implement quite complex
instructions, closely resembling operations in high-level
programming language
 The term complex instruction set computer (CISC) has been
used to refer to processors that use instruction sets of this
type
 The restriction that an instru ction must occupy only one
word has led to a style of computers that have become
known as reduced instruction set computer (RISC)
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
10
0
Computer Arithmetic
Arithmetic & Logic Unit
• Performs arithmetic and logic operations on
data – everything that we think of as
“computing.”
• Everything else in the computer is there to
service this unit
• All ALUs handle integers
• Some may handle floating point (real) numbers
• May be separate FPU (math co-processor)
• FPU may be on separate chip (486DX +)
ALU Inputs and Outputs
Integer Representation
• We have the smallest possible alphabet: the
symbols 0 & 1 represent everything
• No minus sign
• No period
• Signed-Magnitude
• Two’s complement
Benefits of 2 ’s
• One representation of zero
complement
• Arithmetic works easily (see later)
• Negating is fairly easy
— 3 = 00000011
— Boolean complement gives 11111100
— Add 1 to LSB
11111101
Geometric Depiction of Twos Complement
Integers
2 ’s complement
• “Taking the 2’s complement” (complement and
negation
add 1)
is computing the arithmetic negation of
a number
• Compute y = 0 – x
— Or
• Compute y such that x + y = 0
Addition and
• For addition use normal binary addition
Subtraction
— 0+0=sum
0 carry 0
— 0+1=sum 1 carry 0
— 1+1=sum 0 carry 1
• Monitor MSB for overflow
— Overflow cannot occur when adding 2 operands with the
different signs
— If 2 operand have same sign and result has a different sign,
overflow has occurred
• Subtraction: Take 2’s complement of subtrahend and
add to minuend
— i.e. a - b = a + (-b)
• So we only need addition and complement circuits
Hardware for Addition and Subtraction
Side note: Carry look
• Binary addition would seem to be
-ahead
dramatically
slower for large registers
— consider 0111 + 0011
— carries propagate left-to-right
— So 64-bit addition would be 8 times slower than 8bit addition
• It is possible to build a circuit called a “carry
look-ahead adder” that speeds up addition
by eliminating the need to “ripple” carries
through the word
Carry look-ahead
• Carry look-ahead is expensive
• If n is the number of bits in a ripple adder, the
circuit complexity (number of gates) is O(n)
• For full carry look-ahead, the complexity is
O(n3)
• Complexity can be reduced by rippling smaller
look-aheads: e.g., each 16 bit group is handled
by four 4-bit adders and the 16-bit adders are
rippled into a 64-bit adder
Multiplica
• A complex operation compared with addition
tion
and
subtraction
• Many algorithms are used, esp. for large
numbers
• Simple algorithm is the same long
multiplication taught in grade school
— Compute partial product for each digit
— Add partial products
Multiplication Example
•
1011 Multiplicand (11 dec)
•
x 1101 Multiplier (13 dec)
•
1011 Partial products
•
0000
Note: if multiplier bit is 1 copy
multiplicand (place value)
• 1011
otherwise zero
• 1011
• 10001111 Product (143 dec)
• Note: need double length result
Simplifications for Binary
• Partial products are easy to compute:
Arithmetic
— If bit is 0, partial product is 0
— If bit is 1, partial product is multiplicand
• Can add each partial product as it is
generated, so no storage is needed
• Binary multiplication of unsigned integers
reduces to “shift and add”
Control logic and
• 3 n bit registers, 1 bit carry register CF
registers
• Register set up
— Q register <- multiplier
— M register <- multiplicand
— A register <- 0
— CF <- 0
• CF for carries after addition
• Product will be 2n bits in A Q registers
Unsigned Binary Multiplication
Multiplication
• Repeat n times:
Algorithm
— If Q = 1 Add M into A, store carry in CF
0
— Shift CF, A, Q right one bit so that:
– An-1 <- CF
– Qn-1 <- A0
– Q0 is lost
• Note that during execution Q contains bits
from both product and multiplier
Flowchart for Unsigned Binary Multiplication
Execution of Example
Two’s complement
• Shift and add does not work for two’s
multiplication
complement
numbers
• Previous example as 4-bit 2’s complement:
-5 (1011) * -3 (1101) = -113 (10001111)
• What is the problem?
— Partial products are 2n-bit products
When the multiplicand is negative
• Each addition of the negative
multiplicand must be negative number
with 2n bits
• Sign extend multiplicand into partial product
• Or sign extend both operands to
double precision
• Not efficient
When the multiplier is
• When the multiplier (Q register) is negative,
the bits negative
of the operand do not correspond
to shifts and adds needed
• 1101 <->1*2^3 + 1*2^2 + 1*2^0
= -(2^3 + 2^2 + 2^0)
• But we need
-(2^1 + 2^0)
The obvious solution
• Convert multiplier and multiplicand to
unsigned integers
• Multiply
• If original signs differed, negate result
• But there are more efficient ways
Fast
• Consider the product 6234 * 99990
multiplication
— We could do 4 single-digit multiplies and add partial sums
• Or we can express the product as
6234 * (106 – 101 )
• In binary x * 00111100 can be expressed as
x * (25 + 24 + 23 + 22) = x * 60
• We can reduce the number of operations to 2 by
observing that 00111100 = 01000000 – 00000010 (64-4 =
60)
— x * 00111100 = x * 26 – x * 22
— Each block of 1’s can be reduced to two operations
— In the worst case 01010101 we still have only 8 operations
Booth’s Algorithm Registers
• 3 n bit registers, 1 bit register logically to
Setupas Q-1)
the right ofand
Q (denoted
• Register set up
— Q register <- multiplier
— Q-1 <- 0
— M register <- multiplicand
— A register <- 0
— Count <- n
• Product will be 2n bits in A Q registers
Booth’s Algorithm
• Bits of the multiplier are scanned one at a a time (the
Control
Logic
current
bit Q )
0
• As bit is examined the bit to the right is considered
also (the previous bit Q-1 )
• Then:
00: Middle of a string of 0s, so no arithmetic operation.
01: End of a string of 1s, so add the multiplicand to the left half
of the product (A).
10: Beginning of a string of 1s, so subtract the multiplicand from
the left half of the product (A).
11: Middle of a string of 1s, so no arithmetic operation.
• Then shift A, Q, bit Q-1 right one bit using an
arithmetic shift
• In an arithmetic shift, the msb remains unchanged
Booth’s Algorithm
Example of Booth’s Algorithm ( 7* 3= 21)
Example: -3 * 2 = -6 ( -3 = 1101)
A
0000
Q
1101
Q-1
0
M
0010
C/P
Comment
Initial Values
1110
1111
1101
0110
0
1
0010
0010
10
A <- A - 2 = -2
>>1
0001
0000
0110
1011
1
0
0010
0010
01
A <- A + 2
>>1
1110
1111
1011
0101
0
1
0010
0010
01
A <- A - 2 = -2
>>1
1111
1010
1
0010
11
>>1 A:Q = -6
Example:
6
*
-1
=
-6
(
A
Q
Q
Comment
M
C/P
0000 1111 0 1111
0110= -1) Initial Values
-1
1010
1101
1111
0111
1
1
0110 10
0110
A <- A - 6 = -6
>>1
1110 1011 1
0110
11
>>1
1111 0101 1
0110
11
>>1
1111 1010 1
0110
11
>>1 A:Q = -6
Example:
3
*
-2
=
-6
(
A
Q
Q
Comment
M
C/P
0000 0011 0 1110
1110= -2) Initial Values
-1
0010
0001
0011
0001
0
1
0000 1000 1
1110
1111
1000
0100
1111 1010 0
1
0
1110 10
1110
A <- A -(-2) = 2
>> 1
1110
11
>> 1
1110
1110
01
A <- A +(-2) = -2
>>1
1110 00
>> 1 A:Q = -6
Divisi
• More complex than multiplication to
on
implement
(for computers as well as humans!)
— Some processors designed for embedded applications
or digital signal processing lack a divide instruction
• Basically inverse of add and shift: shift
and subtract
• Similar to long division taught in grade school
Unsigned Division In
147 / 11 = 13 with remainder 4
Principle
00001101
Quotient
1011 10010011
1011
001110
Partial
1011
Remainders
001111
1011
100
Dividend
Divisor
Remainder
Unsigned Division
• Using same registers (A,M,Q, count) as
algorithm
multiplication
• Results of division are quotient and remainder
— Q will hold the quotient
— A will hold the remainder
• Initial values
— Q <- 0
— A <- Dividend
— M <- Divisor
— Count <- n
Unsigned Division Flowchart
Example
Two’s complement
• More difficult than unsigned division
division
• Algorithm:
1. M <- Divisor, A:Q <- dividend sign extended to 2n bits; for
example 0111 -> 00000111 ; 1001-> 11111001
(note that 0111 = 7 and 1001 = -3)
2. Shift A:Q left 1 bit
3. If M and A have same signs, perform A <- A-M otherwise
perform A <- A + M
4. The preceding operation succeeds if the sign of A is
unchanged
–
–
If successful, or (A==0 and Q==0) set Q0 <- 1
If not successful, and (A!=0 or Q!=0) set Q0 <- 0 and restore the
previous value of A
5. Repeat steps 2,3,4 for n bit positions in Q
6. Remainder is in A. If the signs of the divisor and dividend were
the same then the quotient is in Q, otherwise the correct
quotient is 0-Q
2 ’s complement division examples
2 ’s complement division examples
2 ’s complement
• 7/ 3 = 2 R 1
remainders
• 7 / -3 = -2 R 1
• -7 / 3 = -2 R -1
• -7 / -3 = 2 R -1
• Here the remainder is defined as:
Dividend = Quotient * Divisor + Remainder
IEEE-754 Floating Point
• Format wasNumbers
discussed earlier in class
• Before IEEE-754 each family of computers had
proprietary format: Cray,Vax, IBM
• Some Cray and IBM machines still use these formats
• Most are similar to IEEE formats but vary in details (bits
in exponent or mantissa):
— IBM Base 16 exponent
— Vax, Cray: bias differs from IEEE
• Cannot make precise translations from one format to
another
• Older binary scientific data not easily accessible
IEEE 754
• +/- 1.significand x 2exponent
• Standard for floating point storage
• 32 and 64 bit standards
• 8 and 11 bit exponent respectively
• Extended formats (both mantissa
and exponent) for intermediate
results
Floating Point Examples
FP
• For a 32 bit number
Ranges
— 8 bit exponent
— +/- 2256  1.5 x 1077
• Accuracy
— The effect of changing lsb of mantissa
— 23 bit mantissa 2-23  1.2 x 10-7
— About 6 decimal places
Expressible Numbers
Density of Floating Point
Numbers
•Note that there is a tradeoff between density
and precision
For a floating point representation of n bits, if
we increase the precision by using more bits
in the mantissa then then we decrease the
range
If we increase the range by using more bits for
the exponent then we decrease the density
and precision
Floating Point Arithmetic Operations
FP Arithmetic
• Addition and subtraction are more
+ / -than multiplication and division
complex
• Need to align mantissas
• Algorithm:
— Check for zeros
— Align significands (adjusting exponents)
— Add or subtract significands
— Normalize result
FP Addition & Subtraction Flowchart
for Z < - X + Y and Z < - X - Y
Zero check
• Addition and subtraction identical except for
sign change
• For subtraction, just negate subtrahend
(Y in Z = X-Y) then compute Z = X+Y
• If either operand is 0 report the other as
the result
Significand
• Manipulate numbers so that both exponents
Alignment
are
equal
• Shift number with smaller exponent to
the right – if bits are lost they will be less
significant
Repeat
Shift mantissa right 1 bit
Add 1 to exponent
Until exponents are equal
• If mantissa becomes 0 report other number
as result
Additi
• Add mantissas together, taking sign into
on
account
• May result in 0 if signs differ
• Can result in mantissa overflow by 1 bit (carry)
— Shift mantissa right and increment exponent
— Report error if exponent overflow
Normaliza
• While (MSB of mantissa == 0)
tion
— Shift mantissa left one bit
— Decrement exponent
— Check for exponent underflow
• Round mantissa
FP Arithmetic Multiplication
• Simpler processes than addition and
subtractionand Division
— Check for zero
— Add/subtract exponents
— Multiply/divide significands (watch sign)
— Normalize
— Round
Floating Point Multiplication
Multiplica
• If either operand is 0 report 0
tion
• Add
exponents
— Because addition doubles bias, first subtract the bias from one
exponent
• If exponent underflow or overflow, report error
— Underflow may be reported as 0 and overflow as infinity
• Multiply mantissas as if they were integers (similar to
2’s comp mult.)
— Note product is twice as long as factors
• Normalize and round
— Same process as addition
— Could result in exponent underflow
Floating Point Division
Divisi
• If divisor is 0 report error or infinity; dividend 0 then
on is 0
result
• Subtract divisor exponent from dividend exp.
— Removes bias so add bias back
• If exponent underflow or overflow, report error
— Underflow may be reported as 0 and overflow as infinity
• Divide mantissas as if they were integers (similar to 2’s
comp mult.)
— Note product is twice as long as factors
• Normalize and round
— Same process as addition
— Could result in exponent underflow
IEEE Standard for Binary
• Specifies practices and procedures beyond
Arithmetic
format specification
— Guard bits (intermediate formats)
— Rounding
— Treatment of infinities
— Quiet and signaling NaNs
— Denormalized numbers
Precision
• Floating point arithmetic is inherently inexact
considerations
except
where only numbers composed of
sums of powers of 2 are used
• To preserve maximum precision there are two
main techniques:
— Guard bits
— Rounding rules
Guard
• Length of FPU registers > bits in mantissa
bits
• Allows some preservation of precision when
— aligning exponents for addition
— Multiplying or dividing significands
• We have seen that results of arithmetic can
vary when intermediate stores to memory
are made in the course of a computation
Roundi
• Conventional banker’s rounding (round up if
ng has a slight bias toward the larger
0.5)
number
• To remove this bias use round-to-even:
1.5 -> 2
2.5 -> 2
3.5 -> 4
4.5 -> 4
Etc
IEEE
• Four types are defined:
Rounding
— Round to nearest (round to even)
— Round to + infinity
— Round to – infinity
— Round to 0
Round to
• If extra bits beyond mantissa are 100..1.. then
nearest
round
up
• If extra bits are 01… then truncate
• Special case: 10000…0
— Round up if last representable bit is 1
— Truncate if last representable bit is 0
Round to + / • Useful for interval arithmetic
infinity
— Result of fp computation is expressed as an interval
with upper and lower endpoints
— Width of interval gives measure of precision
— In numerical analysis algorithms are designed to
minimize width of interval
Round to 0
• Simple truncation, obvious bias
• May be needed when explicitly rounding
following operations with
transcendental functions
Infinities
• Infinity treated as limiting case for real
arithmetic
• Most arithmetic operations involving
infinities produce infinity
Quiet and Signaling NaNs
• NaN = Not a Number
• Signaling NaN causes invalid operation
exception if used as operand
• Quiet NaN can propagate through
arithmetic operations without raising an
exception
• Signaling NaNs are useful for initial values
of uninitialized variables
• Actual representation is implementation
(processor) specific
Quiet NaNs
Denormalized Numbers
• Handle exponent underflow
• Provide values in the “hole around 0”
Unnormalized Numbers
• Denormalized numbers have fewer bits of
precision than normal numbers
• When an operation is performed with a
denormalized number and a normal number,
the result is called an “unnormal” number
• Precision is unknown
• FPU can be programmed to raise an
exception for unnormal computations
HARDWIRED CONTROL AND
MICROPROGRAMMED CONTROL
172
Connection Between the Processor
and the Memory
Memory
MAR
MDR
Control
PC
R0
R1
Processor
IR
ALU
Rn - 1
n general purpose
registers
Figure 1.2. Connections between the processor and the memory.
173
Overview
• Instruction Set Processor (ISP)
• Central Processing Unit (CPU)
• A typical computing task consists of a series of
steps specified by a sequence of machine
instructions that constitute a program.
• An instruction is executed by carrying out a
sequence of more rudimentary operations.
174
Fundamental Concepts
• Processor fetches one instruction at a time and perform
the operation specified.
• Instructions are fetched from successive memory
locations until a branch or a jump instruction is
encountered.
• Processor keeps track of the address of the memory
location containing the next instruction to be fetched
using Program Counter (PC).
• Instruction Register (IR)
175
Executing an Instruction
• Fetch the contents of the memory location pointed to by
the PC. The contents of this location are loaded into the
IR (fetch phase).
IR ← [[PC]]
• Assuming that the memory is byte addressable,
increment the contents of the PC by 4 (fetch phase).
PC ← [PC] + 4
• Carry out the actions specified by the instruction in the IR
(execution phase).
176
Processor Organization
Internal processor
bus
Control signals
PC
Instruction
Address
lines
decoder and
MAR
control logic
Memory
bus
MDR
Data
lines
IR
Datapath
Y
R0
Constant 4
Select
MUX
Add
ALU
control
lines
Sub
A
B
ALU
Carry-in
XOR
Textbook Page 413
R n - 1
TEMP
Z
Figure 7.1. Single-bus organization of the datapath inside a processor.
177
Executing an Instruction
• Transfer a word of data from one processor
register to another or to the ALU.
• Perform an arithmetic or a logic operation and
store the result in a processor register.
• Fetch the contents of a given memory location
and load them into a processor register.
• Store a word of data from a processor register
into a given memory location.
178
Register Transfers
Internal processor
bus
Riin
Ri
Riout
Yin
Y
Constant 4
Select
MUX
A
B
ALU
Zin
Z
Textbook Page 416
Zout
Figure 7.2. Input and output gating for the registers in Figure 7.1.
179
Performing an Arithmetic or Logic
Operation
•
•
•
The ALU is a combinational circuit that has no internal
storage.
ALU gets the two operands from MUX and bus. The
result is temporarily stored in register Z.
What is the sequence of operations to add the contents
of register R1 to those of R2 and store the result in R3?
1. R1out, Yin
2. R2out, SelectY, Add, Zin
3. Zout, R3in
180
Fetching a Word from Memory
• Address into MAR; issue Read operation; data into MDR.
Memory-bus
data lines
MDRoutE
MDRout
Internal processor
bus
MDR
MDR inE
MDRin
Figure7.4.
7.4. Connection
Connection and
control
signals
for register
Figure
and
control
signals
forgister
re MDR.
MDR.
181
Fetching a Word from Memory
• The response time of each memory access varies
• The processor waits until it receives an MFC indication .
• Move (R1), R2
 MAR ← [R1]
 Start a Read operation on the memory bus
 Wait for the MFC response from the memory
 Load MDR from the memory bus
 R2 ← [MDR]
182
Execution of a Complete Instruction
• Add (R3), R1
• Fetch the instruction
• Fetch the first operand (the contents of the
memory location pointed to by R3)
• Perform the addition
• Load the result into R1
183
Architecture
Internal processor
bus
Riin
Ri
Riout
Yin
Y
Constant 4
Select
MUX
A
B
ALU
Zin
Z
Zout
Figure 7.2. Input and output gating for the registers in Figure 7.1.
184
Execution of a Complete Instruction
Internal processor
bus
Add (R3), R1
Step
Control signals
PC
Action
1
PCout , MAR in , Read, Select4,Add, Zin
2
Zout , PCin , Y in , WMF C
3
MDR out , IR in
4
R3out , MAR in , Read
5
R1out , Y in , WMF C
6
MDR out , SelectY, Add, Zin
7
Zout , R1in , End
Instruction
Address
lines
decoder and
MAR
Memory
bus
MDR
Data
lines
IR
Y
R0
Constant 4
Select
MUX
Add
ALU
control
lines
Sub
A
B
R n - 1
ALU
Carry-in
Figure 7.6. Control sequencefor execution of the instruction Add (R3),R1.
XOR
Add R2, R1 ?
control logic
TEMP
Z
185
Figure 7.1. Single-bus organization of the datapath inside a processor.
Execution of a Complete Instruction
Internal processor
bus
Add R2, R1
Step
PCout , MAR in , Read, Select4,Add, Zin
2
Zout , PCin , Y in , WMF C
3
MDR out , IR in
4
R3out , MAR in , Read
5
R1out , Y in , WMF C
7
PC
Action
1
6
Control signals
decoder and
MAR
control logic
Memory
bus
MDR
Data
lines
IR
Y
R0
Constant 4
R2outMDR out , SelectY, Add, Zin
Zout , R1in , End
Instruction
Address
lines
Select
MUX
Add
ALU
control
lines
Sub
A
B
R n - 1
ALU
Carry-in
Figure 7.6. Control sequencefor execution of the instruction Add (R3),R1.
XOR
TEMP
Z
186
Figure 7.1. Single-bus organization of the datapath inside a processor.
Execution of Branch Instructions
• A branch instruction replaces the contents of
PC with the branch target address, which is
usually obtained by adding an offset X given in
the branch instruction.
• The offset X is usually the difference between
the branch target address and the address
immediately following the branch instruction.
• Conditional branch
187
Execution of Branch Instructions
Step Action
1
PCout , MAR in , Read, Select4,Add, Z in
2
Zout , PCin , Yin , WMF C
3
MDR out , IR in
4
Offset-field-of-IRout, Add, Z in
5
Z out , PCin , End
Figure 7.7. Control sequence for an unconditional branch instruction.
188
Exercise
Internal processor
bus
• What is the control
sequence for
execution of the
instruction
Add R1, R2
including the
instruction fetch
phase? (Assume
single bus
architecture)
Control signals
PC
Instruction
Address
lines
decoder and
MAR
control logic
Memory
bus
MDR
Data
lines
IR
Y
R0
Constant 4
Select
MUX
Add
ALU
control
lines
Sub
A
B
R n - 1
ALU
Carry-in
XOR
TEMP
Z
189
Figure 7.1. Single-bus organization of the datapath inside a processor.
Hardwired Control
190
Overview
• To execute instructions, the processor must
have some means of generating the control
signals needed in the proper sequence.
• Two categories: hardwired control and
microprogrammed control
• Hardwired system can operate at high speed;
but with little flexibility.
191
Control Unit Organization
Clock
CLK
Control step
counter
External
inputs
IR
Decoder/
encoder
Condition
codes
Control signals
Figure 7.10. Control unit organization.
192
Detailed Block Description
Clock
CLK
Control step
counter
Reset
Step decoder
T 1 T2
Tn
INS1
External
inputs
INS2
IR
Instruction
decoder
Encoder
Condition
codes
INSm
Run
End
Control signals
Figure 7.11. Separation of the decoding and encoding functions.
193
Generating Zin
• Zin = T1 + T6 • ADD + T4 • BR + …
Branch
T4
Add
T6
T1
Figure 7.12. Generation of the Zin control signal for the processor in Figure 7.1.
194
Generating End
• End = T7 • ADD + T5 • BR + (T5 • N + T4 • N) • BRN +…
Branch<0
Add
T7
N
Branch
N
T5
T4
T5
End
Figure 7.13. Generation of the End control signal.
195
A Complete Processor
Instruction
unit
Integer
unit
Instruction
cache
Floating-point
unit
Data
cache
Bus interface
Processor
System b
us
Main
memory
Input/
Output
Figure 7.14. Block diagram of a complete processor
.
196
Microprogrammed Control
Micro instruction
PCout
MAR in
Read
MDRout
IRin
Yin
Select
Add
Zin
Z out
R1out
R1in
R3out
WMFC
End
•
Control signals are generated by a program similar to machine language
programs.
Control Word (CW); microroutine; microinstruction
: Textbook page430
PCin
•
1
0
1
1
1
0
0
0
1
1
1
0
0
0
0
0
0
2
1
0
0
0
0
0
1
0
0
0
1
0
0
0
1
0
3
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
4
0
0
1
1
0
0
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
1
0
0
0
0
1
0
0
1
0
6
0
0
0
0
1
0
0
0
1
1
0
0
0
0
0
0
7
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
1
Figure 7.15 An example of microinstructions for Figure 7.6.
197
Overview
Textbook page 421
Step
Action
1
PCout , MAR in , Read, Select4,Add, Zin
2
Zout , PCin , Y in , WMF C
3
MDR out , IR in
4
R3out , MAR in , Read
5
R1out , Y in , WMF C
6
MDR out , SelectY, Add, Zin
7
Zout , R1in , End
Figure 7.6. Control sequencefor execution of the instruction Add (R3),R1.
198
Basic organization of a
microprogrammed control unit
•
Control store
IR
Starting
address
generator
Clock
PC
Control
store
One function
cannot be carried
out by this simple
organization.
CW
Figure 7.16. Basic organization of a microprogrammed control unit.
199
Conditional branch
•
•
The previous organization cannot handle the situation when the control unit is
required to check the status of the condition codes or external inputs to
choose between alternative courses of action.
Use conditional branch microinstruction.
AddressMicroinstruction
0
PCout , MAR in , Read, Select4,Add, Z in
1
Zout , PCin , Y in , WMFC
2
MDRout , IR in
3
Branch to starting addressof appropriatemicroroutine
. ... .. ... ... .. ... .. ... ... .. ... ... .. ... .. ... ... .. ... .. ... ... .. ... ..
25
If N=0, then branch to microinstruction0
26
Offset-field-of-IRout , SelectY, Add, Z in
27
Zout , PCin , End
Figure 7.17. Microroutine for the instruction Branch<0.
200
Microprogrammed Control
External
inputs
IR
Starting and
branch address
generator
Clock
µPC
Control
store
Figure 7.18.
Condition
codes
CW
Organization of the control unit to allow
201
conditional branching in the microprogram.
Microinstructions
• A straightforward way to structure
microinstructions is to assign one bit position
to each control signal.
• However, this is very inefficient.
• The length can be reduced: most signals are
not needed simultaneously, and many signals
are mutually exclusive.
• All mutually exclusive signals are placed in the
same group in binary coding.
202
Partial Format for the
Microinstructions
Microinstruction
F1
F2
F3
F4
F5
F1 (4 bits)
F2 (3 bits)
F3 (3 bits)
F4 (4 bits)
F5 (2 bits)
0000: No transfer
0001: PC
out
0010: MDRout
0011: Zout
0100: R0out
0101: R1out
0110: R2out
0111: R3out
1010: TEMPout
1011: Offsetout
000: No transfer 000: No transfer 0000: Add
001: PCin
001: MARin
0001: Sub
010: IRin
010: MDRin
011: Zin
011: TEMPin
1111: XOR
100: R0in
100: Yin
101: R1in
16 ALU
110: R2in
functions
111: R3in
F6
F7
F8
F6 (1 bit)
F7 (1 bit)
F8 (1 bit)
0: SelectY
1: Select4
0: No action
1: WMFC
0: Continue
1: End
00: No action
01: Read
10: Write
What is the price paid for
this scheme?
Require a little more hardware
Figure 7.19. An example of a partial format for field-encoded microinstructions.
203
Further Improvement
• Enumerate the patterns of required signals in
all possible microinstructions. Each
meaningful combination of active control
signals can then be assigned a distinct code.
• Vertical organization
Textbook page 434
• Horizontal organization
204
Microprogram Sequencing
• If all microprograms require only straightforward
sequential execution of microinstructions except for
branches, letting a μPC governs the sequencing would be
efficient.
• However, two disadvantages:
 Having a separate microroutine for each machine instruction results in a
large total number of microinstructions and a large control store.
 Longer execution time because it takes more time to carry out the
required branches.
205
Microinstructions with Next-Address
Field
Condition
IR
• The microprogram
which requires several
branch
microinstructions
Solution
• A powerful alternative
approach is to include
an address field as a
part of every
microinstruction to
indicate the location of
the next
microinstruction to be
fetched.
•
External
Inputs
Condition
codes
Decoding circuits
A R
Control store
I R
Next address
Microinstruction decoder
Control signals
206
Figure 7.22. Microinstruction-sequencing organization.
Memory
Chapter 7 - Memory
7-2
1.
2.
3.
4.
5.
6.
7.
8.
9.
Chapter
Contents
The Memory Hierarchy
Random-Access Memory
Memory Chip Organization
Case Study: Rambus Memory
Cache Memory
Virtual Memory
Advanced Topics
Case Study: Associative Memory in Routers
Case Study: The Intel Pentium 4 Memory System
Chapter 7 - Memory
7-3
The
Memory
Hierarchy
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
Chapter 7 - Memory
7-4
Functional Behavior of a RAM
Cell
Static RAM cell (a) and dynamic RAM cell (b).
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
Chapter 7 - Memory
7-5
Simplified RAM Chip
Pinout
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-6
Chapter 7 - Memory
A
FourWord
Memory
with Four
Bits
per
Word in a
2D
Organizatio
n
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
Chapter 7 - Memory
7-7
A Simplified Representation
of the Four-Word by FourBit RAM
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-8
Chapter 7 - Memory
2-1/2D Organization of a 64Word by One-Bit
RAM
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-9
Chapter 7 - Memory
Two Four-Word by Four-Bit
RAMs are Used in Creating a
Four-Word by Eight-Bit RAM
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-10
Chapter 7 - Memory
Two Four-Word by Four-Bit RAMs
Make up an Eight-Word by
Four-Bit RAM
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
Chapter 7 - Memory
7-11
Single-InLine
Memory
Module
• 256 MB dual in-line
memory module organized
for a 64-bit word with 16
16M × 8-bit RAM chips
(eight chips on each side
of the DIMM).
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-12
Chapter 7 - Memory
Single-InLine
Memory
Module
• Schematic diagram of
256 MB dual in-line
memory module.
(Source: adapted from
http://wwws.ti.com/sc/ds/tm4en64
kpu.pdf.)
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
Chapter 7 - Memory
7-13
A ROM Stores Four Four-Bit
Words
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-14
Chapter 7 - Memory
A Lookup Table (LUT)
Implements an EightBit ALU
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
Chapter 7 - Memory
7-15
Flash
Memory
• (a) External view of flash memory module and (b) flash module
internals. (Source: adapted from HowStuffWorks.com.)
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
Chapter 7 - Memory
7-16
Cell Structure for Flash
Memory
• Current flows from source to drain when a sufficient negative charge is
placed on the dielectric material, preventing current flow through the
word line. This is the logical 0 state. When the dielectric material is not
charged, current flows between the bit and word lines, which is the
logical 1 state.
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-17
Rambus
• Comparison of DRAM and RDRAM configurations.
Memory
Computer Architecture and Organization by M. Murdocca
and V. Heuring
Chapter 7 - Memory
© 2007 M. Murdocca and V.
Heuring
7-18
Chapter 7 - Memory
Rambus
• Rambus technology on the Nintendo 64 motherboard (left)
enables cost savings over
the conventional Sega Saturn
Memory
motherboard design (right).
• Nintendo 64 game console:
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-19
Chapter 7 - Memory
Placement of Cache
Memory in a
Computer System
• The locality principle: a recently referenced memory location is likely to
be referenced again (temporal locality); a neighbor of a recently
referenced memory location is likely to be referenced (spatial locality).
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-20
Chapter 7 - Memory
An Associative Mapping Scheme
for a Cache Memory
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
Chapter 7 - Memory
7-21
Associative
Mapping
• Consider how an access to memory location (A035F014) is mapped to
the cache for a 2 word Example
memory. The memory is divided into 2 blocks
of 2 = 32 words per block, and the cache consists of 2 slots:
16
32
5
27
14
• If the addressed word is in the cache, it will be found in word (14)16 of a
slot that has tag (501AF80)16, which is made up of the 27 most
significant bits of the address. If the addressed word is not in the cache,
then the block corresponding to tag field (501AF80)16 is brought into an
available slot in the cache from the main memory, and the memory
reference is then satisfied from the cache.
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-22
Chapter 7 - Memory
Associative Mapping Area
• Area allocation for associative
mapping scheme based on bits stored:
Allocation
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-23
Chapter 7 - Memory
Replacement
• When there are no available slots in which to place a block, a
replacement policy is implemented.
The replacement policy governs
Policies
the choice of which slot is freed up for the new block.
• Replacement policies are used for associative and set-associative
mapping schemes, and also for virtual memory.
• Least recently used (LRU)
• First-in/first-out (FIFO)
• Least frequently used (LFU)
• Random
• Optimal (used for analysis only – look backward in time and reverseengineer the best possible strategy for a particular sequence of
memory references.)
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-24
A Direct Mapping Scheme for
Cache Memory
Computer Architecture and Organization by M. Murdocca
and V. Heuring
Chapter 7 - Memory
© 2007 M. Murdocca and V.
Heuring
7-25
Chapter 7 - Memory
Direct Mapping
Example
• For a direct mapped
cache, each main memory
block can be mapped to only one slot, but each
slot can receive more than one block. Consider
how an access to memory location (A035F014)
16
is mapped to the cache for a 232 word memory.
The memory is divided into
27 blocks of 25 = 32 words per block, and the cache consists of 214 slots:
•2If the
addressed word is in the cache, it will be found in word (14)16 of slot
(2F80)16, which will have a tag of (1406)16.
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-26
Chapter 7 - Memory
Direct Mapping Area
• Area allocation for direct mapping scheme based on bits stored:
Allocation
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
A Set Associative Mapping Scheme
for a Cache Memory
7-27
Computer Architecture and Organization by M. Murdocca
and V. Heuring
Chapter 7 - Memory
© 2007 M. Murdocca and V.
Heuring
Chapter 7 - Memory
7-28
Set-Associative Mapping
• Consider how an access to memory location (A035F014) is mapped to
the cache for a 2 word memory.
The memory is divided into 2 blocks
Example
of 2 = 32 words per block, there are two blocks per set, and the cache
16
32
27
5
consists of 214 slots:
• The leftmost 14 bits form the tag field, followed by 13 bits for the set field,
followed by five bits for the word field:
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-29
Chapter 7 - Memory
Set Associative Mapping
Area Allocation
• Area allocation for set associative mapping scheme based on bits stored:
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-30
Cache Read and Write
Policies
Computer Architecture and Organization by M. Murdocca
and V. Heuring
Chapter 7 - Memory
© 2007 M. Murdocca and V.
Heuring
Chapter 7 - Memory
7-31
Hit Ratios and Effective Access
• Hit ratio and effective access time for single level cache:
Times
• Hit ratios and effective access time for multi-level cache:
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
Chapter 7 - Memory
7-32
Direct Mapped Cache
• Compute hit ratio and
effective access time for Example
a program that executes
from memory locations
48 to 95, and then loops
10 times from 15 to 31.
• The direct mapped
cache has four 16-word
slots, a hit time of 80 ns,
and a miss time of 2500
ns. Load-through is
used. The cache is
initially empty.
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
Chapter 7 - Memory
7-33
Table of Events for Example
Program
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
Calculation of Hit Ratio and
Effective Access Time for Example
Program
7-34
Computer Architecture and Organization by M. Murdocca
and V. Heuring
Chapter 7 - Memory
© 2007 M. Murdocca and V.
Heuring
Chapter 7 - Memory
7-35
Multi-level Cache
As an example, consider a two-level cache in which the L1 hit time is 5 ns,
the L2 hit time is 20 ns, and the
L2 miss time is 100 ns. There are 10,000
Memory
memory references of which 10 cause L2 misses and 90 cause L1 misses.
Compute the hit ratios of the L1 and L2 caches and the overall effective
access time.
H1 is the ratio of the number of times the accessed word is in the L1 cache
to the total number of memory accesses. There are a total of 85 (L1) and
15 (L2) misses, and so:
(Continued on next slide.)
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
Chapter 7 - Memory
7-36
Multi-level Cache Memory
H is the ratio of the number of times the accessed word is in the L2 cache
to the number of times the L2 cache
is accessed, and so:
(Cont’)
2
The effective access time is then:
= 5.23 ns per access
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-37
Chapter 7 - Memory
Neat Little LRU
• A sequence is shown for the Neat Little LRU Algorithm for a cache with
four slots. Main memoryAlgorithm
blocks are accessed in the sequence: 0, 2, 3,
1, 5, 4.
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-38
Chapter 7 - Memory
Cache
• The goal of cache coherence is to ensure that every cache sees the
same value for a referenced
location, which means making sure that
Coherency
any shared operand that is changed is updated throughout the system.
• This brings us to the issue of false sharing, which reduces cache
performance when two operands that are not shared between processes
share the same cache line. The situation is shown below. The problem is
that each process will invalidate the other’s cache line when writing data
without a real need, unless the compiler prevents this.
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-39
Chapter 7 - Memory
Overlay
• A partition graph for a program with a main routine and three subroutines:
s
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-40
Chapter 7 - Memory
Virtual
• Virtual memory is stored in a hard disk image. The physical memory
holds a small number of virtual
pages in physical page frames.
Memory
• A mapping between a virtual and a physical memory:
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-41
Chapter 7 - Memory
Page
• The page table maps between virtual memory and physical memory.
Table
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-42
Using the Page
Table
Chapter 7 - Memory
• A virtual address is
translated into a physical
address:
Typical page table entry
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-43
Using the Page Table
• The
(cont’)
configuration of
Chapter 7 - Memory
a page table
changes as a
program
executes.
• Initially, the page
table is empty. In
the final
configuration,
four pages are in
physical
memory.
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-44
Chapter 7 - Memory
Segmentati
• A segmented memory allows two users to share the same word
processor code, with different data
spaces:
on
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
Chapter 7 - Memory
7-45
• (a) Free area
of memory
after initialization; (b)
after
fragmentation; (c)
after
coalescing.
Fragmentati
on
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-46
Chapter 7 - Memory
Translation
Lookaside
• An example TLB holds 8 entries for a system with 32 virtual pages and
16 page frames.
Buffer
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-47
Chapter 7 - Memory
Putting
it
All
• An example TLB holds 8 entries for a system with 32 virtual pages and
16 page frames.
Together
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
Content Addressable
Memory –
• Relationships between random
access memory and content
Addressing
Chapter 7 - Memory
7-48
addressable memory:
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-49
Overview of
CAM
Chapter 7 - Memory
• Source: (Foster,
C. C., Content
Addressable
Parallel
Processors, Van
Nostrand
Reinhold
Company, 1976.)
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
Chapter 7 - Memory
7-50
Addressing Subtrees for a
CAM
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
Chapter 7 - Memory
7-51
Associative Memory in
Routers
• A simple network
with
three routers.
• The use of associative
memories in high-end routers
reduces the lookup time by
allowing a search to be performed in a single operation.
• The search is based on the destination address, rather than the
physical memory address.
• Access methods for this memory have been standardized into an
interface interoperability agreement by the Network Processing Forum.
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
7-52
Chapter 7 - Memory
Block Diagram of Dual-Read
• A dual-read or dual-port
RAM
RAM allows any two
words to be
simultaneously read
from the same memory.
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
Chapter 7 - Memory
7-53
The Intel 4 Pentium Memory
System
Computer Architecture and Organization by M. Murdocca
and V. Heuring
© 2007 M. Murdocca and V.
Heuring
Input/Output Organization
Outlin
 Accessing I / O Devicese
 Interrupts
 Direct Memory Access
 Buses
 Interface Circuits
 Standard I / O
Interfaces
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
26
1
Content
Coverage
Main Memory System
Address
Data/Instructio
n
Central Processing Unit (CPU)
Cache
memory
Operationa
l
Registers
Program
Counter
Arithmeti
c and
Logic Unit
Instruction
Sets
Control Unit
Input/Output System
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
26
2
Accessing I/O
 Single-bus structure
Devices
 The bus enables all the devices connected to it to
exchange information
 Typically, the bus consists of three sets of lines used to
carry address, data, and control signals
 Each I / O device is assigned a unique set of addresses
Processor
Memory
Bus
I/O device 1
Advanced Reliable Systems (ARES) Lab.
I/O device n
Jin-Fu Li, EE, NCU
26
3
I/O
 Memory mappedMapping
I /O

Devices and memory share an address space
 I / O looks just like memory read /write
 No special commands for I/ O

Large selection of memory access commands available
 Isolated I / O

Separate address spaces
 Need I / O or memory select lines
 Special commands for I /O
 Limited set
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
26
4
Memory-Mapped
 When I / O devices and the memory share the same
I/O
address space, the arrangement is called memorymapped I / O
 With memory-mapped I / O, any machine instruction that
can access memory can be used to transfer data to or from
an I / O device
 Most computer systems use memory-mapped I / O.
 Some processors have special IN and OUT instructions to
perform I / O transfers

When building a computer system based on these processors, the
designer has the option of connecting I / O devices to use the
special I / O address space or simply incorporating them as part of
the memory address space
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
26
5
I/O Interface for an Input
 The address decoder,
the data and status
Device
registers, and the control circuitry required to
coordinate
I / O transfers constitute the device’s
Address lines
interfac
Data lines
Bus
e circuit
Control lines
Address
decoder
Control
circuits
Data and status
registers
Input device
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
26
6
I/O
Techniques
 Programmed
 Interrupt driven
 Direct Memory Access
(DMA)
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
26
7
Program-Controlled
 Consider a simple example
I/O of I / O operations
involving a keyboard and a display device in a
computer system. The four registers shown
below are used in the data transfer operations

The two flags KIRQ and DIRQ in STATUS register are
used in conjunction with interrupts
DATAIN
DATAOUT
STATUS
DIRQ KIRQ SOUT
CONTROL
7
Advanced Reliable Systems (ARES) Lab.
6
5
4
Jin-Fu Li, EE, NCU
DEN
KEN
3
2
1
SIN
0
26
8
An
 A program that reads
one line from the keyboard,
Example
stores it in memory buffer, and echoes it back to the
display
WAITK
WAITD
Move
#LINE, R0
TestBit
#0,STATUS
Branch=0 WAITK
Move
DATAIN,R1
TestBit
#1,STATUS
Branch=0 WAITD
Move
R1,DATAOUT
Move
R1,(R0)+
Compare #$0D,R1
Branch=0 WAITK
Move
#$0A,DATAOUT
Call
PROCESS
Advanced Reliable Systems (ARES) Lab.
Initialize memory pointer
Test SIN
Wait for character to be entered
Read character
Test SOUT
Wait for display to become ready
Send character to display
Store character and advance pointer
Check if Carriage Return
If not, get another character
Otherwise, send Line Feed
Call a subroutine to process the
input line
Jin-Fu Li, EE, NCU
26
9
Program-Controlled I/O
 The example described above illustrates
program- controlled I / O, in which the processor
repeatedly checks a status flag to achieve the
required
synchronization between the processor and an
input or output device. We say that the processor
polls the devices
 There are two other commonly used mechanisms
for implementing I / O operations: interrupts and
direct memory access

Interrupts: synchronization is achieved by having the I / O
device send a special signal over the bus whenever it is
ready for a data transfer operation
 Direct memory access: it involves having the device
interface
transfer
data Jin-Fu
directly
Li, EE, NCU to or from the memory
Advanced Reliable
Systems (ARES)
Lab.
27
0
Interrupt
 To avoid the processor being
s not performing any
useful computation, a hardware signal called an
interrupt to the processor can do it. At least one
of the bus control lines, called an interrupt-request
line, is usually dedicated for this purpose
 An interrupt-service routine usually is needed and
is executed when an interrupt request is issued
 On the other hand, the processor must inform
the device that its request has been recognized
so
that it may remove its interrupt-request signal.
An interrupt-acknowledge signal serves
this function
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
27
1
Exampl
e
Program 1
COMPUTE routine
Program 2
PRINT routine
1
2
Interrupt occurs
here
i
i+1
M
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
27
2
Interrupt-Service Routine & Subroutine
 Treatment of an interrupt-service routine is
very similar to that of a subroutine
 An important departure from the similarity
should be noted

A subroutine performs a function required by the program
from which it is called.
 The interrupt-service routine may not have anything in
common with the program being executed at the time the
interrupt request is received. In fact, the two programs
often belong to different users
 Before executing the interrupt-service routine, any
information that may be altered during the
execution of that routine must be saved. This
information must be restored before the interrupted
program is
resumed
27
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
3
Interrupt
 The information that needs to be saved and
Latency
restored typically includes the condition code flags
and the
contents of any registers used by both the
interrupted program and the interrupt-service
routine
 Saving registers also increases the delay between
the time an interrupt request is received and the
start of execution of the interrupt-service routine.
The delay is called interrupt latency
 Typically, the processor saves only the contents of
the program counter and the processor status
register.
Any additional information that needs to be
saved must be saved by program instruction at
Jin-Fu Li, EE, NCU
Advanced
theReliable Systems (ARES) Lab.
beginning of the interrupt-service routine
27
4
Interrupt
 An equivalent circuit
for an open-drain bus
Hardware
used to implement a common interrupt-request
line
V
dd
Processor
R
INTR
INTR
INTR1
INTR2
INTRn
INTR=INTR1+INTR2+…+INTRn
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
27
5
Handling Multiple
 Handling multiple devices gives rise to a number of
Devices
questions:

How can the processor recogniz e the device requesting an
interrupt?
 Given that different devices are likely to require different
interrupt-service routines, how can the processor obtain the
starting address of the approp riate routine in each case?
 Should a device be allowed to interrupt the processor while
another interrupt is being serviced?
 How should two or more simult aneous interrupt request be
handled?
 The information needed to determine whether a
device is requesting an interrupt is available in
its status register

When a device raises an interrupt request, it sets to 1 one of
the bits in its status register , which we will call the IRQ bit
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
27
6
Identify the Interrupting
 The simplest way toDevice
identify the interrupting
device is to have the interrupt-service routine poll
all the
I / O devices connected to the bus

The polling scheme is easy to implement. Its main
disadvantage is the time spen t interrogating all the devices
 A device requesting an interrupt may identify itself
directly to the processor. Then, the processor can
immediately start executing the corresponding
interrupt-service routine. This is called
vectored interrupts
 An interrupt request from a high-priority device
should be accepted while the processor is servicing
another request fromJin-Fu
a Li,lower-priority
device
EE, NCU
Advanced Reliable Systems (ARES) Lab.
27
7
Interrupt
 The processor’s priority
is usually encoded in a few
Priority
bits of the processor status word. It can be changed
by program instructions that write into the
program
status register (PS). These are privileged
instructions, which can be executed only while the
processor is
running in the supervisor mode
 The processor is in the su pervisor mode only
when executing operating system routines. It
switches to the user mode before beginning to
execute
application program
Advanced
AnReliable
attempt
to executeJin-FuaLi,privileged
instruction while 19
EE, NCU
Systems (ARES) Lab.
in the user mode leads to a special type of interrupt
Implementation of Interrupt
 An example of thePriority
implementation of a
multiple- priority scheme
Processor
INTR1
Device 1
INTA1
INTRp
Device 2
Device p
INTAp
Priority arbitration
circuit
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
27
9
Simultaneous
 Consider the problem of simultaneous arrivals
Requests
of interrupt requests from two or more devices.
The
processor must have some means of
deciding
which request to service first
 Interrupt priority scheme with daisy chain
Processor
INTR
INTA
Device 1
Advanced Reliable Systems (ARES) Lab.
Device 2
Jin-Fu Li, EE, NCU
Device n
28
0
Priority
 Combination of the interrupt priority scheme
Group
with daisy chain and with individual
interrupt- request and interrupt-acknowledge
lines
Processor
INTR1
INTA1
Device
Device
Device
Device
Device
Device
INTRp
INTAp
Priority arbitration
circuit
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
28
1
Direct Memory
 To transfer large blocks
of data at high speed, a
Access
special control unit may be provided between an
external device and the main memory, without
continuous intervention by the processor. This
approach is called direct memory access (DMA)
 DMA transfers are performed by a control circuit that
is part of the I / O device in terface. We refer to this
circuit as a DMA controller.
 Since it has to transfer blocks of data, the DMA
controller must increment the memory address
for successive words and keep track of the
number of transfers
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
28
2
DMA
 Although a DMA controller can transfer data
Controller
without intervention by the processor, its
operation must be under the control of a program
executed by the processor
 An example
31 30
1
0
Status and control
IRQ
Done
IE
R/W
Starting address
Word count
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
28
3
DMA Controller in a Computer System
Main
memory
Processor
System bus
Disk/DMA
controller
Disk
Disk
Advanced Reliable Systems (ARES) Lab.
DMA
controller
Printer
Keyboard
Network
Interface
Jin-Fu Li, EE, NCU
28
4
Memory Access
 Memory accesses by the processor and the DMA
Priority
controllers are interwoven. Request by DMA
devices for using the bus are always given higher
priority
than processor requests.
 Among different DMA devices, top priority is given
to high-speed peripherals such as a disk, a highspeed network interface, etc.
 Since the processor originates most memory
access cycles, the DMA controller can be said to
“steal”
memory cycles from the processor. Hence, this
interweaving technique is usually called cycle stealing
 The DMA controller may transfer a block of data
without interruption. This is called block/burst
28
Jin-Fu Li, EE, NCU
Advanced Reliable Systems (ARES) Lab.
mode
5
Bus
 A conflict may arise
if both the processor and a DMA
Arbitration
controller or two DMA controllers try to use the bus
at the same time to access the main memory. To
resolve this problem, an arbitration procedure on
bus is needed
 The device that is allowed to initiate data transfer
on the bus at any given time is called the bus
master. When the current master relinquishes
control of the bus, another device can acquire this
status
 Bus arbitration is the process by which the next
device to become the bus master take into
account the needs of various devices by
establishing a
Jin-Fu Li, EE, NCU
Advanced Reliable Systems (ARES) Lab.
priority system for gaining access to the bus
28
6
Bus
 There are two approaches
to bus arbitration
Arbitration

Centralized and distributed
 In centralized arbitration, a single bus
arbiter performs the required arbitration
 In distributed arbitration, all devices participate
in the selection of the next bus master
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
28
7
Centralized
Arbitration
Processor
BBSY
BR
BG1
DMA
Controller 1
BG2
DMA
Controller 2
BR
BG1
BG2
BBSY
Processor
Advanced Reliable Systems (ARES) Lab.
DMA controller 2
Jin-Fu Li, EE, NCU
Processor
28
8
Distributed
Arbitration
Assume that IDs of A and B are 5 and 6.
Also, the code seen by both devices is 0111
Vcc
ARB3
ARB2
ARB1
ARB0
Start-Arbitration
O.C.
0
1
0
1
0
1
1
1
Interface circuit for device A
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
28
9
Buse
 A bus protocol is the sets of rules that govern the
behavior of various devices connected to the bus
as to when to place information on the bus, assert
control signals, and so on
 In a synchronous bus, all devices derive timing
information from a common clock line. Equal
spaced pulses on this line define equal
time intervals
 In the simplest form of a synchronous bus,
each of these intervals constitutes a bus cycle
during which one data transfer can take place
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
29
0
A Synchronous Bus
 Timing of an inputExample
transfer on a synchronous
bus
Bus clock
Address and
command
Data
t0
t1
t2
Bus Cycle
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
29
1
A Synchronous Bus
 Detail timing
Example
diagram
Bus clock
Seen by master
tAM
Address and
command
Data
tDM
Seen by slave
Slave send
the requested
data
tAS
Address and
command
Data
tDS
t1
Advanced Reliable Systems (ARES) Lab.
t1
Jin-Fu Li, EE, NCU
t2
29
2
Input Transfer Using Multiple Clock Cycles
1
2
3
4
Clock
Address
Command
Data
Slave-ready
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
29
3
Asynchronous
 An alternative scheme
for controlling data
Bus
transfers on the bus is based on the use of a
handshake between the master and slave
Address
and command
Master-ready
Slave-ready
Data
t0
t1
t3
t2
t4
t5
Bus cycle
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
29
4
Asynchronous
 Handshake control ofBus
data transfer during
an output operation
Address
and command
Data
Master-ready
Slave-ready
t0
t1
t3
t2
t4
t5
Bus cycle
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
29
5
Discussio
 The choice of a particular design involves trade-offs
n
among factors such as

Simplicity of the device interface
 Ability to accommodate device interfaces that introduce different
amounts of delay
 Total time required for bus transfer
 Ability to detect errors results from addressing a nonexistent
device or from an interface malfunction
 Asynchronous bus

The handshake process eliminates the need for synchronization
of the sender and receiver clock, thus simplifying timing design
 Synchronous bus

Clock circuitry must be designed carefully to ensure proper
synchronization, and delays must be kept within strict bounds
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
29
6
Circuits
 Keyboard toInterface
processor connection

When a key is pressed, the Valid signal changes from 0
o 1, causing the ASCII code to be loaded into DATAIN
and SIN to be set to 1
 The status flag SIN is cleared to 0 when the processor
reads the contents of the DATAIN register
Data
DATAIN
Data
Address
Processor
SIN
R/W
Valid
Master-ready
Slave-ready
Advanced Reliable Systems (ARES) Lab.
Encoder
and
Debouncing
circuit
Keyboard
switches
Input
Interface
Jin-Fu Li, EE, NCU
29
7
Input Interface
Circuit
D7
D0
R/W
Masterready
A31
A1
Q7 D7
Keyboard
data
Q0 D0
SIN
Slaveready
DATAIN
1
Readstatus
Status
flag
Valid
Readdata
Address
decoder
A0
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
29
8
Circuit for the Status Flag
Block
SIN
Read-data
Master-ready
Q
D
1
Valid
Q
Clear
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
29
9
Printer to Processor
 The interface contains a data register, DATAOUT,
and a status flag,Connection
SOUT

The SOUT flag is set to 1 when the printer is ready to accept
another character, and it is cleared to 0 when a new
character is loaded into DATAOUT by the processor

When the printer is ready to acce pt a character, it asserts its
idle signal
Data
DATAOUT
Data
Address
SOUT
Processor
Printer
R/W
Valid
Master-ready
Slave-ready
Advanced Reliable Systems (ARES) Lab.
Output
Interface
Jin-Fu Li, EE, NCU
Idle
30
0
D7
Output Interface
Circuit
DATAOUT
D7 Q7
D1 Q1
D0 Q0
D1
D0
SOUT
Slaveready
R/W
Masterready
A31
A1
Printer
data
1
Readstatus
Handshake
control
Idle
Valid
Loaddata
Address
decoder
A0
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
30
1
D7
A General 8-Bit Parallel
Interface
P7
DATAIN
D0
P0
DATAOUT
Data
Direction
Register
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
30
2
Output Interface Circuit for a Bus Protocol
DATAOUT
D7 Q7
D7
Printer
data
D1 Q1
D0 Q0
D1
D0
SOUT
Readstatus
Handshake
control
Idle
Valid
Loaddata
Respond
Go=1
R/W
Slaveready
A31
A1
Go
Address
decoder
Myaddress
Timing
Logic
My-address
Idle
A0
Clock
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
30
3
Timing Diagram for an Output Operation
Time
1
2
3
Clock
Address
R/W
Data
Go
Slave-ready
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
30
4
Serial
 A serial port is used to
connect the processor to
Port
I / O devices that require transmission of data
one bit at a time
 The key feature of an interface circuit for a
serial port is that it is capable of
communicating in a bit-serial fashion on the
device side and in a bit- parallel fashion on the
bus side
 The transformation between the parallel and
serial formats is achieved with shift registers
that have parallel access capability
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
30
5
A Serial
Interface
Input shift register
Serial input
DATAIN
D7
D0
DATAOUT
Output shift register
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
Serial output
30
6
Standard I/O
 The processor bus is the bus defined by the signals on the
processor chip itself.Interfaces
Devices that require a very high




speed connection to the processor, such as the main
memory, may be connected directly to this bus
The motherboard usually provides another bus that can
support more devices.
The two buses are interconnected by a circuit, which we
called a bridge, that translates the signals and protocols of
one bus into those of the other
It is impossible to define a uniform standards for the
processor bus. The structure of this bus is closely tied to
the architecture of the processor
The expansion bus is not subject to these limitations, and
therefore it can use a standardized signaling structure
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
30
7
Peripheral Component Interconnect Bus
 Use of a PCI bus in a computer
system
Host
Main
memory
PCI
bridge
PCI Bus
Disk
Advanced Reliable Systems (ARES) Lab.
Printer
Jin-Fu Li, EE, NCU
Ethernet
interface
30
8
PCI Bus
 The bus support three independent address
spaces: memory, I / O, and configuration.
 The I / O address space is intended for use with
processors, such Pentium, that have a separate I /
O address space.
 However, the system designer may choose to use
memory-mapped I / O even when a separate I / O
address space is available
 The configuration space is intended to give the
PCI its plug-and-play capability.

A 4-bit command that accompanies the address identifies
which of the three spaces is being used in a given data
transfer operation
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
30
9
Data Transfer Signals on the PCI Bus
Name
Function
CLK
A 33-MHz or 66MHz clock
FRAME#
Sent by the initiator to indicate the duration of a transaction
AD
32 address/data lines, which may be optionally increased to 64
C/BE#
4 command/byte-enable lines (8 for 64-bit bus)
IRDY#, TRDY#
Initiator-ready and Target-ready signals
DEVSEL#
A response from the device indicating that it has recognized its
Address and is ready for a data transfer transaction
IDSEL#
Initialization Device Select
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
31
0
A Read Operation on the PCI
Bus
1
2
3
5
4
6
7
CLK
Frame#
AD
Address
C/BE#
Cmnd
#1
#2
#3
#4
Byte enable
IRDY#
TRDY#
DEVSEL#
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
31
1
Universal Serial Bus
 The USB has been designed
(USB) to meet several key
objectives
 Provide a simple, low-cost, and easy to use
interconnection system that overcomes the difficulties
due to the limited number of I / O ports available on a
computer
 Accommodate a wide range of data transfer
characteristics for I / O devices, including telephone and
Internet connections
 Enhance user convenience through a “plug-and-play”
mode of operation
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
31
2
USB
 A serial transmission format has been chosen for the USB
Structure
because a serial bus satisfies the low-cost and flexibility
requirements
 Clock and data information are encoded together and
transmitted as a single signal
 Hence, there are no limitations on clock frequency or distance
arising from data skew
 To accommodate a large number of devices that can be
added or removed at any time, the USB has the tree
structure
 Each node of the tree has a device called a hub, which acts as an
intermediate control point between the host and the I / O device
 At the root of the tree, a root hub connects the entire tree to the
host computer
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
31
3
USB Tree
Structure
Host Computer
Root
hub
Hub
I/O
device
Hub
I/O
device
Advanced Reliable Systems (ARES) Lab.
Hub
I/O
device
I/O
device
I/O
device
I/O
device
Jin-Fu Li, EE, NCU
31
4
USB Tree
 The tree structure enables many devices to be connected
Structure
while using only simple
point-to-point serial links
 Each hub has a number of ports where devices may be
connected, including other hubs
 In normal operation, a hub copies a message that it
receives from its upstream connection to all its
downstream ports
 As a result, a message sent by th e host computer is broadcast to all
I / O devices, but only the addresse d device will respond to that
message
 A message sent from an I / O de vice is sent only upstream
towards the root of the tree and is not seen by other
devices
 Hence, USB enables the host to communicate with the I / O devices,
but it does not enable these devices to communicate with each
other
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
31
5
USB
 All information transferred
over the USB is
Protocols
organized in packets, where a packet consists of
one or more bytes of information
 The information transferred on the USB can
be divided into two broad categories: control
and data
 Control packets perform such tasks as addressing a
device to initiate data transfer, acknowledging that data
have been received correctly, or indicating an error
 Data packets carry informatio n that is delivered to a
device. For example, input and output data are
transferred inside data packets
Advanced Reliable Systems (ARES) Lab.
Jin-Fu Li, EE, NCU
31
6
Interconnection Networks
• Uses of interconnection networks
– Connect processors to shared memory
– Connect processors to each other
• Interconnection media types
– Shared medium
– Switched medium
Shared versus Switched Media
Switch Network Topologies
• View switched network as a graph
– Vertices = processors or switches
– Edges = communication paths
• Two kinds of topologies
– Direct
– Indirect
Direct Topology
• Ratio of switch nodes to processor nodes is
1:1
• Every switch node is connected to
– 1 processor node
– At least 1 other switch node
Indirect Topology
• Ratio of switch nodes to processor nodes is
greater than 1:1
• Some switches simply connect other switches
2-D Meshes
Binary Tree Network
• Indirect topology
• n = 2d processor nodes, n-1 switches
Hypertree Network
Butterfly Network Routing
Hypercube Addressing
1110
0110
0111
1010
0010
1111
1011
0011
1100
0100
0101
1000
0000
1101
0001
1001
Shuffle-exchange Illustrated
0
1
2
3
4
5
6
7
Shuffle-exchange Addressing
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Why Processor Arrays?
• Historically, high cost of a control unit
• Scientific applications have data parallelism
Processor Array
Data/instruction Storage
• Front end computer
– Program
– Data manipulated sequentially
• Processor array
– Data manipulated in parallel
Processor Array Performance
• Performance: work done per time unit
• Performance of processor array
– Speed of processing elements
– Utilization of processing elements
2-D Processor Interconnection Network
Each VLSI chip has 16 processing elements
Flynn’s Taxonomy
• Instruction stream
• Data stream
• Single vs. multiple
• Four combinations
– SISD
– SIMD
– MISD
– MIMD
SISD
• Single Instruction, Single Data
• Single-CPU systems
• Note: co-processors don’t count
– Functional
– I/O
• Example: PCs
SIMD
• Single Instruction, Multiple Data
• Two architectures fit this category
– Pipelined vector processor
(e.g., Cray-1)
– Processor array
(e.g., Connection Machine)
MISD
• Multiple
Instruction,
Single Data
• Example:
systolic array
Pipelining
Overview
• Pipelining is widely used in modern processors.
• Pipelining improves system performance in
terms of throughput.
• Pipelined organization requires sophisticated
compilation techniques.
Making the Execution of Programs
Faster
• Use faster circuit technology to build the
processor and the main memory.
• Arrange the hardware so that more than one
operation can be performed at the same time.
• In the latter way, the number of operations
performed per second is increased even
though the elapsed time needed to perform
any one operation is not changed.
Traditional Pipeline Concept
• Laundry Example
• Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold
• Washer takes 30 minutes
• Dryer takes 40 minutes
• “Folder” takes 20 minutes
A
B
C
D
Traditional Pipeline Concept
6 PM
7
8
9
10
11
Midnight
Time
30
A
B
C
D
40
20
30
40
20
30
40
20 30
40 20
• Sequential laundry takes 6 hours
for 4 loads
• If they learned pipelining, how
long would laundry take?
Traditional Pipeline Concept
6 PM
T
a
s
k
O
r
d
e
r
7
8
9
10
11
Midnight
Time
30
A
B
C
D
40
40
40
40
20
• Pipelined laundry takes 3.5
hours for 4 loads
Traditional Pipeline Concept
6 PM
7
8
9
Time
T
a
s
k
O
r
d
e
r
30
A
B
C
D
40
40
40
40
20
• Pipelining doesn’t help latency
of single task, it helps
throughput of entire workload
• Pipeline rate limited by slowest
pipeline stage
• Multiple tasks operating
simultaneously using different
resources
• Potential speedup = Number
pipe stages
• Unbalanced lengths of pipe
stages reduces speedup
• Time to “fill” pipeline and time
to “drain” it reduces speedup
• Stall for Dependences
Use the Idea of Pipelining in a
Computer
Fetch + Execution
Time
I1
I2
I3
Time
Clock cycle
F
1
E
1
F
2
E
2
F
3
1
2
F1
E1
3
4
E
3
Instruction
I1
(a) Sequential execution
I2
F2
E2
Interstage buffer
B1
I3
Instruction
fetch
unit
Execution
unit
(b) Hardware organization
F3
E3
(c) Pipelined execution
Figure 8.1. Basic idea of instruction pipelining.
Use the Idea of Pipelining in a
Computer
Time
Clock cycle
1
2
3
4
5
6
F1
D1
E1
W1
F2
D2
E2
W2
F3
D3
E3
W3
F4
D4
E4
7
Instruction
Fetch + Decode
+ Execution + Write
I1
I2
I3
I4
W4
(a) Instruction execution divided into four steps
Interstage ubffers
D : Decode
instruction
and fetch
operands
F : Fetch
instruction
B1
E: Execute
operation
B2
(b) Hardware organization
Textbook page: 457
Figure 8.2. A 4-stage pipeline.
W : Write
results
B3
Role of Cache Memory
• Each pipeline stage is expected to complete in one clock
cycle.
• The clock period should be long enough to let the
slowest pipeline stage to complete.
• Faster stages can only wait for the slowest one to
complete.
• Since main memory is very slow compared to the
execution, if each instruction needs to be fetched from
main memory, pipeline is almost useless.
• Fortunately, we have cache.
Pipeline Performance
• The potential increase in performance
resulting from pipelining is proportional to the
number of pipeline stages.
• However, this increase would be achieved only
if all pipeline stages require the same time to
complete, and there is no interruption
throughout program execution.
• Unfortunately, this is not true.
Pipeline Performance
Clock cycle
1
2
3
4
F1
D1
E1
W1
F2
D2
5
6
7
8
9
Instruction
I1
I2
I3
I4
I5
F3
E2
W2
D3
E3
W3
F4
D4
E4
W4
F5
D5
E5
Figure 8.3. Effect of an e
xecution operation taking more than one clock
ycle.
c
Time
Pipeline Performance
• The previous pipeline is said to have been stalled for two clock cycles.
• Any condition that causes a pipeline to stall is called a hazard.
• Data hazard – any condition in which either the source or the
destination operands of an instruction are not available at the time
expected in the pipeline. So some operation has to be delayed, and the
pipeline stalls.
• Instruction (control) hazard – a delay in the availability of an
instruction causes the pipeline to stall.
• Structural hazard – the situation when two instructions require the use
of a given hardware resource at the same time.
Pipeline Performance
Instruction
hazard
Clock cycle
1
2
3
4
F1
D1
E1
W1
5
Time
9
6
7
8
D2
E2
W2
F3
D3
E3
W3
8
Time
9
Instruction
I1
I2
F2
I3
(a) Instruction execution steps in successive clock cycles
Clock cycle
1
2
3
4
5
6
7
F1
F2
F2
F2
F2
F3
D1
idle
idle
idle
D2
D3
E1
idle
idle
idle
E2
E3
W1
idle
idle
idle
W2
Stage
F: Fetch
D: Decode
E: Execute
W: Write
(b) Function performed by each processor stage in successive clock cycles
Figure 8.4. Pipeline stall caused by a cache miss in F2.
Idle periods –
stalls (bubbles)
W3
Pipeline Performance
Load X(R1), R2
Structural hazard
Clock cycle
Time
1
2
3
4
5
6
F1
D1
E1
W1
F2
D2
F3
7
E2
M2
W2
D3
E3
W3
F4
D4
E4
Instruction
I1
I 2 (Load)
I3
I4
I5
F5
D5
Figure 8.5. Effect of a Load instruction on pipeline timing.
Pipeline Performance
• Again, pipelining does not result in individual instructions
being executed faster; rather, it is the throughput that
increases.
• Throughput is measured by the rate at which instruction
execution is completed.
• Pipeline stall causes degradation in pipeline performance.
• We need to identify all hazards that may cause the
pipeline to stall and to find ways to minimize their impact.
Quiz
• Four instructions, the I2 takes two clock cycles
for execution. Pls draw the figure for 4-stage
pipeline, and figure out the total cycles
needed for the four instructions to complete.
Data Hazards
Data Hazards
• We must ensure that the results obtained when instructions are
executed in a pipelined processor are identical to those obtained when
the same instructions are executed sequentially.
• Hazard occurs
A←3+A
B←4×A
• No hazard
A←5×C
B ← 20 + C
• When two operations depend on each other, they must be executed
sequentially in the correct order.
• Another example:
Mul R2, R3, R4
Add R5, R4, R6
Data Hazards
Clock cycle
1
2
3
4
F1
D1
E1
W1
F2
D2
5
6
7
8
D2A
E2
W2
D3
E3
W3
F4
D4
E4
9
Instruction
I 1 (Mul)
I 2 (Add)
I3
I4
F3
W4
Figure 8.6. Pipeline stalled by data dependenc
y between D
2 and W1.
Figure 8.6. Pipeline stalled by data dependency between D2 and W1.
Time
Operand Forwarding
• Instead of from the register file, the second
instruction can get data directly from the
output of ALU after the previous instruction is
completed.
• A special arrangement needs to be made to
“forward” the output of ALU to the input of
ALU.
Source 1
Source 2
SRC1
SRC2
Register
file
ALU
RSLT
Destination
(a) Datapath
SRC1,SRC2
RSLT
E: Execute
(ALU)
W: Write
(Register file)
Forwarding path
(b) Position of the source and result registers in the processor pipeline
Figure 8.7. Operand forwarding in a pipelined processor
.
Handling Data Hazards in Software
• Let the compiler detect and handle the hazard:
I1: Mul R2, R3, R4
NOP
NOP
I2: Add R5, R4, R6
• The compiler can reorder the instructions to
perform some useful work during the NOP
slots.
Side Effects
• The previous example is explicit and easily detected.
• Sometimes an instruction changes the contents of a register other than
the one named as the destination.
• When a location other than one explicitly named in an instruction as a
destination operand is affected, the instruction is said to have a side
effect. (Example?)
• Example: conditional code flags:
Add R1, R3
AddWithCarry R2, R4
• Instructions designed for execution on pipelined hardware should have
few side effects.
Instruction Hazards
Overview
• Whenever the stream of instructions supplied
by the instruction fetch unit is interrupted, the
pipeline stalls.
• Cache miss
• Branch
Unconditional Branches
Time
Clock cycle
1
2
F1
E1
3
4
5
6
Instruction
I1
I 2 (Branch)
I3
Ik
I k+1
F2
Execution unit idle
E2
F3
X
Fk
Ek
Fk+1
Ek+1
Figure 8.8. An idle cycle caused by a branch instruction.
Branch Timing
Clock cycle
1
2
3
4
I1
F1
D1
E1
W1
F2
D2
E2
F3
D3
X
F4
X
I 2 (Branch)
I3
- Branch penalty
- Reducing the penalty
I4
5
Fk
Ik
I k+1
6
7
8
Dk
Ek
Wk
Fk+1
Dk+1
Ek+1
(a) Branch address computed inecute
Ex stage
Clock cycle
1
2
3
4
I1
F1
D1
E1
W1
F2
D2
I 2 (Branch)
I3
Ik
I k+1
F3
5
6
7
Dk
Ek
Wk
Fk+1
D k+1 Ek+1
Time
X
Fk
(b) Branch address computed in Decode stage
Figure 8.9. Branch timing.
Time
Instruction Queue and Prefetching
Instruction fetch unit
Instruction queue
F : Fetch
instruction
D : Dispatch/
Decode
unit
E : Execute
instruction
W : Write
results
Figure 8.10. Use of an instruction queue in the hardware organization of Figure 8.2b.
Conditional Braches
• A conditional branch instruction introduces
the added hazard caused by the dependency
of the branch condition on the result of a
preceding instruction.
• The decision to branch cannot be made until
the execution of that instruction has been
completed.
• Branch instructions represent about 20% of
the dynamic instruction count of most
programs.
Delayed Branch
• The instructions in the delay slots are always
fetched. Therefore, we would like to arrange
for them to be fully executed whether or not
the branch is taken.
• The objective is to place useful instructions in
these slots.
• The effectiveness of the delayed branch
approach depends on how often it is possible
to reorder instructions.
Delayed Branch
LOOP
NEXT
Shift_left
Decrement
Branch=0
Add
R1
R2
LOOP
R1,R3
(a) Original program loop
LOOP
NEXT
Decrement
Branch=0
Shift_left
Add
R2
LOOP
R1
R1,R3
(b) Reordered instructions
Figure 8.12. Reordering of instructions for a delayed branch.
Delayed Branch
Time
Clock cycle
1
2
F
E
3
4
5
6
7
8
Instruction
Decrement
Branch
Shift (delay slot)
Decrement (Branch tak
en)
Branch
Shift (delay slot)
Add (Branch not tak
en)
F
E
F
E
F
E
F
E
F
E
F
E
Figure 8.13.
Execution timing showing the delay slot being filled
during the last two passes through the loop in Figure 8.12.
Branch Prediction
• To predict whether or not a particular branch will be taken.
• Simplest form: assume branch will not take place and continue to fetch
instructions in sequential address order.
• Until the branch is evaluated, instruction execution along the predicted
path must be done on a speculative basis.
• Speculative execution: instructions are executed before the processor
is certain that they are in the correct execution sequence.
• Need to be careful so that no processor registers or memory locations
are updated until it is confirmed that these instructions should indeed
be executed.
Incorrectly Predicted Branch
Time
Clock cycle
1
2
3
4
5
F1
D1
E1
W1
F2
D2/P2
E2
F3
D3
X
F4
X
6
Instruction
I 1 (Compare)
I 2 (Branch>0)
I3
I4
Ik
Fk
Dk
Figure 8.14.Timing when a branch decision has been incorrectly predicted
as not taken.
Branch Prediction
• Better performance can be achieved if we arrange for
some branch instructions to be predicted as taken and
others as not taken.
• Use hardware to observe whether the target address is
lower or higher than that of the branch instruction.
• Let compiler include a branch prediction bit.
• So far the branch prediction decision is always the same
every time a given instruction is executed – static branch
prediction.
Influence on Instruction Sets
Overview
• Some instructions are much better suited to
pipeline execution than others.
• Addressing modes
• Conditional code flags
Addressing Modes
• Addressing modes include simple ones and
complex ones.
• In choosing the addressing modes to be
implemented in a pipelined processor, we
must consider the effect of each addressing
mode on instruction flow in the pipeline:
 Side effects
 The extent to which complex addressing modes cause the
pipeline to stall
 Whether a given mode is likely to be used by compilers
Recall
Load X(R1), R2
Time
Clock cycle
1
2
3
4
5
6
F1
D1
E1
W1
F2
D2
F3
7
E2
M2
W2
D3
E3
W3
F4
D4
E4
Instruction
I1
I 2 (Load)
I3
I4
I5
Load (R1), R2
F5
D5
Figure 8.5. Effect of a Load instruction on pipeline timing.
Complex Addressing Mode
Load (X(R1)), R2
Clock cycle 1
2
3
Load
D
X + [R1]
F
4
5
6
[X +[R1]] [[X +[R1]]]
Time
7
W
Forward
Next instruction
F
D
(a) Complex addressing mode
E
W
Simple Addressing Mode
Add #X, R1, R2
Load (R2), R2
Load (R2), R2
Add
F
Load
Load
Next instruction
D
X + [R1]
W
F
D
[X +[R1]]
W
F
D
[[X +[R1]]]
W
F
D
E
(b) Simple addressing mode
W
Addressing Modes
• In a pipelined processor, complex addressing modes do
not necessarily lead to faster execution.
• Advantage: reducing the number of instructions /
program space
• Disadvantage: cause pipeline to stall / more hardware to
decode / not convenient for compiler to work with
• Conclusion: complex addressing modes are not suitable
for pipelined execution.
Addressing Modes
• Good addressing modes should have:
 Access to an operand does not require more than one access
to the memory
 Only load and store instruction access memory operands
 The addressing modes used do not have side effects
• Register, register indirect, index
Conditional Codes
• If an optimizing compiler attempts to reorder
instruction to avoid stalling the pipeline when
branches or data dependencies between
successive instructions occur, it must ensure
that reordering does not cause a change in the
outcome of a computation.
• The dependency introduced by the conditioncode flags reduces the flexibility available for
the compiler to reorder instructions.
Conditional Codes
Add
Compare
Branch=0
R1,R2
R3,R4
...
(a) A program fragment
Compare
Add
Branch=0
R3,R4
R1,R2
...
(b) Instructions reordered
Figure 8.17. Instruction reordering.
Conditional Codes
• Two conclusion:
 To provide flexibility in reordering instructions, the conditioncode flags should be affected by as few instruction as possible.
 The compiler should be able to specify in which instructions of
a program the condition codes are affected and in which they
are not.
Datapath and Control
Considerations
Bus A
Bus B
Bus C
Original Design
Incrementer
PC
Register
file
MUX
Constant 4
A
ALU
R
B
Instruction
decoder
IR
MDR
MAR
Memory bus
data lines
Address
lines
Figure 7.8. Three-bus organization of the datapath.
Register
file
Bus B
Bus A
Pipelined Design
ALU
R
B
Bus C
- Separate instruction and data caches
- PC is connected to IMAR
- DMAR
- Separate MDR
- Buffers for ALU
- Instruction queue
- Instruction decoder output
A
PC
Control signal pipeline
Incrementer
Instruction
decoder
IMAR
Memory address
(Instruction fetches)
Instruction
queue
MDR/Write
DMAR
MDR/Read
Instruction cache
- Reading an instruction from the instruction cache
- Incrementing the PC
- Decoding an instruction
- Reading from or writing into the data cache
- Reading the contents of up to two regs
- Writing into one register in the reg file
- Performing an ALU operation
Memory address
(Data access)
Data cache
Figure 8.18. Datapath modified for pipelinedxecution,
e
with
interstage ubffers at the input and output of the ALU.
Superscalar Operation
Overview
• The maximum throughput of a pipelined processor is one
instruction per clock cycle.
• If we equip the processor with multiple processing units
to handle several instructions in parallel in each
processing stage, several instructions start execution in
the same clock cycle – multiple-issue.
• Processors are capable of achieving an instruction
execution throughput of more than one instruction per
cycle – superscalar processors.
• Multiple-issue requires a wider path to the cache and
multiple execution units.
Superscalar
F : Instruction
fetch unit
Instruction queue
Floatingpoint
unit
Dispatch
unit
W : Write
results
Integer
unit
Figure 8.19. A processor with two execution units.
Timing
Time
Clock cycle
1
2
3
4
5
6
I 1 (Fadd)
F1
D1
E1A
E1B
E1C
W1
I 2 (Add)
F2
D2
E2
W2
I 3 (Fsub)
F3
D3
E3
E3
E3
I 4 (Sub)
F4
D4
E4
W4
7
W3
Figure 8.20. An example of instruction execution flow in the processor of Figure 8.19,
assuming no hazards are encountered.
Out-of-Order Execution
• Hazards
• Exceptions
• Imprecise exceptions
• Precise exceptions
Time
Clock cycle
1
2
3
4
5
6
I 1 (Fadd)
F1
D1
E1A
E1B
E1C
W1
I 2 (Add)
F2
D2
E2
I 3 (Fsub)
F3
D3
I 4 (Sub)
F4
D4
7
W2
E3A
(a) Delayed write
E3B
E3C
W3
E4
W4
Execution Completion
• It is desirable to used out-of-order execution, so that an execution unit
is freed to execute other instructions as soon as possible.
• At the same time, instructions must be completed in program order to
allow precise exceptions.
• The use of temporary registers
• Commitment unit
Clock cycle
1
2
3
4
5
6
I 1 (Fadd)
F1
D1
E1A
E1B
E1C
W1
I 2 (Add)
F2
D2
E2
TW2
I 3 (Fsub)
F3
D3
E3A
E3B
I 4 (Sub)
F4
D4
E4
TW4
(b) Using temporary registers
Time
7
W2
E3C
W3
W4
Performance Considerations
Overview
• The execution time T of a program that has a
dynamic instruction count N is given by:
T
N S
R
where S is the average number of clock cycles it takes
to fetch and execute one instruction, and R is the
clock rate.
• Instruction throughput is defined as the number of
instructions executed per second.
R
Ps 
S
Overview
• An n-stage pipeline has the potential to increase the
throughput by n times.
• However, the only real measure of performance is the
total execution time of a program.
• Higher instruction throughput will not necessarily lead to
higher performance.
• Two questions regarding pipelining
 How much of this potential increase in instruction throughput can be
realized in practice?
 What is good value of n?
Number of Pipeline Stages
• Since an n-stage pipeline has the potential to increase
the throughput by n times, how about we use a 10,000stage pipeline?
• As the number of stages increase, the probability of the
pipeline being stalled increases.
• The inherent delay in the basic operations increases.
• Hardware considerations (area, power, complexity,…)