计算机代写|ECE 570 High-Performance Computer Architecture Assignment #1

这是一个美国的计算机代写架构Assignment

1- For this problem, consider the following two code sequences executing on the 5-stage datapath with normal forwarding and bypassing shown below:

Code 1

LD R4, 0(R2)
ADD R1, R4, R2

Code 2

SUB R4, R3, R2
BNEZ R4, Loop

(a) The first code is an example of a “load followed by its immediate use”. Show and explain the correct execution timing that would allow the datapath to properly execute Code 1.

(b) The second code is an example where MEM-to-ID forwarding (also known as branch forwarding) will benefit on the 5-stage pipeline.

(i) Show and explain the execution timing as to why the datapath would not properly execute the above code sequence without the MEM-ID forwarding.

(ii) Make the necessary modifications to the datapath to implement the MEM-ID forwarding and allow Code 2 to run with minimal stalling and show the execution timing for the modified design. Clearly explain your design.

2- Consider the following code, which represents Y = a´X+Y operation for a vector length of 100. Assume the pipeline latencies from Figure 3.2 in the text and a 1-cycle delay branch that is resolved in the ID stage. In addition, as discussed in Problem #1, the pipeline uses branch forwarding that forwards the result of an ALU operation from the MEM stage to the ID stage (i.e., MEM-to-ID forwarding).

DADDIU R4, R1, #800 ; R1 = X, R4 = upper bound for X
foo: L.D F2, 0(R1) ; load X(i)
MUL.D F4, F2, F0 ; F0 = a, perform a*x(i)
L.D F6, 0(R2) ; load Y(i)
ADD.D F6, F4, F6 ; perform a*X(i) + Y(i)
S.D F6, 0(R2) ; store Y(i)
DADDIU R1, R1, #8 ; increment X index
DADDIU R2, R2, #8 ; increment Y index
DSUBU R3, R4, R1 ; test if done
BNEZ R3, foo ; loop if not done

(a) Show how this loop would execute without any scheduling. Maximize the performance of this code by applying both instruction reordering (also known as pipeline scheduling) and delay branch techniques.
Ignoring the startup delays and assuming the loop executes 100 times, determine the number of cycles required to execute the code before and after the optimizations. Do not be concerned about what happens after the loop.

(b) Unroll the loop as many times as necessary to schedule it without stalls and show the instruction schedule.
Again, assuming the loop executes 100 times, determine the number of cycles required to execute the code before and after unrolling

3- Consider the following code segment within a loop body:

if (d==0)
d=1;
if (d==1)

Here is the equivalent MIPS code assuming d is assigned to registers R1.

bnez R1,L1 ; branch b1 (d!=0)
addi R1,R0,#1 ; d==0, so d=1
L1: subi R3,R1,#1
bnez R3,L2 ; branch b2 (d!=1)

L2:

Assume that the following list of 8 values of d is to be processed:

0, 2, 0, 2, 0, 2, 0, 2

(a) Suppose 2-bit BPBs are used to predict the execution of the two branches in this loop. 2-bit BPBs consists of four states: strongly not taken (sNT), weakly not taken (wNT), strongly taken (sT), and weakly taken (wT).
Show the trace of predictions and the actual outcomes of branches b1 and b2. Assume initial values of the 2–bit predictors are sNT. What are the prediction accuracies for b1 and b2? What is the overall prediction accuracy?

(b) Suppose two-level (1, 2) branch prediction is used predict the execution of the two branches in this loop. That is, in addition to the 2-bit predictor, a 1-bit global register (g) is used. Assume the 2-bit predictors are initialized to sNT and g is initialized to NT. Show the trace of predictions and the actual outcomes of branches b1 and b2. What are the prediction accuracies for b1 and b2? What is the overall prediction accuracy?

(c) Suppose a two-level (2, 2) branch prediction is used predict the execution of the two branches in this loop. That is, in addition to the 2-bit predictor, a 2-bit global register (g) is used. Assume the 2-bit predictors are all initialized to sNT and g is initialized to NT, NT, with the LSB representing the most recent branch outcome.
Therefore, initial predictions, g, and their meaning are given below