This lab will help you understand how cache memory works and the impact that cache memories can have on the performance of your C programs. The lab consists of two parts. In the fifirst part you will write a small C program (about 200-400 lines) that simulates the behavior of a cache memory. In the second part, you will optimize a small matrix transpose function, with the goal of minimizing the number of cache misses.
This is an individual project. All handins are electronic via GitHub and Autolab. Code will be vigorously tested for integrity violations; honor code violations will be reported and handled through offificial channels.
Clarififications and corrections will be posted on the course Piazza page.
2 Submitting Your Code
2.1 Git Instructions
We will use git repository hosted by GitHub to distribute the baseline code for the project and for receiving submissions.
You can accept the assignment and obtain your starter code by going to the following url:
2.2 Handin Instructions
You will be modifying two fifiles: csim.c and trans.c. These two fifiles are the only fifiles that the au
tograder will collect. Two binary executables test-csim and test-trans have been provided for you to test the correctness of your cache simulator and transpose functions respectively. You can also run driver.py to test both csim and trans at the same time.
To turn in your solution, you must (1) submit the <user>-handin.tar fifile that the make instruction produces to the Autolab server, and (2) commit and push your repository to GitHub.
To compile, type:
linux> make clean
NOTE: Your assignment will earn zero credit if your code does not compile on a Linuxlab machine using the above commands.
To make sure that you have submitted your code correctly, be sure to verify that your submission to the Autolab server worked (i.e., compiles and runs) and that your code is visible on github.com.
We will use Autolab to grade your submission for correctness and use the GitHub submission to grade your style. Note that the Autolab part is not optional, and how your submission scores on Autolab will be your fifinal lab 4 grade for Part A and Part B. For information on how Part A and Part B are graded, see Section 4
(i.e., don’t be surprised if the scores from Autolab is different from what you get by invoking driver.py).
You will fifirst need to register with Autolab. To register, visit https://cse361s.engr.wustl.edu/
and create an account with your wustl email. After registering, you should see 361 on your home page. You can click on Cache Lab to upload your tar fifile, and the Autolab server will unpack and run your code, and tell you your score. You can also visit the scoreboard page to see where you stand compared to others who have submitted a solution.
3 Project Description
The lab has two parts. In Part A you will implement a cache simulator. In Part B you will write a matrix transpose function optimized for cache performance.
3.1 Reference Trace Files
The traces subdirectory of your repository contains a collection of reference trace fifiles that we will use to evaluate the correctness of the cache simulator you write in Part A. The trace fifiles are generated by a Linux program called valgrind. For example, typing linux> valgrind –log-fd=1 –tool=lackey -v –trace-mem=yes ls -l on the command line runs the executable program “ls -l”, captures a trace of each of its memory accesses in the order they occur, and prints them on stdout.
Valgrind memory traces have the following form:
Each line denotes one or two memory accesses. The format of each line is [space]operation address,size
The operation fifield denotes the type of memory access: “I” denotes an instruction load, “L” a data load,“S” a data store, and “M” a data modify (i.e., a data load followed by a data store). There is never a space before an “I”. There is always a space before each “M”, “L”, and “S”. The address fifield specififies a 64-bit hexadecimal memory address. The size fifield specififies the number of bytes accessed by the operation.
3.2 Part A: Writing a Cache Simulator
In Part A you will write a cache simulator in csim.c that takes a valgrind memory trace as input,simulates the hit/miss behavior of a cache memory on this trace, and outputs the total number of hits,misses, evictions, dirty bytes evicted, dirty bytes active in the cache, and back-to-back references to the same address within a given set.
We have provided you with the binary executable of a reference cache simulator, called csim-ref, that simulates the behavior of a cache with arbitrary size and associativity on a valgrind trace fifile. It uses the LRU (least-recently used) replacement policy when choosing which cache line to evict.
The reference simulator takes the following command-line arguments:
Usage: ./csim-ref [-hv] -s <s> -E <E> -b <b> -t <tracefile>
- -h: Optional help flflag that prints usage info
- -v: Optional verbose flflag that displays trace info
- -s <s>: Number of set index bits (S = 2s is the number of sets)
- -E <E>: Associativity (number of lines per set)
- -b <b>: Number of block bits (B = 2b is the block size)
- -t <tracefile>: Name of the valgrind trace to replay
The command-line arguments are based on the notation (s, E, and b) from the textbook. For example:
linux> ./csim-ref -s 4 -E 1 -b 4 -t traces/yi.trace
hits:4 misses:5 evictions:3 dirty_evicted:32 dirty_active:16 double_refs:4
The same example in verbose mode:
linux> ./csim-ref -v -s 4 -E 1 -b 4 -t traces/yi.trace
L 10,1 miss
M 20,1 miss hit-double_ref
L 22,1 hit-double_ref
S 18,1 hit-double_ref
L 110,1 miss dirty-eviction
L 210,1 miss eviction
M 12,1 miss eviction hit-double_ref
hits:4 misses:5 evictions:3 dirty_bytes_evicted:32 dirty_bytes_active:16 double_refs:4
Your job for Part A is to fifill in the csim.c fifile so that it takes the same command-line arguments and produces the identical output as the reference simulator. Notice that this fifile is almost completely empty.
You’ll need to write it from scratch. Note you do not need to match the usage or verbosity output. You only need to match the results of printSummary.
Programming Rules for Part A
- Your csim.c fifile must compile without warnings in order to receive credit.
- Your simulator must work correctly for arbitrary values of s, E, and b. This means that you will need to allocate storage for your simulator’s data structures using the malloc function. Type “man malloc” for information about this function.
- For this lab, we are interested only in data cache performance, so your simulator should ignore all instruction cache accesses (lines starting with “I”). Recall that valgrind always puts “I” in the first column (with no preceding space), and “M”, “L”, and “S” in the second column (with a preceding space). This may help you parse the trace.
- Keep in mind that the dirty bytes active and dirty bytes evicted metrics are byte counts, not counts of the number of dirty lines or dirty lines that have been evicted. Also note that the dirty bytes active metric is a count of how many dirty bytes are currently in the cache at the end of the simulation.
- The double references metric counts how often a cache hit corresponds to the most recently accessed line for a given set. Each set must therefore have a way of tracking this information.
- To receive credit for Part A, you must call the function printSummary, with the total number of hits, misses, evictions, dirty bytes evicted, dirty bytes active in cache, and double references within the same set, at the end of your main function:
printSummary(hit_count, miss_count, eviction_count,
- For this lab, you may assume that memory accesses are aligned properly, so that a single memory access never crosses block boundaries.
3.3 Part B: Optimizing Matrix Transpose
In Part B you will write a transpose function in trans.c that causes as few cache misses as possible.
Let A denote a matrix, and Aij denote the component on the ith row and jth column. The transpose of A,denoted AT , is a matrix such that Aij = Aji
To help you get started, we have given you an example transpose function in trans.c that computes the transpose of N × M matrix A and stores the results in M × N matrix B:
char trans_desc = “Simple row-wise scan transpose”;
void trans(int M, int N, int A[N][M], int B[M][N])
The example transpose function is correct, but it is ineffificient because the access pattern results in relatively many cache misses.
Your job in Part B is to write a similar function, called transpose_submit, that minimizes the number of cache misses across different-sized matrices:
char transpose_submit_desc = “Transpose submission”;
void transpose_submit(int M, int N, int A[N][M], int B[M][N]);
Do not change the description string (“Transpose submission”) for your transpose_submit function. The autograder searches for this string to determine which transpose function to evaluate for credit.
Programming Rules for Part B
- Your code in trans.c must compile without warnings to receive credit.
- You are allowed to defifine at most 12 local variables of type int per transpose function.1
- You are not allowed to side-step the previous rule by using any variables of type long or by using any bit tricks to store more than one value to a single variable.
- Your transpose function may not use recursion.
- If you choose to use helper functions, you may not have more than 12 local variables on the stack at a time between your helper functions and your top level transpose function. For example, if your transpose declares 8 variables, and then you call a function which uses 4 variables, which calls another function which uses 2, you will have 14 variables on the stack, and you will be in violation of the rule.
- Your transpose function may not modify array A. You may, however, do whatever you want with the contents of array B.
- You are NOT allowed to defifine any arrays in your code or to use any variant of malloc.
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: firstname.lastname@example.org 微信:itcsdx