answer.
Ask question
Login Signup
Ask question
All categories
  • English
  • Mathematics
  • Social Studies
  • Business
  • History
  • Health
  • Geography
  • Biology
  • Physics
  • Chemistry
  • Computers and Technology
  • Arts
  • World Languages
  • Spanish
  • French
  • German
  • Advanced Placement (AP)
  • SAT
  • Medicine
  • Law
  • Engineering
Butoxors [25]
2 years ago
4

You are designing a write buffer between a write-through L1 cache and a write-back L2 cache. The L2 cache write data bus is 16 B

wide and can perform a write to an independent cache address every 4 processor cycles.
a. How many bytes wide should each write buffer entry be?
b. What speedup could be expected in the steady state by using a merging write buffer instead of a non-merging buffer when zeroing memory by the execution of 64-bit stores if all other instructions could be issued in parallel with the stores and the blocks are present in the L2 cache?
c. What would be the effect of possible L1 misses on the number of required write buffer entries for systems with blocking and non-blocking caches?
Computers and Technology
1 answer:
Bezzdna [24]2 years ago
3 0

Answer:

Clock 2.5GHz

L1 I cache 32KB, 8way, 64B line size, 4 cycle access latency

L1 Dcache write-back, write-allocate; MSHR with 0 (lockup

cache), 1, 2, and 64 (unconstrained non-blocking

cache) entries, write-back buffer with 16 entries

L2 cache 256KB, 8way, 64B line size, 10 cycle access latency

L3 cache 2MB per core, 64B line size, 36 cycle access latency

Memory DDR3-1600, 90 cycle access latency

Issue width 4

Instruction window size 36

ROB Size 128

Load Buffer Size 48

Store Buffer Size 32

b)

parallelism took this a step further by providing more parallelism and hence more

latency-hiding opportunities. It is likely that the use of instruction- and threadlevel

parallelism will be the primary tool to combat whatever memory delays are

encountered in modern multilevel cache systems.

that of the lockup cache setup (hit-under-0-miss). For the integer programs: the average performance

(measured as CPI) improvement is 7.08% for hit-under-1-miss, 8.36% for hit-under-2-misses, and 9.02%

for hit-under-64-misses (essentially the unconstraint non-blocking cache), compared to lockup cache. For

the floating point programs, the three numbers are 12.69%, 16.22%, and 17.76%, respectively

c)

Non-blocking caches are an effective technique for tolerating cache-miss latency. They can reduce

miss-induced processor stalls by buffering the misses and continuing to serve other independent access

requests. Previous research on the complexity and performance of non-blocking caches supporting

non-blocking loads showed they could achieve significant performance gains in comparison to blocking

caches. However, those experiments were performed with benchmarks that are now over a decade old.

Furthermore the processor that was simulated was a single-issue processor with unlimited run-ahead

capability, a perfect branch predictor, fixed 16-cycle memory latency, single-cycle latency for floating

point operations, and write-through and write-no-allocate caches. These assumptions are very different

from today's high performance out-of-order processors such as the Intel Nehalem. Thus, it is time to

re-evaluate the performance impact of non-blocking caches on practical out-of-order processors using

up-to-date benchmarks. In this study, we evaluate the impacts of non-blocking data caches using the latest

SPECCPU2006 benchmark suite on practical high performance out-of-order (OOO) processors.

Simulations show that a data cache that supports hit-under-2-misses can provide a 17.76% performance

gain for a typical high performance OOO processor running the SPECCPU 2006 benchmarks in

comparison to a similar machine with a blocking cache.

Explanation:

You might be interested in
Assume a program requires the execution of 50 x 106 FP instructions, 110 x 106 INT instructions, 80 x 106 L/S instructions, and
Molodets [167]

Explanation:

FP - 50 \times 10^6

CPI - 1

INT -110 \times 10^6,

CPI - 1

I/S - 80 \times 10^6 ,

CPI - 4

Branch - 16 \times 10^6

CPI - 2

Clock Speed - 2 \times 10^9

Time(old) =\frac{50 x 10^6 + 110 x 10^6 + 4 x ( 80 x 10^6) + 2 x (16 x 10^ 6)}{2 x 10^9}

Time(old) = 256 \times 10^ {-3}

Time(new) =  \frac{256 \times 10^{-3}}{2}

                = 128 \times 10^{-3}

                =\frac{CPI(new) x [50 x 10^6 + 110 x 10^6 + 4 x ( 80 x 10^6) + 2 x (16 x 10^ 6)]}{2 x 10^9}

                =  128 \times 10^{-3}

CPI(new) = \frac{-206}{50}

               = -4.12

3 0
2 years ago
Your computer has a quad-core processor that supports multithreading installed. given that the system is running windows, how ca
AlexFokin [52]
By using cpu-z or the performance ran in taskmgr
8 0
2 years ago
Write a function named printtriangle that receives a parameter that holds a non-negative integer value and prints a triangle of
NemiM [27]
Static void PrintTriangle(int n){    for(;n>0;n--) {                Console.WriteLine(new String('*', n));    }}
5 0
2 years ago
Describe the output when the following code executes in 64-bit mode: .data dividend_hi QWORD 00000108h dividend_lo QWORD 3330002
iragen [17]

Answer:

RAX = 333000h (16 bits with preceding zeros removed)

RDX = 20h (also 16 bits with preceding zeros removed)

Explanation:

The "div" opcode in the assembly language source code is used to divide operands. It accepts a divisor ( the denominator) and divides the content of the AX register. The result is saved in the AX register while the remainder (if any) is saved in the DX register. If the DX register holds any data, the data is replaced with the divisor remnant.

The code above divides the content of the RAX register with the divisor variable and saves the result and remainder in the RAX and RDX respectively.

7 0
2 years ago
Haley is helping to choose members for a customer satisfaction team. Which of the following employees demonstrate skill in focus
ivanzaharov [21]
Jesse constantly looks for better ways to solve problems for customers
7 0
2 years ago
Other questions:
  • In what year were graphical user interfaces (GUIs) pioneered? 1969 1974 1991 2001
    12·2 answers
  • Jail and prison officials may generally limit inmate rights when the limitations serve
    13·2 answers
  • Which description of the plain text file format is most accurate?
    11·2 answers
  • Collaboration online increases students' motivation by
    5·2 answers
  • Create the setStyles() function using the commands listed in the steps below. Christine wants one of five fancy style sheets to
    11·1 answer
  • 9.6 Code Practice: Question 1
    9·1 answer
  • Which of the following is not true of how computers represent complex information
    5·2 answers
  • Which of the following best describes the protocols used on the Internet?
    12·1 answer
  • To reduce costs and the environmental impact of commuting, your company decides to close a number of offices and to provide supp
    14·1 answer
  • I have a variable and set it equal to 5. 1 then send it as an argument to a function that adds 5 to the variable passed in. Outs
    9·1 answer
Add answer
Login
Not registered? Fast signup
Signup
Login Signup
Ask question!