Cachegrind: Why so many cache misses?









up vote
6
down vote

favorite












I'm currently learning about various profiling and performance utilities under Linux, notably valgrind/cachegrind.



I have following toy program:



#include <iostream>
#include <vector>

int
main()
const unsigned int COUNT = 1000000;

std::vector<double> v;

for(int i=0;i<COUNT;i++)
v.push_back(i);


double counter = 0;
for(int i=0;i<COUNT;i+=8)
counter += v[i+0];
counter += v[i+1];
counter += v[i+2];
counter += v[i+3];
counter += v[i+4];
counter += v[i+5];
counter += v[i+6];
counter += v[i+7];


std::cout << counter << std::endl;



Compiling this program with g++ -O2 -g main.cpp and running valgrind --tool=cachegrind ./a.out, then cg_annotate cachegrind.out.31694 --auto=yes produces following result:



 --------------------------------------------------------------------------------
-- Auto-annotated source: /home/andrej/Data/projects/pokusy/dod.cpp
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw

. . . . . . . . . #include <iostream>
. . . . . . . . . #include <vector>
. . . . . . . . .
. . . . . . . . . int
7 1 1 1 0 0 4 0 0 main()
. . . . . . . . . const unsigned int COUNT = 1000000;
. . . . . . . . .
. . . . . . . . . std::vector<double> v;
. . . . . . . . .
5,000,000 0 0 1,999,999 0 0 0 0 0 for(int i=0;i<COUNT;i++)
3,000,000 0 0 0 0 0 1,000,000 0 0 v.push_back(i);
. . . . . . . . .
. . . . . . . . .
3 0 0 0 0 0 0 0 0 double counter = 0;
250,000 0 0 0 0 0 0 0 0 for(int i=0;i<COUNT;i+=8)
250,000 0 0 125,000 1 1 0 0 0 counter += v[i+0];
125,000 0 0 125,000 0 0 0 0 0 counter += v[i+1];
125,000 1 1 125,000 0 0 0 0 0 counter += v[i+2];
125,000 0 0 125,000 0 0 0 0 0 counter += v[i+3];
125,000 0 0 125,000 0 0 0 0 0 counter += v[i+4];
125,000 0 0 125,000 0 0 0 0 0 counter += v[i+5];
125,000 0 0 125,000 125,000 125,000 0 0 0 counter += v[i+6];
125,000 0 0 125,000 0 0 0 0 0 counter += v[i+7];
. . . . . . . . .
. . . . . . . . .
. . . . . . . . . std::cout << counter << std::endl;
11 0 0 6 1 1 0 0 0


What I'm worried about is this line:



125,000 0 0 125,000 125,000 125,000 0 0 0 counter += v[i+6];


Why this line has so many cache-misses? The data are in contiguous memory, each iteration I'm reading 64-bytes of data (assuming the cache line is 64 bytes long).



I'm running this program on Ubuntu Linux 18.04.1, kernel 4.19, g++ 7.3.0.
Computer is AMD 2400G.










share|improve this question





















  • What is column legend? What is D1mr?
    – VTT
    Nov 9 at 18:51










  • @VTT From the valgrind.org/docs/manual/cg-manual.html : I cache reads (Ir, which equals the number of instructions executed), I1 cache read misses (I1mr) and LL cache instruction read misses (ILmr). D cache reads (Dr, which equals the number of memory reads), D1 cache read misses (D1mr), and LL cache data read misses (DLmr). D cache writes (Dw, which equals the number of memory writes), D1 cache write misses (D1mw), and LL cache data write misses (DLmw).
    – Andrej Kesely
    Nov 9 at 18:53










  • I am not seeing many missed reads here. That one line... you are bound to get a missed cache after so many times round the loop (near the end looks convincing to me). The number of misses compared to the number of reads doesn seem so bad (125000/8million+?) did I read that correctly?
    – Galik
    Nov 9 at 19:00










  • @Galik While it doesn't seem bad, it's not 0, so it's not good enough ;)
    – NathanOliver
    Nov 9 at 19:01






  • 1




    Out of interest did you try incrementing the loop by just 1 and letting the optimizer do its own loop unrolling to see how that performs? (with -O3 also)?
    – Galik
    Nov 9 at 19:09














up vote
6
down vote

favorite












I'm currently learning about various profiling and performance utilities under Linux, notably valgrind/cachegrind.



I have following toy program:



#include <iostream>
#include <vector>

int
main()
const unsigned int COUNT = 1000000;

std::vector<double> v;

for(int i=0;i<COUNT;i++)
v.push_back(i);


double counter = 0;
for(int i=0;i<COUNT;i+=8)
counter += v[i+0];
counter += v[i+1];
counter += v[i+2];
counter += v[i+3];
counter += v[i+4];
counter += v[i+5];
counter += v[i+6];
counter += v[i+7];


std::cout << counter << std::endl;



Compiling this program with g++ -O2 -g main.cpp and running valgrind --tool=cachegrind ./a.out, then cg_annotate cachegrind.out.31694 --auto=yes produces following result:



 --------------------------------------------------------------------------------
-- Auto-annotated source: /home/andrej/Data/projects/pokusy/dod.cpp
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw

. . . . . . . . . #include <iostream>
. . . . . . . . . #include <vector>
. . . . . . . . .
. . . . . . . . . int
7 1 1 1 0 0 4 0 0 main()
. . . . . . . . . const unsigned int COUNT = 1000000;
. . . . . . . . .
. . . . . . . . . std::vector<double> v;
. . . . . . . . .
5,000,000 0 0 1,999,999 0 0 0 0 0 for(int i=0;i<COUNT;i++)
3,000,000 0 0 0 0 0 1,000,000 0 0 v.push_back(i);
. . . . . . . . .
. . . . . . . . .
3 0 0 0 0 0 0 0 0 double counter = 0;
250,000 0 0 0 0 0 0 0 0 for(int i=0;i<COUNT;i+=8)
250,000 0 0 125,000 1 1 0 0 0 counter += v[i+0];
125,000 0 0 125,000 0 0 0 0 0 counter += v[i+1];
125,000 1 1 125,000 0 0 0 0 0 counter += v[i+2];
125,000 0 0 125,000 0 0 0 0 0 counter += v[i+3];
125,000 0 0 125,000 0 0 0 0 0 counter += v[i+4];
125,000 0 0 125,000 0 0 0 0 0 counter += v[i+5];
125,000 0 0 125,000 125,000 125,000 0 0 0 counter += v[i+6];
125,000 0 0 125,000 0 0 0 0 0 counter += v[i+7];
. . . . . . . . .
. . . . . . . . .
. . . . . . . . . std::cout << counter << std::endl;
11 0 0 6 1 1 0 0 0


What I'm worried about is this line:



125,000 0 0 125,000 125,000 125,000 0 0 0 counter += v[i+6];


Why this line has so many cache-misses? The data are in contiguous memory, each iteration I'm reading 64-bytes of data (assuming the cache line is 64 bytes long).



I'm running this program on Ubuntu Linux 18.04.1, kernel 4.19, g++ 7.3.0.
Computer is AMD 2400G.










share|improve this question





















  • What is column legend? What is D1mr?
    – VTT
    Nov 9 at 18:51










  • @VTT From the valgrind.org/docs/manual/cg-manual.html : I cache reads (Ir, which equals the number of instructions executed), I1 cache read misses (I1mr) and LL cache instruction read misses (ILmr). D cache reads (Dr, which equals the number of memory reads), D1 cache read misses (D1mr), and LL cache data read misses (DLmr). D cache writes (Dw, which equals the number of memory writes), D1 cache write misses (D1mw), and LL cache data write misses (DLmw).
    – Andrej Kesely
    Nov 9 at 18:53










  • I am not seeing many missed reads here. That one line... you are bound to get a missed cache after so many times round the loop (near the end looks convincing to me). The number of misses compared to the number of reads doesn seem so bad (125000/8million+?) did I read that correctly?
    – Galik
    Nov 9 at 19:00










  • @Galik While it doesn't seem bad, it's not 0, so it's not good enough ;)
    – NathanOliver
    Nov 9 at 19:01






  • 1




    Out of interest did you try incrementing the loop by just 1 and letting the optimizer do its own loop unrolling to see how that performs? (with -O3 also)?
    – Galik
    Nov 9 at 19:09












up vote
6
down vote

favorite









up vote
6
down vote

favorite











I'm currently learning about various profiling and performance utilities under Linux, notably valgrind/cachegrind.



I have following toy program:



#include <iostream>
#include <vector>

int
main()
const unsigned int COUNT = 1000000;

std::vector<double> v;

for(int i=0;i<COUNT;i++)
v.push_back(i);


double counter = 0;
for(int i=0;i<COUNT;i+=8)
counter += v[i+0];
counter += v[i+1];
counter += v[i+2];
counter += v[i+3];
counter += v[i+4];
counter += v[i+5];
counter += v[i+6];
counter += v[i+7];


std::cout << counter << std::endl;



Compiling this program with g++ -O2 -g main.cpp and running valgrind --tool=cachegrind ./a.out, then cg_annotate cachegrind.out.31694 --auto=yes produces following result:



 --------------------------------------------------------------------------------
-- Auto-annotated source: /home/andrej/Data/projects/pokusy/dod.cpp
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw

. . . . . . . . . #include <iostream>
. . . . . . . . . #include <vector>
. . . . . . . . .
. . . . . . . . . int
7 1 1 1 0 0 4 0 0 main()
. . . . . . . . . const unsigned int COUNT = 1000000;
. . . . . . . . .
. . . . . . . . . std::vector<double> v;
. . . . . . . . .
5,000,000 0 0 1,999,999 0 0 0 0 0 for(int i=0;i<COUNT;i++)
3,000,000 0 0 0 0 0 1,000,000 0 0 v.push_back(i);
. . . . . . . . .
. . . . . . . . .
3 0 0 0 0 0 0 0 0 double counter = 0;
250,000 0 0 0 0 0 0 0 0 for(int i=0;i<COUNT;i+=8)
250,000 0 0 125,000 1 1 0 0 0 counter += v[i+0];
125,000 0 0 125,000 0 0 0 0 0 counter += v[i+1];
125,000 1 1 125,000 0 0 0 0 0 counter += v[i+2];
125,000 0 0 125,000 0 0 0 0 0 counter += v[i+3];
125,000 0 0 125,000 0 0 0 0 0 counter += v[i+4];
125,000 0 0 125,000 0 0 0 0 0 counter += v[i+5];
125,000 0 0 125,000 125,000 125,000 0 0 0 counter += v[i+6];
125,000 0 0 125,000 0 0 0 0 0 counter += v[i+7];
. . . . . . . . .
. . . . . . . . .
. . . . . . . . . std::cout << counter << std::endl;
11 0 0 6 1 1 0 0 0


What I'm worried about is this line:



125,000 0 0 125,000 125,000 125,000 0 0 0 counter += v[i+6];


Why this line has so many cache-misses? The data are in contiguous memory, each iteration I'm reading 64-bytes of data (assuming the cache line is 64 bytes long).



I'm running this program on Ubuntu Linux 18.04.1, kernel 4.19, g++ 7.3.0.
Computer is AMD 2400G.










share|improve this question













I'm currently learning about various profiling and performance utilities under Linux, notably valgrind/cachegrind.



I have following toy program:



#include <iostream>
#include <vector>

int
main()
const unsigned int COUNT = 1000000;

std::vector<double> v;

for(int i=0;i<COUNT;i++)
v.push_back(i);


double counter = 0;
for(int i=0;i<COUNT;i+=8)
counter += v[i+0];
counter += v[i+1];
counter += v[i+2];
counter += v[i+3];
counter += v[i+4];
counter += v[i+5];
counter += v[i+6];
counter += v[i+7];


std::cout << counter << std::endl;



Compiling this program with g++ -O2 -g main.cpp and running valgrind --tool=cachegrind ./a.out, then cg_annotate cachegrind.out.31694 --auto=yes produces following result:



 --------------------------------------------------------------------------------
-- Auto-annotated source: /home/andrej/Data/projects/pokusy/dod.cpp
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw

. . . . . . . . . #include <iostream>
. . . . . . . . . #include <vector>
. . . . . . . . .
. . . . . . . . . int
7 1 1 1 0 0 4 0 0 main()
. . . . . . . . . const unsigned int COUNT = 1000000;
. . . . . . . . .
. . . . . . . . . std::vector<double> v;
. . . . . . . . .
5,000,000 0 0 1,999,999 0 0 0 0 0 for(int i=0;i<COUNT;i++)
3,000,000 0 0 0 0 0 1,000,000 0 0 v.push_back(i);
. . . . . . . . .
. . . . . . . . .
3 0 0 0 0 0 0 0 0 double counter = 0;
250,000 0 0 0 0 0 0 0 0 for(int i=0;i<COUNT;i+=8)
250,000 0 0 125,000 1 1 0 0 0 counter += v[i+0];
125,000 0 0 125,000 0 0 0 0 0 counter += v[i+1];
125,000 1 1 125,000 0 0 0 0 0 counter += v[i+2];
125,000 0 0 125,000 0 0 0 0 0 counter += v[i+3];
125,000 0 0 125,000 0 0 0 0 0 counter += v[i+4];
125,000 0 0 125,000 0 0 0 0 0 counter += v[i+5];
125,000 0 0 125,000 125,000 125,000 0 0 0 counter += v[i+6];
125,000 0 0 125,000 0 0 0 0 0 counter += v[i+7];
. . . . . . . . .
. . . . . . . . .
. . . . . . . . . std::cout << counter << std::endl;
11 0 0 6 1 1 0 0 0


What I'm worried about is this line:



125,000 0 0 125,000 125,000 125,000 0 0 0 counter += v[i+6];


Why this line has so many cache-misses? The data are in contiguous memory, each iteration I'm reading 64-bytes of data (assuming the cache line is 64 bytes long).



I'm running this program on Ubuntu Linux 18.04.1, kernel 4.19, g++ 7.3.0.
Computer is AMD 2400G.







c++ performance profiling cpu-cache cachegrind






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 9 at 18:48









Andrej Kesely

7,2692728




7,2692728











  • What is column legend? What is D1mr?
    – VTT
    Nov 9 at 18:51










  • @VTT From the valgrind.org/docs/manual/cg-manual.html : I cache reads (Ir, which equals the number of instructions executed), I1 cache read misses (I1mr) and LL cache instruction read misses (ILmr). D cache reads (Dr, which equals the number of memory reads), D1 cache read misses (D1mr), and LL cache data read misses (DLmr). D cache writes (Dw, which equals the number of memory writes), D1 cache write misses (D1mw), and LL cache data write misses (DLmw).
    – Andrej Kesely
    Nov 9 at 18:53










  • I am not seeing many missed reads here. That one line... you are bound to get a missed cache after so many times round the loop (near the end looks convincing to me). The number of misses compared to the number of reads doesn seem so bad (125000/8million+?) did I read that correctly?
    – Galik
    Nov 9 at 19:00










  • @Galik While it doesn't seem bad, it's not 0, so it's not good enough ;)
    – NathanOliver
    Nov 9 at 19:01






  • 1




    Out of interest did you try incrementing the loop by just 1 and letting the optimizer do its own loop unrolling to see how that performs? (with -O3 also)?
    – Galik
    Nov 9 at 19:09
















  • What is column legend? What is D1mr?
    – VTT
    Nov 9 at 18:51










  • @VTT From the valgrind.org/docs/manual/cg-manual.html : I cache reads (Ir, which equals the number of instructions executed), I1 cache read misses (I1mr) and LL cache instruction read misses (ILmr). D cache reads (Dr, which equals the number of memory reads), D1 cache read misses (D1mr), and LL cache data read misses (DLmr). D cache writes (Dw, which equals the number of memory writes), D1 cache write misses (D1mw), and LL cache data write misses (DLmw).
    – Andrej Kesely
    Nov 9 at 18:53










  • I am not seeing many missed reads here. That one line... you are bound to get a missed cache after so many times round the loop (near the end looks convincing to me). The number of misses compared to the number of reads doesn seem so bad (125000/8million+?) did I read that correctly?
    – Galik
    Nov 9 at 19:00










  • @Galik While it doesn't seem bad, it's not 0, so it's not good enough ;)
    – NathanOliver
    Nov 9 at 19:01






  • 1




    Out of interest did you try incrementing the loop by just 1 and letting the optimizer do its own loop unrolling to see how that performs? (with -O3 also)?
    – Galik
    Nov 9 at 19:09















What is column legend? What is D1mr?
– VTT
Nov 9 at 18:51




What is column legend? What is D1mr?
– VTT
Nov 9 at 18:51












@VTT From the valgrind.org/docs/manual/cg-manual.html : I cache reads (Ir, which equals the number of instructions executed), I1 cache read misses (I1mr) and LL cache instruction read misses (ILmr). D cache reads (Dr, which equals the number of memory reads), D1 cache read misses (D1mr), and LL cache data read misses (DLmr). D cache writes (Dw, which equals the number of memory writes), D1 cache write misses (D1mw), and LL cache data write misses (DLmw).
– Andrej Kesely
Nov 9 at 18:53




@VTT From the valgrind.org/docs/manual/cg-manual.html : I cache reads (Ir, which equals the number of instructions executed), I1 cache read misses (I1mr) and LL cache instruction read misses (ILmr). D cache reads (Dr, which equals the number of memory reads), D1 cache read misses (D1mr), and LL cache data read misses (DLmr). D cache writes (Dw, which equals the number of memory writes), D1 cache write misses (D1mw), and LL cache data write misses (DLmw).
– Andrej Kesely
Nov 9 at 18:53












I am not seeing many missed reads here. That one line... you are bound to get a missed cache after so many times round the loop (near the end looks convincing to me). The number of misses compared to the number of reads doesn seem so bad (125000/8million+?) did I read that correctly?
– Galik
Nov 9 at 19:00




I am not seeing many missed reads here. That one line... you are bound to get a missed cache after so many times round the loop (near the end looks convincing to me). The number of misses compared to the number of reads doesn seem so bad (125000/8million+?) did I read that correctly?
– Galik
Nov 9 at 19:00












@Galik While it doesn't seem bad, it's not 0, so it's not good enough ;)
– NathanOliver
Nov 9 at 19:01




@Galik While it doesn't seem bad, it's not 0, so it's not good enough ;)
– NathanOliver
Nov 9 at 19:01




1




1




Out of interest did you try incrementing the loop by just 1 and letting the optimizer do its own loop unrolling to see how that performs? (with -O3 also)?
– Galik
Nov 9 at 19:09




Out of interest did you try incrementing the loop by just 1 and letting the optimizer do its own loop unrolling to see how that performs? (with -O3 also)?
– Galik
Nov 9 at 19:09












2 Answers
2






active

oldest

votes

















up vote
3
down vote













It's important to first check the generated assembly code because that's what cachegrind is going to simulate. The loop that you are interested in gets compiled into the following code:



.L28:
addsd xmm0, QWORD PTR [rax]
add rax, 64
addsd xmm0, QWORD PTR [rax-56]
addsd xmm0, QWORD PTR [rax-48]
addsd xmm0, QWORD PTR [rax-40]
addsd xmm0, QWORD PTR [rax-32]
addsd xmm0, QWORD PTR [rax-24]
addsd xmm0, QWORD PTR [rax-16]
addsd xmm0, QWORD PTR [rax-8]
cmp rdx, rax
jne .L28


There are 8 read accesses per iteration and each is 8-byte in size. In C++, it's guaranteed that each element is 8-byte aligned, but up to two cache lines could be accessed per iteration depending on the address of the array of the v vector. cachegrind uses dynamic binary instrumentation to obtain the address of each memory access and apply its cache hierarchy model to determine whether an access is a hit or miss at each level in the hierarchy (it supports only L1 and LLC though). In this particular instance, it happens that a new cache line is accessed at counter += v[i+6];. Then, the next 7 accesses would be to the same 64-byte cache line. The source code line at which a new cache line is accessed doesn't impact the total number of misses reported by cachegrind. It will just tell you that a different source code line incurs that many misses.



Note that cachegrind simulates a very simplified cache hierarchy based on the machine it's running on. In this case, it is AMD 2400G, which has a 64-byte line size at all cache levels. In addition, the size of the L3 is 4MB. But since the total array size is 8MB, then the following loop:



for(int i=0;i<COUNT;i++) 
v.push_back(i);



will leave only the second half of the array in the LLC. Now in the very first iteration of the second loop in which counter is calculated, the first line accessed will not be in the L1 or LLC. This explains the 1 in D1mr and DLmr columns. Then at counter += v[i+6];, another line is accessed, which is also a miss in both levels of the cache. However, in this case, the next 7 accesses will all be hits. At this point, only the access from counter += v[i+6]; will miss and there are 125,000 such accesses (1 million / 8).



Note that cachegrind is just a simulator and what actually happens on a real processor can be and most probably is very different. For example, on my Haswell processor, by using perf, the total number of L1D misses from all of the code (both loops) is only 65,796. So cachegrind may either significantly overestimate or underestimate the miss and hit counts.






share|improve this answer






















  • Yes, you have right. When I use reverse iterator: for(auto it = v.rbegin(); it != v.rend(); ++it) counter += *it; , the cachegrind reports only half total misses. I also tried perf stat -e L1-dcache-load-misses ./a.out, and it oscillates around ~105,000 on my computer (for both loops). Does that mean that AMD has worse prefetcher than Intel? I also didn't know that cachegrind is only simulator...
    – Andrej Kesely
    Nov 9 at 20:43











  • @AndrejKesely It not only depends on the hardware cache prefetchers, but also on cache replacement policies and physical addresses of the locations accessed (cachegrind only uses virtual addresses). If I disable prefetchers on Haswell, I get 191,997 L1D misses for the whole program (compared to 65,796 when prefetchers are enabled).
    – Hadi Brais
    Nov 9 at 20:56


















up vote
1
down vote













I suspect that this happens because vector buffer is not aligned on cache line boundary. That is sudden jump in cache misses marks a point when we proceed to a next line. So I suggest to check v.data() value.






share|improve this answer




















  • Besides writing a custom allocator, is there any way the OP could get a vector with the right alignment?
    – NathanOliver
    Nov 9 at 19:05










  • @NathanOliver I think OP could just allocate an array of bytes and select an aligned starting position. Or just select an aligned starting position in vector of double.
    – VTT
    Nov 9 at 19:08











  • Isn't aligning the vector to a cache line boundary just going to make the position where the cache misses move to a different place? Aren't you still going to get a cache miss just as regularly as you iterate through memory?
    – Galik
    Nov 9 at 19:23







  • 1




    @Galik Yes, now I tried to allocate the vector with double* v = (double *)aligned_alloc(64, COUNT * 8); instead of std::vector, and the cache miss is on line counter += v[i+0];. I thought that CPU prefetcher will fetch data as they go.
    – Andrej Kesely
    Nov 9 at 19:25










  • @Galik Yes, but this misalignment explains why cache misses happen in the middle of the loop.
    – VTT
    Nov 9 at 19:33










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53231681%2fcachegrind-why-so-many-cache-misses%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
3
down vote













It's important to first check the generated assembly code because that's what cachegrind is going to simulate. The loop that you are interested in gets compiled into the following code:



.L28:
addsd xmm0, QWORD PTR [rax]
add rax, 64
addsd xmm0, QWORD PTR [rax-56]
addsd xmm0, QWORD PTR [rax-48]
addsd xmm0, QWORD PTR [rax-40]
addsd xmm0, QWORD PTR [rax-32]
addsd xmm0, QWORD PTR [rax-24]
addsd xmm0, QWORD PTR [rax-16]
addsd xmm0, QWORD PTR [rax-8]
cmp rdx, rax
jne .L28


There are 8 read accesses per iteration and each is 8-byte in size. In C++, it's guaranteed that each element is 8-byte aligned, but up to two cache lines could be accessed per iteration depending on the address of the array of the v vector. cachegrind uses dynamic binary instrumentation to obtain the address of each memory access and apply its cache hierarchy model to determine whether an access is a hit or miss at each level in the hierarchy (it supports only L1 and LLC though). In this particular instance, it happens that a new cache line is accessed at counter += v[i+6];. Then, the next 7 accesses would be to the same 64-byte cache line. The source code line at which a new cache line is accessed doesn't impact the total number of misses reported by cachegrind. It will just tell you that a different source code line incurs that many misses.



Note that cachegrind simulates a very simplified cache hierarchy based on the machine it's running on. In this case, it is AMD 2400G, which has a 64-byte line size at all cache levels. In addition, the size of the L3 is 4MB. But since the total array size is 8MB, then the following loop:



for(int i=0;i<COUNT;i++) 
v.push_back(i);



will leave only the second half of the array in the LLC. Now in the very first iteration of the second loop in which counter is calculated, the first line accessed will not be in the L1 or LLC. This explains the 1 in D1mr and DLmr columns. Then at counter += v[i+6];, another line is accessed, which is also a miss in both levels of the cache. However, in this case, the next 7 accesses will all be hits. At this point, only the access from counter += v[i+6]; will miss and there are 125,000 such accesses (1 million / 8).



Note that cachegrind is just a simulator and what actually happens on a real processor can be and most probably is very different. For example, on my Haswell processor, by using perf, the total number of L1D misses from all of the code (both loops) is only 65,796. So cachegrind may either significantly overestimate or underestimate the miss and hit counts.






share|improve this answer






















  • Yes, you have right. When I use reverse iterator: for(auto it = v.rbegin(); it != v.rend(); ++it) counter += *it; , the cachegrind reports only half total misses. I also tried perf stat -e L1-dcache-load-misses ./a.out, and it oscillates around ~105,000 on my computer (for both loops). Does that mean that AMD has worse prefetcher than Intel? I also didn't know that cachegrind is only simulator...
    – Andrej Kesely
    Nov 9 at 20:43











  • @AndrejKesely It not only depends on the hardware cache prefetchers, but also on cache replacement policies and physical addresses of the locations accessed (cachegrind only uses virtual addresses). If I disable prefetchers on Haswell, I get 191,997 L1D misses for the whole program (compared to 65,796 when prefetchers are enabled).
    – Hadi Brais
    Nov 9 at 20:56















up vote
3
down vote













It's important to first check the generated assembly code because that's what cachegrind is going to simulate. The loop that you are interested in gets compiled into the following code:



.L28:
addsd xmm0, QWORD PTR [rax]
add rax, 64
addsd xmm0, QWORD PTR [rax-56]
addsd xmm0, QWORD PTR [rax-48]
addsd xmm0, QWORD PTR [rax-40]
addsd xmm0, QWORD PTR [rax-32]
addsd xmm0, QWORD PTR [rax-24]
addsd xmm0, QWORD PTR [rax-16]
addsd xmm0, QWORD PTR [rax-8]
cmp rdx, rax
jne .L28


There are 8 read accesses per iteration and each is 8-byte in size. In C++, it's guaranteed that each element is 8-byte aligned, but up to two cache lines could be accessed per iteration depending on the address of the array of the v vector. cachegrind uses dynamic binary instrumentation to obtain the address of each memory access and apply its cache hierarchy model to determine whether an access is a hit or miss at each level in the hierarchy (it supports only L1 and LLC though). In this particular instance, it happens that a new cache line is accessed at counter += v[i+6];. Then, the next 7 accesses would be to the same 64-byte cache line. The source code line at which a new cache line is accessed doesn't impact the total number of misses reported by cachegrind. It will just tell you that a different source code line incurs that many misses.



Note that cachegrind simulates a very simplified cache hierarchy based on the machine it's running on. In this case, it is AMD 2400G, which has a 64-byte line size at all cache levels. In addition, the size of the L3 is 4MB. But since the total array size is 8MB, then the following loop:



for(int i=0;i<COUNT;i++) 
v.push_back(i);



will leave only the second half of the array in the LLC. Now in the very first iteration of the second loop in which counter is calculated, the first line accessed will not be in the L1 or LLC. This explains the 1 in D1mr and DLmr columns. Then at counter += v[i+6];, another line is accessed, which is also a miss in both levels of the cache. However, in this case, the next 7 accesses will all be hits. At this point, only the access from counter += v[i+6]; will miss and there are 125,000 such accesses (1 million / 8).



Note that cachegrind is just a simulator and what actually happens on a real processor can be and most probably is very different. For example, on my Haswell processor, by using perf, the total number of L1D misses from all of the code (both loops) is only 65,796. So cachegrind may either significantly overestimate or underestimate the miss and hit counts.






share|improve this answer






















  • Yes, you have right. When I use reverse iterator: for(auto it = v.rbegin(); it != v.rend(); ++it) counter += *it; , the cachegrind reports only half total misses. I also tried perf stat -e L1-dcache-load-misses ./a.out, and it oscillates around ~105,000 on my computer (for both loops). Does that mean that AMD has worse prefetcher than Intel? I also didn't know that cachegrind is only simulator...
    – Andrej Kesely
    Nov 9 at 20:43











  • @AndrejKesely It not only depends on the hardware cache prefetchers, but also on cache replacement policies and physical addresses of the locations accessed (cachegrind only uses virtual addresses). If I disable prefetchers on Haswell, I get 191,997 L1D misses for the whole program (compared to 65,796 when prefetchers are enabled).
    – Hadi Brais
    Nov 9 at 20:56













up vote
3
down vote










up vote
3
down vote









It's important to first check the generated assembly code because that's what cachegrind is going to simulate. The loop that you are interested in gets compiled into the following code:



.L28:
addsd xmm0, QWORD PTR [rax]
add rax, 64
addsd xmm0, QWORD PTR [rax-56]
addsd xmm0, QWORD PTR [rax-48]
addsd xmm0, QWORD PTR [rax-40]
addsd xmm0, QWORD PTR [rax-32]
addsd xmm0, QWORD PTR [rax-24]
addsd xmm0, QWORD PTR [rax-16]
addsd xmm0, QWORD PTR [rax-8]
cmp rdx, rax
jne .L28


There are 8 read accesses per iteration and each is 8-byte in size. In C++, it's guaranteed that each element is 8-byte aligned, but up to two cache lines could be accessed per iteration depending on the address of the array of the v vector. cachegrind uses dynamic binary instrumentation to obtain the address of each memory access and apply its cache hierarchy model to determine whether an access is a hit or miss at each level in the hierarchy (it supports only L1 and LLC though). In this particular instance, it happens that a new cache line is accessed at counter += v[i+6];. Then, the next 7 accesses would be to the same 64-byte cache line. The source code line at which a new cache line is accessed doesn't impact the total number of misses reported by cachegrind. It will just tell you that a different source code line incurs that many misses.



Note that cachegrind simulates a very simplified cache hierarchy based on the machine it's running on. In this case, it is AMD 2400G, which has a 64-byte line size at all cache levels. In addition, the size of the L3 is 4MB. But since the total array size is 8MB, then the following loop:



for(int i=0;i<COUNT;i++) 
v.push_back(i);



will leave only the second half of the array in the LLC. Now in the very first iteration of the second loop in which counter is calculated, the first line accessed will not be in the L1 or LLC. This explains the 1 in D1mr and DLmr columns. Then at counter += v[i+6];, another line is accessed, which is also a miss in both levels of the cache. However, in this case, the next 7 accesses will all be hits. At this point, only the access from counter += v[i+6]; will miss and there are 125,000 such accesses (1 million / 8).



Note that cachegrind is just a simulator and what actually happens on a real processor can be and most probably is very different. For example, on my Haswell processor, by using perf, the total number of L1D misses from all of the code (both loops) is only 65,796. So cachegrind may either significantly overestimate or underestimate the miss and hit counts.






share|improve this answer














It's important to first check the generated assembly code because that's what cachegrind is going to simulate. The loop that you are interested in gets compiled into the following code:



.L28:
addsd xmm0, QWORD PTR [rax]
add rax, 64
addsd xmm0, QWORD PTR [rax-56]
addsd xmm0, QWORD PTR [rax-48]
addsd xmm0, QWORD PTR [rax-40]
addsd xmm0, QWORD PTR [rax-32]
addsd xmm0, QWORD PTR [rax-24]
addsd xmm0, QWORD PTR [rax-16]
addsd xmm0, QWORD PTR [rax-8]
cmp rdx, rax
jne .L28


There are 8 read accesses per iteration and each is 8-byte in size. In C++, it's guaranteed that each element is 8-byte aligned, but up to two cache lines could be accessed per iteration depending on the address of the array of the v vector. cachegrind uses dynamic binary instrumentation to obtain the address of each memory access and apply its cache hierarchy model to determine whether an access is a hit or miss at each level in the hierarchy (it supports only L1 and LLC though). In this particular instance, it happens that a new cache line is accessed at counter += v[i+6];. Then, the next 7 accesses would be to the same 64-byte cache line. The source code line at which a new cache line is accessed doesn't impact the total number of misses reported by cachegrind. It will just tell you that a different source code line incurs that many misses.



Note that cachegrind simulates a very simplified cache hierarchy based on the machine it's running on. In this case, it is AMD 2400G, which has a 64-byte line size at all cache levels. In addition, the size of the L3 is 4MB. But since the total array size is 8MB, then the following loop:



for(int i=0;i<COUNT;i++) 
v.push_back(i);



will leave only the second half of the array in the LLC. Now in the very first iteration of the second loop in which counter is calculated, the first line accessed will not be in the L1 or LLC. This explains the 1 in D1mr and DLmr columns. Then at counter += v[i+6];, another line is accessed, which is also a miss in both levels of the cache. However, in this case, the next 7 accesses will all be hits. At this point, only the access from counter += v[i+6]; will miss and there are 125,000 such accesses (1 million / 8).



Note that cachegrind is just a simulator and what actually happens on a real processor can be and most probably is very different. For example, on my Haswell processor, by using perf, the total number of L1D misses from all of the code (both loops) is only 65,796. So cachegrind may either significantly overestimate or underestimate the miss and hit counts.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 9 at 20:27

























answered Nov 9 at 20:19









Hadi Brais

8,56911636




8,56911636











  • Yes, you have right. When I use reverse iterator: for(auto it = v.rbegin(); it != v.rend(); ++it) counter += *it; , the cachegrind reports only half total misses. I also tried perf stat -e L1-dcache-load-misses ./a.out, and it oscillates around ~105,000 on my computer (for both loops). Does that mean that AMD has worse prefetcher than Intel? I also didn't know that cachegrind is only simulator...
    – Andrej Kesely
    Nov 9 at 20:43











  • @AndrejKesely It not only depends on the hardware cache prefetchers, but also on cache replacement policies and physical addresses of the locations accessed (cachegrind only uses virtual addresses). If I disable prefetchers on Haswell, I get 191,997 L1D misses for the whole program (compared to 65,796 when prefetchers are enabled).
    – Hadi Brais
    Nov 9 at 20:56

















  • Yes, you have right. When I use reverse iterator: for(auto it = v.rbegin(); it != v.rend(); ++it) counter += *it; , the cachegrind reports only half total misses. I also tried perf stat -e L1-dcache-load-misses ./a.out, and it oscillates around ~105,000 on my computer (for both loops). Does that mean that AMD has worse prefetcher than Intel? I also didn't know that cachegrind is only simulator...
    – Andrej Kesely
    Nov 9 at 20:43











  • @AndrejKesely It not only depends on the hardware cache prefetchers, but also on cache replacement policies and physical addresses of the locations accessed (cachegrind only uses virtual addresses). If I disable prefetchers on Haswell, I get 191,997 L1D misses for the whole program (compared to 65,796 when prefetchers are enabled).
    – Hadi Brais
    Nov 9 at 20:56
















Yes, you have right. When I use reverse iterator: for(auto it = v.rbegin(); it != v.rend(); ++it) counter += *it; , the cachegrind reports only half total misses. I also tried perf stat -e L1-dcache-load-misses ./a.out, and it oscillates around ~105,000 on my computer (for both loops). Does that mean that AMD has worse prefetcher than Intel? I also didn't know that cachegrind is only simulator...
– Andrej Kesely
Nov 9 at 20:43





Yes, you have right. When I use reverse iterator: for(auto it = v.rbegin(); it != v.rend(); ++it) counter += *it; , the cachegrind reports only half total misses. I also tried perf stat -e L1-dcache-load-misses ./a.out, and it oscillates around ~105,000 on my computer (for both loops). Does that mean that AMD has worse prefetcher than Intel? I also didn't know that cachegrind is only simulator...
– Andrej Kesely
Nov 9 at 20:43













@AndrejKesely It not only depends on the hardware cache prefetchers, but also on cache replacement policies and physical addresses of the locations accessed (cachegrind only uses virtual addresses). If I disable prefetchers on Haswell, I get 191,997 L1D misses for the whole program (compared to 65,796 when prefetchers are enabled).
– Hadi Brais
Nov 9 at 20:56





@AndrejKesely It not only depends on the hardware cache prefetchers, but also on cache replacement policies and physical addresses of the locations accessed (cachegrind only uses virtual addresses). If I disable prefetchers on Haswell, I get 191,997 L1D misses for the whole program (compared to 65,796 when prefetchers are enabled).
– Hadi Brais
Nov 9 at 20:56













up vote
1
down vote













I suspect that this happens because vector buffer is not aligned on cache line boundary. That is sudden jump in cache misses marks a point when we proceed to a next line. So I suggest to check v.data() value.






share|improve this answer




















  • Besides writing a custom allocator, is there any way the OP could get a vector with the right alignment?
    – NathanOliver
    Nov 9 at 19:05










  • @NathanOliver I think OP could just allocate an array of bytes and select an aligned starting position. Or just select an aligned starting position in vector of double.
    – VTT
    Nov 9 at 19:08











  • Isn't aligning the vector to a cache line boundary just going to make the position where the cache misses move to a different place? Aren't you still going to get a cache miss just as regularly as you iterate through memory?
    – Galik
    Nov 9 at 19:23







  • 1




    @Galik Yes, now I tried to allocate the vector with double* v = (double *)aligned_alloc(64, COUNT * 8); instead of std::vector, and the cache miss is on line counter += v[i+0];. I thought that CPU prefetcher will fetch data as they go.
    – Andrej Kesely
    Nov 9 at 19:25










  • @Galik Yes, but this misalignment explains why cache misses happen in the middle of the loop.
    – VTT
    Nov 9 at 19:33














up vote
1
down vote













I suspect that this happens because vector buffer is not aligned on cache line boundary. That is sudden jump in cache misses marks a point when we proceed to a next line. So I suggest to check v.data() value.






share|improve this answer




















  • Besides writing a custom allocator, is there any way the OP could get a vector with the right alignment?
    – NathanOliver
    Nov 9 at 19:05










  • @NathanOliver I think OP could just allocate an array of bytes and select an aligned starting position. Or just select an aligned starting position in vector of double.
    – VTT
    Nov 9 at 19:08











  • Isn't aligning the vector to a cache line boundary just going to make the position where the cache misses move to a different place? Aren't you still going to get a cache miss just as regularly as you iterate through memory?
    – Galik
    Nov 9 at 19:23







  • 1




    @Galik Yes, now I tried to allocate the vector with double* v = (double *)aligned_alloc(64, COUNT * 8); instead of std::vector, and the cache miss is on line counter += v[i+0];. I thought that CPU prefetcher will fetch data as they go.
    – Andrej Kesely
    Nov 9 at 19:25










  • @Galik Yes, but this misalignment explains why cache misses happen in the middle of the loop.
    – VTT
    Nov 9 at 19:33












up vote
1
down vote










up vote
1
down vote









I suspect that this happens because vector buffer is not aligned on cache line boundary. That is sudden jump in cache misses marks a point when we proceed to a next line. So I suggest to check v.data() value.






share|improve this answer












I suspect that this happens because vector buffer is not aligned on cache line boundary. That is sudden jump in cache misses marks a point when we proceed to a next line. So I suggest to check v.data() value.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 9 at 18:55









VTT

23k32345




23k32345











  • Besides writing a custom allocator, is there any way the OP could get a vector with the right alignment?
    – NathanOliver
    Nov 9 at 19:05










  • @NathanOliver I think OP could just allocate an array of bytes and select an aligned starting position. Or just select an aligned starting position in vector of double.
    – VTT
    Nov 9 at 19:08











  • Isn't aligning the vector to a cache line boundary just going to make the position where the cache misses move to a different place? Aren't you still going to get a cache miss just as regularly as you iterate through memory?
    – Galik
    Nov 9 at 19:23







  • 1




    @Galik Yes, now I tried to allocate the vector with double* v = (double *)aligned_alloc(64, COUNT * 8); instead of std::vector, and the cache miss is on line counter += v[i+0];. I thought that CPU prefetcher will fetch data as they go.
    – Andrej Kesely
    Nov 9 at 19:25










  • @Galik Yes, but this misalignment explains why cache misses happen in the middle of the loop.
    – VTT
    Nov 9 at 19:33
















  • Besides writing a custom allocator, is there any way the OP could get a vector with the right alignment?
    – NathanOliver
    Nov 9 at 19:05










  • @NathanOliver I think OP could just allocate an array of bytes and select an aligned starting position. Or just select an aligned starting position in vector of double.
    – VTT
    Nov 9 at 19:08











  • Isn't aligning the vector to a cache line boundary just going to make the position where the cache misses move to a different place? Aren't you still going to get a cache miss just as regularly as you iterate through memory?
    – Galik
    Nov 9 at 19:23







  • 1




    @Galik Yes, now I tried to allocate the vector with double* v = (double *)aligned_alloc(64, COUNT * 8); instead of std::vector, and the cache miss is on line counter += v[i+0];. I thought that CPU prefetcher will fetch data as they go.
    – Andrej Kesely
    Nov 9 at 19:25










  • @Galik Yes, but this misalignment explains why cache misses happen in the middle of the loop.
    – VTT
    Nov 9 at 19:33















Besides writing a custom allocator, is there any way the OP could get a vector with the right alignment?
– NathanOliver
Nov 9 at 19:05




Besides writing a custom allocator, is there any way the OP could get a vector with the right alignment?
– NathanOliver
Nov 9 at 19:05












@NathanOliver I think OP could just allocate an array of bytes and select an aligned starting position. Or just select an aligned starting position in vector of double.
– VTT
Nov 9 at 19:08





@NathanOliver I think OP could just allocate an array of bytes and select an aligned starting position. Or just select an aligned starting position in vector of double.
– VTT
Nov 9 at 19:08













Isn't aligning the vector to a cache line boundary just going to make the position where the cache misses move to a different place? Aren't you still going to get a cache miss just as regularly as you iterate through memory?
– Galik
Nov 9 at 19:23





Isn't aligning the vector to a cache line boundary just going to make the position where the cache misses move to a different place? Aren't you still going to get a cache miss just as regularly as you iterate through memory?
– Galik
Nov 9 at 19:23





1




1




@Galik Yes, now I tried to allocate the vector with double* v = (double *)aligned_alloc(64, COUNT * 8); instead of std::vector, and the cache miss is on line counter += v[i+0];. I thought that CPU prefetcher will fetch data as they go.
– Andrej Kesely
Nov 9 at 19:25




@Galik Yes, now I tried to allocate the vector with double* v = (double *)aligned_alloc(64, COUNT * 8); instead of std::vector, and the cache miss is on line counter += v[i+0];. I thought that CPU prefetcher will fetch data as they go.
– Andrej Kesely
Nov 9 at 19:25












@Galik Yes, but this misalignment explains why cache misses happen in the middle of the loop.
– VTT
Nov 9 at 19:33




@Galik Yes, but this misalignment explains why cache misses happen in the middle of the loop.
– VTT
Nov 9 at 19:33

















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53231681%2fcachegrind-why-so-many-cache-misses%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How to how show current date and time by default on contact form 7 in WordPress without taking input from user in datetimepicker

Syphilis

Darth Vader #20