how the cache work.

NightWare · June 04, 2009, 10:16:29 PM

hi,

i'm quite tired to see stupids "speed test" (instructions tests, or worst, api call test...), speed tests should be used for algos, and nothing else. so i've decided to explain how the cache work, and how you should write/evaluate algos because of the cache influence.

What is the cache ?
the cache (l1/l2/l3), like memory or even hard drives are STORING space, nothing ELSE ! the difference is the level of proximity with the cpu (and the level of proximity has a direct speed influence). the l1 cache (also named code/trace cache) is the nearest, only few kilobytes it essentially contain most frequently used adress (and it's the essential part coz everything depends of the address...), code parts and most frequently used datas. the l2/l3 cache (also named data cache) is a bit less near, few megabytes, contain 4kb pages of code/data

as you can notice, both contain code AND data, so why is it named code and data cache ? because the first contain code+data frequently used for the execution of the instruction. and for the second contain code+data not used yet by the l1 cache, maybe will be never used, maybe will be used soon. in some way it's a guess of what could be used in a near futur (it's why 4kb is enough...).

How it works ?
the principle is quite simple, you STORE in the nearest possible locations the code/data, depending of the frequency of the need (read the librarian story concerning the cache, on wikipedia, and just add a room for the l2/l3 cache).
step 1. when you read data from memory (~100 cycles) it's slow because you don't only read the data, but also the rest of the 4kb (to put a copy in the l2 cache).
step 2. when you read the next data, it's a bit faster (~10 cycles) coz here you read from the l2 cache (you remember ? the 4kb).
step 3. when you nedd to re-read a data previously read, it's also ~10 cycles (l2 cache), but now it copy the address in l1 cache (and initialize a counter).
step 4. when you re-re-read a data, you have 2 possibility. 4.1 a data ins't needed anymore in l1 cache, so your new data naturally replace the existing one.
4.2 all data are still used in l1 cache, so your new data will be inserted in l1 cache ONLY if the counter is higher than one of the other address counter, in l1 cache.
step 5. your data is now on l1 cache, so now (and while it will be replaced by something else), it will cost you only ~1 cycle as extra cost for reading.

i've explained the principle for reading data, for writting it's a bit more complex, i will explain how it works later.

L1 cache specific stuff :
the l1 cache is not a storing space like the others. it contain 3 different parts (for storing, there is certainly others internally for the execution...)
first part for the address (remember, everything depends of the address), classic size here.
second part for the code (in fact parts of code and NOT ENTIRE ALGOs (generally), i will explain after...), undetermined size here, coz it depends of the code.
third part for the data, here the size can go to 64 bytes (why 64 ?, 16 for a simd register, 16+16 for hardware prefetching in both sens (yes both), and i'm sure you understand the usage of the last 16 bytes...(if not, read infos concerning alignment, anyway it's needed for instruction like lddqu))

i remember you that for those 3 parts the l1 cache is only few kb (8 by core, on my core2 duo), so there is no space to loose.

Prefetching :
the prefetching allow you to pre-read data, the principle is to allow you to alterate the normal processing, by indicate to the cpu what will be read next (to avoid wrong guess by the cpu), (in this case you need to understand that a PSD (prefetch scheduling distance) is required, to really obtain a benefit, if it's easy to apply outside of a loop, in a loop it's more problematic...). the prefetching is also automatically used to pre-read code. the prefetching instructions you can use for data, are just variants of the internal automatic code prefetching.

The influence of the cache on your algos, (i'm going to take the example of a simple algo :

ALIGN 16
;
; clean a mem area (by filling with zeros)
; note : the area must be aligned 16, otherwise crash...
; and XMM0 is modified
;
; Syntax :
; mov eax,{taille à éffacer en octets}
; mov edi,{adresse du bloc de destination}
; call ZeroMem_Sse2
;
; Return : nothing
;
ZeroMem_Sse2 PROC
      push ecx                              ; empiler ecx

; ici on traite les 4x owords
Label1:   mov ecx,eax                           ; placer la taille à copier dans ecx
      and ecx,11111111111111111111111111000000b         ; ne conserver que les bits 4x owords en ecx
      jz Label3                              ; si ecx est égal à 0, aller Label3
      pxor XMM0,XMM0                           ; effacer XMM0
Label2:   movdqa OWORD PTR[edi],XMM0                  ; placer 0 en edi
      movdqa OWORD PTR[edi+16],XMM0                  ; placer 0 en edi+16
      movdqa OWORD PTR[edi+32],XMM0                  ; placer 0 en edi+32
      movdqa OWORD PTR[edi+48],XMM0                  ; placer 0 en edi+48
      add edi,OWORD*4                        ; on ajoute 64 à edi
      sub ecx,OWORD*4                        ; on enleve 64 à ecx (notre pas de boucle)
      jnz Label2                           ; tant que ecx est différent de 0, aller Label2
; ici on traite les owords
Label3:   mov ecx,eax                           ; placer la taille à copier dans ecx
      and ecx,00000000000000000000000000110000b         ; ne conserver que les bits owords en ecx
      jz Label5                              ; si ecx est égal à 0, aller Label5
      add edi,ecx                           ; on ajoute ecx à edi
      neg ecx                              ; inverser ecx
      pxor XMM0,XMM0                           ; effacer XMM0
Label4:   movdqa OWORD PTR[edi+ecx],XMM0               ; placer 0 en edi+ecx
      add ecx,OWORD                           ; on ajoute 16 à ecx (notre pas de boucle)
      jnz Label4                           ; tant que ecx est différent de 0, aller Label4
; ici on traite les dwords
Label5:   mov ecx,eax                           ; placer la taille à copier dans ecx
      and ecx,00000000000000000000000000001100b         ; ne conserver que les bits dwords en ecx
      jz Label7                              ; si ecx est égal à 0, aller Label7
      add edi,ecx                           ; on ajoute ecx à edi
      neg ecx                              ; inverser ecx
Label6:   mov DWORD PTR [edi+ecx],0                  ; placer 0 en edi+ecx
      add ecx,DWORD                           ; on ajoute 4 à ecx (notre pas de boucle)
      jnz Label6                           ; tant que ecx est différent de 0, aller Label6
; ici on traite les octets
Label7:   mov ecx,eax                           ; placer la taille à copier dans ecx
      and ecx,00000000000000000000000000000011b         ; ne conserver que les bits octets en ecx
      jz Label9                              ; si ecx est égal à 0, aller Label9
      add edi,ecx                           ; on ajoute ecx à edi
      neg ecx                              ; inverser ecx
Label8:   mov BYTE PTR [edi+ecx],0                     ; placer 0 en edi+ecx
      inc ecx                              ; on ajoute 1 à ecx (notre pas de boucle)
      jnz Label8                           ; tant que ecx est différent de 0, aller Label8
Label9:   sub edi,eax                           ; soustraire la taille à edi (pour retrouver l'adresse originelle)

      pop ecx                              ; désempiler ecx
   ret                                    ; retourner (sortir de la procédure)
ZeroMem_Sse2 ENDP

note : it's totally possible that the entire algo will be in yellow (speed test, or multiple calls of the function), but it's another usage of the function. and it's something you must decide when you're coding your algo : what will be the use of the function ?

otherwise, for the red part (part that will not entering in l1 cache) :
here, your choice will be to avoid jumps, use a masking system, or the instructions specifically coded by intel (bt/adc/sbb, cmovcc, rep mov.., etc...) coz those instructions have been developped exactly for that reason ! (no ! intel haven't the habit to create useless instructions ! you confuse with microsoft here !), it's short, can appear slower in speed test, but it avoid undesirable effect, like instruction size, branch mispredictions (~50 cycles), etc...
example : bt eax,1 is faster than test eax,1 (coz for determining the state of a bit, reading 4 bytes (0FBAE001) cost less than reading 5 bytes (A901000000), it's just a question of logic, even if a speed test say the contrary (understanding what's under the results !))

for the yellow part (part that will entering in l1 cache) :
here you can use jumps, choice the fastest instruction of speed test, etc... (even if it's not always true, there is other factors...).
example : test eax,1 is faster than bt eax,1 (coz the extra byte have been absorbed by the loop, so after few loop, test BECOME faster, and NOW you fully understand the importance of the cache).

Write back ?
writting to/from the cache, you remember the story with the librarian ? well, now imagine what will happen if one of the books is altered (page missing, written, etc...).
the librarian will not give a new book, he will give the altered book, near. it's the same for the cpu, the data will not write the memory until it's necessary (it would be a speed abberation...). the cache will wait that the data quit the l1 cache to write the memory (by using NT (non temporal) similar approach).
technically if a data quit the cache, it can't return to the l1 cache suddenly (must pass by the l2 cache before). but you need to understand one thing : to read the data again, the cpu will WAIT until the value have been written to memory. and it's something that can generate BIG SLOWDOWNs (writting is slower than reading, always !).

now, a small game... :bdg, with the example algo, in your opinion, could it be more interesting to avoid part from Label3 to Label7 ?

ps : yeah i know, i speak english like a spanish cow, but even in this case you certainly understand better how the cache really work.

mitchi · June 04, 2009, 10:33:04 PM

Hi Nightware,

I've read it all. I like the first part of your post (before the ZeroMem SSE2 code) because it's easy to read and you give a good example of a typical memory operation. This part is easy to understand.
However, what comes after is harder to understand. The code sample has 2 colors and you don't tell us which one is what... And the rest is also hard to follow just because I don't understand the color system.
Maybe you should use only one color and rewrite that last part?

dedndave · June 04, 2009, 10:36:38 PM

Quotenote : it's totally possible that the entire algo will be in yellow (speed test, or multiple calls of the function), but it's another usage of the function. and it's something you must decide when you're coding your algo : what will be the use of the function ?

otherwise, for the red part (part that will not entering in l1 cache)

Quotefor the yellow part (part that will entering in l1 cache)

BogdanOntanu · June 05, 2009, 04:52:43 AM

Sorry...

I hope your post was made with good intentions BUT your post is utterly naive and makes huge mistakes and it is misleading other people IF they choose to believe your dreams about how "cache" works in a CPU.

Edit: not to mention that on the default board colors scheme the yellow text is very very hard to read... if even possible.

dedndave · June 05, 2009, 05:11:36 AM

of course different microprocessors have different sets of rules for how the cache operates
here is a simple description for the pentium 4 - some detail
at the bottom of the page are several links to more in-depth documents

in MS word format:
https://users.cs.jmu.edu/abzugcx/public/Computer-Architecture/Term-Projects/Intel-Pentium-Series-by-Tim-Barto-2002-SUMMER-CS-585.doc

in HTML format:
http://74.125.95.132/search?q=cache:VPLvxZMahNIJ:https://users.cs.jmu.edu/abzugcx/public/Computer-Architecture/Term-Projects/Intel-Pentium-Series-by-Tim-Barto-2002-SUMMER-CS-585.doc+intel+cache+operation&cd=12&hl=en&ct=clnk&gl=us

the article describes the NW bit
i hafta find out which mode XP uses
i suppose it may be different for other OS's
i wonder if we are allowed to alter that bit

sinsi · June 05, 2009, 10:24:52 AM

Sorry...
I've got to agree with Bogdan, you are all over the place e.g.
- core duo/core2/i7 have seperate code/data caches
- cache lines are 16 bytes, sometimes 2 lines are read to decode 1 instruction
- some sse+ instructions read an extra 2 lines *ahead* just in case
- the new colours (purple and grey) are worse than the previous (red and yellow)

Have you read the intel manual (3a, chapter 10 I think)? Heavy... :'(

d0d0 · June 05, 2009, 01:50:18 PM

Bogdan and sinsi,

Please kindly educate us on how you think/understand the cache works. It's easy to critisize someone considering he/she has taken time to try to explain it.

Thanks

NightWare · June 05, 2009, 11:12:39 PM

hi,

hmm, i must admit that, concerning cache, most of my infos come from "intel 64 and IA32 architectures optimization reference manual" 09/11/2007 (IA32_Opt.book), chapter 7 optimize cache usage, not too heavy here. it's a protected document so i'm not allowed to quote, but i think it can be downloaded from intel site, http://www.intel.com/Assets/PDF/manual/248966.pdf (hmm, new version, but it must certainly say the same thing...).

since i haven't read the technical documents from intel, concerning the cpu architecture, i may perfectly have misundertood some concept (it's possible, i code and i'm not very interested of how it works internally...). but from what i've read, it seams quite clear...

EDIT :

Quote from: sinsi on June 05, 2009, 10:24:52 AM
- core duo/core2/i7 have seperate code/data caches
- cache lines are 16 bytes, sometimes 2 lines are read to decode 1 instruction
- some sse+ instructions read an extra 2 lines *ahead* just in case
- the new colours (purple and grey) are worse than the previous (red and yellow)

hmm, i re-read the doc, but
- each core has his own code cache (i don't think i've said the contrary...), but it's common for the data cache (ch 2.1.4/5), i know, it's a bit different for quad core.
- a data cache line is 64 bytes (ch 2.3), 16 bytes (or 128 bits) it's for core memory cluster (not exactly the same thing).
- hmm, the constant atempt to pre-read ahead (other than the hardware 128 bit automatic data prefetching) is 256 bytes for l2 (ch 7.2), but maybe we don't speak of the same thing here, or is relative to the previous sentence...
- yeah, the colors are not very well chosen... i plaid guilty here... :bg

hutch-- · June 06, 2009, 03:15:31 AM

While I do thank Nightware for having made the effort, I tend to agree with Bogdan here and in my case its for a couple of reasons. The first is that you can objectively test code combinations in real life conditions in real time, if you could not then computers would be useless. The other is that the actual details of cache size and how it works differ from processor to processor.

I would suggest that it is more useful to address multiuple levels of cache by its abstract, different sized buffers for both code and data at different access stages in a processor's operation. Intel and I guess AMD do in fact publish the data on cache design in part but the sheer volume of detail at a true hardware level is well beyond the information in the technical users manuals and is related to computerised chip design, not user information.

mitchi · June 06, 2009, 05:50:53 AM

A question here. Is there something about the cache that an average ASM programmer like me does not know and SHOULD know?

sinsi · June 06, 2009, 06:52:01 AM

Has anyone played around with RDPMC? It would be interesting to get some of those numbers.
I know it's all cpu dependent but someone like jj (and myself) would have an interest.

NightWare - sorry, my numbers are wrong, yours seem to be right. It all depends on which part of the intel docs you read.
Stuff is scattered all over the place in 10 different pdfs :( (I still don't know where '16' came from - the 486 maybe :bdg).

jj2007 · June 06, 2009, 09:08:51 AM

Quote from: sinsi on June 06, 2009, 06:52:01 AM
Has anyone played around with RDPMC?

Chokes. Olly says privileged instruction :(

NightWare · June 07, 2009, 02:13:00 AM

Quote from: mitchi on June 06, 2009, 05:50:53 AM
A question here. Is there something about the cache that an average ASM programmer like me does not know and SHOULD know?

hmm, answer to the small game (anyway no one want to play) : YES it's certainly more interesting to avoid part from Label3 to Label7 (by replacing the last value by ..0111111b), why ?
1. with 3 POSSIBLE iteration for the code, the benefit of the cache will be near to zero, IF entering cache (no guaranty it will be 3).
2. the mispredictions of the jumps (Label4 and Label6 each ~50 cycles) will not be avdantageously aborbed with just 3 iterations, 50/3=16.6 extra cycles per iteration (*2 for the 2 parts)
3. you also avoid the mispredictions for Label5 and Label7, an economy of ~100 cycles.
4. you will avoid some stalls.
5. the absorbed misprediction for the jump to Label8 will be more profitable 50/63=0.79 extra cycles per iteration, instead of another 50/3.
6. as a bonus, the code is shorter of a reasonnable amount of bytes.

so it should be more interesting, even in the 63 worst case. a speed test will tell you the contrary, because in the case of a speed test the entire algo will be in yellow.
and the benefit of using "bigger" intructions, will appear more advantageous. even if you don't believe me, you should think a bit about that.

jj2007 · June 07, 2009, 08:07:23 AM

Quote from: NightWare on June 07, 2009, 02:13:00 AM
a speed test will tell you the contrary, because in the case of a speed test the entire algo will be in yellow.

Yes, in a speed test with tens of thousands of loops, everything will be in the cache. In real life, however, we will lose a few hundred cycles (don't have the energy right now to translate that into nanoseconds) :(

Now the crucial question: In which real life situation would it matter to be a few hundred cycles slower, once? Can you give an example?

MichaelW · June 07, 2009, 09:07:57 AM

There are bound to be complex algorithms that cannot be split into smaller pieces and which may not fit entirely in the cache, and situations were such an algorithm would need to run a large number of times to accomplish some task.

News:

how the cache work.

d0d0