The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: NightWare on June 04, 2009, 10:16:29 PM

Title: how the cache work.
Post by: NightWare on June 04, 2009, 10:16:29 PM
hi,

i'm quite tired to see stupids "speed test" (instructions tests, or worst, api call test...), speed tests should be used for algos, and nothing else. so i've decided to explain how the cache work, and how you should write/evaluate algos because of the cache influence.

What is the cache ?
the cache (l1/l2/l3), like memory or even hard drives are STORING space, nothing ELSE ! the difference is the level of proximity with the cpu (and the level of proximity has a direct speed influence). the l1 cache (also named code/trace cache) is the nearest, only few kilobytes it essentially contain most frequently used adress (and it's the essential part coz everything depends of the address...), code parts and most frequently used datas. the l2/l3 cache (also named data cache) is a bit less near, few megabytes, contain 4kb pages of code/data

as you can notice, both contain code AND data, so why is it named code and data cache ? because the first contain code+data frequently used for the execution of the instruction. and for the second contain code+data not used yet by the l1 cache, maybe will be never used, maybe will be used soon. in some way it's a guess of what could be used in a near futur (it's why 4kb is enough...).

How it works ?
the principle is quite simple, you STORE in the nearest possible locations the code/data, depending of the frequency of the need (read the librarian story concerning the cache, on wikipedia, and just add a room for the l2/l3 cache).
step 1. when you read data from memory (~100 cycles) it's slow because you don't only read the data, but also the rest of the 4kb (to put a copy in the l2 cache).
step 2. when you read the next data, it's a bit faster (~10 cycles) coz here you read from the l2 cache (you remember ? the 4kb).
step 3. when you nedd to re-read a data previously read, it's also ~10 cycles (l2 cache), but now it copy the address in l1 cache (and initialize a counter).
step 4. when you re-re-read a data, you have 2 possibility. 4.1 a data ins't needed anymore in l1 cache, so your new data naturally replace the existing one.
4.2 all data are still used in l1 cache, so your new data will be inserted in l1 cache ONLY if the counter is higher than one of the other address counter, in l1 cache.
step 5. your data is now on l1 cache, so now (and while it will be replaced by something else), it will cost you only ~1 cycle as extra cost for reading.

i've explained the principle for reading data, for writting it's a bit more complex, i will explain how it works later.

L1 cache specific stuff :
the l1 cache is not a storing space like the others. it contain 3 different parts (for storing, there is certainly others internally for the execution...)
first part for the address (remember, everything depends of the address), classic size here.
second part for the code (in fact parts of code and NOT ENTIRE ALGOs (generally), i will explain after...), undetermined size here, coz it depends of the code.
third part for the data, here the size can go to 64 bytes (why 64 ?, 16 for a simd register, 16+16 for hardware prefetching in both sens (yes both), and i'm sure you understand the usage of the last 16 bytes...(if not, read infos concerning alignment, anyway it's needed for instruction like lddqu))

i remember you that for those 3 parts the l1 cache is only few kb (8 by core, on my core2 duo), so there is no space to loose.

Prefetching :
the prefetching allow you to pre-read data, the principle is to allow you to alterate the normal processing, by indicate to the cpu what will be read next (to avoid wrong guess by the cpu), (in this case you need to understand that a PSD (prefetch scheduling distance) is required, to really obtain a benefit, if it's easy to apply outside of a loop, in a loop it's more problematic...). the prefetching is also automatically used to pre-read code. the prefetching instructions you can use for data, are just variants of the internal automatic code prefetching.

The influence of the cache on your algos, (i'm going to take the example of a simple algo :

ALIGN 16
;
; clean a mem area (by filling with zeros)
; note : the area must be aligned 16, otherwise crash...
; and XMM0 is modified
;
; Syntax :
; mov eax,{taille à éffacer en octets}
; mov edi,{adresse du bloc de destination}
; call ZeroMem_Sse2
;
; Return : nothing
;
ZeroMem_Sse2 PROC
      push ecx                              ; empiler ecx

; ici on traite les 4x owords
Label1:   mov ecx,eax                           ; placer la taille à copier dans ecx
      and ecx,11111111111111111111111111000000b         ; ne conserver que les bits 4x owords en ecx
      jz Label3                              ; si ecx est égal à 0, aller Label3
      pxor XMM0,XMM0                           ; effacer XMM0
Label2:   movdqa OWORD PTR[edi],XMM0                  ; placer 0 en edi
      movdqa OWORD PTR[edi+16],XMM0                  ; placer 0 en edi+16
      movdqa OWORD PTR[edi+32],XMM0                  ; placer 0 en edi+32
      movdqa OWORD PTR[edi+48],XMM0                  ; placer 0 en edi+48
      add edi,OWORD*4                        ; on ajoute 64 à edi
      sub ecx,OWORD*4                        ; on enleve 64 à ecx (notre pas de boucle)
      jnz Label2                           ; tant que ecx est différent de 0, aller Label2
; ici on traite les owords
Label3:   mov ecx,eax                           ; placer la taille à copier dans ecx
      and ecx,00000000000000000000000000110000b         ; ne conserver que les bits owords en ecx
      jz Label5                              ; si ecx est égal à 0, aller Label5
      add edi,ecx                           ; on ajoute ecx à edi
      neg ecx                              ; inverser ecx
      pxor XMM0,XMM0                           ; effacer XMM0
Label4:   movdqa OWORD PTR[edi+ecx],XMM0               ; placer 0 en edi+ecx
      add ecx,OWORD                           ; on ajoute 16 à ecx (notre pas de boucle)
      jnz Label4                           ; tant que ecx est différent de 0, aller Label4
; ici on traite les dwords
Label5:   mov ecx,eax                           ; placer la taille à copier dans ecx
      and ecx,00000000000000000000000000001100b         ; ne conserver que les bits dwords en ecx
      jz Label7                              ; si ecx est égal à 0, aller Label7
      add edi,ecx                           ; on ajoute ecx à edi
      neg ecx                              ; inverser ecx
Label6:   mov DWORD PTR [edi+ecx],0                  ; placer 0 en edi+ecx
      add ecx,DWORD                           ; on ajoute 4 à ecx (notre pas de boucle)
      jnz Label6                           ; tant que ecx est différent de 0, aller Label6
; ici on traite les octets
Label7:   mov ecx,eax                           ; placer la taille à copier dans ecx
      and ecx,00000000000000000000000000000011b         ; ne conserver que les bits octets en ecx
      jz Label9                              ; si ecx est égal à 0, aller Label9
      add edi,ecx                           ; on ajoute ecx à edi
      neg ecx                              ; inverser ecx
Label8:   mov BYTE PTR [edi+ecx],0                     ; placer 0 en edi+ecx
      inc ecx                              ; on ajoute 1 à ecx (notre pas de boucle)
      jnz Label8                           ; tant que ecx est différent de 0, aller Label8
Label9:   sub edi,eax                           ; soustraire la taille à edi (pour retrouver l'adresse originelle)

      pop ecx                              ; désempiler ecx
   ret                                    ; retourner (sortir de la procédure)
ZeroMem_Sse2 ENDP 

note : it's totally possible that the entire algo will be in yellow (speed test, or multiple calls of the function), but it's another usage of the function. and it's something you must decide when you're coding your algo : what will be the use of the function ?

otherwise, for the red part (part that will not entering in l1 cache) :
here, your choice will be to avoid jumps, use a masking system, or the instructions specifically coded by intel (bt/adc/sbb, cmovcc, rep mov.., etc...) coz those instructions have been developped exactly for that reason ! (no ! intel haven't the habit to create useless instructions ! you confuse with microsoft here !), it's short, can appear slower in speed test, but it avoid undesirable effect, like instruction size, branch mispredictions (~50 cycles), etc...
example : bt eax,1 is faster than test eax,1 (coz for determining the state of a bit, reading 4 bytes (0FBAE001) cost less than reading 5 bytes (A901000000), it's just a question of logic, even if a speed test say the contrary (understanding what's under the results !))

for the yellow part (part that will entering in l1 cache) :
here you can use jumps, choice the fastest instruction of speed test, etc... (even if it's not always true, there is other factors...).
example : test eax,1 is faster than bt eax,1 (coz the extra byte have been absorbed by the loop, so after few loop, test BECOME faster, and NOW you fully understand the importance of the cache).

Write back ?
writting to/from the cache, you remember the story with the librarian ? well, now imagine what will happen if one of the books is altered (page missing, written, etc...).
the librarian will not give a new book, he will give the altered book, near. it's the same for the cpu, the data will not write the memory until it's necessary (it would be a speed abberation...). the cache will wait that the data quit the l1 cache to write the memory (by using NT (non temporal) similar approach).
technically if a data quit the cache, it can't return to the l1 cache suddenly (must pass by the l2 cache before). but you need to understand one thing : to read the data again, the cpu will WAIT until the value have been written to memory. and it's something that can generate BIG SLOWDOWNs (writting is slower than reading, always !).

now, a small game...  :bdg, with the example algo, in your opinion, could it be more interesting to avoid part from Label3 to Label7 ?


ps : yeah i know, i speak english like a spanish cow, but even in this case you certainly understand better how the cache really work.

Title: Re: how the cache work.
Post by: mitchi on June 04, 2009, 10:33:04 PM
Hi Nightware,

I've read it all. I like the first part of your post (before the ZeroMem SSE2 code) because it's easy to read and you give a good example of a typical memory operation. This part is easy to understand.
However, what comes after is harder to understand. The code sample has 2 colors and you don't tell us which one is what... And the rest is also hard to follow just because I don't understand the color system.
Maybe you should use only one color and rewrite that last part?
Title: Re: how the cache work.
Post by: dedndave on June 04, 2009, 10:36:38 PM
Quotenote : it's totally possible that the entire algo will be in yellow (speed test, or multiple calls of the function), but it's another usage of the function. and it's something you must decide when you're coding your algo : what will be the use of the function ?

otherwise, for the red part (part that will not entering in l1 cache)

Quotefor the yellow part (part that will entering in l1 cache)
Title: Re: how the cache work.
Post by: BogdanOntanu on June 05, 2009, 04:52:43 AM
Sorry...

I hope your post was made with good intentions BUT your post is utterly naive and makes huge mistakes and it is misleading other people  IF they choose to believe your dreams about how "cache" works in a CPU.

Edit: not to mention that on the default board colors scheme the yellow text is very very hard to read... if even possible.
Title: Re: how the cache work.
Post by: dedndave on June 05, 2009, 05:11:36 AM
of course different microprocessors have different sets of rules for how the cache operates
here is a simple description for the pentium 4 - some detail
at the bottom of the page are several links to more in-depth documents

in MS word format:
https://users.cs.jmu.edu/abzugcx/public/Computer-Architecture/Term-Projects/Intel-Pentium-Series-by-Tim-Barto-2002-SUMMER-CS-585.doc

in HTML format:
http://74.125.95.132/search?q=cache:VPLvxZMahNIJ:https://users.cs.jmu.edu/abzugcx/public/Computer-Architecture/Term-Projects/Intel-Pentium-Series-by-Tim-Barto-2002-SUMMER-CS-585.doc+intel+cache+operation&cd=12&hl=en&ct=clnk&gl=us

the article describes the NW bit
i hafta find out which mode XP uses
i suppose it may be different for other OS's
i wonder if we are allowed to alter that bit
Title: Re: how the cache work.
Post by: sinsi on June 05, 2009, 10:24:52 AM
Sorry...
I've got to agree with Bogdan, you are all over the place e.g.
- core duo/core2/i7 have seperate code/data caches
- cache lines are 16 bytes, sometimes 2 lines are read to decode 1 instruction
- some sse+ instructions read an extra 2 lines *ahead* just in case
- the new colours (purple and grey) are worse than the previous (red and yellow)

Have you read the intel manual (3a, chapter 10 I think)? Heavy... :'(
Title: Re: how the cache work.
Post by: d0d0 on June 05, 2009, 01:50:18 PM
Bogdan and sinsi,

Please kindly educate us on how you think/understand the cache works. It's easy to critisize someone considering he/she has taken time to try to explain it.

Thanks
Title: Re: how the cache work.
Post by: NightWare on June 05, 2009, 11:12:39 PM
hi,

hmm, i must admit that, concerning cache, most of my infos come from "intel 64 and IA32 architectures optimization reference manual" 09/11/2007 (IA32_Opt.book), chapter 7 optimize cache usage, not too heavy here. it's a protected document so i'm not allowed to quote, but i think it can be downloaded from intel site, http://www.intel.com/Assets/PDF/manual/248966.pdf (hmm, new version, but it must certainly say the same thing...).

since i haven't read the technical documents from intel, concerning the cpu architecture, i may perfectly have misundertood some concept (it's possible, i code and i'm not very interested of how it works internally...). but from what i've read, it seams quite clear...

EDIT :
Quote from: sinsi on June 05, 2009, 10:24:52 AM
- core duo/core2/i7 have seperate code/data caches
- cache lines are 16 bytes, sometimes 2 lines are read to decode 1 instruction
- some sse+ instructions read an extra 2 lines *ahead* just in case
- the new colours (purple and grey) are worse than the previous (red and yellow)
hmm, i re-read the doc, but
- each core has his own code cache (i don't think i've said the contrary...), but it's common for the data cache (ch 2.1.4/5), i know, it's a bit different for quad core.
- a data cache line is 64 bytes (ch 2.3), 16 bytes (or 128 bits) it's for core memory cluster (not exactly the same thing).
- hmm, the constant atempt to pre-read ahead (other than the hardware 128 bit automatic data prefetching) is 256 bytes for l2 (ch 7.2), but maybe we don't speak of the same thing here, or is relative to the previous sentence...
- yeah, the colors are not very well chosen... i plaid guilty here...  :bg

Title: Re: how the cache work.
Post by: hutch-- on June 06, 2009, 03:15:31 AM
While I do thank Nightware for having made the effort, I tend to agree with Bogdan here and in my case its for a couple of reasons. The first is that you can objectively test code combinations in real life conditions in real time, if you could not then computers would be useless. The other is that the actual details of cache size and how it works differ from processor to processor.

I would suggest that it is more useful to address multiuple levels of cache by its abstract, different sized buffers for both code and data at different access stages in a processor's operation. Intel and I guess AMD do in fact publish the data on cache design in part but the sheer volume of detail at a true hardware level is well beyond the information in the technical users manuals and is related to computerised chip design, not user information.
Title: Re: how the cache work.
Post by: mitchi on June 06, 2009, 05:50:53 AM
A question here. Is there something about the cache that an average ASM programmer like me does not know and SHOULD know?
Title: Re: how the cache work.
Post by: sinsi on June 06, 2009, 06:52:01 AM
Has anyone played around with RDPMC? It would be interesting to get some of those numbers.
I know it's all cpu dependent but someone like jj (and myself) would have an interest.

NightWare - sorry, my numbers are wrong, yours seem to be right. It all depends on which part of the intel docs you read.
Stuff is scattered all over the place in 10 different pdfs  :( (I still don't know where '16' came from - the 486 maybe  :bdg).
Title: Re: how the cache work.
Post by: jj2007 on June 06, 2009, 09:08:51 AM
Quote from: sinsi on June 06, 2009, 06:52:01 AM
Has anyone played around with RDPMC?

Chokes. Olly says privileged instruction :(
Title: Re: how the cache work.
Post by: NightWare on June 07, 2009, 02:13:00 AM
Quote from: mitchi on June 06, 2009, 05:50:53 AM
A question here. Is there something about the cache that an average ASM programmer like me does not know and SHOULD know?

hmm, answer to the small game (anyway no one want to play) : YES it's certainly more interesting to avoid part from Label3 to Label7 (by replacing the last value by ..0111111b), why ?
1. with 3 POSSIBLE iteration for the code, the benefit of the cache will be near to zero, IF entering cache (no guaranty it will be 3).
2. the mispredictions of the jumps (Label4 and Label6 each ~50 cycles) will not be avdantageously aborbed with just 3 iterations, 50/3=16.6 extra cycles per iteration (*2 for the 2 parts)
3. you also avoid the mispredictions for Label5 and Label7, an economy of ~100 cycles.
4. you will avoid some stalls.
5. the absorbed misprediction for the jump to Label8 will be more profitable 50/63=0.79 extra cycles per iteration, instead of another 50/3.
6. as a bonus, the code is shorter of a reasonnable amount of bytes.

so it should be more interesting, even in the 63 worst case. a speed test will tell you the contrary, because in the case of a speed test the entire algo will be in yellow.
and the benefit of using "bigger" intructions, will appear more advantageous. even if you don't believe me, you should think a bit about that.
Title: Re: how the cache work.
Post by: jj2007 on June 07, 2009, 08:07:23 AM
Quote from: NightWare on June 07, 2009, 02:13:00 AM
a speed test will tell you the contrary, because in the case of a speed test the entire algo will be in yellow.

Yes, in a speed test with tens of thousands of loops, everything will be in the cache. In real life, however, we will lose a few hundred cycles (don't have the energy right now to translate that into nanoseconds) :(

Now the crucial question: In which real life situation would it matter to be a few hundred cycles slower, once? Can you give an example?
Title: Re: how the cache work.
Post by: MichaelW on June 07, 2009, 09:07:57 AM
There are bound to be complex algorithms that cannot be split into smaller pieces and which may not fit entirely in the cache, and situations were such an algorithm would need to run a large number of times to accomplish some task.
Title: Re: how the cache work.
Post by: jj2007 on June 07, 2009, 02:27:44 PM
Sounds possible although a bit "constructed". Filling 32 or 64k of cache with the kind of code that is painted in yellow above requires 500...1000 inner loops ::)
Title: Re: how the cache work.
Post by: Mark Jones on June 07, 2009, 04:45:54 PM
Perhaps a dumb question, but can modern cache be disabled in the BIOS (like in the days of old)? If not, how about "filling it" with junk from a loop of 1,000, THEN run the algo under test once, rinse & repeat...

One thing is certain: the cache adds a very complex dynamic to the timing results.
Title: Re: how the cache work.
Post by: dedndave on June 07, 2009, 04:47:42 PM
there is a CD (Cache Disable) control bit - i dunno if windows wants you to diddle with it or not
(same as the NW, Not Write-though bit)
these bits are in CR0
Title: Re: how the cache work.
Post by: dedndave on June 07, 2009, 05:37:14 PM
apparently, it is safe to modify these bits
i guess the OS really doesn't really know the difference, other than by reading the CR0 register
CR0 - bit 29 - NW - Not Write-through (1 = not write-through or write-back cache, 0 = write-through cache)
CR0 - bit 30 - CD Cache Disable (1 = cache disabled, 0 = cache enabled)

both bits are set to 1 at reset - the OS sets them both to 0

disable the cache:

        mov     eax,CR0
        or      eax,40000000h
        mov     CR0,eax

enable the cache:

        mov     eax,CR0
        and     eax,0BFFFFFFFh
        mov     CR0,eax

might be a good idea to flush the cache before disabling it - REALTIME_PRIORITY_CLASS during the switch also
it would be interesting to see some timing comparisons on different processors (of both bits)
Jochen will be on this like a kid at christmas time
Title: Re: how the cache work.
Post by: dedndave on June 07, 2009, 06:10:53 PM
nope - lol
XP rejected my attempt to disable the cache - let me play with NW
it may not allow me to play with CR0 at all for security reasons

EDIT
XP won't even let me read CR0 - hasta be a work-around someplace
Title: Re: how the cache work.
Post by: drizz on June 07, 2009, 06:42:28 PM
Quote from: dedndave on June 07, 2009, 06:10:53 PMXP won't even let me read CR0 - hasta be a work-around someplace
Ring0
Title: Re: how the cache work.
Post by: dedndave on June 07, 2009, 06:45:03 PM
ty Drizz - i am reading now - it is do-able
Title: Re: how the cache work.
Post by: BlackVortex on June 07, 2009, 06:50:08 PM
One method I use to get as much info as possible for my own thread is with an exception handler. You get the full (?) thread context, even debug registers.
I'm kinda bored now, look here :
http://msdn.microsoft.com/en-us/library/ms679331(VS.85).aspx

Is the register you're looking for there ?
Title: Re: how the cache work.
Post by: dedndave on June 07, 2009, 06:58:01 PM
well - that will tell me about the exception
it appears you need to enter ring 0, make changes in the msr, then leave ring 0
it can be done, but i think this may be a topic that the forum avoids
too bad, as it would be interesting to see some comparisons, especially with the NW bit toggled
but, temporarily disabling the cache would also be nice to test theories of cache operation
Title: Re: how the cache work.
Post by: NightWare on June 07, 2009, 10:09:13 PM
Quote from: jj2007 on June 07, 2009, 08:07:23 AM
Now the crucial question: In which real life situation would it matter to be a few hundred cycles slower, once? Can you give an example?
it's not executed once, never, your app/program is a loop, nothing else. so it will ALWAYS call a function an enormous number of time, during the use. the problem is  : your app  (and your app is also not the only one running...), in most case, use a lot of diffrents algos to accomplish a task. and the cache is constantly updated, so it's not only few hundred cycles you loose once, but few hundred cycles you loose for every iterations, (and you can multiply it, when your main loop (your app) has severals loop levels...).
Title: Re: how the cache work.
Post by: jj2007 on June 07, 2009, 10:44:55 PM
Quote from: dedndave on June 07, 2009, 06:58:01 PM
well - that will tell me about the exception
it appears you need to enter ring 0, make changes in the msr, then leave ring 0
it can be done, but i think this may be a topic that the forum avoids

You would need a device driver and a call gate to use ring 0 functions from userland. Feasible, apparently, but not my cup of tea either...
Title: Re: how the cache work.
Post by: dedndave on June 07, 2009, 10:47:22 PM
nah - i have some code that will do it i found on a RE forum site
in fact, it will do some other stuff i didn't want to do - lol
Title: Re: how the cache work.
Post by: BlackVortex on June 07, 2009, 11:36:20 PM
Quote from: dedndave on June 07, 2009, 10:47:22 PM
nah - i have some code that will do it i found on a RE forum site
in fact, it will do some other stuff i didn't want to do - lol

Link plz ?   :green

Or PM if you prefer.

EDIT: Thanks
Title: Re: how the cache work.
Post by: drizz on June 08, 2009, 12:06:05 AM
There is a very similar project(driver) coded by Opcode. look for "iopl_module_732.zip" in:
http://ghirai.com/hutch/files/win32asmboard_code_arhive.tar.bz2
Title: Re: how the cache work.
Post by: NightWare on June 08, 2009, 02:56:23 AM
Quote from: jj2007 on June 07, 2009, 02:27:44 PM
Filling 32 or 64k of cache with the kind of code that is painted in yellow above requires 500...1000 inner loops ::)
why doing that ? :lol, maybe you've planned to scroll the page after...

Quote from: dedndave on June 07, 2009, 06:58:01 PM
temporarily disabling the cache would also be nice to test theories of cache operation
yep, but if you place yourself in the point of view of the cpu, you can deduce lot of things... for example, when you read from memory it's easy coz it's naturally indexed by the size of the memory, but when you put portions of this memory in the l2 cache there is no natural index anymore. but the cpu NEED a way to quickly know if a value is in the cache or not, so you can SEE how it could work, (try to code an algo for the possible options and you will understand quickly...).
Title: Re: how the cache work.
Post by: jj2007 on June 08, 2009, 06:59:50 AM
Quote from: NightWare on June 08, 2009, 02:56:23 AM
Quote from: jj2007 on June 07, 2009, 02:27:44 PM
Filling 32 or 64k of cache with the kind of code that is painted in yellow above requires 500...1000 inner loops ::)
why doing that ?

Because you assume that your 100-byte algo is 1. is not in the cache and 2. called many times, so every time you lose 100 cycles. Let's assume a worst case scenario:

- You have been granted 32k of virgin cache and a 20 ms time slice for your exclusive use.
- You have a very complex graphics routine of say, 32k.
- You call memfill to clear the screen. For the sake of the argument, do it three times to make sure it's in the cache.
- The complex algo could fill the cache entirely and thus evict the 100 bytes of memfill.
- However, running it once is not sufficient: The watchdog at the cache entry lets in only those with a frequent visitor card. The others pay a penalty :wink
- So you need to run first the 32k algo three times to make sure memfill is evicted.
- Then memfill comes as infrequent visitor and has to pay 100 cycles.

Now, in my naive understanding of things, 32k*3 means roughly 100,000 cycles, and 100 cycles (= 0.1%) more would not hurt me. But I have probably overlooked something - you are the expert. Seriously. I want to learn and understand this.
Title: Re: how the cache work.
Post by: NightWare on June 09, 2009, 09:16:03 PM
> You have been granted 32k of virgin cache and a 20 ms time slice for your exclusive use.
- technically, there is no "virgin" cache, and you have considerably less time per OS iteration, coz the OS give you the hand every ~15 ms (so this slice of time represent your app, the others, and the OS functionnality).

> You have a very complex graphics routine of say, 32k.
- ok,

> You call memfill to clear the screen. For the sake of the argument, do it three times to make sure it's in the cache.
- not exactly, in my explanation i've reduced the infos to the minimum (to make the principle understandable), but in reality for just 3 call the memfill algo
will not be entirely in the cache, because there is an ORDER in the execution. yep, place yourself in the point of view of the cpu, in your case YOU know where the loop are, NOT the cpu (in fact, there a system to avoid mispredictions, but it work only for the case where the jumps are always in the same direction, or if the direction always change.
it's why i prefere considering every jumps as misprediction, coz in most cases it's impossible to respect this principle...), so (in the example) the part from Label2 to Label3 will be considered by the cpu as THE loop to store (before considering the entire function).

> The complex algo could fill the cache entirely and thus evict the 100 bytes of memfill.
- no and yes, no the complex algo will not fill the cache, there is requirements for that, linked to the loop statu, execution order, the frequency, size, etc... and 32k is clearly beyond the cache size.
example : for the librarian, if someone want a trilogy, but he only has 2 empty slot free, then it's useless/unproductive to store 2 book and make the travel for the last one.
it's the same for the cpu, for an additionnal reason : there is a COST to fill large amount of bytes. if you look ch 2.1.4 (dtlb part) you will see that there is locations reserved for large pages, in your opinion, why ?
and yes it will probably disappear of the cache (here it's a question of logistic/logic).

> However, running it once is not sufficient: The watchdog at the cache entry lets in only those with a frequent visitor card. The others pay a penalty
- more or less, there is requirements to be in the cache, IF others algo don't enter in the cache, it's because there is no reason so they don't pay a penalty (here, the penalty would have been to produce the extra work to put the code in the cache), it's just executed normally.

> So you need to run first the 32k algo three times to make sure memfill is evicted.
- no it depends of the requirments, IF your 32k algo has no loop, it will not fit in the cache, so something else is used. IF your 32k algo has loops, then thoses loops will be stored (in the order of the execution).

> Then memfill comes as infrequent visitor and has to pay 100 cycles.
- it depends, here again it's a logistic problem. the memfill algo is condamned to disappear in ALL case, it's just a question of time/frequency (the principle of the cache itself).
example : for the librarian, if a book is less used, the book will return to his "normal" place/location and a book more used will replace it. AND IF the book is used more often (again), the process will restart.

> you are the expert. Seriously. I want to learn and understand this.
- i'm not more expert than you are, the difference is i've just spent some time to understand how it could work, with speed optimization in mind, coz the cpu can't use complex operations for that (by this i mean division or multiplication, other than register shift), at THIS level.