I used the method which filling memory as something with SSE technology.
Code is like below >
fmaszero proc pDest:DWORD, pLength:DWORD
xorps xmm0, xmm0 ; Fill memory as zero
mov edi, dword ptr [pDest] ; Put the address of buffer into EDI
mov ecx, dword ptr [pLength] ; Put the length of buffer into ECX
mov edx, ecx ; Copy ECX to EDX
shr ecx, 4d ; Divide by 16 (128-bit processing)
L_1:
movdqu oword ptr [edi], xmm0 ; Move zero data(null) to EDI's address
add edi, 16d ; Increase the pointer
dec ecx ; Decrease ECX
jnz L_1 ; If not zero, still
mov ecx, edx ; Reload count from EDX register
and ecx, 15d ; Divide result of 16d
xor al, al ; Fill AL as zero
rep stosb ; Fill memory
ret ; Return
fmaszero endp
I want to get the other people's opinion about this method.
In my opinion, it is the best method for speed.
What do you think?
1. Those unaligned writes (movdqu) are a performance killer. Align the pointer before starting the SSE loop.
2. The benefits of SSE vs. memset() vary by buffer length. Is this really faster for the buffers you'll use it on?
3. The break-even length for SSE vs stosd varies by processor.
4. Have you tried temporal vs non-temporal writes in your environment?
Stephanos,
- If the memory block is 16 bytes or more it works fine.
- If the memory block is less than 16 bytes, it overwrites other memory.
Sorry, I wrote this code in my school and I didn't test it so it have some problem.
If so, what is the fastest method?
Stephanos,
Don't be sorry, it's a good idea. :U I was just pointing out a problem I saw.
I'll leave the "what is the fastest method" for someone else. It's debatable and varies with the processor.
I tried speed test with this board's 'ZeroMemory Speed Test' kit.
So I can see its very low speed.
Microsoft's thing shows about 380 but my one shows about 700../
MOVDQU is really speed killer. So I tried MOVDQA.
When I use MOVDQA, it shows about 290 ... (The fastest)
Hmm.. Actually, MOVDQU is not good... From now, I may use MOVDQA.
Stephanos,
Aligned is almost always faster so its worth the effort to align the data so you can use the faster instruction. If you can organise it, use a non temporal write as it is not slowed down by the cache.
Vaguely I rememer that a 64 bit MMX fill is still faster than a 128 bit SSE version so it may be worth having a look at that as well.
Hmm. I tried MMX instructions to fill memory like below.
fmaszero proc pDest:DWORD, pLength:DWORD
emms
; xorps xmm0, xmm0 ; Fill memory as zero
mov edi, dword ptr [pDest] ; Put the address of buffer into EDI
mov ecx, dword ptr [pLength] ; Put the length of buffer into ECX
mov edx, ecx ; Copy ECX to EDX
shr ecx, 3d ; Divide by 8 (64-bit processing)
L_1:
; movdqa oword ptr [edi], xmm0 ; Move zero data(null) to EDI's address
movq qword ptr [edi], mm0
; add edi, 16d ; Increase the pointer
add edi, 8d
dec ecx ; Decrease ECX
jnz L_1 ; If not zero, still
mov ecx, edx ; Reload count from EDX register
; and ecx, 15d ; Divide result of 16d
and ecx, 7d
xor al, al ; Fill AL as zero
rep stosb ; Fill memory
ret ; Return
fmaszero endp
It took about 380 thus I think that it is slower than SSE2 instruction's one.
How can I write with MMX it can have the fastest speed?
Quote from: hutch-- on January 11, 2008, 10:26:46 PM
Vaguely I rememer that a 64 bit MMX fill is still faster than a 128 bit SSE version so it may be worth having a look at that as well.
Maybe on some CPUs, but its about the same on my P3 (330MHz/256KB) and definately slower on my P4 (2.4GHz/512KB).
i've posted a sse fast zeromem algo, here
http://www.masm32.com/board/index.php?topic=7458.0
and i'm quite sure a mmx algo can't beat it :toothy
Have a look at the instruction "movntq" for 64 bit fast fills. The action is in the NT part of the instruction, non temporal means it does not write back through the cache.
:P ok, i'll take a look at this instruction, and i'll make a speed test (even if i'm quite sure of the result). i'll report the result...
??? i was quite sure of the result, but don't expect this result !!!
Resultats des tests de vitesse entre les differentes macros :
Routine RtlZeroMemory, effectuee en 181 cycles
eax = 0 ebx = 197016916/ ecx = 1991541844
edx = 1997672244 esi = 4278576 edi = 4284248
Routine 1, effectuee en 282 cycles
eax = 1024 ebx = 197016916/ ecx = 1991541844
edx = 1997672244 esi = 4278576 edi = 4276224
Routine 2, effectuee en 136 cycles
eax = 1024 ebx = 197016916/ ecx = 1991541844
edx = 1997672244 esi = 4278576 edi = 4276224
Routine 3, effectuee en 77 cycles
eax = 1024 ebx = 197016916/ ecx = 1991541844
edx = 1997672244 esi = 4278576 edi = 4276224
Routine 4, effectuee en 255 cycles
eax = 1024 ebx = 197016916/ ecx = 1991541844
edx = 1997672244 esi = 4278576 edi = 4276224
Routine 5, effectuee en 142 cycles
eax = 1024 ebx = 197016916/ ecx = 1991541844
edx = 1997672244 esi = 4278576 edi = 4276224
Routine 6, effectuee en 80 cycles
eax = 1024 ebx = 197016916/ ecx = 1991541844
edx = 1997672244 esi = 4278576 edi = 4276224
Routine 7, effectuee en 556 cycles
eax = 1024 ebx = 197016916/ ecx = 1991541844
edx = 1997672244 esi = 4278576 edi = 4276224
Routine 8, effectuee en 3 cycles
eax = 1024 ebx = 197016916/ ecx = 1991541844
edx = 1997672244 esi = 4278576 edi = 4276224
Appuyez sur ENTER pour quitter...
for info :
routine 0 = rtlZeroMemory (the one from ntdll.dll)
routine 1 = my ALU zeromem
routine 2 = my MMX zeromem
routine 3 = my SSE zeromem
routine 4 = my unaligned ALU zeromem
routine 5 = my unaligned MMX zeromem
routine 6 = my unaligned SSE zeromem
routine 7 = unaligned MMX movntq zeromem
routine 8 = empty
for this test i use exactly the same instructions for routine 5 and routine 7, except i use movntq instead of movq...
NightWare, your algorithm's processing speed is pretty good! hmm.. Actually, the best solution is algorithm only...
NightWare-
How about including rep stosd as a baseline?
Depending on how much memory should be zeroed (will this memory be in cache or will not, will it be larger then 2MB (for example L2 cache)) best choice using sse(2) will be either movntdq (movntps, movntpd) or movdqa (movaps, movapd).
Quote from: Jimg on January 12, 2008, 04:18:34 PM
How about including rep stosd as a baseline?
jimg,
that's why there is rtlzeromemory in the test (it's a rep stosd/stosb algo)
Well to state the obvious then..... SSE algorithm implementations for Zero Memory Fills
is significantly faster than anything else by a rate of roughly 2:1. :cheekygreen:
That's quite handy to know..... I hope lots more coders will jump on the SSE bandwagon now.
Quote from: Draakie on January 14, 2008, 05:49:12 AM
Well to state the obvious then..... SSE algorithm implementations for Zero Memory Fills
is significantly faster than anything else by a rate of roughly 2:1. :cheekygreen:
drakkie,
heu... in fact it's the case for my codes... but maybe someone is able to produce a faster algo (i've speed up to much algo to think i have now the final solution), so you can't say sse is 2 times faster than anything else... (you can think it, like i do, but don't say it... :wink)
hutch, asmfan,
after few more tests (fill/copy/xchg/...) and docs reading, it seams movntq (in fact, every movntxx instructions) dramatically slowdown algos if you need to re-use the data quickly. so optimisations where you use an instruction several times, with the same data, to increase the speed of the algo are not welcome with thoses instructions.
beside, thoses intructions are not made to speed up things (in fact, not like you think...), but to avoid cache pollution... the NT part of the instruction is made to exlude the instruction of the cache treatment (when you just need to store data from time to time, and when the data isn't recurrently used by the cache). so those instructions are ALWAYS slower. :dance:
hutch, misinforming is a crime... you are a very, very, old coder :toothy, so i will not ban you this time (beside banning an administrator can generate problems... :lol) but be careful, the next time....
now seriously, i haven't loose my time coz it will be certainly interesting for me to use those instructions later...
actually asmfan and hutch- are correct. With large buffer sizes the MOVNT will overtake the plain SSE mov. As an example. I took your code, and then I,
1) used GlobalAlloc() to create an 8MB buffer, which is bigger than my L2 cache size.
2) I copied your routine and replaced all occurrences of MOVAPS with MOVNTDQ.
Sse_ZeroMem_UnAligned - 10381740
Sse_ZeroMem_UnAligned_NT - 5034231
I ran it on my Core 2 Duo. The speedup at 8MB is a bit over 2x.
The NT instructions will speed up your code if you have something larger than your L2 cache size.
humm... i didn't test it with large buffers, here you give me a subject of reflexion... i was pretty sure movntxx were made to exclude the mem movments for operations with low priority, the docs i read were in this direction, not exactly twhat you explained... beside the result obtained with few ko show exactly the contrary...
i didn't expected a difference like that... with a screen of aproximatively 3mo, this instruction become interesting... very interesting...
i'm going to make some tests... thanks for the info :U
i've a project designed for speed... i'm going to test the instructions in this project... the fps will inform me.
and i'll report the results
edit :
test on core2 duo T7300 2*2Ghz l2/4mb, under vista
for the test, i've decided to not make classic speed test, coz the cache could modify the results.
i've opted for a wip 3d engine (no directx/opengl here), i've removed a maximum of objects, but keep few of them (approx. 25000 polygons) to be sure the cache is used (or at least partially) by the main process, filled faces also removed (wip), and 1152*720*32 window, approx. 3mb.
to balance, i've multiplied the following procs by 5 (and alternate them, coz they are a bit differents), at the beginning of the main loop
ZeroMem (to Fill the Screen)
FillZBuffer (with fp value)
(i've launched a tiny tool (active and different each time) + compiled + launched the app) * 3
results (here *5, 5*3*2=30mb) :
movaps : 45 to 52, 45 to 52, 45 to 51 fps (~48.33)
movntps : 60 to 74, 61 to 74, 60 to 74 fps (~67,16) +19
without multiplying the procs (3*2=6mb) :
movaps : 103 to 139 fps (~121)
movntps : 121 to 190 fps (~155.5) +34
if we keep in mind that there is some math calculations, movntps appear at first sight considerably faster... and not in a speed test, in real use...
but after that, i've tried to change some movaps from math calculations, i've started with the 3 owords of the global rotation matrix, and those just 3 owords make me loose 5 fps (on the last test)... so i tried to see if more changes could slowdown again the app. with filled face (where the z are used to compare if the changes have to be made or not, and do it if needed...) it could terribly slowdown the entire process.
so instead of only multiplying the procs, i have added "zbuffer to screen" and "screen to zbuffer" procs (no NT here).
results (here *2, 2*3*4=24mb) :
movaps : 46 to 54, 47 to 54, 47 to 54 fps (~50.33)
movntps : 52 to 61, 52 to 60, 52 to 60 fps (~56.16) +6
humm.., it seams slower, copying data quickly it's cool but if you need to wait to use it... however it's still faster, so i tried to multiply NT usage, same as previous test, but this time with NT in the (z2s and s2z) copy proc.
results (here *2, 2*3*4=24mb) :
movaps : 50 to 60, 51 to 61, 51 to 61 fps (~55.66)
movntps : 57 to 65, 56 to 65, 57 to 64 fps (~60.66) +5
movntxx can't be used all the time.., the cache wait until the NT movments are made, when it's done it seams movntxx speed up the process for large area, but multiplying NT use, with access behind, slowdown the process. i've continued by activate NT in just one of the 2 procs added.
results (here *1, 3*4=12mb) :
movaps z2s s2zNT : 74 to 93 fps
movaps z2sNT s2z : 72 to 96 fps
here i don't understand, i've used/drawn polygons (without using the zbuffer), so test 1 should be faster than test 2... and it's not... quite mysterious...
for those tests, 1st value is fps when i've moved to see all the polygons on the screen, and 2nd value it's when there is nothing (no background, except 0 or the corresponding color of the ZBuffer fp value).
conclusion :
when i've seen the 1st results, i was very interested, but since ZeroMem will be removed later (it's just here temporary, to see what's done), or maybe for the sky. FillZBuffer will stay, but with a slowdown when modifying it, it will have to be tested. i can't even use it to fill faces coz there is light after, and later some filter...
movntps is faster on all the cases tested here (zeromem/fillmem/copymem), and the speed probably increase if we increase the length of the area. so movntxx instructions are really faster on large memory area. but those instructions have to be used carrefully... here i've intentionally tested it on large memory area. the problem, it's to know when you have to use it or not, and when it become faster than movaps... just to see, i've reproduced the first test, but here on a 320*200*32 windows, approx. 0.25mb.
results (here *5, 5*0.25*2=2.5mb) :
movaps : 770 to 3340 fps (~2055) + 1212
movntps : 535 to 1150 fps (~842.5)
mark, thanks for pointed this, it helped me to understand those instructions better. now i'm going to see if the sse2 movnti instruction do the same thing for the 32 bits register...
i shouldn't say "those instructions are ALWAYS slower", it was a misinforming
We need to understand the most basic of facts:
The non-temporal writes are NOT faster. Never have been. Never will be.
They potentialy make *other* memory accesses faster by avoiding cache misses at those sites. The cost of an L2 cache miss is typically betwee 30 to 100 cpu cycles, and the NT moves are used to avoid them elsewhere by avoiding the purge of data soon to be needed (the "working set")
Cache contention must be a factor. Your working set plus the buffer being written must be larger than the L2 cache size or else you are better off leveraging the cache. On systems with a 512K L2 cache, expect no gains when your working set is 64K and the buffer you are filling is 256K. Expect diappointment and potential slowdowns with the NT moves here. The best performance to be expected from the non-temporal writes in terms of bandwidth is fairly obvious! Its (Bus Speed * Bus Width). This is considerably less than the bandwidth of L1 or L2 memory, which could easily have 4 to 32 times as much.
Quote from: Rockoon on February 21, 2008, 07:48:38 AM
We need to understand the most basic of facts:
The non-temporal writes are NOT faster. Never have been. Never will be.
They potentialy make *other* memory accesses faster by avoiding cache misses at those sites. The cost of an L2 cache miss is typically betwee 30 to 100 cpu cycles, and the NT moves are used to avoid them elsewhere by avoiding the purge of data soon to be needed (the "working set")
Cache contention must be a factor. Your working set plus the buffer being written must be larger than the L2 cache size or else you are better off leveraging the cache. On systems with a 512K L2 cache, expect no gains when your working set is 64K and the buffer you are filling is 256K. Expect diappointment and potential slowdowns with the NT moves here. The best performance to be expected from the non-temporal writes in terms of bandwidth is fairly obvious! Its (Bus Speed * Bus Width). This is considerably less than the bandwidth of L1 or L2 memory, which could easily have 4 to 32 times as much.
You are correct. The actual MOVNT instruction itself runs slow, since it writes directly to memory.
Which reminds me of a trick to speed up your code. If you have to do multiple operations on your data ( like in a graphics engine), it is faster to break your data up into chunks, that fit into the L1 cache. I am bringing this up, since NightWare was talking about a graphics engine. You really want to keep it in the L1 cache if it all possible. L1 cache latency on AMD is 3 cycles and 2 cycles for Intel. The L1 cache is 8k on Intel and 64k on AMD.
There are two additional tricks you can do to speed up ZeroMem, that will also work in dealing with data you are dealing with in a graphics engine. They are using prefetch instructions and TLB priming. TLB Priming only helps if you have data larger than the page size of the OS you are running. Prefetch helps if you have more data than your cache size. On Intel processors the cache size has been bigger than 64 bytes ever since the P4. There are multiple prefetch instructions that all do something different. Play around with them. If you search for "prefetch", you should be able to find it being discussed.
I posted an example of doing TLB priming in this thread. in case anyone is curious :)
http://www.masm32.com/board/index.php?topic=6576.msg63693#msg63693
Quote from: Mark_Larson on February 21, 2008, 12:06:25 PM
L1 cache latency on AMD is 3 cycles and 2 cycles for Intel. The L1 cache is 8k on Intel and 64k on AMD.
It should also be pointed out that AMD's L1 caches are 2-way set associative, wheeas Intels are 4-way or 8-way.
The # of ways is basically an indication of how many pointers your working set is allowed to work with before DIRECT set contentions become a possibility, and is unrelated to how much bandwidth you are trying to consume.
Here, DIRECT contention means that it is possible for the pointers to be seperated such that they share the same cache set.
In 2-way set caches, it is not possible for a 2 pointer algorithm to produce set contention, but 3 pointer algorithms will suffer if all 3 map to the same set.
In 8-way set caches, it is not possible for an 8 pointer (or less) algorithm to produce set contention.
The tradeoff is in the size of the caches, the cost of misses, and so forth. 2-way caches as implimented in hardware are typically much larger than their 4-way and 8-way brothers so have fewer misses in general, but have more significant worst-case behavior in specific circumstances.
On another forum (VB related) I had asked the resident experts to run a test application I had cooked up which measured the performance of various data strides in a 6-pointer algorithm. The algorithm itself was simply the vertical component of a 5x5 guassian blur convolution which I had noticed performed slower on 1024x768 images than it did on 1280x1024 images. 5 vertical pointers into the source image + 1 pointer into an output image.
AMD CPU's (like mine) started feeling the pain of set contention much earlier than Intel CPUs at power-of-two strides between the pointers, and infact that 1024 pixel (4096 byte) stride was right where it started becoming really pronounced with an L1 cache miss every ~4 pixels.
The Core2 CPU's performed the best overall when normalized on CPU speed, followed by AMD64's, and then Pentium M's. Suprisingly, P3's performed better on this normalized scale than the P4's.
My solution to avoiding this "problem" was to turn the 5+1 pointer 1-pass algorithm into a 1+1 pointer 5-pass algorithm. I know that this solution is sub-optimal but its better than original while still using a more or less straight forward algorithm.
during several days, i've played a bit with prefetch (nta,t0,t1,t2), and the results (3*2=6mb) :
movaps : 103 to 139 fps (~121)
movaps (prefetchnta) : 103 to 139 fps (~121) =
movaps (prefetcht0/1/2) : 103 to 139 fps (~121) =
movntps : 121 to 190 fps (~155.5) +34
i've tried it several times, read docs to be sure i did it correctly (nothing in agner fog's doc, but in intel's P4 & Xeon optimisation, it's well explained), but it seams there is no effect, of course it doesn't mean there is no effect, maybe the speed up obtained is balanced by a slowdown on another algo, due to cache modification. beside, here i'm not sure that the automatic hardware data prefetcher implemented on P4+ isn't used in all case (here the data are used/displayed at the end of the loop, and re-used quickly at the beginning, so there's not enough time for prefetch).
the problem here (for zeromem and fillmem algo), is also you can't implement the prefetch instruction where it should really be. when you calculate the psd (prefetch sheduling distance) the distance generate a problem in an algo with loop. now it's very interesting to know how it works, coz for textured face filling it's totally possible to calc/use a psd and implement the prefetch instructions where it's needed.
so concerning zeromem and fillmem (in this case), movnt approach is far better, you clean the screen and the zbuffer with NT at the beginning of the main loop, and while you recalc the positions of all your points, it's done and become accessible for the rest of the code... beside no need to calc a psd here.
Something I should have mentioned with multicore processors, its worth a try using multiple threads to handle the memory to zero in multiple blocks, on a single processor machine this would be much slower as thread overhead would kill it but if thread overhead considerations can be overcome you may get the advantages of parallelism if done on a multiple processor machine.
Why don't you try using h/w prefetch rather than s/w? I think it more efficient for cache movements. What do profi think of it?
Quote from: asmfan on March 02, 2008, 07:48:34 AM
Why don't you try using h/w prefetch rather than s/w? I think it more efficient for cache movements. What do profi think of it?
I don't understand what you are saying. The h/w prefetcher is always used. And you can't use it programmatically like you can with the "prefetch" instruction.
Nighttware
Quote from: NightWare on March 01, 2008, 10:05:29 PM
during several days, i've played a bit with prefetch (nta,t0,t1,t2), and the results (3*2=6mb) :
movaps : 103 to 139 fps (~121)
movaps (prefetchnta) : 103 to 139 fps (~121) =
movaps (prefetcht0/1/2) : 103 to 139 fps (~121) =
movntps : 121 to 190 fps (~155.5) +34
i've tried it several times, read docs to be sure i did it correctly (nothing in agner fog's doc, but in intel's P4 & Xeon optimisation, it's well explained), but it seams there is no effect, of course it doesn't mean there is no effect, maybe the speed up obtained is balanced by a slowdown on another algo, due to cache modification. beside, here i'm not sure that the automatic hardware data prefetcher implemented on P4+ isn't used in all case (here the data are used/displayed at the end of the loop, and re-used quickly at the beginning, so there's not enough time for prefetch).
the problem here (for zeromem and fillmem algo), is also you can't implement the prefetch instruction where it should really be. when you calculate the psd (prefetch sheduling distance) the distance generate a problem in an algo with loop. now it's very interesting to know how it works, coz for textured face filling it's totally possible to calc/use a psd and implement the prefetch instructions where it's needed.
so concerning zeromem and fillmem (in this case), movnt approach is far better, you clean the screen and the zbuffer with NT at the beginning of the main loop, and while you recalc the positions of all your points, it's done and become accessible for the rest of the code... beside no need to calc a psd here.
actually I used to get anywhere from a 10% speedup to a 30% speedup from using pretech. However on my Core 2 Duo I am not getting any speedup either. So maybe the h/w prefetcher does such a good job, that you don't get any speed up? Do you have a Core 2 Duo, NightWare?
and just in case you aren't aware, the prefetch instruction on Intel processors fetches 2 cache lines ( 128 bytes).
On my P4, getting it to give you a speed up was a pain in the butt. So I wrote a program that would find the optimum prefetch distance using different locations in the loop and different offsets into memory. I'll see if I can find the program. The best solution was always very weird and unexpected. I'll see if I can dig up the code.
Have a look at -
AMD_block_prefetch_paper.pdf (http://cdrom.amd.com/devconn/events/AMD_block_prefetch_paper.pdf) (c)2001 Advanced Micro Devices Inc.
QuoteUsing Block Prefetch for Optimized Memory Performance by Advanced Micro Devices
Mike Wall
Member of Technical Staff
Developer Performance Team
And you'll find out what h/w prefetch I'm talking about.
Also second interesting file also here (http://cdrom.amd.com/devconn/events/)
Quote from: asmfan on March 02, 2008, 02:44:56 PM
Have a look at - AMD_block_prefetch_paper.pdf (http://cdrom.amd.com/devconn/events/AMD_block_prefetch_paper.pdf) (c)2001 Advanced Micro Devices Inc.
QuoteUsing Block Prefetch for Optimized Memory Performance by Advanced Micro Devices
Mike Wall
Member of Technical Staff
Developer Performance Team
And you'll find out what h/w prefetch I'm talking about.
Also second interesting file also here (http://cdrom.amd.com/devconn/events/)
I was aware. Intel works the same way. What I meant was, that there is no instruction you can use that was specifically designed to do h/w prefetching. You read ahead a cache line in advance to force the data into the cache.
EDIT: On Intel processors I get better performance from using a prefetchnta than a MOV. But I have to use my program to find the optimum way to use it in the loop.
EDIT2: I found my program, let me play with it a bit and see what it says. I'll post it later. It is written in C.
Quote from: Mark_Larson on March 02, 2008, 01:45:59 PM
Do you have a Core 2 Duo, NightWare?
yep, but it doesn't matter, :lol there is a constant in all the docs i've read, and i think we completly forgotten the basis here... :red with prefetch we read datas (and then, optionally alterate them, and write them).
but here, for zeromem or fillmem algo there is no need to READ and alterate... we just need to write a value we already have... so psd/prefetch/etc... are useless here... no ? :lol
Quote from: NightWare on March 03, 2008, 03:15:15 AM
Quote from: Mark_Larson on March 02, 2008, 01:45:59 PM
Do you have a Core 2 Duo, NightWare?
yep, but it doesn't matter, :lol there is a constant in all the docs i've read, and i think we completly forgotten the basis here... :red with prefetch we read datas (and then, optionally alterate them, and write them).
but here, for zeromem or fillmem algo there is no need to READ and alterate... we just need to write a value we already have... so psd/prefetch/etc... are useless here... no ? :lol
actually if you want the algorithm to run as fast as possible the data you want to write to memory needs to be in the cache. And the way you ensure that is by h/w or s/w prefetch.
Quote from: Mark_Larson on March 03, 2008, 04:56:47 PM
the data you want to write to memory needs to be in the cache
it's in the cache, you put it when used xor reg,reg or mov reg,imm. so why do you want to copy the block you're gonna clean in the cache ? it's a useless work you ask to the cpu... :wink
you need to re-look at the code. The code is actually pre-reading the NEXT cache line, while you are doing the current caches lines data. That way you are guaranteed when you start processing the data that it is already in the cache. The TLB priming works the same way. You load the next Page, before you actually need it.
Quote from: Mark_Larson on March 04, 2008, 03:22:26 PM
You load the next Page, before you actually need it.
::) and there is a reason/case where you need it, for a zeromem or fillmem algo ?
Quote from: NightWare on March 06, 2008, 02:27:17 AM
Quote from: Mark_Larson on March 04, 2008, 03:22:26 PM
You load the next Page, before you actually need it.
::) and there is a reason/case where you need it, for a zeromem or fillmem algo ?
yes so you don't get cache misses on the data. By doing the prefetching you guarantee the data is always in the cache.
As I somewhere said dynamically determining the cache line size with cpuid we can make things faster.
I posted cpuid program on fasm forum... wait a minute I'll find it
Here it is (http://board.flatassembler.net/topic.php?t=7171)
By using some manuals (Intel/amd - whatever) we can easily find cpuid input value to determine cache line size (80000006h -> ecx [0:7]).
I believe it's the way to smart programs which will use cpuid first and then show max performance according to CPU abilities.
Quote from: hutch-- on March 01, 2008, 10:41:34 PM
Something I should have mentioned with multicore processors, its worth a try using multiple threads to handle the memory to zero in multiple blocks, on a single processor machine this would be much slower as thread overhead would kill it but if thread overhead considerations can be overcome you may get the advantages of parallelism if done on a multiple processor machine.
synchronizing variable can do cachepolluting, which you dont want to get so it interferes cache to work optimum to fillmemory, Intels paper I read on this says clearly variable must be alone on 128byte aligned cacheline or mixed with other stuff can cause cachepolluting and to output 128bytes sometimes for that will also cut on bandwidth needed to fillmemory, cachepolluting means a loop of cache doing writeback to memory and readback and performance will drop radically
non synchronized threads doing fills, no idea how much ineffective zeroing will be caused by one thread gets too far away memoryadresses from the other?
and if you already reached bottleneck of memoryspeed with one thread whats the point?
but if you have an idea howto memoryfill with multiple threads that could work you could theoretically speedup x2 for a dualcore until memorybandwidth bottlenecks it lets code it
Quote from: asmfan on March 06, 2008, 04:03:42 PM
As I somewhere said dynamically determining the cache line size with cpuid we can make things faster.
I posted cpuid program on fasm forum... wait a minute I'll find it
Here it is (http://board.flatassembler.net/topic.php?t=7171)
By using some manuals (Intel/amd - whatever) we can easily find cpuid input value to determine cache line size (80000006h -> ecx [0:7]).
I believe it's the way to smart programs which will use cpuid first and then show max performance according to CPU abilities.
thanks nice info, you also know howto use cpuid for knowing how much texture I could keep in cache? reading cache's memory size?
Quote from: daydreamer on March 07, 2008, 08:23:44 AM
thanks nice info, you also know howto use cpuid for knowing how much texture I could keep in cache? reading cache's memory size?
I think this question isn't CPU but GPU and videocard relative but if you do your own processing you can determine how much L1/2/3 cache available and much cores and threads per core are supported. Dividing total L2 cache on chunks (texture sizes) you'll get total num of textures fit in L2 cache. Passing this chunks to appropriate numbers of threads will give performance boost.
Take a look at
AMD Man-s (http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_7044,00.html)
[AMD CPUID Specification] 25481.pdf (http://developer.amd.com/specifications.jsp)
[Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2A - Instruction Set Reference, A-M] 253666.pdf (http://www.intel.com/products/processor/manuals/index.htm)
Some fields of cpuid are available on both chips (Intl & AMd) but others are specific. For compatibility compare documentation of different developers.
From my tests with Memory Filling (or zeroing) I've found the following to be pretty much consistently true:
0 - 4096 bytes (Use RISC moves, 4x unrolled with a negative loop / index value)
4096 bytes - 32kb (Use SSE+TLB Priming+8x unroll)
32kb-3Mb (Use REP STOSD - This is probably dependant on L2 cache size?)
3Mb+ (Use an MMX or SSE non temporal store with movntdq or movntq)
I've tried it on 5 or 6 different processor types and those ranges seem to pretty close on all of them.
johnsa,
after your post i've tested an old rep stosd/stosb algo (old because it's slower in speed test), and the result is just a bit under movntps (-4 fps, beside, here the job is entirely done). thank you for the info :U
I'm going to perform the same tests with mem-copying and try to determine which algo's work best for what size data.
It seems that the best solution for a generic mem-fill or mem-copy would be to determine the data size up front and use one of the 4 possible solutions that best fit.
It's quite interesting, for example if you were using the fill memory to clear your screen buffer, the resolution you are running in would determine which approach is better.
(1024*768*32bit) = 3meg which is just on the boundary between a REP STOSD and an MMX/SSE NT write. So 1280x1024 would be best filled using NT and 800x600 would be best with a rep stosd.
hi,
i've said somewhere "i'm going to test movnti", i've been busy but i've tested it now, and it's totally the subject of the topic :
ALIGN 16
;
; Syntaxe :
; mov eax,BlockSize
; mov edx,FillValue
; mov edi,MemBlockPointer
; call Sse2_DwFillMem_NT
;
Sse2_DwFillMem_NT PROC
push ecx
and eax,11111111111111111111111111111100b ; to avoid gpf
; owords
Label1: mov ecx,eax
and ecx,11111111111111111111111111110000b
jz Label3
Label2: movnti DWORD PTR[edi],edx
movnti DWORD PTR[edi+4],edx
movnti DWORD PTR[edi+8],edx
movnti DWORD PTR[edi+12],edx
add edi,DWORD*4
sub ecx,DWORD*4
jnz Label2
; dwords
Label3: mov ecx,eax
and ecx,00000000000000000000000000001100b
jz Label5
add edi,ecx
neg ecx
Label4: movnti DWORD PTR [edi+ecx],edx
add ecx,DWORD
jnz Label4
; end
Label5: sub edi,eax ; restore edi
pop ecx
ret
Sse2_DwFillMem_NT ENDP
and i've obtained exactly the same result as movntps/movntdq (the one i use in my app) :
ALIGN 16
;
; Syntaxe :
; mov eax,BlockSize
; mov edx,FillValue
; mov edi,MemBlockPointer
; call Sse2_DwFillMem_NT
;
Sse2_DwFillMem_NT PROC
push ecx
and eax,11111111111111111111111111111100b ; to avoid gpf
; value to simd register
movd XMM0,edx ; XMM0 = _,_,_,x
pshufd XMM0,XMM0,000h ; XMM0 = x,x,x,x
; owords x4
Label1: mov ecx,eax
and ecx,11111111111111111111111111000000b
jz Label3
Label2: movntdq OWORD PTR[edi],XMM0
movntdq OWORD PTR[edi+16],XMM0
movntdq OWORD PTR[edi+32],XMM0
movntdq OWORD PTR[edi+48],XMM0
add edi,OWORD*4
sub ecx,OWORD*4
jnz Label2
; owords
Label3: mov ecx,eax
and ecx,00000000000000000000000000110000b
jz Label5
add edi,ecx
neg ecx
Label4: movntdq OWORD PTR[edi+ecx],XMM0
add ecx,OWORD
jnz Label4
; dwords
Label5: mov ecx,eax
and ecx,00000000000000000000000000001100b
jz Label7
add edi,ecx
neg ecx
Label6: movnti DWORD PTR [edi+ecx],edx
add ecx,DWORD
jnz Label6
; end
Label7: sub edi,eax ; restore edi
pop ecx
ret
Sse2_DwFillMem_NT ENDP
now, it's Non-Temporal so if we think in term of speed, one of the previous algos certainly finish the job before the other, i don't know wich one yet. but movnti is really interesting even for those who don't want to use simd stuff. of course, you need to understand when you have to use it, but in some case, it can really speed up your code.
advantages :
the treatment is by dword and not by qword/oword (simd register), so it's easier to use.
you can easilly replace your mov instruction with it (when you need to store data in your existing algos).
you preserve a simd register (and i appreciate...).
your memory block doesn't have to be 16 bytes aligned.
limits :
you can't use it for word/byte
disadvantages :
even if you use it exclusively on 32bits register, it's a Sse2 instruction. So you will have to test the cpu, before using it.
you also need to understand why and when to use lfence/sfence/mfence.
conclusion : VOTE FOR MOVNTI ! i'm nightware and i approve this message... :lol