ZeroMemory with SSE2

asmfan · January 12, 2008, 05:24:57 PM

Depending on how much memory should be zeroed (will this memory be in cache or will not, will it be larger then 2MB (for example L2 cache)) best choice using sse(2) will be either movntdq (movntps, movntpd) or movdqa (movaps, movapd).

NightWare · January 12, 2008, 11:19:34 PM

Quote from: Jimg on January 12, 2008, 04:18:34 PM
How about including rep stosd as a baseline?

jimg,
that's why there is rtlzeromemory in the test (it's a rep stosd/stosb algo)

Draakie · January 14, 2008, 05:49:12 AM

Well to state the obvious then..... SSE algorithm implementations for Zero Memory Fills
is significantly faster than anything else by a rate of roughly 2:1. :cheekygreen:

That's quite handy to know..... I hope lots more coders will jump on the SSE bandwagon now.

NightWare · January 14, 2008, 10:26:34 PM

Quote from: Draakie on January 14, 2008, 05:49:12 AM
Well to state the obvious then..... SSE algorithm implementations for Zero Memory Fills
is significantly faster than anything else by a rate of roughly 2:1. :cheekygreen:

drakkie,
heu... in fact it's the case for my codes... but maybe someone is able to produce a faster algo (i've speed up to much algo to think i have now the final solution), so you can't say sse is 2 times faster than anything else... (you can think it, like i do, but don't say it... :wink)

hutch, asmfan,
after few more tests (fill/copy/xchg/...) and docs reading, it seams movntq (in fact, every movntxx instructions) dramatically slowdown algos if you need to re-use the data quickly. so optimisations where you use an instruction several times, with the same data, to increase the speed of the algo are not welcome with thoses instructions.

beside, thoses intructions are not made to speed up things (in fact, not like you think...), but to avoid cache pollution... the NT part of the instruction is made to exlude the instruction of the cache treatment (when you just need to store data from time to time, and when the data isn't recurrently used by the cache). so those instructions are ALWAYS slower. :dance:

hutch, misinforming is a crime... you are a very, very, old coder :toothy, so i will not ban you this time (beside banning an administrator can generate problems... :lol) but be careful, the next time....

now seriously, i haven't loose my time coz it will be certainly interesting for me to use those instructions later...

Mark_Larson · February 18, 2008, 10:27:54 PM

actually asmfan and hutch- are correct. With large buffer sizes the MOVNT will overtake the plain SSE mov. As an example. I took your code, and then I,

1) used GlobalAlloc() to create an 8MB buffer, which is bigger than my L2 cache size.
2) I copied your routine and replaced all occurrences of MOVAPS with MOVNTDQ.

Code Select


Sse_ZeroMem_UnAligned       - 10381740
Sse_ZeroMem_UnAligned_NT -  5034231

I ran it on my Core 2 Duo. The speedup at 8MB is a bit over 2x.

The NT instructions will speed up your code if you have something larger than your L2 cache size.

NightWare · February 19, 2008, 10:31:54 PM

humm... i didn't test it with large buffers, here you give me a subject of reflexion... i was pretty sure movntxx were made to exclude the mem movments for operations with low priority, the docs i read were in this direction, not exactly twhat you explained... beside the result obtained with few ko show exactly the contrary...
i didn't expected a difference like that... with a screen of aproximatively 3mo, this instruction become interesting... very interesting...

i'm going to make some tests... thanks for the info :U
i've a project designed for speed... i'm going to test the instructions in this project... the fps will inform me.
and i'll report the results

edit :
test on core2 duo T7300 2*2Ghz l2/4mb, under vista

for the test, i've decided to not make classic speed test, coz the cache could modify the results.
i've opted for a wip 3d engine (no directx/opengl here), i've removed a maximum of objects, but keep few of them (approx. 25000 polygons) to be sure the cache is used (or at least partially) by the main process, filled faces also removed (wip), and 1152*720*32 window, approx. 3mb.

to balance, i've multiplied the following procs by 5 (and alternate them, coz they are a bit differents), at the beginning of the main loop
ZeroMem (to Fill the Screen)
FillZBuffer (with fp value)
(i've launched a tiny tool (active and different each time) + compiled + launched the app) * 3

Code Select

results (here *5, 5*3*2=30mb) :
movaps : 45 to 52, 45 to 52, 45 to 51 fps (~48.33)
movntps : 60 to 74, 61 to 74, 60 to 74 fps (~67,16) +19

Code Select

without multiplying the procs (3*2=6mb) :
movaps : 103 to 139 fps (~121)
movntps : 121 to 190 fps (~155.5) +34

if we keep in mind that there is some math calculations, movntps appear at first sight considerably faster... and not in a speed test, in real use...

but after that, i've tried to change some movaps from math calculations, i've started with the 3 owords of the global rotation matrix, and those just 3 owords make me loose 5 fps (on the last test)... so i tried to see if more changes could slowdown again the app. with filled face (where the z are used to compare if the changes have to be made or not, and do it if needed...) it could terribly slowdown the entire process.
so instead of only multiplying the procs, i have added "zbuffer to screen" and "screen to zbuffer" procs (no NT here).

Code Select

results (here *2, 2*3*4=24mb) :
movaps : 46 to 54, 47 to 54, 47 to 54 fps (~50.33)
movntps : 52 to 61, 52 to 60, 52 to 60 fps (~56.16) +6

humm.., it seams slower, copying data quickly it's cool but if you need to wait to use it... however it's still faster, so i tried to multiply NT usage, same as previous test, but this time with NT in the (z2s and s2z) copy proc.

Code Select

results (here *2, 2*3*4=24mb) :
movaps : 50 to 60, 51 to 61, 51 to 61 fps (~55.66)
movntps : 57 to 65, 56 to 65, 57 to 64 fps (~60.66) +5

movntxx can't be used all the time.., the cache wait until the NT movments are made, when it's done it seams movntxx speed up the process for large area, but multiplying NT use, with access behind, slowdown the process. i've continued by activate NT in just one of the 2 procs added.

Code Select

results (here *1, 3*4=12mb) :
movaps z2s s2zNT : 74 to 93 fps
movaps z2sNT s2z : 72 to 96 fps

here i don't understand, i've used/drawn polygons (without using the zbuffer), so test 1 should be faster than test 2... and it's not... quite mysterious...

for those tests, 1st value is fps when i've moved to see all the polygons on the screen, and 2nd value it's when there is nothing (no background, except 0 or the corresponding color of the ZBuffer fp value).

conclusion :
when i've seen the 1st results, i was very interested, but since ZeroMem will be removed later (it's just here temporary, to see what's done), or maybe for the sky. FillZBuffer will stay, but with a slowdown when modifying it, it will have to be tested. i can't even use it to fill faces coz there is light after, and later some filter...

movntps is faster on all the cases tested here (zeromem/fillmem/copymem), and the speed probably increase if we increase the length of the area. so movntxx instructions are really faster on large memory area. but those instructions have to be used carrefully... here i've intentionally tested it on large memory area. the problem, it's to know when you have to use it or not, and when it become faster than movaps... just to see, i've reproduced the first test, but here on a 320*200*32 windows, approx. 0.25mb.

Code Select

results (here *5, 5*0.25*2=2.5mb) :
movaps : 770 to 3340 fps (~2055) + 1212
movntps : 535 to 1150 fps (~842.5)

mark, thanks for pointed this, it helped me to understand those instructions better. now i'm going to see if the sse2 movnti instruction do the same thing for the 32 bits register...

i shouldn't say "those instructions are ALWAYS slower", it was a misinforming

Rockoon · February 21, 2008, 07:48:38 AM

We need to understand the most basic of facts:

The non-temporal writes are NOT faster. Never have been. Never will be.

They potentialy make *other* memory accesses faster by avoiding cache misses at those sites. The cost of an L2 cache miss is typically betwee 30 to 100 cpu cycles, and the NT moves are used to avoid them elsewhere by avoiding the purge of data soon to be needed (the "working set")

Cache contention must be a factor. Your working set plus the buffer being written must be larger than the L2 cache size or else you are better off leveraging the cache. On systems with a 512K L2 cache, expect no gains when your working set is 64K and the buffer you are filling is 256K. Expect diappointment and potential slowdowns with the NT moves here. The best performance to be expected from the non-temporal writes in terms of bandwidth is fairly obvious! Its (Bus Speed * Bus Width). This is considerably less than the bandwidth of L1 or L2 memory, which could easily have 4 to 32 times as much.

Mark_Larson · February 21, 2008, 12:06:25 PM

Quote from: Rockoon on February 21, 2008, 07:48:38 AM
We need to understand the most basic of facts:

The non-temporal writes are NOT faster. Never have been. Never will be.

They potentialy make *other* memory accesses faster by avoiding cache misses at those sites. The cost of an L2 cache miss is typically betwee 30 to 100 cpu cycles, and the NT moves are used to avoid them elsewhere by avoiding the purge of data soon to be needed (the "working set")

Cache contention must be a factor. Your working set plus the buffer being written must be larger than the L2 cache size or else you are better off leveraging the cache. On systems with a 512K L2 cache, expect no gains when your working set is 64K and the buffer you are filling is 256K. Expect diappointment and potential slowdowns with the NT moves here. The best performance to be expected from the non-temporal writes in terms of bandwidth is fairly obvious! Its (Bus Speed * Bus Width). This is considerably less than the bandwidth of L1 or L2 memory, which could easily have 4 to 32 times as much.

You are correct. The actual MOVNT instruction itself runs slow, since it writes directly to memory.

Which reminds me of a trick to speed up your code. If you have to do multiple operations on your data ( like in a graphics engine), it is faster to break your data up into chunks, that fit into the L1 cache. I am bringing this up, since NightWare was talking about a graphics engine. You really want to keep it in the L1 cache if it all possible. L1 cache latency on AMD is 3 cycles and 2 cycles for Intel. The L1 cache is 8k on Intel and 64k on AMD.

There are two additional tricks you can do to speed up ZeroMem, that will also work in dealing with data you are dealing with in a graphics engine. They are using prefetch instructions and TLB priming. TLB Priming only helps if you have data larger than the page size of the OS you are running. Prefetch helps if you have more data than your cache size. On Intel processors the cache size has been bigger than 64 bytes ever since the P4. There are multiple prefetch instructions that all do something different. Play around with them. If you search for "prefetch", you should be able to find it being discussed.

Mark_Larson · February 21, 2008, 11:08:45 PM

I posted an example of doing TLB priming in this thread. in case anyone is curious :)

http://www.masm32.com/board/index.php?topic=6576.msg63693#msg63693

Rockoon · February 22, 2008, 01:11:11 AM

Quote from: Mark_Larson on February 21, 2008, 12:06:25 PM
L1 cache latency on AMD is 3 cycles and 2 cycles for Intel. The L1 cache is 8k on Intel and 64k on AMD.

It should also be pointed out that AMD's L1 caches are 2-way set associative, wheeas Intels are 4-way or 8-way.

The # of ways is basically an indication of how many pointers your working set is allowed to work with before DIRECT set contentions become a possibility, and is unrelated to how much bandwidth you are trying to consume.

Here, DIRECT contention means that it is possible for the pointers to be seperated such that they share the same cache set.

In 2-way set caches, it is not possible for a 2 pointer algorithm to produce set contention, but 3 pointer algorithms will suffer if all 3 map to the same set.
In 8-way set caches, it is not possible for an 8 pointer (or less) algorithm to produce set contention.

The tradeoff is in the size of the caches, the cost of misses, and so forth. 2-way caches as implimented in hardware are typically much larger than their 4-way and 8-way brothers so have fewer misses in general, but have more significant worst-case behavior in specific circumstances.

On another forum (VB related) I had asked the resident experts to run a test application I had cooked up which measured the performance of various data strides in a 6-pointer algorithm. The algorithm itself was simply the vertical component of a 5x5 guassian blur convolution which I had noticed performed slower on 1024x768 images than it did on 1280x1024 images. 5 vertical pointers into the source image + 1 pointer into an output image.

AMD CPU's (like mine) started feeling the pain of set contention much earlier than Intel CPUs at power-of-two strides between the pointers, and infact that 1024 pixel (4096 byte) stride was right where it started becoming really pronounced with an L1 cache miss every ~4 pixels.

The Core2 CPU's performed the best overall when normalized on CPU speed, followed by AMD64's, and then Pentium M's. Suprisingly, P3's performed better on this normalized scale than the P4's.

My solution to avoiding this "problem" was to turn the 5+1 pointer 1-pass algorithm into a 1+1 pointer 5-pass algorithm. I know that this solution is sub-optimal but its better than original while still using a more or less straight forward algorithm.

NightWare · March 01, 2008, 10:05:29 PM

during several days, i've played a bit with prefetch (nta,t0,t1,t2), and the results (3*2=6mb) :
movaps : 103 to 139 fps (~121)
movaps (prefetchnta) : 103 to 139 fps (~121) =
movaps (prefetcht0/1/2) : 103 to 139 fps (~121) =
movntps : 121 to 190 fps (~155.5) +34

i've tried it several times, read docs to be sure i did it correctly (nothing in agner fog's doc, but in intel's P4 & Xeon optimisation, it's well explained), but it seams there is no effect, of course it doesn't mean there is no effect, maybe the speed up obtained is balanced by a slowdown on another algo, due to cache modification. beside, here i'm not sure that the automatic hardware data prefetcher implemented on P4+ isn't used in all case (here the data are used/displayed at the end of the loop, and re-used quickly at the beginning, so there's not enough time for prefetch).

the problem here (for zeromem and fillmem algo), is also you can't implement the prefetch instruction where it should really be. when you calculate the psd (prefetch sheduling distance) the distance generate a problem in an algo with loop. now it's very interesting to know how it works, coz for textured face filling it's totally possible to calc/use a psd and implement the prefetch instructions where it's needed.

so concerning zeromem and fillmem (in this case), movnt approach is far better, you clean the screen and the zbuffer with NT at the beginning of the main loop, and while you recalc the positions of all your points, it's done and become accessible for the rest of the code... beside no need to calc a psd here.

hutch-- · March 01, 2008, 10:41:34 PM

Something I should have mentioned with multicore processors, its worth a try using multiple threads to handle the memory to zero in multiple blocks, on a single processor machine this would be much slower as thread overhead would kill it but if thread overhead considerations can be overcome you may get the advantages of parallelism if done on a multiple processor machine.

asmfan · March 02, 2008, 07:48:34 AM

Why don't you try using h/w prefetch rather than s/w? I think it more efficient for cache movements. What do profi think of it?

Mark_Larson · March 02, 2008, 01:45:59 PM

Quote from: asmfan on March 02, 2008, 07:48:34 AM
Why don't you try using h/w prefetch rather than s/w? I think it more efficient for cache movements. What do profi think of it?

I don't understand what you are saying. The h/w prefetcher is always used. And you can't use it programmatically like you can with the "prefetch" instruction.

Nighttware

Quote from: NightWare on March 01, 2008, 10:05:29 PM
during several days, i've played a bit with prefetch (nta,t0,t1,t2), and the results (3*2=6mb) :
movaps : 103 to 139 fps (~121)
movaps (prefetchnta) : 103 to 139 fps (~121) =
movaps (prefetcht0/1/2) : 103 to 139 fps (~121) =
movntps : 121 to 190 fps (~155.5) +34

i've tried it several times, read docs to be sure i did it correctly (nothing in agner fog's doc, but in intel's P4 & Xeon optimisation, it's well explained), but it seams there is no effect, of course it doesn't mean there is no effect, maybe the speed up obtained is balanced by a slowdown on another algo, due to cache modification. beside, here i'm not sure that the automatic hardware data prefetcher implemented on P4+ isn't used in all case (here the data are used/displayed at the end of the loop, and re-used quickly at the beginning, so there's not enough time for prefetch).

the problem here (for zeromem and fillmem algo), is also you can't implement the prefetch instruction where it should really be. when you calculate the psd (prefetch sheduling distance) the distance generate a problem in an algo with loop. now it's very interesting to know how it works, coz for textured face filling it's totally possible to calc/use a psd and implement the prefetch instructions where it's needed.

so concerning zeromem and fillmem (in this case), movnt approach is far better, you clean the screen and the zbuffer with NT at the beginning of the main loop, and while you recalc the positions of all your points, it's done and become accessible for the rest of the code... beside no need to calc a psd here.

actually I used to get anywhere from a 10% speedup to a 30% speedup from using pretech. However on my Core 2 Duo I am not getting any speedup either. So maybe the h/w prefetcher does such a good job, that you don't get any speed up? Do you have a Core 2 Duo, NightWare?

and just in case you aren't aware, the prefetch instruction on Intel processors fetches 2 cache lines ( 128 bytes).

On my P4, getting it to give you a speed up was a pain in the butt. So I wrote a program that would find the optimum prefetch distance using different locations in the loop and different offsets into memory. I'll see if I can find the program. The best solution was always very weird and unexpected. I'll see if I can dig up the code.

asmfan · March 02, 2008, 02:44:56 PM

QuoteUsing Block Prefetch for Optimized Memory Performance by Advanced Micro Devices
Mike Wall
Member of Technical Staff
Developer Performance Team

And you'll find out what h/w prefetch I'm talking about.
Also second interesting file also here

News:

ZeroMemory with SSE2