News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Which is faster?

Started by Neil, May 01, 2009, 10:56:52 AM

Previous topic - Next topic

NightWare

Quote from: lingo on May 03, 2009, 12:49:30 PM
No offend, but who uses .if  .elseif .else or preserve ecx and edx in  speed critical algos? IMO idiots in assembly and it is the result:   :lol

hmm, as being an idiot, i've just few things to say (stupid stuff, certainly...) :

preserving registers is a programming commodity, allow you to NOT loose your time to debug, and allow you to USE EVERY register (instead of considering some of them not usable like the others...).

plus, since, by essence, it's NOT IN speed critical algos (and only wrap the algos) who care about extra clock cycles executed just once ? (IMO people who have never understood what coding consist of...), yep few clock cycles areN'T measurable by a human (or maybe this algo don't need human interaction ?).

plus2, (if we go this way...) can you explain me why you preserve ebx/esi/edi ? it's made by the operating system no ? so why are you doing the job once more ? (especially if speed is critical, IMO some people have a weird logic...). it's quite understandable that the operating system preserve the table/source/destination registers "for you", BUT YOU ?

Quote from: hutch-- on May 04, 2009, 12:13:42 AM
There is yet another variation of this, allocate so much memory that it cannot fit into cache

no, coz here you will just measure useless stuff, (data read, wich is something we don't care). or you must also clean the code from the cache, and also the address for branch mis/preditions. the result you will obtain will essentially consist of data reading (something uncompressable), and you will NOT SEE the effect of the algo.

Quote from: hutch-- on May 04, 2009, 12:13:42 AM
then copy the test data in blocks larger than the cache sequentially in the allocated memory

the interest ? knowing the cost of reading data again ? UNLESS YOU DON'T READ THE SAME THINGS, IT'S THE SAME RESULTS IN ALL CASE... (and if you don't read the same things, what the hell are you comparing ?)

Quote from: hutch-- on May 04, 2009, 12:13:42 AM
then do random rotation of the bits being read and you will really see how bad some algos are.

no, random access will not allow us to know if it's a memory aliasing case or not. plus here again, measuring the reading of data has no interest.
MichaelW's macros measure a work from one point (the beginning) to another one (the end), and divide the result by the number of loop. and IMO it's the way to test algos. OS interaction certainly alterate the results a bit, but it's ALSO the case in normal use.
so the results ARE consistents (an focus on the algo, coz with the cache the code/data read + branch mispredictions (things that we don't care, in most case where speed is critical...) are absorbed by the loop). the "consistent" problem come from others factors, and also HOW you use the algo IN your app. to finish, with random access we will not obtain the normal benefit of the hardware prefetching (so the results obtained will be more than discutable... coz not reflecting the normal use...).

PS : it was just to join the unhappy club...  :bdg

lingo

I wonder is there another assembly idiot in this forum who preserves always ecx and edx by default?

hutch--

NightWare,

You appear to have missed why you avoid preloaded data in cache, you get false readings by having the data in cache and this does not help you with real world situations where the data is rarely ever in cache. The method I described brings into play a phenomenon called "page thrashing which really does slow down phony readings of data in cache.

In a relatrively small number of situations, (primarily testing instruction sequences) you can benefit by testing on highly localised data that is very small but in most instances the test is useless in emulating real world situations that regularly occur in daily software use.

By allocating a much larger block of memory than will fit into cache you force the algorithms to read the data and process the data directly so all of the processor factors come into play, branch prediction, instruction order, pipeline effects, pairing etc .....

If I have learnt one thing over the years of tweaking C algos in assembler, reduce the number of memory accesses and you will see real world speed increases where the sum total of the rest show only minor and trivial improvements in timing.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

BlackVortex

Quote from: lingo on May 04, 2009, 02:42:19 AM
I wonder is there another assembly idiot in this forum who preserves always ecx and edx by default?
Chill out, this is a programming forum, no need to have the attitude of a 14-year old mmorpg player. Why do you see anything as a skill contest ?

jj2007

Quote from: hutch-- on May 04, 2009, 12:13:42 AM
There is yet another variation of this, allocate so much memory that it cannot fit into cache then copy the test data in blocks larger than the cache sequentially in the allocated memory then do random rotation of the bits being read and you will really see how bad some algos are.

I doubt whether the "random rotation" will change a lot, but the large data block has been tested already in Timings and Cache (and probably earlier, I doubt that I invented this technique :wink):

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

  Masm32 lib szLen   126 cycles
  crt strlen         101 cycles
strlen32s            33 cycles
strlen64LingoB       28 cycles
_strlen (Agner Fog)  30 cycles


Same but with a scheme that eliminates the influence of the cache:

Masm32 lib szLen   *** 672 ms for 7078488 loops
  crt strlen         *** 609 ms for 7078488 loops
strlen32s            *** 328 ms for 7078488 loops
strlen64LingoB       *** 343 ms for 7078488 loops
_strlen (Agner Fog)  *** 344 ms for 7078488 loops


Differences between algos remain significant (I see Nightware has doubts about that, although I fully agree with 99% of this post), but they are much smaller than for the same tests with data in the cache. Which is nothing sensational. In most cases, we can assume data is in the cache, but in the case of a virus scanner, for example, terabytes have to be read and scanned, no data in cache, therefore "cache-free" timing algo needed.

hutch--

 :bg

Quote
I doubt whether the "random rotation" will change a lot, but the large data block has been tested already in Timings and Cache (and probably earlier, I doubt that I invented this technique

Here is a man who has yet to test page thrashing. At its crudest do the test so that each successive read is further up the offset than the cache size and watch the timings crash. To produce a long enough timing randomly pick addresses that are larger than the cache size apart and watch your timings change, not by milliseconds but seconds.

To complicate the matter, try both temporal and non-temporal reads and writes to see the difference and why non-cached writes are useful.

Most of the testing done with small samples already loaded in cache are a waste of space that don't reflect how the algo performs in real world situations.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on May 04, 2009, 10:49:05 AM

Most of the testing done with small samples already loaded in cache are a waste of space that don't reflect how the algo performs in real world situations.

Hutch,
As I mentioned in my post, my test is constructed in a way that it does work on non-cached data because the allocated buffer is far beyond cache size. But I am really curious to see a code sample that supports your "seconds instead of milliseconds" statement.
:bg

hutch--

It was a bit big to upload, it was about 1.5 gigabytes of code that built to about a 350 meg test piece. Now what you are testing is very simple, make EVERY read reload the current memory page and suddenly it all gets SSSLLLLLOOOOOOOOOWWWWWWWW. If you are not getting this effect, you are doing something wrong.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

Hutch,
   The program e1 seems to be using is not the one i found in the laboratory 1st thread.
The one they use IDs the CPU at the beginning. Where is a link to that one ?

jj2007

Hutch,
Maybe we misunderstand each other. Here is the (simplified) core piece of my test code:

mov esi, len(offset arg)  ; the length of the string whose zero delimiter we want to find
..
mov ebx, LoopCt
invoke GetTickCount
push eax ; save timer
push ebx ; save counter
  .Repeat
invoke pAlgo, edi ; pAlgo=strlen, strlen32s, etc.
add edi, esi ; move start position higher with EACH loop
dec ebx
.Until Zero?
invoke GetTickCount
pop ebx
pop ecx
sub eax, ecx
print str$(eax)," ms for "
print str$(ebx)," loops",13,10


Since the size of FatBuffer is 100 MB, as far as I can see, in 99% of all cases the strlen algo must read from memory rather than from cache. But please correct me if I am wrong - I freely admit that I am uncertain. I'd like to understand it ::)

Output:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
100000000 bytes allocated

codesizes: strlen32s=85, strlen64A=120, strlen64B=87

-- test 16k, misaligned 0, 16384 bytes
  Masm32 lib szLen   *** 62 ms for 6096 loops
  crt strlen         *** 32 ms for 6096 loops
strlen32s            *** 31 ms for 6096 loops
strlen64LingoB       *** 47 ms for 6096 loops
_strlen (Agner Fog)  *** 47 ms for 6096 loops

-- test 4k, misaligned 11, 4096 bytes
  Masm32 lib szLen   *** 63 ms for 24304 loops
  crt strlen         *** 32 ms for 24304 loops
strlen32s            *** 31 ms for 24304 loops
strlen64LingoB       *** 32 ms for 24304 loops
_strlen (Agner Fog)  *** 31 ms for 24304 loops

dedndave

JJ - where can I obtain this test program?

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
100000000 bytes allocated

codesizes: strlen32s=85, strlen64A=120, strlen64B=87

-- test 16k, misaligned 0, 16384 bytes
  Masm32 lib szLen   *** 62 ms for 6096 loops
  crt strlen         *** 32 ms for 6096 loops
strlen32s            *** 31 ms for 6096 loops
strlen64LingoB       *** 47 ms for 6096 loops
_strlen (Agner Fog)  *** 47 ms for 6096 loops

-- test 4k, misaligned 11, 4096 bytes
  Masm32 lib szLen   *** 63 ms for 24304 loops
  crt strlen         *** 32 ms for 24304 loops
strlen32s            *** 31 ms for 24304 loops
strlen64LingoB       *** 32 ms for 24304 loops
_strlen (Agner Fog)  *** 31 ms for 24304 loops

jj2007

Quote from: dedndave on May 04, 2009, 01:38:35 PM
JJ - where can I obtain this test program?

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
100000000 bytes allocated

Here it is. You need to move slenSSE2.inc \masm32\include\slenSSE2.inc
Same for include \masm32\macros\timers.asm (not included to avoid version confusion, see top post of The Laboratory).

No warranties about mental sanity etc. after reading my code. Look for AlgoTest inside the asm file...

[attachment deleted by admin]

hutch--

JJ,

Your example is linear read. The way to test this is to allocate 100 meg as in your example, write the same 1 meg of data to each megabyte in the allocated buffer then read the same piece of data from each of the 1 meg buffers that make up the 100 meg buffer. Do it either randomly or a preset order to ensure that no read is sequential and make the piece of data small, 32 bytes or similar.

On each read if you do this you force the processor to reload the page table which ensures you have no cache reads. Watch it drop to snail pace in comparison to cached reads.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

Thanks JJ,
   I ran your exe. My CPU takes several more cycles than yours, as it is a dual core.
I thought the program ID'ed dual-cores.

              Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
200000000 bytes allocated
codesizes: strlen32s=132, strlen64B=84, NWStrLen=118, _strlen=66 bytes

-- test 16k, misaligned 0, 16384 bytes
  Masm32 lib szLen   ** 188 ms for 12193 loops
  crt strlen         ** 93 ms for 12193 loops
strlen32s            ** 47 ms for 12193 loops
strlen64LingoB       ** 47 ms for 12193 loops
NWStrLen             ** 47 ms for 12193 loops
_strlen (Agner Fog)  ** 47 ms for 12193 loops

-- test 4k, misaligned 11, 4096 bytes
  Masm32 lib szLen   ** 172 ms for 48611 loops
  crt strlen         ** 94 ms for 48611 loops
strlen32s            ** 47 ms for 48611 loops
strlen64LingoB       ** 47 ms for 48611 loops
NWStrLen             ** 47 ms for 48611 loops
_strlen (Agner Fog)  ** 62 ms for 48611 loops

jj2007

Quote from: hutch-- on May 04, 2009, 02:28:29 PM
read the same piece of data from each of the 1 meg buffers that make up the 100 meg buffer. Do it either randomly or a preset order to ensure that no read is sequential and make the piece of data small, 32 bytes or similar.

Could you give a real life example of an application that would behave like this? I chose linear read because I thought of e.g.
- reading Windows.inc into a buffer, find first occurence of "Duplicate.inc"
- a virus scanner reading word.exe into a buffer, find first occurence of a pattern
etc.

In any case I hope we can agree that during the life cycle of the loop, i.e. between the two GetTickCounts, the code reads memory into the cache, at a buffer size of 200 Mega.