News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Which is faster?

Started by Neil, May 01, 2009, 10:56:52 AM

Previous topic - Next topic

dedndave

Microsoft suggests confining the thread to a single core
that means that the advantage of having 2 cores (or more) is negated by the test
what is needed, is to acquire the timer values from all cores, run the test, then acquire them all again
and see how many total cycles were used
i am not experienced enough to know how to do that

MichaelW

From what little testing I have done on a dual-core system my small, simple test apps seemed to run on one core only. I have yet to see any clear demonstration where having multiple cores provided a performance advantage.
eschew obfuscation

dedndave

it may be that it hasn't been measured yet
until we devise a test that accomodates more than one core, we won't really know

Mark Jones

In my experience with the AMD Athlon dual-core, Windows likes to assign a single-thread process to one CPU core, and it alternates cores for each new process. I.e., if you open up two single-threaded programs, they both run on a separate core. So yes, two things can be running at the same time. Of course, they have to share the same busses, so it is not exactly 2x the performance.

Programs like BOINC spawn new worker processes to utilize all available cores. Quick and easy solution. Applications which are multi-threaded utilize the additional cores by creating threads to run on each core. New threads may alternate cores like processes do; I will have to look into that. But thread affinity can also be set, to force the thread to only run on the selected core.

There is performance to be had in utilizing additional cores, but with this comes added complexity. If an app uses two threads on different cores, then the programmer must make provisions for synchronization. If the threads need to communicate with each other, then the EnterCriticalSection and LeaveCriticalSection API's are helpful to guarantee things don't get desynchronized.

As a side note, a single-CPU system can run a multi-threaded app just fine. Each thread just runs on the same core, and time slices are divided between the threads. There is little overhead in the thread switching, something like a few thousand clocks.

When it comes to implementing multi-threading in general, the best design concept seems to be that of one master-thread which "doles out work units" to n independent worker threads, where n is the number of cores detected at startup. This concept guarantees total processor usage, and is tolerant of thread timing variance. (Sorry if this was a little more info than necessary, lol.)
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

dedndave

well, that is kind of what I thought, too
but, when I run a simple timing test, the numbers tell me otherwise....

Reference null tests:
Null:               20 clocks.
10x NOP:            25 clocks.

Failure-mode CMP tests:
10x CMP REG,REG:    7 clocks.
10x CMP REG,IMMED:  -210 clocks.
10x CMP MEM,REG:    -202 clocks.
10x CMP MEM,IMMED:  -202 clocks.

Success-mode CMP tests:
10x CMP REG,REG:    1 clocks.
10x CMP REG,IMMED:  554189126 clocks.
10x CMP MEM,REG:    -202 clocks.
10x CMP MEM,IMMED:  -202 clocks.

Failure-mode TEST tests:
10x TEST REG,REG:   1356305252 clocks.
10x TEST REG,IMMED: 554189125 clocks.
10x TEST MEM,REG:   6 clocks.
10x TEST MEM,IMMED: 54 clocks.

Success-mode TEST tests:
10x TEST REG,REG:   18 clocks.
10x TEST REG,IMMED: 15 clocks.
10x TEST MEM,REG:   -330 clocks.
10x TEST MEM,IMMED: 47 clocks.

it seems obvious that the counters are coming from the 2 cores
that kind of implies that this single process is running on both cores, no ?

btw - i am using XP
- this could well be OS dependant

MichaelW

Quoteit seems obvious that the counters are coming from the 2 cores

If you think that is so, then try restricting the process to the first core by adding these statements to your source somewhere above the tests:

    invoke GetCurrentProcess
    invoke SetProcessAffinityMask, eax, 1


And you might also want to try the second core, specified with an affinity mask value of 2.
eschew obfuscation

hutch--

JJ,

> Could you give a real life example of an application that would behave like this?

I just don't have time to write a test piece for you but I wonder what is the problem. The bottom line is ensure that each read is not in cache and that the size of each read is small, a sample of less than 32 bytes comes to mind.

Real time examples are things like a small in memory database under 2 gig in size, a very large table of preset data, anything that is large enough to be useful that is loaded directly into memory and accessed in a random manner.

To simulate conditions of this type ensure the reads are NOT linear and not in cache. An algorithm is as good as it performs under conditions of this type and almost exclusively small test pieces that repeatedly bash the same address in cache do not effectively emulate these conditions.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

lingo

#82
"hmm, as being an idiot, i've just few things to say (stupid stuff, certainly...) :
preserving registers is a programming commodity, allow you to NOT loose your time to debug...bla,blah, bla"


NightWare,
Will be better to teach your lovely kleptomaniac how to preserve ecx and edx faster  :lol
For your info kleptomania  is an inability or great difficulty in resisting impulses of stealing.
People with this disorder are likely to have a comorbid condition, specifically paranoid, schizoid or borderline personality disorder
Kleptomania can occur after traumatic brain injury...etc.
Example:
What means for the kleptomaniac 
"inner loop inspired by Lingo, with adaptions'"
For the kleptomaniac that means copy and paste....  :lol
As an idiot he preserved ecx and edx again (because NightWare preserved registers) and his program will become 'faster' on his 'special' CPUs.
From another point of view it is not a big deal for everyone from this forum to beat kleptomaniac's code. Just take a look:  :lol
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
codesizes: strlen32=80, strlen64A=93, _strlen=66

-- test 16k           return values Lingo, jj, Agner: 16384, 16384, 16384
crt_strlen    :       11096 cycles
strlen32      :       1577 cycles
strlen64LingoA :      1511 cycles
_strlen (Agner Fog):  2761 cycles

-- test 4k            return values Lingo, jj, Agner: 4096, 4096, 4096
crt_strlen    :       2727 cycles
strlen32      :       416 cycles
strlen64LingoA :      395 cycles
_strlen (Agner Fog):  707 cycles

-- test 1k            return values Lingo, jj, Agner: 1024, 1024, 1024
crt_strlen    :       726 cycles
strlen32      :       97 cycles
strlen64LingoA :      77 cycles
_strlen (Agner Fog):  192 cycles

-- test 0             return values Lingo, jj, Agner: 191, 191, 191
crt_strlen    :       148 cycles
strlen32      :       23 cycles
strlen64LingoA :      18 cycles
_strlen (Agner Fog):  59 cycles

-- test 1             return values Lingo, jj, Agner: 191, 191, 191
crt_strlen    :       152 cycles
strlen32      :       38 cycles
strlen64LingoA :      33 cycles
_strlen (Agner Fog):  40 cycles

-- test 4             return values Lingo, jj, Agner: 191, 191, 191
crt_strlen    :       147 cycles
strlen32      :       23 cycles
strlen64LingoA :      18 cycles
_strlen (Agner Fog):  42 cycles

-- test 7             return values Lingo, jj, Agner: 191, 191, 191
crt_strlen    :       150 cycles
strlen32      :       23 cycles
strlen64LingoA :      18 cycles
_strlen (Agner Fog):  40 cycles

Press any key to exit...





[attachment deleted by admin]

jj2007

Quote from: lingo on May 05, 2009, 03:22:06 AM
"hmm, as being an idiot, i've just few things to say (stupid stuff, certainly...) :
preserving registers is a programming commodity, allow you to NOT loose your time to debug...bla,blah, bla"


NightWare,
Will be better to teach your lovely kleptomaniac how to preserve ecx and edx faster  :lol
For your info kleptomania  is an inability or great difficulty in resisting impulses of stealing.
People with this disorder are likely to have a comorbid condition, specifically paranoid, schizoid or borderline personality disorder
Kleptomania can occur after traumatic brain injury...etc.


Lingo, please seek professional advice, at least on the definition of kleptomania and its application in code development.

A propos code: Compliments, it seems your sense of competition is still working fine. Your code beats mine in most cases, except Test 1 (on my archaic Celeron M). Will you make it public domain, or will you suit thieves?

hutch--

 :bg

I have worked out what this this antagonism is at last, it must be something in the water. Has there been a reactor leak recently in the EU or perhaps a chemical spill or even worse, the EU water supply is fed directly from the GRAY Danube (used to be blue), perhaps Berlesconi washed his socks in it or even worse, Tony Blair gave a speech nearby and it ended up full of sewerage.

Now I think there is only one solution, force the contestants to drink bottled water from African water supplies or perhaps Indian ones so that they end up with such a severe case of the trots that they don't have time to throw the surplus medium at each other.  :clap:
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

everyone knows - if you go to Mexico - don't drink the water - just tequillia

hutch--

 :bg

Dave,

We don't want them to drink Mexican water, they may kiss a pig later.  :P
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on May 04, 2009, 11:52:50 PM
JJ,
...  ensure the reads are NOT linear and not in cache. An algorithm is as good as it performs under conditions of this type

If and only if you have that type of application - a database in memory that needs many thousand random accesses per second. How realistic is that? Maybe Google needs it that way ::)

As I mentioned earlier, a virus scanner, or a "find RtlZeroMemory in all *.asm files" algo would behave in the way my test was designed.

Quote from: hutch-- on May 05, 2009, 11:46:04 AM
it must be something in the water. Has there been a reactor leak recently in the EU ...

Very funny, Sir Hutch. What is your official policy in this forum regarding calling other members (Nightware, myself) idiots? Do you recommend it nowadays officially? Do you prefer other labels, can you make suggestions? I have been tempted many times, but until now my good education stopped me from answering in the same language. However, you seem to like this style. What do other members of the forum think about it?

lingo

"Your code beats mine in most cases.."

Let see what is "your" and what is "mine"

strlen64B     proc szBuffer : dword
   pop        ecx
   pop        eax
   movdqu     xmm2, [eax]
   pxor       xmm0, xmm0
   pcmpeqb    xmm2, xmm0
   pxor       xmm1, xmm1
   pmovmskb   edx, xmm2
   test       edx, edx
   jz         @f
   bsf        eax, edx
   jmp        ecx
@@:
   lea       ecx,   [eax+16]
   and       eax,    -16
@@:
   pcmpeqb    xmm0, [eax+16]
   pcmpeqb   xmm1, [eax+32]
   por       xmm1, xmm0
   add       eax,    32
   pmovmskb   edx,    xmm1
   test       edx,    edx
   jz       @B
   shl       edx,    16
   sub       eax,    ecx
   pmovmskb    ecx,    xmm0
   or       edx,    ecx
   mov       ecx,    [esp-8]
   bsf       edx,    edx
   add       eax,    edx
   jmp       ecx
strlen64B       endp

strlen32s    proc      src:DWORD   ; with lots of inspiration from Lingo, NightWare and Agner Fog
      pop       eax         ; trash the return address
      pop       eax         ; the src pointer
      pxor       xmm0, xmm0   ; zero for comparison (no longer needed for xmm1 - thanks, NightWare)
      movups    xmm1, [eax]    ; move 16 bytes into xmm1, unaligned (adapted from Lingo/NightWare)
      pcmpeqb    xmm1, xmm0   ; set bytes in xmm1 to FF if nullbytes found in xmm1
      mov       edx,     eax      ; save pointer to string
      pmovmskb    eax,     xmm1   ; set byte mask in eax
      bsf       eax,     eax      ; bit scan forward
      jne       Lt16         ; less than 16 bytes, we are done
      mov       MbGlobRet, edx   ; edx preserved because Masm32 szLen preserves it
      and       edx,      -16      ; align initial pointer to 16-byte boundary
      lea       eax,      [edx+16]    ; aligned pointer + 16 (first 0..15 dealt with by movups above)
@@:   
      pcmpeqb    xmm0, [eax]    ; ---- inner loop inspired by Lingo, with adaptions -----
      pcmpeqb   xmm1, [eax+16]    ; compare packed bytes in [m128] and xmm1 for equality
      lea       eax,      [eax+32]    ; len counter (moving up lea or add costs 3 cycles for the 191 byte string)
      por       xmm1, xmm0   ; or them: one of the mem locations may contain a nullbyte
      pmovmskb    edx,      xmm1   ; set byte mask in edx
      test       edx,      edx
      jz      @B
@@:
      sub       eax,   [esp-4]    ; subtract original src pointer
      shl       edx,    16      ; create space for the ecx bytes
      push       ecx         ; all registers preserved, except edx and eax = return value
      pmovmskb    ecx,    xmm0   ; set byte mask in ecx (has to be repeated, sorry)
      or       edx,    ecx      ; combine xmm0 and xmm1 results
      bsf       edx,    edx      ; bit scan for the index
      pop       ecx
      lea       eax,    [eax+edx-32] ; add scan index
      mov       edx,    MbGlobRet
Lt16:      
      jmp       dword ptr [esp-4-4] ; ret address, one arg - the Lingo style equivalent to ret 4 ;-)
strlen32s    endp
[/size][/pre]

Hutch,
I appreciate your knowledge about water old link but will be better to see your opinion again about ecx and edx preservation  old link
Everyone (including sick people) can do with my code what they want but when someone tolerate idiotic behavior
as a useless registers preservation I can't be quiet.


hutch--

 :bg

> How realistic is that?

Extremely, I use to write them, they are called fixed length records and will generally rip the titz of a relational database.

RE: various forms of name calling, I have explained that admin has enough trouble making mountains into molehills but there is an easy way that we try and avoid, the bulldozer approach is to shut the topic and move it to the trash can, that turns mountains into flat plains very quickly.  :bdg
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php