News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Test Or Compare?

Started by Neil, May 04, 2009, 01:39:05 PM

Previous topic - Next topic

dedndave

Quote from: dedndave on Today at 11:38:00 AM
that means that the advantage of having 2 cores (or more) is negated by the test

Not really. It is quite unlikely that you can convince the OS to run your fast inner loop in both cores simultaneously. In practice, your code runs on core 1, while Microsoft Word runs more or less independently on core 2. Now if you have three loop variants in "your" core 1, it is still very meaningful to compare their relative performance.

JJ ! - lol
i am not even using MS word
please don't tell me MS has reserved half my brain for MS word - lol

i absolutely say that if you have a dual core processor, and you confine a test to one core, you will not see the advantage of having two cores - this seems pretty basic

jj2007

Quote from: dedndave on May 04, 2009, 07:41:41 PM

i absolutely say that if you have a dual core processor, and you confine a test to one core, you will not see the advantage of having two cores - this seems pretty basic


Correct but not very meaningful. You do not see the absolute advantage, but you can still compare algo A with algo B on a relative basis - and you want to know which one is faster, right?

jj2007

Quote from: Mark Jones on May 04, 2009, 05:34:38 PM

Are your results repeatable JJ?


One more on Celeron M:

Reference null tests:
Null:               12 clocks.
1000x NOP:            516 clocks.

Failure-mode CMP tests:
1000x CMP REG,REG:    1002 clocks.
1000x CMP REG,IMMED:  492 clocks.
1000x CMP MEM,REG:    1008 clocks.
1000x CMP MEM,IMMED:  1008 clocks.

Success-mode CMP tests:
1000x CMP REG,REG:    996 clocks.
1000x CMP REG,IMMED:  504 clocks.
1000x CMP MEM,REG:    1008 clocks.
1000x CMP MEM,IMMED:  1008 clocks.

Failure-mode TEST tests:
1000x TEST REG,REG:   996 clocks.
1000x TEST REG,IMMED: 492 clocks.
1000x TEST MEM,REG:   1008 clocks.
1000x TEST MEM,IMMED: 1008 clocks.

Success-mode TEST tests:
1000x TEST REG,REG:   996 clocks.
1000x TEST REG,IMMED: 504 clocks.
1000x TEST MEM,REG:   1008 clocks.
1000x TEST MEM,IMMED: 1008 clocks.


Now the picture is getting clearer...

Bloated exe attached :bg

[attachment deleted by admin]

dedndave

ok JJ, my friend, you have me challenged, now - lol
let me see if i can figure out how to do this....

jj2007

Quote from: dedndave on May 04, 2009, 08:11:44 PM
ok JJ, my friend, you have me challenged, now - lol

:bg

What makes me furious, though, is that a 6-byte test ecx, -1 is twice as fast than a 2-byte test ecx, ecx - that's just not fair! :8)

Mark Jones

Highly repeatable results here.


Reference null tests:
Null:               0 clocks.
1000x NOP:            333 clocks.

Failure-mode CMP tests:
1000x CMP REG,REG:    333 clocks.
1000x CMP REG,IMMED:  363 clocks.
1000x CMP MEM,REG:    505 clocks.
1000x CMP MEM,IMMED:  625 clocks.

Success-mode CMP tests:
1000x CMP REG,REG:    333 clocks.
1000x CMP REG,IMMED:  363 clocks.
1000x CMP MEM,REG:    505 clocks.
1000x CMP MEM,IMMED:  625 clocks.

Failure-mode TEST tests:
1000x TEST REG,REG:   333 clocks.
1000x TEST REG,IMMED: 356 clocks.
1000x TEST MEM,REG:   505 clocks.
1000x TEST MEM,IMMED: 625 clocks.

Success-mode TEST tests:
1000x TEST REG,REG:   333 clocks.
1000x TEST REG,IMMED: 356 clocks.
1000x TEST MEM,REG:   505 clocks.
1000x TEST MEM,IMMED: 625 clocks.


It may be worth noting that I have BOINC running in the background, which pretty much keeps both cores fully utilized in all available idle time (thus preventing the CPU from entering any power-saving or clock-reducing state, but yielding to any higher-priority thread.) However, the timing is exactly the same wether the BOINC worker threads are running or not. (Back to the drawing board, lol.)
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

dedndave

yes - it seems logical (and fair - lol) that reg,reg is faster than reg,immed
of course, i have not learned these new CPU's well enough to say
certainly, on a 8088, reg,reg was always faster, and usually fewer bytes
it sounds to me like something is wrong with the measurement, somehow

btw JJ - i am looking into writing a simple test that reads RDTSC from each core, then runs the test sequence with both cores running
it may take me a while - and i may need some help, as i am already lost wth thread limitation and access - lol
let me see how far i can get on my own before i ask for help

hutch--

Steve,

Quote
For those that have fun debugging their code after seemingly
minor edits, note that:

Code:
       TEST    AX,AX
       JZ      @F

       CMP     AX,AX
       JZ      @F

have different results.  Almost as much fun as mixing up signed
and unsigned conditional jumps.


The comment appears to be misleading, I posted a well known technique top test for zero so you have code like this,


test eax, eax,
jz label

CMP code would be like this.

cmp eax, 0
je label
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

daydreamer

in newer cores, cmp/J is fused together into a macro-op
so what is the point to measure cmp/test by itself when branch misprediction causes most lost cycles and are more interesting to know how much I lose?

jj2007

Quote from: dedndave on May 04, 2009, 09:13:00 PM
btw JJ - i am looking into writing a simple test that reads RDTSC from each core, then runs the test sequence with both cores running
it may take me a while - and i may need some help, as i am already lost wth thread limitation and access - lol
let me see how far i can get on my own before i ask for help


Just found an excellent source: Performance measurements with RDTSC by Peter Kankovski.

dedndave

great reading JJ - thanks
poking around, i am learning things i hadn't intended to - lol
on the bright side, this doesn't look as hard as i thought

FORTRANS

Quote from: hutch-- on May 05, 2009, 01:52:52 AM
Steve,

The comment appears to be misleading,

Hutch,

   Sorry, my point was intended to show that if you use CMP AX,AX,
or its ilk, you are probably misusing the instruction.  CMP AX,AX can
only have one possible result, which negates any use outside clearing
flags to a known state or a NOP with side effects.  Use of TEST and
CMP can sometimes confuse beginners, or (in my case, the last time it
bit) sleepy programmers.  You (I) can stare at the code for a long time
before you notice that perfectly legal code is inappropriate.

   Your code shows a much more reasonable use of CMP.

Regards,

Steve N.

hutch--

The trick with TEST is to treat it like an AND with no store result but a changed flag if the condition is met. The real problem for learners is not understanding the flags after an instruction has been run. Its just part of the learning curve of assembler but I agree with you that TEST AX, AX can be confusing to those who are not familiar with how flags work.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php