Which is best ?

dsouza123 · May 15, 2006, 04:26:04 AM

In code that compares a register with 0
which of the following is best ?

The meaning of best includes fastest, smallest,
works best with the conditional jump jz following
or any other reason that makes one preferable.

Code Select


cmp  ecx,0
test ecx,0
or   ecx, ecx

xor  ebx,ebx   ; ebx = 0, used with the following two options
              ; so this adds time, size and ties up ebx
cmp  ecx,ebx
test ecx,ebx

jz past   ; the conditional jump following the evaluation of ecx

The issue came up in comparing these two (hopefully) equivalent versions of code, high and low level.
The comparison of two equal size arrays of dwords, NR and CR, index range 0 to 47.
If NR is less than CR, ans = 0
If NR is equal to CR, ans = 1
If NR is greater than CR, ans = 2

A side benefit of the 0,1,2 convention is to check for greater than or equal, ans != 0.

For an example with less than,
the first time the dword pair aren't equal,
if the current dword of NR is less than it's counterpart in CR
then ans = 0.

Code Select


                          mov ecx, 48
                          mov ans, 1
                          .repeat
                             dec ecx
                             mov eax, [NR+ ecx*4]
                             mov edx, [CR+ ecx*4]
                             .if eax > edx
                                mov ans,2
                             .elseif eax < edx
                                mov ans,0
                             .endif
                          .until (ecx == 0) || (ans != 1)

                          mov ecx, 48
                          mov ans, 1    ; equal
                        again:
                          cmp ecx, 0    ; spot for the versions above
                          jz  past
                          dec ecx
                          mov eax, [NR+ ecx*4]
                          mov edx, [CR+ ecx*4]
                          cmp eax, edx
                          je  again
                          ja  plus
                          dec ans       ; less than
                          jmp past
                        plus:
                          inc ans       ; greater than
                        past:

Any better/other ways to do the comparison ?

asmfan · May 15, 2006, 08:32:21 AM

The best way(my point of view:) of comparing a register to zero is to use TEST/OR (OR is more preferred because TEST can lead to partial flags register stall) also AND can be used

ChrisLeslie · May 15, 2006, 09:14:23 AM

dsouza123,

I must have too much spare time lately! - I put each of your versions in a 10,000,000 loop and with timers and repeated several times (XP/P4).
Your high level version gave 735 to 750 ms.
Your low level version gave 500-515 ms.
Low levels win on that liitle test, for all its worth.
I personally put a lot of value on readability and maintainability, which for me is the HL version. Some others however may find the LL version more maintainable!!!

Chris

EduardoS · May 15, 2006, 11:52:00 AM

Quote from: asmfan on May 15, 2006, 08:32:21 AM
The best way(my point of view:) of comparing a register to zero is to use TEST/OR (OR is more preferred because TEST can lead to partial flags register stall) also AND can be used

TEST affect the same flags as AND,

I prefer TEST eax, eax, only 2 bytes and don't generate dependencies,
for the future maybe cmp eax, 0 will be better, Intel macro-fusion will make cmp/jccc a single instruction.

Ratch · May 15, 2006, 02:21:04 PM

asmfan,

Quote
...(OR is more preferred because TEST can lead to partial flags register stall)...

Do you have any documentation to back up that statement? A OR/AND EAX,EAX instruction implies a unneeded write to a register, unless the CPU is smart enough to know that this particular instruction/operand will not change EAX. Is it that smart?

dsouza123,
Don't forget the CMOVXX instruction series if you can use 'em.

Quote
...I personally put a lot of value on readability...

Code is read a relatively few times. Code is executed a gazillion times. Wanna change your mind about that? :bdg Ratch

dsouza123 · May 15, 2006, 03:09:11 PM

Thank you everyone for the very good information.

Asmfan and EduardoS
Two comments that have me perplexed.
1. The TEST leads to partial flags register stalls.
2. TEST eax, eax .. and don't generate dependencies.
They seem to conflict or maybe I am missing the distinction.

ChrisLeslie
That is very interesting the low level takes 2/3 the time of the high level.
Could you post the testing code(s) ?
Testing like that is quite worthwhile, it didn't occur to me to do it.
I'm curious how you accomplished it.

Ratch
The CMOVXX is available on what processors ?
Run time is very important.

asmfan · May 15, 2006, 05:11:19 PM

I dont remember exactly from where i heard of TEST and OR comparison either from Agner Fog's optimization tuts or/test:) from Mark Larson's tuts? Or the matter of pairing/unpairing in pipes... Need to look through documentation...

hutch-- · May 15, 2006, 06:29:14 PM

On any of the later Intel hardware a test for zero is usually faster using TEST REG, REG than CMP REG, 0 and Intel document this preference as being simpler and faster than the CMP version on PIV hardware. I have yet to find a context where a CMOVxx is as fast as a CMP / Jxx so I would not hold out on that one as it appears to be an interim idea from Intel that was a waste of silicon.

I go in the direction of readability where it matters and the direction of bare mnemonic coding where speed is the main consideration and the only real action is to know the difference of where to apply which idea. Having a MessageBox call manually coded in DB notation would not give you a nanosecond speed increase and generally where speed does not matter but where complexity comes into play, clear coding can be read, written and maintained much faster than bare mnemonic code.

Where you have speed issues that effect how well code works, you hammer both the algo design and the fastest code you can get going and while this does not preclude clear coding, it takes priority to bring the task up to pace.

MichaelW · May 15, 2006, 07:21:54 PM

I cannot find any statements in Agner Fog's Pentium Optimization document or Mark Larson's Assembly Optimization Tips that indicate a preference for TEST over AND (even though I seem to recall reading such a statement), but on a P3 TEST r32,r32 appears to execute 2x faster. Both instructions are the same length. According to Agner Fog's document, for a PPro, P2, or P3 both instructions generate one micro-op that can go to port 0 or 1, and for a P4 all of the timings shown for the two instructions are identical.

Agner Fog's document clearly states that a TEST followed by a LAHF, PUSHF, or PUSHFD will cause a partial flags stall, where an AND followed by a LAHF, PUSHF, or PUSHFD will not, even though TEST and AND do exactly the same thing to the flags.

P3:

Code Select


1000 cycles, AND only
498 cycles, TEST only

1202 cycles, AND followed by LAHF
1202 cycles, TEST followed by LAHF

498 cycles, CMP r32, 0

[attachment deleted by admin]

EduardoS · May 15, 2006, 09:49:27 PM

A64:

Code Select


995 cycles, AND only
328 cycles, TEST only

3001 cycles, AND followed by LAHF
3003 cycles, TEST followed by LAHF

330 cycles, CMP r32, 0

ChrisLeslie · May 16, 2006, 12:21:43 AM

Ratch,

QuoteCode is read a relatively few times. Code is executed a gazillion times. Wanna change your mind about that? Ratch

No. When code is read it can be very important to easily re-read it. For example, as I turn up for work today, I may have to perform more maintenance on various Java applications. These are 24-7 applications (counting Gamma ray emissions and processing activities for CSIRO) where introducing bugs is not an option. Significant modifications average about one a month. The code has to be readable and modifiable without spending time later on reverting back a version because of a slimy bug. Fortunately, Java, despite its many shortcomings, is good for maintainability and stability. Speed is definitely less an issue is this circumstance. When I go home and want to write some very repetitive and speed critical code then I will think about the low level approach.

Regards

Chris

Jackal · May 16, 2006, 12:27:24 AM

P4

501 cycles, AND only
500 cycles, TEST only

4057 cycles, AND followed by LAHF
4029 cycles, TEST followed by LAHF

322 cycles, CMP r32, 0

hutch-- · May 16, 2006, 02:08:58 AM

Chris,

Unrelated to the topic, I am glad to see the CSIRO still up and running as it has produced some genuine genius over many years. I thought most of it had been sold off or farmed out to the private sector.

ChrisLeslie · May 16, 2006, 09:54:37 AM

dsouza123,

QuoteTesting like that is quite worthwhile, it didn't occur to me to do it.
I'm curious how you accomplished it.

I used this code for timing your examples:

Code Select

.data
    time1 dd ?
    howLong db 80 dup(0)

.code
    invoke GetTickCount  ; get tick counts at start and store in memory
    mov time1,eax  
    
    ; code block to time goes here
    
    invoke GetTickCount  ; get tick counts at end and subtract the start tick counts
    sub eax,time1           ; the answer goes to eax
    
    invoke dwtoa,eax,ADDR howLong  ; convert to string and display the result
    invoke StdOut,ADDR howLong 
    ; etc ,etc

But don't get hooked on silly benchmarks!!

Chris

News:

Which is best ?

dsouza123

asmfan

ChrisLeslie

EduardoS

Ratch

dsouza123

asmfan

hutch--

MichaelW

EduardoS

ChrisLeslie

Jackal

hutch--

ChrisLeslie