News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Optimization manual updated again. Now covers Core 2

Started by agner, August 14, 2006, 07:40:58 AM

Previous topic - Next topic

clive

Quote from: jj2007
Imagine this (rather useless) sequence:

mov eax, esi
REPEAT 100
imul eax, eax, 10
ENDM
mov esi, eax

You do understand that your whole REPEAT 100 construct is totally contrived, right? Unless the fragment you test is self contained and dependent, all you end up doing is slip streaming the whole repeated sequence into the pipeline, the timing of the actual sequence gets lost. You'd get a better real world timing using a single iteration. In any situation where your timing suggests an execution speed of a sequence containing an IMUL is faster than the demonstrated speed of the instruction, you either a) do not have a dependency on the result, or b) you are not getting an accurate timing.

The idea of the 100 iterations is to push the looping instruction times into the ~1% of the total measurement, but unless the fragments are carefully crafted this is not what is happening.

I code in C because it's convenient, I've been coding in about a dozen assembly languages for as long as you have.

QuoteOverall speed depends, as you rightly assume, on latency

I haven't assumed, I know this, and don't need to benchmark everything to recognize when it might occur. Many algorithms in C and ASM can be discarded with static analysis.

-Clive
It could be a random act of randomness. Those happen a lot as well.

jj2007

Quote from: clive on April 16, 2010, 02:56:25 AM
This paper has some good analysis
http://gmplib.org/~tege/x86-timing.pdf
And it confirms that for a modern CPU, e.g. a Core 2, the latency is 3 cycles, while throughput is 1 cycle for imul.

Quote
"When compiling with /G7, we produce the faster (but longer) sequence that avoids using the imul instruction, which has a 14-cycle latency on the Intel Pentium 4:"
http://msdn.microsoft.com/en-us/library/aa290055%28VS.71%29.aspx
That was written 8 years ago...

Quote from: clive on April 16, 2010, 04:39:50 AM

QuoteOverall speed depends, as you rightly assume, on latency

I haven't assumed, I know this, and don't need to benchmark everything

You quote rather selectively. My "overall" related to your isolated 100*imul eax, eax, 1 example, and the text continues as follows:

QuoteOverall speed depends, as you rightly assume, on latency: the next imul can only start when the previous one has finished.
However, nobody in Masmland stops you from doing other things in parallel

Which implies that latency is pretty irrelevant in an optimised Masm app. Throughput is the interesting variable.

hutch--

Pick your Intel core, pick a result.



PIV 3.8 gig.

220
220
220
220
328 MS leamul10
406 MS imul10
516 MS shl10
312 MS add10
328 MS leamul10
407 MS imul10
500 MS shl10
328 MS add10
328 MS leamul10
406 MS imul10
531 MS shl10
313 MS add10
328 MS leamul10
406 MS imul10
516 MS shl10
328 MS add10
344 MS leamul10
406 MS imul10
500 MS shl10
328 MS add10
328 MS leamul10
406 MS imul10
532 MS shl10
328 MS add10
312 MS leamul10
407 MS imul10
484 MS shl10
328 MS add10
328 MS leamul10
406 MS imul10
516 MS shl10
328 MS add10
Average algo timing
328 MS leamul10
406 MS imul10
511 MS shl10
324 MS add10
Press any key to continue ...

Core2 Quad 3.0 gig

220
220
220
265 MS leamul10
235 MS imul10
437 MS shl10
234 MS add10
266 MS leamul10
234 MS imul10
438 MS shl10
234 MS add10
266 MS leamul10
234 MS imul10
438 MS shl10
234 MS add10
266 MS leamul10
234 MS imul10
438 MS shl10
234 MS add10
266 MS leamul10
234 MS imul10
438 MS shl10
234 MS add10
266 MS leamul10
234 MS imul10
437 MS shl10
235 MS add10
265 MS leamul10
235 MS imul10
437 MS shl10
250 MS add10
266 MS leamul10
234 MS imul10
438 MS shl10
234 MS add10
Average algo timing
265 MS leamul10
234 MS imul10
437 MS shl10
236 MS add10
Press any key to continue ...

2.8 i7 860 quad

220
220
234 MS leamul10
249 MS imul10
390 MS shl10
312 MS add10
234 MS leamul10
250 MS imul10
390 MS shl10
296 MS add10
250 MS leamul10
234 MS imul10
390 MS shl10
312 MS add10
234 MS leamul10
234 MS imul10
374 MS shl10
297 MS add10
234 MS leamul10
234 MS imul10
374 MS shl10
297 MS add10
218 MS leamul10
218 MS imul10
390 MS shl10
281 MS add10
234 MS leamul10
234 MS imul10
374 MS shl10
297 MS add10
234 MS leamul10
234 MS imul10
374 MS shl10
296 MS add10
Average algo timing
234 MS leamul10
235 MS imul10
382 MS shl10
298 MS add10
Press any key to continue ...


This is the test piece.


IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    leamul10 PROTO :DWORD
    imul10   PROTO :DWORD
    shl10    PROTO :DWORD
    add10    PROTO :DWORD

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL tim1  :DWORD
    LOCAL tim2  :DWORD
    LOCAL tim3  :DWORD
    LOCAL tim4  :DWORD

    mov tim1, 0
    mov tim2, 0
    mov tim3, 0
    mov tim4, 0

    iteration equ <100000000>

    push esi

    mov eax, 22
    invoke leamul10,eax
    print sstr$(eax),13,10


    mov eax, 22
    invoke imul10,eax
    print sstr$(eax),13,10

    mov eax, 22
    invoke shl10,eax
    print sstr$(eax),13,10

    mov eax, 22
    invoke add10,eax
    print sstr$(eax),13,10

    REPEAT 8

; ----------------------------------------------

    invoke GetTickCount
    push eax

    mov esi, iteration

  @@:
    invoke leamul10,eax
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    add tim1, eax
    print str$(eax)," MS leamul10",13,10

; ----------------------------------------------

    invoke GetTickCount
    push eax

    mov esi, iteration

  @@:
    invoke imul10,eax
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    add tim2, eax
    print str$(eax)," MS imul10",13,10

; ----------------------------------------------

    invoke GetTickCount
    push eax

    mov esi, iteration

  @@:
    invoke shl10,eax
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    add tim3, eax
    print str$(eax)," MS shl10",13,10

; ----------------------------------------------

    invoke GetTickCount
    push eax

    mov esi, iteration

  @@:
    invoke add10,eax
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    add tim4, eax
    print str$(eax)," MS add10",13,10

; ----------------------------------------------

    ENDM


    print "Average algo timing",13,10

    shr tim1, 3
    print str$(tim1)," MS leamul10",13,10

    shr tim2, 3
    print str$(tim2)," MS imul10",13,10

    shr tim3, 3
    print str$(tim3)," MS shl10",13,10

    shr tim4, 3
    print str$(tim4)," MS add10",13,10

    pop edi
    pop esi
    pop ebx

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

leamul10 proc num:DWORD

    mov eax, [esp+4]
    lea eax, [eax+eax*4]
    add eax, eax

    ret 4

leamul10 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

imul10 proc num:DWORD

    mov eax, [esp+4]
    imul eax, 10

    ret 4

imul10 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

shl10 proc num:DWORD

    mov eax, [esp+4]
    shl DWORD PTR [esp+4], 3
    add eax, eax
    add eax, [esp+4]

    ret 4

shl10 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

add10 proc num:DWORD

    mov eax, [esp+4]
    mov ecx, [esp+4]

    add ecx, ecx
    add eax, eax
    add eax, eax
    add eax, eax
    add eax, ecx

    ret 4

add10 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on April 16, 2010, 12:07:39 PM
Pick your Intel core, pick a result.

Hutch, the ABI!!!! :naughty:

    mov eax, [esp+4]
    shl DWORD PTR [esp+4], 3  ; <<<<<<
    add eax, eax
    add eax, [esp+4]

:bg

Prescott:
Average algo timing
545 MS leamul10
568 MS imul10
767 MS shl10
599 MS add10


(now let's do the same for mul123 :toothy)

hutch--

JJ,

> Hutch, the ABI!!!!  ???? [esp] is a memory operand.  :bg

You were too fast finding the test, it had 2 blunders in it, 2 memory reads for the shift and add algos, now it is one and more closely matches the first two. Mainly this demo shows core differences.



PIV Prescott 3.8 gig.

220
220
220
220
328 MS leamul10
406 MS imul10
516 MS shl10
312 MS add10
328 MS leamul10
407 MS imul10
500 MS shl10
328 MS add10
328 MS leamul10
406 MS imul10
531 MS shl10
313 MS add10
328 MS leamul10
406 MS imul10
516 MS shl10
328 MS add10
344 MS leamul10
406 MS imul10
500 MS shl10
328 MS add10
328 MS leamul10
406 MS imul10
532 MS shl10
328 MS add10
312 MS leamul10
407 MS imul10
484 MS shl10
328 MS add10
328 MS leamul10
406 MS imul10
516 MS shl10
328 MS add10
Average algo timing
328 MS leamul10
406 MS imul10
511 MS shl10
324 MS add10
Press any key to continue ...

Core2 Quad 3.0 gig

220
220
220
220
265 MS leamul10
235 MS imul10
265 MS shl10
266 MS add10
266 MS leamul10
234 MS imul10
266 MS shl10
250 MS add10
265 MS leamul10
235 MS imul10
265 MS shl10
250 MS add10
266 MS leamul10
234 MS imul10
266 MS shl10
250 MS add10
266 MS leamul10
234 MS imul10
266 MS shl10
250 MS add10
265 MS leamul10
235 MS imul10
265 MS shl10
250 MS add10
266 MS leamul10
234 MS imul10
266 MS shl10
250 MS add10
265 MS leamul10
235 MS imul10
265 MS shl10
235 MS add10
Average algo timing
265 MS leamul10
234 MS imul10
265 MS shl10
250 MS add10
Press any key to continue ...

Nehalem core i7 quad 2.8 gig

220
220
220
220
250 MS leamul10
234 MS imul10
281 MS shl10
327 MS add10
234 MS leamul10
250 MS imul10
250 MS shl10
327 MS add10
265 MS leamul10
234 MS imul10
266 MS shl10
327 MS add10
234 MS leamul10
234 MS imul10
234 MS shl10
312 MS add10
234 MS leamul10
234 MS imul10
249 MS shl10
312 MS add10
234 MS leamul10
234 MS imul10
250 MS shl10
328 MS add10
218 MS leamul10
218 MS imul10
250 MS shl10
327 MS add10
234 MS leamul10
234 MS imul10
234 MS shl10
328 MS add10
Average algo timing
237 MS leamul10
234 MS imul10
251 MS shl10
323 MS add10
Press any key to continue ...


The test piece.


IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    leamul10 PROTO :DWORD
    imul10   PROTO :DWORD
    shl10    PROTO :DWORD
    add10    PROTO :DWORD

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL tim1  :DWORD
    LOCAL tim2  :DWORD
    LOCAL tim3  :DWORD
    LOCAL tim4  :DWORD

    mov tim1, 0
    mov tim2, 0
    mov tim3, 0
    mov tim4, 0

    iteration equ <100000000>

    push esi

    mov eax, 22
    invoke leamul10,eax
    print sstr$(eax),13,10


    mov eax, 22
    invoke imul10,eax
    print sstr$(eax),13,10

    mov eax, 22
    invoke shl10,eax
    print sstr$(eax),13,10

    mov eax, 22
    invoke add10,eax
    print sstr$(eax),13,10

    REPEAT 8

; ----------------------------------------------

    invoke GetTickCount
    push eax

    mov esi, iteration

  @@:
    invoke leamul10,eax
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    add tim1, eax
    print str$(eax)," MS leamul10",13,10

; ----------------------------------------------

    invoke GetTickCount
    push eax

    mov esi, iteration

  @@:
    invoke imul10,eax
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    add tim2, eax
    print str$(eax)," MS imul10",13,10

; ----------------------------------------------

    invoke GetTickCount
    push eax

    mov esi, iteration

  @@:
    invoke shl10,eax
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    add tim3, eax
    print str$(eax)," MS shl10",13,10

; ----------------------------------------------

    invoke GetTickCount
    push eax

    mov esi, iteration

  @@:
    invoke add10,eax
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    add tim4, eax
    print str$(eax)," MS add10",13,10

; ----------------------------------------------

    ENDM


    print "Average algo timing",13,10

    shr tim1, 3
    print str$(tim1)," MS leamul10",13,10

    shr tim2, 3
    print str$(tim2)," MS imul10",13,10

    shr tim3, 3
    print str$(tim3)," MS shl10",13,10

    shr tim4, 3
    print str$(tim4)," MS add10",13,10

    pop edi
    pop esi
    pop ebx

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

leamul10 proc num:DWORD

    mov eax, [esp+4]
    lea eax, [eax+eax*4]
    add eax, eax

    ret 4

leamul10 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

imul10 proc num:DWORD

    mov eax, [esp+4]
    imul eax, 10

    ret 4

imul10 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

shl10 proc num:DWORD

    mov eax, [esp+4]
    mov ecx, eax
    shl eax, 3
    add eax, ecx
    add eax, ecx

    ret 4

shl10 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

add10 proc num:DWORD

    mov eax, [esp+4]
    mov ecx, eax

    add ecx, ecx
    add eax, eax
    add eax, eax
    add eax, eax
    add eax, ecx

    ret 4

add10 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

#35
Yep, for a P4 and mul10, the imul is not the best solution, it's 25% slower. Sizewise it wins for everything different from powers of 2, of course.

Average algo timing

Prescott
365 MS leamul10
449 MS imul10
361 MS shl10
361 MS add10

Celeron M
536 MS leamul10
542 MS imul10
576 MS shl10
839 MS add10

Code size, mul10 only:
12      bytes for add10
9       bytes for shl10
5       bytes for lea10
3       bytes for imul10

clive

Let's examine the benchmarking of code fragments.

Method A

L = Loop time in cycles
T = Test unit in cycles

t = measured cycles per unit

Testing 100 units

  |  T  |  T  |  T  | .. |  T  |  T  | L |

  |  t  |

t = ((T * 100) + L) / 100

If T = 10, L = 2, then t = 10.02 cycles, error from looping is 0.2%

Method B

L = Loop time in cycles
X = Test unit in cycles billed
Y = Test unit in cycles deferred (ie pipelined)

T = X + Y Test unit in cycles actual

t = measure cycles per unit

Testing 100 units

  |    T    |
  | X |  Y  |
      | X |  Y  |
          | X |  Y  |
              | X |  Y  |
                ..
                             | X |  Y  |
                                 | X |  Y  | L |

  | t |

t = ((X * 100) + Y + L) / 100

If X = 4, Y = 6, L = 2, then T = 10, yet t = 4.08 cycles, error 59.2% (nominally (Y / T) * 100)


Method B is not how to benchmark a code fragment, you'd get better results by doing 1 iteration. It is only useful for loop unwinding, much code that processes streams of data has dependencies.

The purpose of doing 100 iterations is to drop the looping overhead to 1% or below.

Latency is important, dependent things don't really happen before the answer is provided. Throughput only defines how fast you can stuff something into a pipeline, you can usually stuff a new thing in the pipeline every cycle (with multiple execution units, more than one operation can start per cycle), but it will take multiple cycles for the answer to appear.

-Clive
It could be a random act of randomness. Those happen a lot as well.

hutch--

Clive,

I find the idea interesting as i understand the difference between throughput and latency but I don't know a good way to test it. The benchmark I posted above puts each method in a seperate procedure and they all have the same stack and data load overhead but its a test of throughput, not instruction latency.

Michael has developed the set of timing macros that address small instruction counts and within the limits of ring3 access it works reasonably well but the only method i know that will handle small instruction counts with high precision is the one Agner Fog developed a few years ago that was done on a boot disk after switching the processor into protected mode.

I would certainly be interested in a technique that can accurately time instruction latency in small instruction counts, where I have had to deal with this in code I usually set up a test piece as close to the real world requirement as I can get then time it against something else, usually an original algo then a modified one so I can time any diference with the modified one.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

clive

Quote from: clive
I don't have a Willamette P4 to hand, but the IMUL should have latency ~14, and the SHR/SHL 8 probably ~8. I'll dig one up at the weekend.

Ok, so here with a Willamette P4 (Celeron 1.8 GHz), I'll note the 14 cycle IMUL performance, and the 4 cycle shifts, and the totally egregious processor stalling XCHG on memory. AMD made a lot of money on the back of this brain numbingly defective P4 design.

Empty
      1596 Ticks (min),  0.053 cycles

IMUL eax, 1
    418528 Ticks (min), 13.951 cycles

IDIV eax, 1
   1708472 Ticks (min), 56.949 cycles

SHL eax, 1 / SHR eax, 1
    118532 Ticks (min),  3.951 cycles

SHL eax, 8 / SHR eax, 8
    118532 Ticks (min),  3.951 cycles

XCHG eax, [mem]
   3727636 Ticks (min), 124.255 cycles


Intel(R) Celeron(R) CPU 1.80GHz (SSE2)
464     cycles for 100*imul 8
1011    cycles for 100*mul 8
99      cycles for 100*shl 3
93      cycles for 100*shl 1
464     cycles for 100*imul 10
1081    cycles for 100*mul 10
202     cycles for 100*lea mul10
57      cycles for 100*or
1482    cycles for 100*imul 1 - dependent
1616    cycles for 100*mul 1 - dependent
5860    cycles for 100*idiv 1 - dependent
37      cycles for 100*or - dependent
403     cycles for 100*rol 1 - dependent
394     cycles for 100*rol 8 - dependent

419     ticks for 100*imul 8
986     ticks for 100*mul 8
82      ticks for 100*shl 3
79      ticks for 100*shl 1
413     ticks for 100*imul 10
911     ticks for 100*mul 10
170     ticks for 100*lea mul10
49      ticks for 100*or
1255    ticks for 100*imul 1 - dependent
1455    ticks for 100*mul 1 - dependent
5224    ticks for 100*idiv 1 - dependent
33      ticks for 100*or - dependent
356     ticks for 100*rol 1 - dependent
355     ticks for 100*rol 8 - dependent
It could be a random act of randomness. Those happen a lot as well.

MichaelW

QuoteAMD made a lot of money on the back of this brain numbingly defective P4 design.

It surprised me that Intel failed to anticipate the "wall" they hit on clock speed increases, and that they stuck with the plan until AMD was kicking their butt in almost every benchmark.

eschew obfuscation