News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Any way to simplify this loop?

Started by Astro, August 15, 2009, 12:46:19 PM

Previous topic - Next topic

jj2007

Quote from: Rockoon on August 16, 2009, 08:26:49 PM
One of the thoughts I had about why the P4 has such trouble with the 1-op mul is that it produces a 64-bit result. Is the system in question an older 32-bit P4 or one of the later 64-bit ones?

My Celeron M is a Yonah, pretty young. I had thought about the 64-bit result, too - but more than twice the cycle count is quite a bit.
When using e.g. imul edx, edx, 10, one must indeed keep in mind that in contrast to mul the result is only 32-bit. On the positive side, edx is not being trashed...

NightWare

qWord, i haven't made speed test (or even tested them), but have you tried those combinations ?

for 7 :
   lea TmpReg,[DestSrcReg+DestSrcReg*2]
   lea DestSrcReg,[DestSrcReg+TmpReg*2]
for 11 :
   lea TmpReg,[DestSrcReg+DestSrcReg*4]
   lea DestSrcReg,[DestSrcReg+TmpReg*2]
for 13 :
   lea TmpReg,[DestSrcReg+DestSrcReg*2]
   lea DestSrcReg,[DestSrcReg+TmpReg*4]
for 15 :
   lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
   lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
for 17 :
   lea TmpReg,[DestSrcReg*8]
   lea DestSrcReg,[DestSrcReg+TmpReg*2]
for 19 :
   lea TmpReg,[DestSrcReg+DestSrcReg*8]
   lea DestSrcReg,[DestSrcReg+TmpReg*2]
for 21 :
   lea TmpReg,[DestSrcReg+DestSrcReg*4]
   lea DestSrcReg,[DestSrcReg+TmpReg*4]
for 22 :
   lea TmpReg,[DestSrcReg+DestSrcReg*4]
   lea DestSrcReg,[DestSrcReg+TmpReg*2]
   add DestSrcReg,DestSrcReg
for 25 :
   lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
   lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
for 26 :
   lea TmpReg,[DestSrcReg+DestSrcReg*2]
   lea DestSrcReg,[DestSrcReg+TmpReg*4]
   add DestSrcReg,DestSrcReg
for 27 :
   lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
   lea DestSrcReg,[DestSrcReg+DestSrcReg*8]
for 37 :
   lea TmpReg,[DestSrcReg+DestSrcReg*8]
   lea DestSrcReg,[DestSrcReg+TmpReg*4]
for 38 :
   lea TmpReg,[DestSrcReg+DestSrcReg*2]
   shl DestSrcReg,5
   lea DestSrcReg,[DestSrcReg+TmpReg*2]
for 39 :
   lea TmpReg,[DestSrcReg+DestSrcReg*8]
   lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
   lea DestSrcReg,[DestSrcReg+TmpReg*4]
for 42 :
   lea TmpReg,[DestSrcReg+DestSrcReg*4]
   lea DestSrcReg,[DestSrcReg+TmpReg*4]
   add DestSrcReg,DestSrcReg
for 44 :
   lea TmpReg,[DestSrcReg+DestSrcReg*4]
   lea DestSrcReg,[DestSrcReg+TmpReg*2]
   shl DestSrcReg,2
for 45 :
   lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
   lea DestSrcReg,[DestSrcReg+DestSrcReg*8]
for 49 :
   mov TmpReg,DestSrcReg
   lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
   lea DestSrcReg,[TmpReg+DestSrcReg*4]
for 50 :
   lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
   lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
   add DestSrcReg,DestSrcReg


Astro, a little exercise for you, what are the multiplier here ? (a little trap here...)
1)
   lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
   shr DestSrcReg,2
2)
   lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
   shr DestSrcReg,1
3)
   lea TmpReg,[DestSrcReg+DestSrcReg*2]
   lea DestSrcReg,[DestSrcReg+TmpReg*2]
   shr DestSrcReg,2

qWord

Quote from: NightWare on August 17, 2009, 08:21:58 PM
have you tried those combinations ?
maybe - I've written this macro some time back. I'm surly,that I doesn't get idea of all combination you post here, but if i remember rightly, sometimes, it was faster to replace  two lea's with one lea and add/sub/shl.

regards, qWord
FPU in a trice: SmplMath
It's that simple!

dedndave

yes - Jochen ran those tests the other day and i must say i was disappointed in lea, and a little surprised
i guess if you are looking for small code, it is nice, but speed is more important these days
i had written a times 10 table generator and used lea eax,[eax+4*eax] to replace a mov/shift/add
i am sure the mov edx,eax - shl eax,2 - add eax,edx was faster to begin with, although it saved a couple bytes

qWord

here is a short test to show what I've mean:
(on an c2d)
multiple by 39
---------------------------
lea+shl+sub       : 23
lea+lea+lea       : 41



    print "multiple by 39",10,13
    print "---------------------------",10,13
        counter_begin 1000000, REALTIME_PRIORITY_CLASS
            mov edi,12345
            REPEAT 16
                mov edx,edi
                mov eax,edx
                lea edx,[edx+edx*4]
                shl edx,3
                sub edx,eax
            ENDM
        counter_end
        ;shr eax,4
        push eax
        print "lea+shl+sub       : "
        pop eax
        print udword$(eax),10,13
       
        counter_begin 1000000, REALTIME_PRIORITY_CLASS
            mov edi,12345
            REPEAT 16
               mov edx,edi
               lea eax,[edx+edx*8]
               lea edx,[edx+edx*2]
               lea edx,[edx+eax*4]
            ENDM
        counter_end
        ;shr eax,4
        push eax
        print "lea+lea+lea       : "
        pop eax
        print udword$(eax),10,13
FPU in a trice: SmplMath
It's that simple!

hutch--

Generally on later hardware you avoid MUL if you can, shifts unless you have to and use ADD instead of LEA if you can. It depends on the miltiplication you need to do.

1st. Use ADD if you can.
2nd. Use LEA if you get a better encoding.
3rd. Use a shift if it gets you the range you need.
4th. Only use MUL if you have to as it has always been slow.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: qWord on August 17, 2009, 11:44:11 PM
here is a short test to show what I've mean:
(on an c2d)
multiple by 39
---------------------------
lea+shl+sub       : 23
lea+lea+lea       : 41


I get 38 for lea+shl+sub and 17 for imul.

Rockoon

"Bad" performance on multiplication hasn't really been true since the Pentium4.

The following assumes you are targeting AMD64/Core2/I7...

If you can do it with a single shift or lea, then sure.. and you can also possibly get away with 2 leas/shifts and beat out imul on Core2 (but not AMD64), but once you are at 3 leas/shifts, imul is going to be no worse unless you can somehow pair up those lea's and hide their latencies while not being able to do the same for the imul.

The imul latency is 3 cycles on Core2's and AMD64's ..
The lea latency is 1 cycle on Core2, and 2 cycles on AMD64

a dependency chain of 3 lea's thus takes 3 cycles on a core2 (same as imul), and *6* cycles on an AMD64 (twice as slow as an imul)


(Latency figures taken from Agner Fog's excellent research)
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

Astro

 :dazzled:

I need a week to read and consider this thread! Awesome!  :eek

Best regards,
Astro.

NightWare

imm_mul MACRO Constant:REQ,DestSrcReg:REQ,TmpReg
IF Constant EQ 0
xor DestSrcReg,DestSrcReg
ELSEIF Constant EQ 1
;do nothing ;)
ELSEIF Constant EQ 2
add DestSrcReg,DestSrcReg
ELSEIF Constant EQ 3
lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
ELSEIF Constant EQ 4
shl DestSrcReg,2
ELSEIF Constant EQ 5
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
ELSEIF Constant EQ 6
add DestSrcReg,DestSrcReg ;2*3
lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
; lea DestSrcReg,[DestSrcReg+DestSrcReg*2] ;3*2
; shl DestSrcReg,1
ELSEIF Constant EQ 7
lea TmpReg,[DestSrcReg+DestSrcReg*2] ;(3*2)+1
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; mov TmpReg,DestSrcReg ;8-1
; shl DestSrcReg,3
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 8
shl DestSrcReg,3
ELSEIF Constant EQ 9
lea DestSrcReg,[DestSrcReg+DestSrcReg*8]
ELSEIF Constant EQ 10
add DestSrcReg,DestSrcReg ;2*5
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
; lea DestSrcReg,[DestSrcReg+DestSrcReg*4] ;5*2
; shl DestSrcReg,1
ELSEIF Constant EQ 11
lea TmpReg,[DestSrcReg+DestSrcReg*4] ;(5*2)+1
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; lea TmpReg,[DestSrcReg+DestSrcReg*8] ;9+2
; shl DestSrcReg,1
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 12
lea DestSrcReg,[DestSrcReg+DestSrcReg*2] ;3*4
shl DestSrcReg,2
ELSEIF Constant EQ 13
lea TmpReg,[DestSrcReg+DestSrcReg*2] ;(3*4)+1
lea DestSrcReg,[DestSrcReg+TmpReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg*2] ;16-3
; shl DestSrcReg,4
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 14
add DestSrcReg,DestSrcReg ;2*((3*2)+1)
lea TmpReg,[DestSrcReg+DestSrcReg*2]
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; lea TmpReg,[DestSrcReg+DestSrcReg] ;16-2
; shl DestSrcReg,4
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 15
lea DestSrcReg,[DestSrcReg+DestSrcReg*2] ;3*5
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
; mov TmpReg,DestSrcReg ;16-1
; shl DestSrcReg,4
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 16
shl DestSrcReg,4
ELSEIF Constant EQ 17
lea TmpReg,[DestSrcReg*8] ;(8*2)+1
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; mov TmpReg,DestSrcReg ;16+1
; shl DestSrcReg,4
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 18
add DestSrcReg,DestSrcReg ;2*9
lea DestSrcReg,[DestSrcReg+DestSrcReg*8]
; lea DestSrcReg,[DestSrcReg+DestSrcReg*8] ;9*2
; shl DestSrcReg,1
ELSEIF Constant EQ 19
lea TmpReg,[DestSrcReg+DestSrcReg*8] ;18+1
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; lea TmpReg,[DestSrcReg+DestSrcReg*2] ;16+3
; shl DestSrcReg,4
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 20
lea DestSrcReg,[DestSrcReg+DestSrcReg*4] ;5*4
shl DestSrcReg,2
ELSEIF Constant EQ 21
lea TmpReg,[DestSrcReg+DestSrcReg*4] ;(5*4)+1
lea DestSrcReg,[DestSrcReg+TmpReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg*4] ;16+5
; shl DestSrcReg,4
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 22
add DestSrcReg,DestSrcReg ;2*((5*2)+1)
lea TmpReg,[DestSrcReg+DestSrcReg*4]
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; lea TmpReg,[DestSrcReg+DestSrcReg*8] ;(9+2)*2
; shl DestSrcReg,1
; add DestSrcReg,TmpReg
; shl DestSrcReg,1
ELSEIF Constant EQ 23
lea TmpReg,[DestSrcReg+DestSrcReg*2] ;(3*8)-1
neg DestSrcReg
lea DestSrcReg,[DestSrcReg+TmpReg*8]
; lea TmpReg,[DestSrcReg+DestSrcReg*8] ;32-9
; shl DestSrcReg,5
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 24
lea DestSrcReg,[DestSrcReg+DestSrcReg*2] ;3*8
shl DestSrcReg,3
ELSEIF Constant EQ 25
lea DestSrcReg,[DestSrcReg+DestSrcReg*4] ;5*5
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg*8] ;16+9
; shl DestSrcReg,4
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 26
add DestSrcReg,DestSrcReg ; 2*((3*4)+1)
lea TmpReg,[DestSrcReg+DestSrcReg*2]
lea DestSrcReg,[DestSrcReg+TmpReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg*2] ;(16+3)*2
; shl DestSrcReg,4
; sub DestSrcReg,TmpReg
; shl DestSrcReg,1
ELSEIF Constant EQ 27
lea DestSrcReg,[DestSrcReg+DestSrcReg*2] ;3*9
lea DestSrcReg,[DestSrcReg+DestSrcReg*8]
; lea TmpReg,[DestSrcReg+DestSrcReg*4] ;32-5
; shl DestSrcReg,5
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 28
shl DestSrcReg,2 ;4*((3*2)+1)
lea TmpReg,[DestSrcReg+DestSrcReg*2]
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; lea TmpReg,[DestSrcReg*4]
; shl DestSrcReg,5
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 29
lea TmpReg,[DestSrcReg+DestSrcReg*4] ;(5*4)+9
lea DestSrcReg,[DestSrcReg+DestSrcReg*8]
lea DestSrcReg,[DestSrcReg+TmpReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg*2] ;32-3
; shl DestSrcReg,5
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 30
add DestSrcReg,DestSrcReg ; 2*(3*5)
lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg] ;32-2
; shl DestSrcReg,5
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 31
lea TmpReg,[DestSrcReg+DestSrcReg*2] ;(3*9)+4
lea TmpReg,[TmpReg+TmpReg*8]
lea DestSrcReg,[TmpReg+DestSrcReg*4]
; mov TmpReg,DestSrcReg ;32-1
; shl DestSrcReg,5
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 32
shl DestSrcReg,5
ELSEIF Constant EQ 33
lea TmpReg,[DestSrcReg*4] ;(4*8)+1
lea DestSrcReg,[DestSrcReg+TmpReg*8]
; mov TmpReg,DestSrcReg ;32+1
; shl DestSrcReg,5
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 34
add DestSrcReg,DestSrcReg ;2*((8*2)+1)
lea TmpReg,[DestSrcReg*8]
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; lea TmpReg,[DestSrcReg*2] ;32+2
; shl DestSrcReg,5
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 35
lea TmpReg,[DestSrcReg+DestSrcReg*8] ;(9*4)-1
neg DestSrcReg
lea DestSrcReg,[DestSrcReg+TmpReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg*2] ;32+3
; shl DestSrcReg,5
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 36
lea DestSrcReg,[DestSrcReg+DestSrcReg*8] ;9*4
shl DestSrcReg,2
ELSEIF Constant EQ 37
lea TmpReg,[DestSrcReg+DestSrcReg*8] ;(9*4)+1
lea DestSrcReg,[DestSrcReg+TmpReg*4]
; mov TmpReg,DestSrcReg ;(9*4)+1
; lea DestSrcReg,[DestSrcReg+DestSrcReg*8]
; shl DestSrcReg,2
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 38
add DestSrcReg,DestSrcReg ;2*((9*2)+1)
lea TmpReg,[DestSrcReg+DestSrcReg*8]
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; lea TmpReg,[DestSrcReg+DestSrcReg*2] ;(16+3)*2
; shl DestSrcReg,4
; add DestSrcReg,TmpReg
; shl DestSrcReg,1
ELSEIF Constant EQ 39
lea TmpReg,[DestSrcReg+DestSrcReg*2] ;((3*4)+1)*3
lea DestSrcReg,[DestSrcReg+TmpReg*4]
lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
; mov TmpReg,DestSrcReg ;(5*8)-1
; lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
; shl DestSrcReg,3
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 40
lea DestSrcReg,[DestSrcReg+DestSrcReg*4] ;5*8
shl DestSrcReg,3
ELSEIF Constant EQ 41
mov TmpReg,DestSrcReg ;(5*8)+1
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
lea DestSrcReg,[TmpReg+DestSrcReg*8]
; mov TmpReg,DestSrcReg ;(5*8)+1
; lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
; shl DestSrcReg,3
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 42
add DestSrcReg,DestSrcReg ;2*((5*4)+1)
lea TmpReg,[DestSrcReg+DestSrcReg*4]
lea DestSrcReg,[DestSrcReg+TmpReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg*4] ;(16+5)*2
; shl DestSrcReg,4
; add DestSrcReg,TmpReg
; shl DestSrcReg,1
ELSEIF Constant EQ 43
lea TmpReg,[DestSrcReg+DestSrcReg*2] ;(5*8)+3
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
lea DestSrcReg,[TmpReg+DestSrcReg*8]
; lea TmpReg,[DestSrcReg+2*DestSrcReg] ;(5*8)+3
; lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
; shl DestSrcReg,3
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 44
lea TmpReg,[DestSrcReg+DestSrcReg*4] ;((5*2)+1)*4
lea DestSrcReg,[DestSrcReg+TmpReg*2]
shl DestSrcReg,2
; lea TmpReg,[DestSrcReg+DestSrcReg*8] ;(9+2)*4
; shl DestSrcReg,1
; add DestSrcReg,TmpReg
; shl DestSrcReg,2
ELSEIF Constant EQ 45
lea DestSrcReg,[DestSrcReg+DestSrcReg*4] ;5*9
lea DestSrcReg,[DestSrcReg+DestSrcReg*8]
; lea DestSrcReg,[DestSrcReg+DestSrcReg*8] ;(9*4)+9
; mov TmpReg,DestSrcReg
; shl DestSrcReg,2
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 46
lea TmpReg,[DestSrcReg+DestSrcReg*4] ;(5*9)+1
lea TmpReg,[TmpReg+TmpReg*8]
lea DestSrcReg,[DestSrcReg+TmpReg]
; lea TmpReg,[DestSrcReg+DestSrcReg*8] ;(32-9)*2
; shl DestSrcReg,5
; sub DestSrcReg,TmpReg
; shl DestSrcReg,1
ELSEIF Constant EQ 47
lea TmpReg,[DestSrcReg+DestSrcReg*4] ;(5*9)+2
lea TmpReg,[TmpReg+TmpReg*8]
lea DestSrcReg,[TmpReg+DestSrcReg*2]
; mov TmpReg,DestSrcReg ;(3*16)-1
; lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
; shl DestSrcReg,4
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 48
lea DestSrcReg,[DestSrcReg+DestSrcReg*2] ;3*16
shl DestSrcReg,4
ELSEIF Constant EQ 49
lea TmpReg,[DestSrcReg+DestSrcReg*2] ;(3*16)+1
shl TmpReg,4
lea DestSrcReg,[DestSrcReg+TmpReg]
; mov TmpReg,DestSrcReg ;(3*16)+1
; lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
; shl DestSrcReg,4
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 50
add DestSrcReg,DestSrcReg ;(2*5)*5
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg*8] ;(9+16)*2
; shl DestSrcReg,4
; add DestSrcReg,TmpReg
; shl DestSrcReg,1
ELSE
%echo err: constant value too large! (>50)
ENDIF
ENDM



here all the combinations i've found, there is certainly others/better, but i will not pass my life to search
i've limited myself to 3 instructions, and avoided the sub instruction (unfortunatelly my replacement are generally slower...).
you to test if it's interesting or not, personally i've made speed test (basically the same as qword, except i've put "mov DestSrcReg,DestSrcReg" before and after the algo, to see the effect with the possibles stalls).