The MASM Forum Archive 2004 to 2012

General Forums => The Workshop => Topic started by: Astro on August 15, 2009, 12:46:19 PM

Title: Any way to simplify this loop?
Post by: Astro on August 15, 2009, 12:46:19 PM
mov ebx,0h
mov ecx,numdev
MEM_ALLOC:
mov ebx,[ebx]+64h
dec ecx
cmp ecx,0h
jnz MEM_ALLOC


* sigh *

Any way to simplify this? I tried:

mov ebx,64h
mov ebx,[ebx]*numdev ; error here

...but it complains "constant expected".

Best regards,
Astro.
Title: Re: Any way to simplify this loop?
Post by: redskull on August 15, 2009, 01:19:29 PM
Are those brackets in the wrong place?  I'm not sure I see what this does, or why you can't just use a MUL:

mov eax, 64
mul numdev, eax
mov ebx, eax

-r
Title: Re: Any way to simplify this loop?
Post by: qWord on August 15, 2009, 01:29:31 PM
mov ebx,0h ;xor ebx,ebx
...
mov ebx,[ebx]+64h

you are accessing a DWORD located at 0+64h

Quote...but it complains "constant expected".
an Address can be specified through Scale, Index and Base (SIB)
Scale: multiply a register (32bit) by 1,2,4 or 8
Index: register (32bit)
Base: an offset

e.g.: mov eax,DWORD ptr [eax+ecx*4+1234]

EDIT: "dec ecx" sets the zero flag - there is no need for cmp (BTW: test reg,reg is better for such an task).

Title: Re: Any way to simplify this loop?
Post by: Astro on August 15, 2009, 01:45:02 PM
mov eax,DWORD ptr [eax+ecx*4+1234]
Can you explain exactly how this gets executed please?

Quote"dec ecx" sets the zero flag - there is no need for cmp
OK!

Quote(BTW: test reg,reg is better for such an task).
According to the MASM docs, both take the same time to execute. What is the advantage of one vs. the other?

Quoteor why you can't just use a MUL:
haha - it seems to hate me and generates an error.

Best regards,
Astro.
Title: Re: Any way to simplify this loop?
Post by: Astro on August 15, 2009, 01:54:01 PM
Wow - it worked.

I used the mul option - two lines.

   mov eax,64h
   mul numdev


:U

Thanks!

Best regards,
Astro.
Title: Re: Any way to simplify this loop?
Post by: hutch-- on August 15, 2009, 02:08:12 PM
 :bg


mov ebx,0h
mov ecx,numdev
MEM_ALLOC:
mov ebx,[ebx]+64h
dec ecx
cmp ecx,0h
jnz MEM_ALLOC


You don't need to do the compare as DEC and SUB ECX, 1 set the zero flag for the following JNZ.


@@:
  mov ebx, [ebx+64h]
  sub ecx, 1
  jnz @B
Title: Re: Any way to simplify this loop?
Post by: qWord on August 15, 2009, 02:17:07 PM
Quote from: Astro on August 15, 2009, 01:45:02 PM
mov eax,DWORD ptr [eax+ecx*4+1234]
Can you explain exactly how this gets executed please?
Don't know what do explain here - first the Addr. is calculated and then the memory is accessed.

Quote from: Astro on August 15, 2009, 01:45:02 PM
What is the advantage of one vs. the other?
Intel suggest to use it for such task :bg

EDIT: from Intel's Optimization Reference:
QuoteUse a TEST if a register with itself instead of a CMP of the register
to zero, this saves the need to encode the zero and saves encoding space

regards, qWord
Title: Re: Any way to simplify this loop?
Post by: Astro on August 15, 2009, 03:51:27 PM
In what order is:

eax+ecx*4+1234
executed?

Does it follow BODMAS, or some other convention?

(4+1234) * ecx

is not the same as

(4*ecx) + 1234

QuoteIntel suggest to use it for such task  :bg
QuoteUse a TEST if a register with itself instead of a CMP of the register
to zero, this saves the need to encode the zero and saves encoding space
Thanks!

Best regards,
Astro.
Title: Re: Any way to simplify this loop?
Post by: dedndave on August 15, 2009, 06:22:42 PM
the mul is performed first
it is a shortcut, is all
"lea ecx,[ecx+4*ecx]" is a shortcut that multiplies ecx by 5
much faster than mul by 5

"test eax,eax" is also a shortcut
register to register instructions are faster and sometimes shorter than register to immediate
"or eax,eax" and "and eax,eax" do the same thing
which is fastest is a matter of some discussion - lol
out of habit, i use "or" - it was best back in the days of the 8088 - lol

it is good to learn which flags are examined for the different conditional jumps
it is also good to learn how the simplest instructions affect the flags
shifts and rotates are interesting, of course
arithmetic instructions affect the flags according to the operation
logic instructions are similar, except they always clear the carry flag (except "not", which alters no flags at all)
"not" (logic inst) and "neg" (arith inst) are special they do not always leave the flags as expected
suppose i had the following code...

        add     eax,dwordA
        adc     edx,dwordB

after the add instructions, edx is 0, but eax is non-zero - will the zero flag be set or cleared ?
http://www.arl.wustl.edu/~lockwood/class/cs306/books/artofasm/Chapter_6/CH06-1.html
Title: Re: Any way to simplify this loop?
Post by: Astro on August 15, 2009, 06:46:28 PM
Quotewill the zero flag be set or cleared ?
I shall take a look. I have a 50/50 chance of being right.  :bdg

Where do you find these links?  :dazzled:  I must be using the wrong search terms.

Best regards,
Astro.
Title: Re: Any way to simplify this loop?
Post by: MichaelW on August 15, 2009, 10:04:18 PM
Quote from: Astro on August 15, 2009, 01:45:02 PM
mov eax,DWORD ptr [eax+ecx*4+1234]
Can you explain exactly how this gets executed please?

The value in the [] brackets is an indirect memory operand that specifies the address of the data to operate on. The address is calculated at run time from the contents of one or two registers one of which can have a scale factor, and any number of displacements. In the example "4" is a scale factor, and "1234" is a displacement. The address is calculated using the standard order of operations, the same order that you would use in solving an algebraic equation.
Title: Re: Any way to simplify this loop?
Post by: hutch-- on August 15, 2009, 11:16:26 PM
The details of the "complex addressing mode" are worth the effort to learn as they give you the capacity to write very compact code for memory access. The conventions are as follows,

BASE ADDRESS which is a register.
INDEX which is a register
SCALE is required for the data size of the data being indexed. 1 2 4 8.
DISPLACEMENT additional address change with an immediate number.

There have been a few variations on the notation and MASM supports some of them but mainly the complete complex address is contained within a pair of square brackets.

The assembler can tolerate some order variation and will still parse the address correctly but it also allows an "append" type notation of one piar of square brackets followed by another pair of brackets. It acts like the add "+" operator.

What is important to understand is that the notation is a method of specifying a single opcode in the processor which is why you have "mnemonics" which are family names for similar capacities. Just for example the MOV mnemonic has multiple opcodes depending on the operant types and order, instead of having to remember all of them, you use a mnemonic with a well understood notation.

Its worth noting that there are mnemonics that vary from the above order, one mnemonic allows a displacement as the base address which can be very useful in certain contexts where you do not have to use a register. Its often used for array addresses.
Title: Re: Any way to simplify this loop?
Post by: Astro on August 16, 2009, 01:03:25 AM
A great post, Hutch!  :thumbu

Quote"lea ecx,[ecx+4*ecx]" is a shortcut that multiplies ecx by 5
much faster than mul by 5
:dazzled:  A few hours later and I'm no wiser.  :eek

QuoteThe address is calculated at run time from the contents of one or two registers one of which can have a scale factor, and any number of displacements. In the example "4" is a scale factor, and "1234" is a displacement. The address is calculated using the standard order of operations, the same order that you would use in solving an algebraic equation.
OK! It is rather obvious I know about 1% of what I need to.

Thanks for the replies - I need to do a lot of reading.  :eek

Best regards,
Astro.
Title: Re: Any way to simplify this loop?
Post by: NightWare on August 16, 2009, 01:29:38 AM
here, *4 corespond to "shl ecx,2"
Title: Re: Any way to simplify this loop?
Post by: dedndave on August 16, 2009, 02:17:29 AM
QuoteThanks for the replies - I need to do a lot of reading.
when you first start out, you may find it handy to play with some little pieces of code in the debugger
that way, you can watch the registers and flags change as each instruction is executed
gawd - i spent hours with SymDeb - the old symbolic debugger - lol
besides, it's fun to watch all the lights go blinkin   :P
Title: Re: Any way to simplify this loop?
Post by: hutch-- on August 16, 2009, 02:30:57 AM
Astro,

Give this a blast.


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
    include \masm32\include\masm32rt.inc
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    .data?
      value dd ?

    .data
      array dd 0,1,2,3,4

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    push esi
    push edi

    mov esi, OFFSET array           ; ESI holds the BASE ADDRESS

  ; changing displacement
  ; ---------------------
    print str$( [esi] ),13,10
    print str$( [esi+4] ),13,10
    print str$( [esi+8] ),13,10
    print str$( [esi+12] ),13,10
    print str$( [esi+16] ),13,10,13,10


  ; changing address with INDEX
  ; ---------------------------
    mov edi, 0                      ; EDI is INDEX

    print str$( [esi+edi] ),13,10
    add edi, 4
    print str$( [esi+edi] ),13,10
    add edi, 4
    print str$( [esi+edi] ),13,10
    add edi, 4
    print str$( [esi+edi] ),13,10
    add edi, 4
    print str$( [esi+edi] ),13,10,13,10
    add edi, 4


  ; changing address with INDEX times MULTIPLIER
  ; ------------------------------
    mov edi, 0
    print str$( [esi+edi*4] ),13,10
    add edi, 1
    print str$( [esi+edi*4] ),13,10
    add edi, 1
    print str$( [esi+edi*4] ),13,10
    add edi, 1
    print str$( [esi+edi*4] ),13,10
    add edi, 1
    print str$( [esi+edi*4] ),13,10,13,10
    add edi, 1

  ; BASE address (ESI) + INDEX (EDI) times MULTIPLIER + 12 bytes displacement
    mov edi, 0
    print str$( [esi+edi*4+12] ),13,10

    pop edi
    pop esi

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start


Title: Re: Any way to simplify this loop?
Post by: Rockoon on August 16, 2009, 12:14:58 PM
As someone else had noted, the LEA instruction is very powerfull. It has evolved into quite the workhorse. It isnt always the optimal solutions, but has capabilities that simply do not exist otherwise.

It isnt always the most efficient solution, but it is always quite efficient none-the-less because the AGU (Address Generation Unit) gets quite a bit of loving from both Intel and AMD due to its overall importance. I think P4's are the "worst off" in regards to the AGU because there is only one of them, but even there its often the best solution to particular problems (such as when you need a 3-operand addition instruction)
Title: Re: Any way to simplify this loop?
Post by: FORTRANS on August 16, 2009, 01:08:33 PM
Quote from: Astro on August 16, 2009, 01:03:25 AM
Quote"lea ecx,[ecx+4*ecx]" is a shortcut that multiplies ecx by 5
much faster than mul by 5
A few hours later and I'm no wiser.

Hi,

   Well, a multiply is (was) regarded as an expensive
(time consuming) instruction.  Adds and shifts are
much faster and require less setup.  So:

5(ecx) = 1(ecx) + 4(ecx)

Is better than

5(ecx) = 5*ecx

Especially as

Quotehere, *4 corespond to "shl ecx,2"

So, 5(ecx) = ecx + (SHL ecx,2), an add and a shift
instead of an actual multiply (which requires eax and
a real live 5 as well).  So, a bit complex to see for the
first time, but a good tool once understood.

HTH,

Steve N.
Title: Re: Any way to simplify this loop?
Post by: hutch-- on August 16, 2009, 01:28:23 PM
The performance drop with LEA in the PIV family of processors was very unfortunate as it was generally a very fast instruction that could also do compact calculations but with a PIV you only use it on odd occasions.

I would be interested to see if it has come back up to speed on the later CORE and i7 range of Intel hardware.
Title: Re: Any way to simplify this loop?
Post by: jj2007 on August 16, 2009, 01:58:38 PM
Quote from: hutch-- on August 16, 2009, 01:28:23 PM
I would be interested to see if it has come back up to speed on the later CORE and i7 range of Intel hardware.

Test it :bg


.nolist
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
LOOP_COUNT = 10000

.code
start:
print "Comparing mul, imul and lea for performing 'mov eax, 10*eax' :", 13, 10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 125
mov eax, 1
mov ecx, 10 ; 6 bytes for mov ecx, 10 plus mul ecx
REPEAT 8
mul ecx
ENDM
ENDM
counter_end
print str$(eax), 9, "millicycles for mul ecx", 13, 10

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 125
mov eax, 1 ; 3 bytes
REPEAT 8
imul eax, eax, 10
ENDM
ENDM
counter_end
print str$(eax), 9, "millicycles for imul eax, eax, 10", 13, 10

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 125
mov eax, 1 ; 5 bytes
REPEAT 8
lea eax, [eax+4*eax]
add eax, eax
ENDM
ENDM
counter_end
print str$(eax), 9, "millicycles for lea", 13, 10


inkey chr$(13, 10, "--- ok ---", 13)
exit
end start



Celeron M:

Comparing mul, imul and lea for performing 'mov eax, 10*eax' :
2657    millicycles for mul ecx
1122    millicycles for imul eax, eax, 10
1214    millicycles for lea
Title: Re: Any way to simplify this loop?
Post by: hutch-- on August 16, 2009, 02:03:39 PM
No use, I have 3 PIVs to play with but no later stuff yet. I want a Nehalem double quad running at 4 gig next.  :P
Title: Re: Any way to simplify this loop?
Post by: Rockoon on August 16, 2009, 02:35:29 PM
For the record, all AMD64's can execute all of the "fastpath" ALU instructions, other than multiplication, on each of its ALU's. The ALU's and AGU's are all symmetric except for multiplication, which can only be handled on one of the ALU's.

This fact throttles the AMD64 to a maximum of 1 integer multiplication per clock cycle, so in practice is can sometimes be advantageous to use LEA instead of MUL/IMUL. In practice you will find that it isnt usualy advantageous because it is quite easy to have 3 or more AGU instructions in the pipeline at any given time (all memory reads and writes require the AGU's.) When you unroll that loop with the single memory read and single memory write, its now a loop with 4 or more AGU instructions... and that means that the AGU's are often one of your bottlenecks points.

All of the integer ALU operations that read from or write to memory directly require BOTH an ALU and and AGU to accomplish the work.

The AMD64 has 9 execution units.

3 ALU's (Arithmetic and Logic Unit)
3 AGU's (Address Generation Unit)
1 FADD unit (additions, subtractions, and "any")
1 FMUL unit (multiplication, division, and "any")
1 FMISC unit (float<->int, load and store, and "any")

The "any" includes most of the macro-operations, which sometimes require all 3 units simultaneously.

Also of some note is the fact that an AMD64 can only perform two memory reads per clock cycle even though it has 3 AGU's. You can still perform an LEA on that 3rd AGU but there are only 2 paths from the CPU to the L1 cache, so only 2 AGU's can actualy access memory simultaneously.

(yes, its true.. I know an insane amount of AMD64-specific stuff)

The AMD64, for the most part, is a much more pleasant optimization experience than the P4 is. The P4 has an insanely long pipeline and some crazy assymetric execution units.. the situation exists that some instructions are PORT 1 or 2, while others are only PORT 1, and others are just PORT 2. This makes it very hard to schedule properly so that you don't have one of those "1 or 2" instructions sitting in 2 while you also have some 2-only work to do. This has lead to generalized optimization strategies such as pairing up two single-uops or two double-uops (never having an odd number of either next to each other, unless you have a tripple-uop as well, in which case 3-1-1-1 .. the P4 is crazy stuff.. really.. its horrible from an optimizer standpoint)
Title: Re: Any way to simplify this loop?
Post by: jj2007 on August 16, 2009, 03:11:44 PM
Rockoon,
Can you test my snippet above on your AMD? I am curious how imul performs relative to lea. If you have an idea why the simple mul is so slow in comparison, I'd like to know...
Title: Re: Any way to simplify this loop?
Post by: qWord on August 16, 2009, 03:20:14 PM
The advantage on lea is that you can choose the registers you want to use. For this purpose I've written an macro that let me multiply values  up to 50 - maybe some is interested in ...

regards, qWord
Title: Re: Any way to simplify this loop?
Post by: dedndave on August 16, 2009, 04:13:04 PM
thanks qWord   :U
dang - i wanted to use 51 - lol
Title: Re: Any way to simplify this loop?
Post by: Rockoon on August 16, 2009, 05:45:19 PM

C:\asm\masm_forums>jjmuls
Comparing mul, imul and lea for performing 'mov eax, 10*eax' :
1603    millicycles for mul ecx
1409    millicycles for imul eax, eax, 10
1162    millicycles for lea

--- ok ---

C:\asm\masm_forums>jjmuls
Comparing mul, imul and lea for performing 'mov eax, 10*eax' :
1610    millicycles for mul ecx
2714    millicycles for imul eax, eax, 10
1128    millicycles for lea

--- ok ---

C:\asm\masm_forums>jjmuls
Comparing mul, imul and lea for performing 'mov eax, 10*eax' :
1614    millicycles for mul ecx
1382    millicycles for imul eax, eax, 10
1160    millicycles for lea

--- ok ---

C:\asm\masm_forums>jjmuls
Comparing mul, imul and lea for performing 'mov eax, 10*eax' :
2944    millicycles for mul ecx
1344    millicycles for imul eax, eax, 10
1143    millicycles for lea

--- ok ---

C:\asm\masm_forums>jjmuls
Comparing mul, imul and lea for performing 'mov eax, 10*eax' :
1593    millicycles for mul ecx
1378    millicycles for imul eax, eax, 10
1133    millicycles for lea

--- ok ---

C:\asm\masm_forums>


There seems to be quite a bit of deviation in the tests as-is, and the test isnt "real world" which is something I dont spend time on anymore (its futile, as you don't actualy learn anything usefull.. such as how well it pairs with other relevant instructions in a real situation)
Title: Re: Any way to simplify this loop?
Post by: jj2007 on August 16, 2009, 05:54:21 PM
Quote from: qWord on August 16, 2009, 03:20:14 PM
The advantage on lea is that you can choose the registers you want to use. For this purpose I've written an macro that let me multiply values  up to 50 - maybe some is interested in ...

regards, qWord

Cute. I added the imul variant which, by the way, allows to choose the registers.

lea vs. mul
const   IMUL    MUL     LEA             faster?
0       17      41      12              lea or macro
1       18      41      3               lea or macro
2       17      41      12              lea or macro
3       17      41      12              lea or macro
4       17      41      12              lea or macro
5       17      41      13              lea or macro


My timings for the mul variant are higher than yours because I added one line:
         mov edi,cntr
         mov esi,num
         REPEAT 16
            mov eax,esi
            mov ecx, cntr   ; should be loaded somewhere
            mul ecx
         ENDM
Title: Re: Any way to simplify this loop?
Post by: dedndave on August 16, 2009, 06:01:08 PM
millicycles ?
i knew a gal named Milli, once
Title: Re: Any way to simplify this loop?
Post by: qWord on August 16, 2009, 08:16:29 PM
Quote from: jj2007 on August 16, 2009, 05:54:21 PM
My timings for the mul variant are higher than yours because I added one line:
         mov edi,cntr
         mov esi,num
it should be "mov ecx,cntr"  :red

Quote..imul variant which, by the way, allows to choose the registers
I've never thought that imul is such fast... You live and learn  :U

regards, qWord
Title: Re: Any way to simplify this loop?
Post by: Rockoon on August 16, 2009, 08:26:49 PM
One of the thoughts I had about why the P4 has such trouble with the 1-op mul is that it produces a 64-bit result. Is the system in question an older 32-bit P4 or one of the later 64-bit ones?
Title: Re: Any way to simplify this loop?
Post by: jj2007 on August 16, 2009, 09:26:23 PM
Quote from: Rockoon on August 16, 2009, 08:26:49 PM
One of the thoughts I had about why the P4 has such trouble with the 1-op mul is that it produces a 64-bit result. Is the system in question an older 32-bit P4 or one of the later 64-bit ones?

My Celeron M is a Yonah, pretty young. I had thought about the 64-bit result, too - but more than twice the cycle count is quite a bit.
When using e.g. imul edx, edx, 10, one must indeed keep in mind that in contrast to mul the result is only 32-bit. On the positive side, edx is not being trashed...
Title: Re: Any way to simplify this loop?
Post by: NightWare on August 17, 2009, 08:21:58 PM
qWord, i haven't made speed test (or even tested them), but have you tried those combinations ?

for 7 :
   lea TmpReg,[DestSrcReg+DestSrcReg*2]
   lea DestSrcReg,[DestSrcReg+TmpReg*2]
for 11 :
   lea TmpReg,[DestSrcReg+DestSrcReg*4]
   lea DestSrcReg,[DestSrcReg+TmpReg*2]
for 13 :
   lea TmpReg,[DestSrcReg+DestSrcReg*2]
   lea DestSrcReg,[DestSrcReg+TmpReg*4]
for 15 :
   lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
   lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
for 17 :
   lea TmpReg,[DestSrcReg*8]
   lea DestSrcReg,[DestSrcReg+TmpReg*2]
for 19 :
   lea TmpReg,[DestSrcReg+DestSrcReg*8]
   lea DestSrcReg,[DestSrcReg+TmpReg*2]
for 21 :
   lea TmpReg,[DestSrcReg+DestSrcReg*4]
   lea DestSrcReg,[DestSrcReg+TmpReg*4]
for 22 :
   lea TmpReg,[DestSrcReg+DestSrcReg*4]
   lea DestSrcReg,[DestSrcReg+TmpReg*2]
   add DestSrcReg,DestSrcReg
for 25 :
   lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
   lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
for 26 :
   lea TmpReg,[DestSrcReg+DestSrcReg*2]
   lea DestSrcReg,[DestSrcReg+TmpReg*4]
   add DestSrcReg,DestSrcReg
for 27 :
   lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
   lea DestSrcReg,[DestSrcReg+DestSrcReg*8]
for 37 :
   lea TmpReg,[DestSrcReg+DestSrcReg*8]
   lea DestSrcReg,[DestSrcReg+TmpReg*4]
for 38 :
   lea TmpReg,[DestSrcReg+DestSrcReg*2]
   shl DestSrcReg,5
   lea DestSrcReg,[DestSrcReg+TmpReg*2]
for 39 :
   lea TmpReg,[DestSrcReg+DestSrcReg*8]
   lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
   lea DestSrcReg,[DestSrcReg+TmpReg*4]
for 42 :
   lea TmpReg,[DestSrcReg+DestSrcReg*4]
   lea DestSrcReg,[DestSrcReg+TmpReg*4]
   add DestSrcReg,DestSrcReg
for 44 :
   lea TmpReg,[DestSrcReg+DestSrcReg*4]
   lea DestSrcReg,[DestSrcReg+TmpReg*2]
   shl DestSrcReg,2
for 45 :
   lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
   lea DestSrcReg,[DestSrcReg+DestSrcReg*8]
for 49 :
   mov TmpReg,DestSrcReg
   lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
   lea DestSrcReg,[TmpReg+DestSrcReg*4]
for 50 :
   lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
   lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
   add DestSrcReg,DestSrcReg


Astro, a little exercise for you, what are the multiplier here ? (a little trap here...)
1)
   lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
   shr DestSrcReg,2
2)
   lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
   shr DestSrcReg,1
3)
   lea TmpReg,[DestSrcReg+DestSrcReg*2]
   lea DestSrcReg,[DestSrcReg+TmpReg*2]
   shr DestSrcReg,2
Title: Re: Any way to simplify this loop?
Post by: qWord on August 17, 2009, 10:43:14 PM
Quote from: NightWare on August 17, 2009, 08:21:58 PM
have you tried those combinations ?
maybe - I've written this macro some time back. I'm surly,that I doesn't get idea of all combination you post here, but if i remember rightly, sometimes, it was faster to replace  two lea's with one lea and add/sub/shl.

regards, qWord
Title: Re: Any way to simplify this loop?
Post by: dedndave on August 17, 2009, 11:01:17 PM
yes - Jochen ran those tests the other day and i must say i was disappointed in lea, and a little surprised
i guess if you are looking for small code, it is nice, but speed is more important these days
i had written a times 10 table generator and used lea eax,[eax+4*eax] to replace a mov/shift/add
i am sure the mov edx,eax - shl eax,2 - add eax,edx was faster to begin with, although it saved a couple bytes
Title: Re: Any way to simplify this loop?
Post by: qWord on August 17, 2009, 11:44:11 PM
here is a short test to show what I've mean:
(on an c2d)
multiple by 39
---------------------------
lea+shl+sub       : 23
lea+lea+lea       : 41



    print "multiple by 39",10,13
    print "---------------------------",10,13
        counter_begin 1000000, REALTIME_PRIORITY_CLASS
            mov edi,12345
            REPEAT 16
                mov edx,edi
                mov eax,edx
                lea edx,[edx+edx*4]
                shl edx,3
                sub edx,eax
            ENDM
        counter_end
        ;shr eax,4
        push eax
        print "lea+shl+sub       : "
        pop eax
        print udword$(eax),10,13
       
        counter_begin 1000000, REALTIME_PRIORITY_CLASS
            mov edi,12345
            REPEAT 16
               mov edx,edi
               lea eax,[edx+edx*8]
               lea edx,[edx+edx*2]
               lea edx,[edx+eax*4]
            ENDM
        counter_end
        ;shr eax,4
        push eax
        print "lea+lea+lea       : "
        pop eax
        print udword$(eax),10,13
Title: Re: Any way to simplify this loop?
Post by: hutch-- on August 17, 2009, 11:51:25 PM
Generally on later hardware you avoid MUL if you can, shifts unless you have to and use ADD instead of LEA if you can. It depends on the miltiplication you need to do.

1st. Use ADD if you can.
2nd. Use LEA if you get a better encoding.
3rd. Use a shift if it gets you the range you need.
4th. Only use MUL if you have to as it has always been slow.
Title: Re: Any way to simplify this loop?
Post by: jj2007 on August 18, 2009, 06:32:55 AM
Quote from: qWord on August 17, 2009, 11:44:11 PM
here is a short test to show what I've mean:
(on an c2d)
multiple by 39
---------------------------
lea+shl+sub       : 23
lea+lea+lea       : 41


I get 38 for lea+shl+sub and 17 for imul.
Title: Re: Any way to simplify this loop?
Post by: Rockoon on August 18, 2009, 10:40:30 AM
"Bad" performance on multiplication hasn't really been true since the Pentium4.

The following assumes you are targeting AMD64/Core2/I7...

If you can do it with a single shift or lea, then sure.. and you can also possibly get away with 2 leas/shifts and beat out imul on Core2 (but not AMD64), but once you are at 3 leas/shifts, imul is going to be no worse unless you can somehow pair up those lea's and hide their latencies while not being able to do the same for the imul.

The imul latency is 3 cycles on Core2's and AMD64's ..
The lea latency is 1 cycle on Core2, and 2 cycles on AMD64

a dependency chain of 3 lea's thus takes 3 cycles on a core2 (same as imul), and *6* cycles on an AMD64 (twice as slow as an imul)


(Latency figures taken from Agner Fog's excellent research)
Title: Re: Any way to simplify this loop?
Post by: Astro on August 18, 2009, 10:35:35 PM
 :dazzled:

I need a week to read and consider this thread! Awesome!  :eek

Best regards,
Astro.
Title: Re: Any way to simplify this loop?
Post by: NightWare on August 19, 2009, 10:15:11 PM
imm_mul MACRO Constant:REQ,DestSrcReg:REQ,TmpReg
IF Constant EQ 0
xor DestSrcReg,DestSrcReg
ELSEIF Constant EQ 1
;do nothing ;)
ELSEIF Constant EQ 2
add DestSrcReg,DestSrcReg
ELSEIF Constant EQ 3
lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
ELSEIF Constant EQ 4
shl DestSrcReg,2
ELSEIF Constant EQ 5
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
ELSEIF Constant EQ 6
add DestSrcReg,DestSrcReg ;2*3
lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
; lea DestSrcReg,[DestSrcReg+DestSrcReg*2] ;3*2
; shl DestSrcReg,1
ELSEIF Constant EQ 7
lea TmpReg,[DestSrcReg+DestSrcReg*2] ;(3*2)+1
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; mov TmpReg,DestSrcReg ;8-1
; shl DestSrcReg,3
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 8
shl DestSrcReg,3
ELSEIF Constant EQ 9
lea DestSrcReg,[DestSrcReg+DestSrcReg*8]
ELSEIF Constant EQ 10
add DestSrcReg,DestSrcReg ;2*5
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
; lea DestSrcReg,[DestSrcReg+DestSrcReg*4] ;5*2
; shl DestSrcReg,1
ELSEIF Constant EQ 11
lea TmpReg,[DestSrcReg+DestSrcReg*4] ;(5*2)+1
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; lea TmpReg,[DestSrcReg+DestSrcReg*8] ;9+2
; shl DestSrcReg,1
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 12
lea DestSrcReg,[DestSrcReg+DestSrcReg*2] ;3*4
shl DestSrcReg,2
ELSEIF Constant EQ 13
lea TmpReg,[DestSrcReg+DestSrcReg*2] ;(3*4)+1
lea DestSrcReg,[DestSrcReg+TmpReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg*2] ;16-3
; shl DestSrcReg,4
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 14
add DestSrcReg,DestSrcReg ;2*((3*2)+1)
lea TmpReg,[DestSrcReg+DestSrcReg*2]
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; lea TmpReg,[DestSrcReg+DestSrcReg] ;16-2
; shl DestSrcReg,4
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 15
lea DestSrcReg,[DestSrcReg+DestSrcReg*2] ;3*5
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
; mov TmpReg,DestSrcReg ;16-1
; shl DestSrcReg,4
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 16
shl DestSrcReg,4
ELSEIF Constant EQ 17
lea TmpReg,[DestSrcReg*8] ;(8*2)+1
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; mov TmpReg,DestSrcReg ;16+1
; shl DestSrcReg,4
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 18
add DestSrcReg,DestSrcReg ;2*9
lea DestSrcReg,[DestSrcReg+DestSrcReg*8]
; lea DestSrcReg,[DestSrcReg+DestSrcReg*8] ;9*2
; shl DestSrcReg,1
ELSEIF Constant EQ 19
lea TmpReg,[DestSrcReg+DestSrcReg*8] ;18+1
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; lea TmpReg,[DestSrcReg+DestSrcReg*2] ;16+3
; shl DestSrcReg,4
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 20
lea DestSrcReg,[DestSrcReg+DestSrcReg*4] ;5*4
shl DestSrcReg,2
ELSEIF Constant EQ 21
lea TmpReg,[DestSrcReg+DestSrcReg*4] ;(5*4)+1
lea DestSrcReg,[DestSrcReg+TmpReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg*4] ;16+5
; shl DestSrcReg,4
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 22
add DestSrcReg,DestSrcReg ;2*((5*2)+1)
lea TmpReg,[DestSrcReg+DestSrcReg*4]
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; lea TmpReg,[DestSrcReg+DestSrcReg*8] ;(9+2)*2
; shl DestSrcReg,1
; add DestSrcReg,TmpReg
; shl DestSrcReg,1
ELSEIF Constant EQ 23
lea TmpReg,[DestSrcReg+DestSrcReg*2] ;(3*8)-1
neg DestSrcReg
lea DestSrcReg,[DestSrcReg+TmpReg*8]
; lea TmpReg,[DestSrcReg+DestSrcReg*8] ;32-9
; shl DestSrcReg,5
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 24
lea DestSrcReg,[DestSrcReg+DestSrcReg*2] ;3*8
shl DestSrcReg,3
ELSEIF Constant EQ 25
lea DestSrcReg,[DestSrcReg+DestSrcReg*4] ;5*5
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg*8] ;16+9
; shl DestSrcReg,4
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 26
add DestSrcReg,DestSrcReg ; 2*((3*4)+1)
lea TmpReg,[DestSrcReg+DestSrcReg*2]
lea DestSrcReg,[DestSrcReg+TmpReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg*2] ;(16+3)*2
; shl DestSrcReg,4
; sub DestSrcReg,TmpReg
; shl DestSrcReg,1
ELSEIF Constant EQ 27
lea DestSrcReg,[DestSrcReg+DestSrcReg*2] ;3*9
lea DestSrcReg,[DestSrcReg+DestSrcReg*8]
; lea TmpReg,[DestSrcReg+DestSrcReg*4] ;32-5
; shl DestSrcReg,5
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 28
shl DestSrcReg,2 ;4*((3*2)+1)
lea TmpReg,[DestSrcReg+DestSrcReg*2]
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; lea TmpReg,[DestSrcReg*4]
; shl DestSrcReg,5
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 29
lea TmpReg,[DestSrcReg+DestSrcReg*4] ;(5*4)+9
lea DestSrcReg,[DestSrcReg+DestSrcReg*8]
lea DestSrcReg,[DestSrcReg+TmpReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg*2] ;32-3
; shl DestSrcReg,5
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 30
add DestSrcReg,DestSrcReg ; 2*(3*5)
lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg] ;32-2
; shl DestSrcReg,5
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 31
lea TmpReg,[DestSrcReg+DestSrcReg*2] ;(3*9)+4
lea TmpReg,[TmpReg+TmpReg*8]
lea DestSrcReg,[TmpReg+DestSrcReg*4]
; mov TmpReg,DestSrcReg ;32-1
; shl DestSrcReg,5
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 32
shl DestSrcReg,5
ELSEIF Constant EQ 33
lea TmpReg,[DestSrcReg*4] ;(4*8)+1
lea DestSrcReg,[DestSrcReg+TmpReg*8]
; mov TmpReg,DestSrcReg ;32+1
; shl DestSrcReg,5
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 34
add DestSrcReg,DestSrcReg ;2*((8*2)+1)
lea TmpReg,[DestSrcReg*8]
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; lea TmpReg,[DestSrcReg*2] ;32+2
; shl DestSrcReg,5
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 35
lea TmpReg,[DestSrcReg+DestSrcReg*8] ;(9*4)-1
neg DestSrcReg
lea DestSrcReg,[DestSrcReg+TmpReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg*2] ;32+3
; shl DestSrcReg,5
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 36
lea DestSrcReg,[DestSrcReg+DestSrcReg*8] ;9*4
shl DestSrcReg,2
ELSEIF Constant EQ 37
lea TmpReg,[DestSrcReg+DestSrcReg*8] ;(9*4)+1
lea DestSrcReg,[DestSrcReg+TmpReg*4]
; mov TmpReg,DestSrcReg ;(9*4)+1
; lea DestSrcReg,[DestSrcReg+DestSrcReg*8]
; shl DestSrcReg,2
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 38
add DestSrcReg,DestSrcReg ;2*((9*2)+1)
lea TmpReg,[DestSrcReg+DestSrcReg*8]
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; lea TmpReg,[DestSrcReg+DestSrcReg*2] ;(16+3)*2
; shl DestSrcReg,4
; add DestSrcReg,TmpReg
; shl DestSrcReg,1
ELSEIF Constant EQ 39
lea TmpReg,[DestSrcReg+DestSrcReg*2] ;((3*4)+1)*3
lea DestSrcReg,[DestSrcReg+TmpReg*4]
lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
; mov TmpReg,DestSrcReg ;(5*8)-1
; lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
; shl DestSrcReg,3
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 40
lea DestSrcReg,[DestSrcReg+DestSrcReg*4] ;5*8
shl DestSrcReg,3
ELSEIF Constant EQ 41
mov TmpReg,DestSrcReg ;(5*8)+1
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
lea DestSrcReg,[TmpReg+DestSrcReg*8]
; mov TmpReg,DestSrcReg ;(5*8)+1
; lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
; shl DestSrcReg,3
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 42
add DestSrcReg,DestSrcReg ;2*((5*4)+1)
lea TmpReg,[DestSrcReg+DestSrcReg*4]
lea DestSrcReg,[DestSrcReg+TmpReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg*4] ;(16+5)*2
; shl DestSrcReg,4
; add DestSrcReg,TmpReg
; shl DestSrcReg,1
ELSEIF Constant EQ 43
lea TmpReg,[DestSrcReg+DestSrcReg*2] ;(5*8)+3
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
lea DestSrcReg,[TmpReg+DestSrcReg*8]
; lea TmpReg,[DestSrcReg+2*DestSrcReg] ;(5*8)+3
; lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
; shl DestSrcReg,3
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 44
lea TmpReg,[DestSrcReg+DestSrcReg*4] ;((5*2)+1)*4
lea DestSrcReg,[DestSrcReg+TmpReg*2]
shl DestSrcReg,2
; lea TmpReg,[DestSrcReg+DestSrcReg*8] ;(9+2)*4
; shl DestSrcReg,1
; add DestSrcReg,TmpReg
; shl DestSrcReg,2
ELSEIF Constant EQ 45
lea DestSrcReg,[DestSrcReg+DestSrcReg*4] ;5*9
lea DestSrcReg,[DestSrcReg+DestSrcReg*8]
; lea DestSrcReg,[DestSrcReg+DestSrcReg*8] ;(9*4)+9
; mov TmpReg,DestSrcReg
; shl DestSrcReg,2
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 46
lea TmpReg,[DestSrcReg+DestSrcReg*4] ;(5*9)+1
lea TmpReg,[TmpReg+TmpReg*8]
lea DestSrcReg,[DestSrcReg+TmpReg]
; lea TmpReg,[DestSrcReg+DestSrcReg*8] ;(32-9)*2
; shl DestSrcReg,5
; sub DestSrcReg,TmpReg
; shl DestSrcReg,1
ELSEIF Constant EQ 47
lea TmpReg,[DestSrcReg+DestSrcReg*4] ;(5*9)+2
lea TmpReg,[TmpReg+TmpReg*8]
lea DestSrcReg,[TmpReg+DestSrcReg*2]
; mov TmpReg,DestSrcReg ;(3*16)-1
; lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
; shl DestSrcReg,4
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 48
lea DestSrcReg,[DestSrcReg+DestSrcReg*2] ;3*16
shl DestSrcReg,4
ELSEIF Constant EQ 49
lea TmpReg,[DestSrcReg+DestSrcReg*2] ;(3*16)+1
shl TmpReg,4
lea DestSrcReg,[DestSrcReg+TmpReg]
; mov TmpReg,DestSrcReg ;(3*16)+1
; lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
; shl DestSrcReg,4
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 50
add DestSrcReg,DestSrcReg ;(2*5)*5
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg*8] ;(9+16)*2
; shl DestSrcReg,4
; add DestSrcReg,TmpReg
; shl DestSrcReg,1
ELSE
%echo err: constant value too large! (>50)
ENDIF
ENDM



here all the combinations i've found, there is certainly others/better, but i will not pass my life to search
i've limited myself to 3 instructions, and avoided the sub instruction (unfortunatelly my replacement are generally slower...).
you to test if it's interesting or not, personally i've made speed test (basically the same as qword, except i've put "mov DestSrcReg,DestSrcReg" before and after the algo, to see the effect with the possibles stalls).