mov ebx,0h
mov ecx,numdev
MEM_ALLOC:
mov ebx,[ebx]+64h
dec ecx
cmp ecx,0h
jnz MEM_ALLOC
* sigh *
Any way to simplify this? I tried:
mov ebx,64h
mov ebx,[ebx]*numdev ; error here
...but it complains "constant expected".
Best regards,
Astro.
Are those brackets in the wrong place? I'm not sure I see what this does, or why you can't just use a MUL:
mov eax, 64
mul numdev, eax
mov ebx, eax
-r
mov ebx,0h ;xor ebx,ebx
...
mov ebx,[ebx]+64h
you are accessing a DWORD located at 0+64h
Quote...but it complains "constant expected".
an Address can be specified through Scale, Index and Base (SIB)
Scale: multiply a register (32bit) by 1,2,4 or 8
Index: register (32bit)
Base: an offset
e.g.: mov eax,DWORD ptr [eax+ecx*4+1234]
EDIT: "dec ecx" sets the zero flag - there is no need for cmp (BTW: test reg,reg is better for such an task).
mov eax,DWORD ptr [eax+ecx*4+1234]
Can you explain exactly how this gets executed please?
Quote"dec ecx" sets the zero flag - there is no need for cmp
OK!
Quote(BTW: test reg,reg is better for such an task).
According to the MASM docs, both take the same time to execute. What is the advantage of one vs. the other?
Quoteor why you can't just use a MUL:
haha - it seems to hate me and generates an error.
Best regards,
Astro.
Wow - it worked.
I used the mul option - two lines.
mov eax,64h
mul numdev
:U
Thanks!
Best regards,
Astro.
:bg
mov ebx,0h
mov ecx,numdev
MEM_ALLOC:
mov ebx,[ebx]+64h
dec ecx
cmp ecx,0h
jnz MEM_ALLOC
You don't need to do the compare as DEC and SUB ECX, 1 set the zero flag for the following JNZ.
@@:
mov ebx, [ebx+64h]
sub ecx, 1
jnz @B
Quote from: Astro on August 15, 2009, 01:45:02 PM
mov eax,DWORD ptr [eax+ecx*4+1234]
Can you explain exactly how this gets executed please?
Don't know what do explain here - first the Addr. is calculated and then the memory is accessed.
Quote from: Astro on August 15, 2009, 01:45:02 PM
What is the advantage of one vs. the other?
Intel suggest to use it for such task :bg
EDIT: from Intel's Optimization Reference:
QuoteUse a TEST if a register with itself instead of a CMP of the register
to zero, this saves the need to encode the zero and saves encoding space
regards, qWord
In what order is:
eax+ecx*4+1234
executed?
Does it follow BODMAS, or some other convention?
(4+1234) * ecx
is not the same as
(4*ecx) + 1234
QuoteIntel suggest to use it for such task :bg
QuoteUse a TEST if a register with itself instead of a CMP of the register
to zero, this saves the need to encode the zero and saves encoding space
Thanks!
Best regards,
Astro.
the mul is performed first
it is a shortcut, is all
"lea ecx,[ecx+4*ecx]" is a shortcut that multiplies ecx by 5
much faster than mul by 5
"test eax,eax" is also a shortcut
register to register instructions are faster and sometimes shorter than register to immediate
"or eax,eax" and "and eax,eax" do the same thing
which is fastest is a matter of some discussion - lol
out of habit, i use "or" - it was best back in the days of the 8088 - lol
it is good to learn which flags are examined for the different conditional jumps
it is also good to learn how the simplest instructions affect the flags
shifts and rotates are interesting, of course
arithmetic instructions affect the flags according to the operation
logic instructions are similar, except they always clear the carry flag (except "not", which alters no flags at all)
"not" (logic inst) and "neg" (arith inst) are special they do not always leave the flags as expected
suppose i had the following code...
add eax,dwordA
adc edx,dwordB
after the add instructions, edx is 0, but eax is non-zero - will the zero flag be set or cleared ?
http://www.arl.wustl.edu/~lockwood/class/cs306/books/artofasm/Chapter_6/CH06-1.html
Quotewill the zero flag be set or cleared ?
I shall take a look. I have a 50/50 chance of being right. :bdg
Where do you find these links? :dazzled: I must be using the wrong search terms.
Best regards,
Astro.
Quote from: Astro on August 15, 2009, 01:45:02 PM
mov eax,DWORD ptr [eax+ecx*4+1234]
Can you explain exactly how this gets executed please?
The value in the [] brackets is an indirect memory operand that specifies the address of the data to operate on. The address is calculated at run time from the contents of one or two registers one of which can have a scale factor, and any number of displacements. In the example "4" is a scale factor, and "1234" is a displacement. The address is calculated using the standard order of operations, the same order that you would use in solving an algebraic equation.
The details of the "complex addressing mode" are worth the effort to learn as they give you the capacity to write very compact code for memory access. The conventions are as follows,
BASE ADDRESS which is a register.
INDEX which is a register
SCALE is required for the data size of the data being indexed. 1 2 4 8.
DISPLACEMENT additional address change with an immediate number.
There have been a few variations on the notation and MASM supports some of them but mainly the complete complex address is contained within a pair of square brackets.
The assembler can tolerate some order variation and will still parse the address correctly but it also allows an "append" type notation of one piar of square brackets followed by another pair of brackets. It acts like the add "+" operator.
What is important to understand is that the notation is a method of specifying a single opcode in the processor which is why you have "mnemonics" which are family names for similar capacities. Just for example the MOV mnemonic has multiple opcodes depending on the operant types and order, instead of having to remember all of them, you use a mnemonic with a well understood notation.
Its worth noting that there are mnemonics that vary from the above order, one mnemonic allows a displacement as the base address which can be very useful in certain contexts where you do not have to use a register. Its often used for array addresses.
A great post, Hutch! :thumbu
Quote"lea ecx,[ecx+4*ecx]" is a shortcut that multiplies ecx by 5
much faster than mul by 5
:dazzled: A few hours later and I'm no wiser. :eek
QuoteThe address is calculated at run time from the contents of one or two registers one of which can have a scale factor, and any number of displacements. In the example "4" is a scale factor, and "1234" is a displacement. The address is calculated using the standard order of operations, the same order that you would use in solving an algebraic equation.
OK! It is rather obvious I know about 1% of what I need to.
Thanks for the replies - I need to do a lot of reading. :eek
Best regards,
Astro.
here, *4 corespond to "shl ecx,2"
QuoteThanks for the replies - I need to do a lot of reading.
when you first start out, you may find it handy to play with some little pieces of code in the debugger
that way, you can watch the registers and flags change as each instruction is executed
gawd - i spent hours with SymDeb - the old symbolic debugger - lol
besides, it's fun to watch all the lights go blinkin :P
Astro,
Give this a blast.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
comment * -----------------------------------------------------
Build this template with
"CONSOLE ASSEMBLE AND LINK"
----------------------------------------------------- *
.data?
value dd ?
.data
array dd 0,1,2,3,4
.code
start:
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main
inkey
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
push esi
push edi
mov esi, OFFSET array ; ESI holds the BASE ADDRESS
; changing displacement
; ---------------------
print str$( [esi] ),13,10
print str$( [esi+4] ),13,10
print str$( [esi+8] ),13,10
print str$( [esi+12] ),13,10
print str$( [esi+16] ),13,10,13,10
; changing address with INDEX
; ---------------------------
mov edi, 0 ; EDI is INDEX
print str$( [esi+edi] ),13,10
add edi, 4
print str$( [esi+edi] ),13,10
add edi, 4
print str$( [esi+edi] ),13,10
add edi, 4
print str$( [esi+edi] ),13,10
add edi, 4
print str$( [esi+edi] ),13,10,13,10
add edi, 4
; changing address with INDEX times MULTIPLIER
; ------------------------------
mov edi, 0
print str$( [esi+edi*4] ),13,10
add edi, 1
print str$( [esi+edi*4] ),13,10
add edi, 1
print str$( [esi+edi*4] ),13,10
add edi, 1
print str$( [esi+edi*4] ),13,10
add edi, 1
print str$( [esi+edi*4] ),13,10,13,10
add edi, 1
; BASE address (ESI) + INDEX (EDI) times MULTIPLIER + 12 bytes displacement
mov edi, 0
print str$( [esi+edi*4+12] ),13,10
pop edi
pop esi
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
As someone else had noted, the LEA instruction is very powerfull. It has evolved into quite the workhorse. It isnt always the optimal solutions, but has capabilities that simply do not exist otherwise.
It isnt always the most efficient solution, but it is always quite efficient none-the-less because the AGU (Address Generation Unit) gets quite a bit of loving from both Intel and AMD due to its overall importance. I think P4's are the "worst off" in regards to the AGU because there is only one of them, but even there its often the best solution to particular problems (such as when you need a 3-operand addition instruction)
Quote from: Astro on August 16, 2009, 01:03:25 AM
Quote"lea ecx,[ecx+4*ecx]" is a shortcut that multiplies ecx by 5
much faster than mul by 5
A few hours later and I'm no wiser.
Hi,
Well, a multiply is (was) regarded as an expensive
(time consuming) instruction. Adds and shifts are
much faster and require less setup. So:
5(ecx) = 1(ecx) + 4(ecx)
Is better than
5(ecx) = 5*ecx
Especially as
Quotehere, *4 corespond to "shl ecx,2"
So, 5(ecx) = ecx + (SHL ecx,2), an add and a shift
instead of an actual multiply (which requires eax and
a real live 5 as well). So, a bit complex to see for the
first time, but a good tool once understood.
HTH,
Steve N.
The performance drop with LEA in the PIV family of processors was very unfortunate as it was generally a very fast instruction that could also do compact calculations but with a PIV you only use it on odd occasions.
I would be interested to see if it has come back up to speed on the later CORE and i7 range of Intel hardware.
Quote from: hutch-- on August 16, 2009, 01:28:23 PM
I would be interested to see if it has come back up to speed on the later CORE and i7 range of Intel hardware.
Test it :bg
.nolist
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
LOOP_COUNT = 10000
.code
start:
print "Comparing mul, imul and lea for performing 'mov eax, 10*eax' :", 13, 10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 125
mov eax, 1
mov ecx, 10 ; 6 bytes for mov ecx, 10 plus mul ecx
REPEAT 8
mul ecx
ENDM
ENDM
counter_end
print str$(eax), 9, "millicycles for mul ecx", 13, 10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 125
mov eax, 1 ; 3 bytes
REPEAT 8
imul eax, eax, 10
ENDM
ENDM
counter_end
print str$(eax), 9, "millicycles for imul eax, eax, 10", 13, 10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 125
mov eax, 1 ; 5 bytes
REPEAT 8
lea eax, [eax+4*eax]
add eax, eax
ENDM
ENDM
counter_end
print str$(eax), 9, "millicycles for lea", 13, 10
inkey chr$(13, 10, "--- ok ---", 13)
exit
end start
Celeron M:
Comparing mul, imul and lea for performing 'mov eax, 10*eax' :
2657 millicycles for mul ecx
1122 millicycles for imul eax, eax, 10
1214 millicycles for lea
No use, I have 3 PIVs to play with but no later stuff yet. I want a Nehalem double quad running at 4 gig next. :P
For the record, all AMD64's can execute all of the "fastpath" ALU instructions, other than multiplication, on each of its ALU's. The ALU's and AGU's are all symmetric except for multiplication, which can only be handled on one of the ALU's.
This fact throttles the AMD64 to a maximum of 1 integer multiplication per clock cycle, so in practice is can sometimes be advantageous to use LEA instead of MUL/IMUL. In practice you will find that it isnt usualy advantageous because it is quite easy to have 3 or more AGU instructions in the pipeline at any given time (all memory reads and writes require the AGU's.) When you unroll that loop with the single memory read and single memory write, its now a loop with 4 or more AGU instructions... and that means that the AGU's are often one of your bottlenecks points.
All of the integer ALU operations that read from or write to memory directly require BOTH an ALU and and AGU to accomplish the work.
The AMD64 has 9 execution units.
3 ALU's (Arithmetic and Logic Unit)
3 AGU's (Address Generation Unit)
1 FADD unit (additions, subtractions, and "any")
1 FMUL unit (multiplication, division, and "any")
1 FMISC unit (float<->int, load and store, and "any")
The "any" includes most of the macro-operations, which sometimes require all 3 units simultaneously.
Also of some note is the fact that an AMD64 can only perform two memory reads per clock cycle even though it has 3 AGU's. You can still perform an LEA on that 3rd AGU but there are only 2 paths from the CPU to the L1 cache, so only 2 AGU's can actualy access memory simultaneously.
(yes, its true.. I know an insane amount of AMD64-specific stuff)
The AMD64, for the most part, is a much more pleasant optimization experience than the P4 is. The P4 has an insanely long pipeline and some crazy assymetric execution units.. the situation exists that some instructions are PORT 1 or 2, while others are only PORT 1, and others are just PORT 2. This makes it very hard to schedule properly so that you don't have one of those "1 or 2" instructions sitting in 2 while you also have some 2-only work to do. This has lead to generalized optimization strategies such as pairing up two single-uops or two double-uops (never having an odd number of either next to each other, unless you have a tripple-uop as well, in which case 3-1-1-1 .. the P4 is crazy stuff.. really.. its horrible from an optimizer standpoint)
Rockoon,
Can you test my snippet above on your AMD? I am curious how imul performs relative to lea. If you have an idea why the simple mul is so slow in comparison, I'd like to know...
The advantage on lea is that you can choose the registers you want to use. For this purpose I've written an macro that let me multiply values up to 50 - maybe some is interested in ...
regards, qWord
thanks qWord :U
dang - i wanted to use 51 - lol
C:\asm\masm_forums>jjmuls
Comparing mul, imul and lea for performing 'mov eax, 10*eax' :
1603 millicycles for mul ecx
1409 millicycles for imul eax, eax, 10
1162 millicycles for lea
--- ok ---
C:\asm\masm_forums>jjmuls
Comparing mul, imul and lea for performing 'mov eax, 10*eax' :
1610 millicycles for mul ecx
2714 millicycles for imul eax, eax, 10
1128 millicycles for lea
--- ok ---
C:\asm\masm_forums>jjmuls
Comparing mul, imul and lea for performing 'mov eax, 10*eax' :
1614 millicycles for mul ecx
1382 millicycles for imul eax, eax, 10
1160 millicycles for lea
--- ok ---
C:\asm\masm_forums>jjmuls
Comparing mul, imul and lea for performing 'mov eax, 10*eax' :
2944 millicycles for mul ecx
1344 millicycles for imul eax, eax, 10
1143 millicycles for lea
--- ok ---
C:\asm\masm_forums>jjmuls
Comparing mul, imul and lea for performing 'mov eax, 10*eax' :
1593 millicycles for mul ecx
1378 millicycles for imul eax, eax, 10
1133 millicycles for lea
--- ok ---
C:\asm\masm_forums>
There seems to be quite a bit of deviation in the tests as-is, and the test isnt "real world" which is something I dont spend time on anymore (its futile, as you don't actualy learn anything usefull.. such as how well it pairs with other relevant instructions in a real situation)
Quote from: qWord on August 16, 2009, 03:20:14 PM
The advantage on lea is that you can choose the registers you want to use. For this purpose I've written an macro that let me multiply values up to 50 - maybe some is interested in ...
regards, qWord
Cute. I added the imul variant which, by the way, allows to choose the registers.
lea vs. mul
const IMUL MUL LEA faster?
0 17 41 12 lea or macro
1 18 41 3 lea or macro
2 17 41 12 lea or macro
3 17 41 12 lea or macro
4 17 41 12 lea or macro
5 17 41 13 lea or macro
My timings for the mul variant are higher than yours because I added one line:
mov edi,cntr
mov esi,num
REPEAT 16
mov eax,esi
mov ecx, cntr ; should be loaded somewhere
mul ecx
ENDM
millicycles ?
i knew a gal named Milli, once
Quote from: jj2007 on August 16, 2009, 05:54:21 PM
My timings for the mul variant are higher than yours because I added one line:
mov edi,cntr
mov esi,num
it should be "mov ecx,cntr" :red
Quote..imul variant which, by the way, allows to choose the registers
I've never thought that imul is such fast... You live and learn :U
regards, qWord
One of the thoughts I had about why the P4 has such trouble with the 1-op mul is that it produces a 64-bit result. Is the system in question an older 32-bit P4 or one of the later 64-bit ones?
Quote from: Rockoon on August 16, 2009, 08:26:49 PM
One of the thoughts I had about why the P4 has such trouble with the 1-op mul is that it produces a 64-bit result. Is the system in question an older 32-bit P4 or one of the later 64-bit ones?
My Celeron M is a Yonah, pretty young. I had thought about the 64-bit result, too - but more than twice the cycle count is quite a bit.
When using e.g.
imul edx, edx, 10, one must indeed keep in mind that in contrast to
mul the result is only 32-bit. On the positive side, edx is not being trashed...
qWord, i haven't made speed test (or even tested them), but have you tried those combinations ?
for 7 :
lea TmpReg,[DestSrcReg+DestSrcReg*2]
lea DestSrcReg,[DestSrcReg+TmpReg*2]
for 11 :
lea TmpReg,[DestSrcReg+DestSrcReg*4]
lea DestSrcReg,[DestSrcReg+TmpReg*2]
for 13 :
lea TmpReg,[DestSrcReg+DestSrcReg*2]
lea DestSrcReg,[DestSrcReg+TmpReg*4]
for 15 :
lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
for 17 :
lea TmpReg,[DestSrcReg*8]
lea DestSrcReg,[DestSrcReg+TmpReg*2]
for 19 :
lea TmpReg,[DestSrcReg+DestSrcReg*8]
lea DestSrcReg,[DestSrcReg+TmpReg*2]
for 21 :
lea TmpReg,[DestSrcReg+DestSrcReg*4]
lea DestSrcReg,[DestSrcReg+TmpReg*4]
for 22 :
lea TmpReg,[DestSrcReg+DestSrcReg*4]
lea DestSrcReg,[DestSrcReg+TmpReg*2]
add DestSrcReg,DestSrcReg
for 25 :
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
for 26 :
lea TmpReg,[DestSrcReg+DestSrcReg*2]
lea DestSrcReg,[DestSrcReg+TmpReg*4]
add DestSrcReg,DestSrcReg
for 27 :
lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
lea DestSrcReg,[DestSrcReg+DestSrcReg*8]
for 37 :
lea TmpReg,[DestSrcReg+DestSrcReg*8]
lea DestSrcReg,[DestSrcReg+TmpReg*4]
for 38 :
lea TmpReg,[DestSrcReg+DestSrcReg*2]
shl DestSrcReg,5
lea DestSrcReg,[DestSrcReg+TmpReg*2]
for 39 :
lea TmpReg,[DestSrcReg+DestSrcReg*8]
lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
lea DestSrcReg,[DestSrcReg+TmpReg*4]
for 42 :
lea TmpReg,[DestSrcReg+DestSrcReg*4]
lea DestSrcReg,[DestSrcReg+TmpReg*4]
add DestSrcReg,DestSrcReg
for 44 :
lea TmpReg,[DestSrcReg+DestSrcReg*4]
lea DestSrcReg,[DestSrcReg+TmpReg*2]
shl DestSrcReg,2
for 45 :
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
lea DestSrcReg,[DestSrcReg+DestSrcReg*8]
for 49 :
mov TmpReg,DestSrcReg
lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
lea DestSrcReg,[TmpReg+DestSrcReg*4]
for 50 :
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
add DestSrcReg,DestSrcReg
Astro, a little exercise for you, what are the multiplier here ? (a little trap here...)
1)
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
shr DestSrcReg,2
2)
lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
shr DestSrcReg,1
3)
lea TmpReg,[DestSrcReg+DestSrcReg*2]
lea DestSrcReg,[DestSrcReg+TmpReg*2]
shr DestSrcReg,2
Quote from: NightWare on August 17, 2009, 08:21:58 PM
have you tried those combinations ?
maybe - I've written this macro some time back. I'm surly,that I doesn't get idea of all combination you post here, but if i remember rightly, sometimes, it was faster to replace two lea's with one lea and add/sub/shl.
regards, qWord
yes - Jochen ran those tests the other day and i must say i was disappointed in lea, and a little surprised
i guess if you are looking for small code, it is nice, but speed is more important these days
i had written a times 10 table generator and used lea eax,[eax+4*eax] to replace a mov/shift/add
i am sure the mov edx,eax - shl eax,2 - add eax,edx was faster to begin with, although it saved a couple bytes
here is a short test to show what I've mean:
(on an c2d)
multiple by 39
---------------------------
lea+shl+sub : 23
lea+lea+lea : 41
print "multiple by 39",10,13
print "---------------------------",10,13
counter_begin 1000000, REALTIME_PRIORITY_CLASS
mov edi,12345
REPEAT 16
mov edx,edi
mov eax,edx
lea edx,[edx+edx*4]
shl edx,3
sub edx,eax
ENDM
counter_end
;shr eax,4
push eax
print "lea+shl+sub : "
pop eax
print udword$(eax),10,13
counter_begin 1000000, REALTIME_PRIORITY_CLASS
mov edi,12345
REPEAT 16
mov edx,edi
lea eax,[edx+edx*8]
lea edx,[edx+edx*2]
lea edx,[edx+eax*4]
ENDM
counter_end
;shr eax,4
push eax
print "lea+lea+lea : "
pop eax
print udword$(eax),10,13
Generally on later hardware you avoid MUL if you can, shifts unless you have to and use ADD instead of LEA if you can. It depends on the miltiplication you need to do.
1st. Use ADD if you can.
2nd. Use LEA if you get a better encoding.
3rd. Use a shift if it gets you the range you need.
4th. Only use MUL if you have to as it has always been slow.
Quote from: qWord on August 17, 2009, 11:44:11 PM
here is a short test to show what I've mean:
(on an c2d)
multiple by 39
---------------------------
lea+shl+sub : 23
lea+lea+lea : 41
I get 38 for lea+shl+sub and 17 for imul.
"Bad" performance on multiplication hasn't really been true since the Pentium4.
The following assumes you are targeting AMD64/Core2/I7...
If you can do it with a single shift or lea, then sure.. and you can also possibly get away with 2 leas/shifts and beat out imul on Core2 (but not AMD64), but once you are at 3 leas/shifts, imul is going to be no worse unless you can somehow pair up those lea's and hide their latencies while not being able to do the same for the imul.
The imul latency is 3 cycles on Core2's and AMD64's ..
The lea latency is 1 cycle on Core2, and 2 cycles on AMD64
a dependency chain of 3 lea's thus takes 3 cycles on a core2 (same as imul), and *6* cycles on an AMD64 (twice as slow as an imul)
(Latency figures taken from Agner Fog's excellent research)
:dazzled:
I need a week to read and consider this thread! Awesome! :eek
Best regards,
Astro.
imm_mul MACRO Constant:REQ,DestSrcReg:REQ,TmpReg
IF Constant EQ 0
xor DestSrcReg,DestSrcReg
ELSEIF Constant EQ 1
;do nothing ;)
ELSEIF Constant EQ 2
add DestSrcReg,DestSrcReg
ELSEIF Constant EQ 3
lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
ELSEIF Constant EQ 4
shl DestSrcReg,2
ELSEIF Constant EQ 5
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
ELSEIF Constant EQ 6
add DestSrcReg,DestSrcReg ;2*3
lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
; lea DestSrcReg,[DestSrcReg+DestSrcReg*2] ;3*2
; shl DestSrcReg,1
ELSEIF Constant EQ 7
lea TmpReg,[DestSrcReg+DestSrcReg*2] ;(3*2)+1
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; mov TmpReg,DestSrcReg ;8-1
; shl DestSrcReg,3
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 8
shl DestSrcReg,3
ELSEIF Constant EQ 9
lea DestSrcReg,[DestSrcReg+DestSrcReg*8]
ELSEIF Constant EQ 10
add DestSrcReg,DestSrcReg ;2*5
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
; lea DestSrcReg,[DestSrcReg+DestSrcReg*4] ;5*2
; shl DestSrcReg,1
ELSEIF Constant EQ 11
lea TmpReg,[DestSrcReg+DestSrcReg*4] ;(5*2)+1
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; lea TmpReg,[DestSrcReg+DestSrcReg*8] ;9+2
; shl DestSrcReg,1
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 12
lea DestSrcReg,[DestSrcReg+DestSrcReg*2] ;3*4
shl DestSrcReg,2
ELSEIF Constant EQ 13
lea TmpReg,[DestSrcReg+DestSrcReg*2] ;(3*4)+1
lea DestSrcReg,[DestSrcReg+TmpReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg*2] ;16-3
; shl DestSrcReg,4
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 14
add DestSrcReg,DestSrcReg ;2*((3*2)+1)
lea TmpReg,[DestSrcReg+DestSrcReg*2]
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; lea TmpReg,[DestSrcReg+DestSrcReg] ;16-2
; shl DestSrcReg,4
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 15
lea DestSrcReg,[DestSrcReg+DestSrcReg*2] ;3*5
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
; mov TmpReg,DestSrcReg ;16-1
; shl DestSrcReg,4
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 16
shl DestSrcReg,4
ELSEIF Constant EQ 17
lea TmpReg,[DestSrcReg*8] ;(8*2)+1
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; mov TmpReg,DestSrcReg ;16+1
; shl DestSrcReg,4
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 18
add DestSrcReg,DestSrcReg ;2*9
lea DestSrcReg,[DestSrcReg+DestSrcReg*8]
; lea DestSrcReg,[DestSrcReg+DestSrcReg*8] ;9*2
; shl DestSrcReg,1
ELSEIF Constant EQ 19
lea TmpReg,[DestSrcReg+DestSrcReg*8] ;18+1
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; lea TmpReg,[DestSrcReg+DestSrcReg*2] ;16+3
; shl DestSrcReg,4
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 20
lea DestSrcReg,[DestSrcReg+DestSrcReg*4] ;5*4
shl DestSrcReg,2
ELSEIF Constant EQ 21
lea TmpReg,[DestSrcReg+DestSrcReg*4] ;(5*4)+1
lea DestSrcReg,[DestSrcReg+TmpReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg*4] ;16+5
; shl DestSrcReg,4
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 22
add DestSrcReg,DestSrcReg ;2*((5*2)+1)
lea TmpReg,[DestSrcReg+DestSrcReg*4]
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; lea TmpReg,[DestSrcReg+DestSrcReg*8] ;(9+2)*2
; shl DestSrcReg,1
; add DestSrcReg,TmpReg
; shl DestSrcReg,1
ELSEIF Constant EQ 23
lea TmpReg,[DestSrcReg+DestSrcReg*2] ;(3*8)-1
neg DestSrcReg
lea DestSrcReg,[DestSrcReg+TmpReg*8]
; lea TmpReg,[DestSrcReg+DestSrcReg*8] ;32-9
; shl DestSrcReg,5
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 24
lea DestSrcReg,[DestSrcReg+DestSrcReg*2] ;3*8
shl DestSrcReg,3
ELSEIF Constant EQ 25
lea DestSrcReg,[DestSrcReg+DestSrcReg*4] ;5*5
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg*8] ;16+9
; shl DestSrcReg,4
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 26
add DestSrcReg,DestSrcReg ; 2*((3*4)+1)
lea TmpReg,[DestSrcReg+DestSrcReg*2]
lea DestSrcReg,[DestSrcReg+TmpReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg*2] ;(16+3)*2
; shl DestSrcReg,4
; sub DestSrcReg,TmpReg
; shl DestSrcReg,1
ELSEIF Constant EQ 27
lea DestSrcReg,[DestSrcReg+DestSrcReg*2] ;3*9
lea DestSrcReg,[DestSrcReg+DestSrcReg*8]
; lea TmpReg,[DestSrcReg+DestSrcReg*4] ;32-5
; shl DestSrcReg,5
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 28
shl DestSrcReg,2 ;4*((3*2)+1)
lea TmpReg,[DestSrcReg+DestSrcReg*2]
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; lea TmpReg,[DestSrcReg*4]
; shl DestSrcReg,5
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 29
lea TmpReg,[DestSrcReg+DestSrcReg*4] ;(5*4)+9
lea DestSrcReg,[DestSrcReg+DestSrcReg*8]
lea DestSrcReg,[DestSrcReg+TmpReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg*2] ;32-3
; shl DestSrcReg,5
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 30
add DestSrcReg,DestSrcReg ; 2*(3*5)
lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg] ;32-2
; shl DestSrcReg,5
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 31
lea TmpReg,[DestSrcReg+DestSrcReg*2] ;(3*9)+4
lea TmpReg,[TmpReg+TmpReg*8]
lea DestSrcReg,[TmpReg+DestSrcReg*4]
; mov TmpReg,DestSrcReg ;32-1
; shl DestSrcReg,5
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 32
shl DestSrcReg,5
ELSEIF Constant EQ 33
lea TmpReg,[DestSrcReg*4] ;(4*8)+1
lea DestSrcReg,[DestSrcReg+TmpReg*8]
; mov TmpReg,DestSrcReg ;32+1
; shl DestSrcReg,5
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 34
add DestSrcReg,DestSrcReg ;2*((8*2)+1)
lea TmpReg,[DestSrcReg*8]
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; lea TmpReg,[DestSrcReg*2] ;32+2
; shl DestSrcReg,5
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 35
lea TmpReg,[DestSrcReg+DestSrcReg*8] ;(9*4)-1
neg DestSrcReg
lea DestSrcReg,[DestSrcReg+TmpReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg*2] ;32+3
; shl DestSrcReg,5
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 36
lea DestSrcReg,[DestSrcReg+DestSrcReg*8] ;9*4
shl DestSrcReg,2
ELSEIF Constant EQ 37
lea TmpReg,[DestSrcReg+DestSrcReg*8] ;(9*4)+1
lea DestSrcReg,[DestSrcReg+TmpReg*4]
; mov TmpReg,DestSrcReg ;(9*4)+1
; lea DestSrcReg,[DestSrcReg+DestSrcReg*8]
; shl DestSrcReg,2
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 38
add DestSrcReg,DestSrcReg ;2*((9*2)+1)
lea TmpReg,[DestSrcReg+DestSrcReg*8]
lea DestSrcReg,[DestSrcReg+TmpReg*2]
; lea TmpReg,[DestSrcReg+DestSrcReg*2] ;(16+3)*2
; shl DestSrcReg,4
; add DestSrcReg,TmpReg
; shl DestSrcReg,1
ELSEIF Constant EQ 39
lea TmpReg,[DestSrcReg+DestSrcReg*2] ;((3*4)+1)*3
lea DestSrcReg,[DestSrcReg+TmpReg*4]
lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
; mov TmpReg,DestSrcReg ;(5*8)-1
; lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
; shl DestSrcReg,3
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 40
lea DestSrcReg,[DestSrcReg+DestSrcReg*4] ;5*8
shl DestSrcReg,3
ELSEIF Constant EQ 41
mov TmpReg,DestSrcReg ;(5*8)+1
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
lea DestSrcReg,[TmpReg+DestSrcReg*8]
; mov TmpReg,DestSrcReg ;(5*8)+1
; lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
; shl DestSrcReg,3
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 42
add DestSrcReg,DestSrcReg ;2*((5*4)+1)
lea TmpReg,[DestSrcReg+DestSrcReg*4]
lea DestSrcReg,[DestSrcReg+TmpReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg*4] ;(16+5)*2
; shl DestSrcReg,4
; add DestSrcReg,TmpReg
; shl DestSrcReg,1
ELSEIF Constant EQ 43
lea TmpReg,[DestSrcReg+DestSrcReg*2] ;(5*8)+3
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
lea DestSrcReg,[TmpReg+DestSrcReg*8]
; lea TmpReg,[DestSrcReg+2*DestSrcReg] ;(5*8)+3
; lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
; shl DestSrcReg,3
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 44
lea TmpReg,[DestSrcReg+DestSrcReg*4] ;((5*2)+1)*4
lea DestSrcReg,[DestSrcReg+TmpReg*2]
shl DestSrcReg,2
; lea TmpReg,[DestSrcReg+DestSrcReg*8] ;(9+2)*4
; shl DestSrcReg,1
; add DestSrcReg,TmpReg
; shl DestSrcReg,2
ELSEIF Constant EQ 45
lea DestSrcReg,[DestSrcReg+DestSrcReg*4] ;5*9
lea DestSrcReg,[DestSrcReg+DestSrcReg*8]
; lea DestSrcReg,[DestSrcReg+DestSrcReg*8] ;(9*4)+9
; mov TmpReg,DestSrcReg
; shl DestSrcReg,2
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 46
lea TmpReg,[DestSrcReg+DestSrcReg*4] ;(5*9)+1
lea TmpReg,[TmpReg+TmpReg*8]
lea DestSrcReg,[DestSrcReg+TmpReg]
; lea TmpReg,[DestSrcReg+DestSrcReg*8] ;(32-9)*2
; shl DestSrcReg,5
; sub DestSrcReg,TmpReg
; shl DestSrcReg,1
ELSEIF Constant EQ 47
lea TmpReg,[DestSrcReg+DestSrcReg*4] ;(5*9)+2
lea TmpReg,[TmpReg+TmpReg*8]
lea DestSrcReg,[TmpReg+DestSrcReg*2]
; mov TmpReg,DestSrcReg ;(3*16)-1
; lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
; shl DestSrcReg,4
; sub DestSrcReg,TmpReg
ELSEIF Constant EQ 48
lea DestSrcReg,[DestSrcReg+DestSrcReg*2] ;3*16
shl DestSrcReg,4
ELSEIF Constant EQ 49
lea TmpReg,[DestSrcReg+DestSrcReg*2] ;(3*16)+1
shl TmpReg,4
lea DestSrcReg,[DestSrcReg+TmpReg]
; mov TmpReg,DestSrcReg ;(3*16)+1
; lea DestSrcReg,[DestSrcReg+DestSrcReg*2]
; shl DestSrcReg,4
; add DestSrcReg,TmpReg
ELSEIF Constant EQ 50
add DestSrcReg,DestSrcReg ;(2*5)*5
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
lea DestSrcReg,[DestSrcReg+DestSrcReg*4]
; lea TmpReg,[DestSrcReg+DestSrcReg*8] ;(9+16)*2
; shl DestSrcReg,4
; add DestSrcReg,TmpReg
; shl DestSrcReg,1
ELSE
%echo err: constant value too large! (>50)
ENDIF
ENDM
here all the combinations i've found, there is certainly others/better, but i will not pass my life to search
i've limited myself to 3 instructions, and avoided the sub instruction (unfortunatelly my replacement are generally slower...).
you to test if it's interesting or not, personally i've made speed test (basically the same as qword, except i've put "mov DestSrcReg,DestSrcReg" before and after the algo, to see the effect with the possibles stalls).