Print Page - best way of multiplying

Title: best way of multiplying
Post by: juliomacedo on January 18, 2006, 01:55:15 PM

Instead of executing:


mul(19, eax);

I tried to decompose it to shifts and adds, this way:

Code Select


mov(eax, ebx);
shl(3, ebx);
add(eax, ebx);
shl(1, ebx);
add(eax, ebx);

Is it the best way of doing it?

Title: Re: best way of multiplying
Post by: QvasiModo on January 18, 2006, 03:31:58 PM

My attempt... (not benchmarked)

Code Select


mov(eax, ecx);  // ecx := eax
mov(eax, edx);  // edx := eax
shl(4, ecx);    // ecx := ecx * 16
shl(1, edx);    // edx := edx * 2
add(ecx, eax);  // eax := eax + ecx
add(edx, eax);  // eax := eax + edx    16x + 2x + x = 19x

This should break the dependency chain, faster for Pentium 2 and above I think.

Title: Re: best way of multiplying
Post by: hutch-- on January 18, 2006, 03:39:34 PM

Give this a blast. The LEA may be a bit slow on a PIV but its still probably faster than using a shift.

Code Select


    mov ecx, eax            ; store eax in ecx
    lea eax, [eax+eax*8]    ; multiply by 9
    add eax, eax            ; double it for 18
    add eax, ecx            ; add stored EAX for 19

Title: Re: best way of multiplying
Post by: daydreamer on January 18, 2006, 05:56:38 PM

I was gonna ask this one
mov eax,050009h
mov ebx, 0190019h
mul ebx
should eax contain 9x19h and edx 5x19h ?

Title: Re: best way of multiplying
Post by: Petroizki on January 18, 2006, 07:30:49 PM

Code Select

lea ecx, [eax*2 + eax] ; ecx = eax * 3
shl eax, 4             ; eax = eax * 16
add eax, ecx

Or:

Code Select

lea ecx, [eax*8 + eax]
lea eax, [ecx*2 + eax]

I would use imul, rather than playing with these.. :eek

Title: Re: best way of multiplying
Post by: Ian_B on January 18, 2006, 11:18:55 PM

Quote from: Petroizki on January 18, 2006, 07:30:49 PM
I would use imul, rather than playing with these.. :eek

Depends what he's targetting his code for, and how critical the speed requirement is. If it's P4, then IMUL runs with minimum latency of 14 clocks, whereas 2 LEAs cost 4 clocks each, which already almost halves the "cost". It also depends what julio meant by "best" in his initial post, which he didn't clarify. Shortest code, or shortest time to process, or least number of extra registers used? It's always a trade-off.

If you have code space and registers to waste, using the simplest possible and shortest latency instructions is best for P4, as these are not only 1/2 clock cycle latency, but they can "quad-pump" through both arithmetic units, so you can process 4 at once per clock cycle if the results do not depend on one of the simultaneous calculations. That means code such as this, where register dependence (ie. using EAX after you've just modified it) is spaced as far as possible, preferably by 4 instructions, should finish in about one to one and a half times the cycles needed by a single LEA instruction, which as seen in the examples above can only do half the work.

Code Select


MOV EBX, EAX
MOV ECX, EAX
MOV EDX, EAX
ADD EAX, EAX
ADD EBX, EBX
ADD ECX, ECX
ADD EDX, EAX   <-  3*EAX
ADD EAX, EAX
ADD EBX, EBX
ADD ECX, ECX
ADD EAX, EAX   <-  8*EAX
ADD EDX, EBX
ADD EAX, ECX   <-  12*EAX
ADD EAX, EDX   <-  19*EAX

This also adds more irony to the "Intel is NOT RISC" debate, as here the P4 is doing something very RISC-like, processing simple, small instructions very fast and beating the CISC alternatives in its instruction set hands down. I see that as a strength where it allows me as a programmer to make the choice of tradeoff.

IanB

Title: Re: best way of multiplying
Post by: hutch-- on January 19, 2006, 01:57:53 AM

Fine but get it to run on a PIII. It is a specific to a PIV that LEA is no longer fast and the recommended replacement by Intel is to use up to 3 adds instead. Shifts got slower as well and both are to be avoided where possible on a PIV. Below is a clocked comparison of the two methods.

1235 LEA version
2469 ADD version
1219 LEA version
2484 ADD version
1172 LEA version
2500 ADD version
1250 LEA version
2469 ADD version
1234 LEA version
2485 ADD version
1187 LEA version
2500 ADD version
1297 LEA version
2484 ADD version
1250 LEA version
2516 ADD version

On the 2.8 gig PIV Prescott I develop on, the shorter CISC version is more than twice as fast as the longer RISC version.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    .code

start:
   
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    call main
    inkey
    exit

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

main proc

    push ebx
    push esi
    push edi

    ;; mul(19, eax)

    REPEAT 8

    invoke GetTickCount
    push eax

    mov esi, 1000000000
  @@:
    mov eax, 10
  ; ----------------------
    mov ecx, eax            ; store eax in ecx
    lea eax, [eax+eax*8]    ; multiply by 9
    add eax, eax            ; double it for 18
    add eax, ecx            ; add stored EAX for 19
  ; ----------------------
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print str$(eax)," LEA version",13,10

    invoke GetTickCount
    push eax

    mov esi, 1000000000
  @@:
    mov eax, 10
  ; ----------------------
    MOV EBX, EAX
    MOV ECX, EAX
    MOV EDX, EAX
    ADD EAX, EAX
    ADD EBX, EBX
    ADD ECX, ECX
    ADD EDX, EAX   ; <-  3*EAX
    ADD EAX, EAX
    ADD EBX, EBX
    ADD ECX, ECX
    ADD EAX, EAX   ; <-  8*EAX
    ADD EDX, EBX
    ADD EAX, ECX   ; <-  12*EAX
    ADD EAX, EDX   ; <-  19*EAX
  ; ----------------------
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print str$(eax)," ADD version",13,10

    ENDM

    pop edi
    pop esi
    pop ebx

    ret

main endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

end start

LATER:

Here is the olde skoole approach which is still about twice as fast on this Prescott PIV.

Code Select


    invoke GetTickCount
    push eax

    mov esi, 1000000000
  @@:
    mov eax, 10
  ; ----------------------
    mov ecx, eax
    add ecx, eax
    mov edx, eax
    shl eax, 4
    add eax, ecx
    add eax, edx
  ; ----------------------
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print str$(eax)," SHL version",13,10

The trick is when you have a slow instruction you need to use is to interleave it with shorter faster instructions so the following instruction falls into the hole left by the slow one. You pay a speed penalty by using two consecutive slow instructions as the second one is not absorbed by the hole left by the first.

Title: Re: best way of multiplying
Post by: Ian_B on January 19, 2006, 07:59:30 AM

I agree with your timings, hutch. I really don't understand why that should be the case, though. It's a pity Mark Larson's not around much, I'm sure he'd be able to explain why 4 dependent instructions including a slow LEA are faster than my "RISC-like" code suggestion.

However, I'd point out that I was originally questioning the use of IMUL or MUL, which works out 4 times slower. And I qualified myself explicitly by pointing out that julio gave us no baseline to work from, as he didn't specify what processor he wanted to target or what aspect of "best" he considered was most important.

IanB

Title: Re: best way of multiplying
Post by: Mark_Larson on January 19, 2006, 07:48:52 PM

Quote from: Ian_B on January 19, 2006, 07:59:30 AM
I agree with your timings, hutch. I really don't understand why that should be the case, though. It's a pity Mark Larson's not around much, I'm sure he'd be able to explain why 4 dependent instructions including a slow LEA are faster than my "RISC-like" code suggestion.

I read all the posts, I just don't have a lot of time to post. You had a good idea to make it RISC-like. Usually that works great in these cases. I am a bit concerned about using GetTickCount, since it returns times in milliseconds, and not clock cycles. So you loose some accuracy. And since both your code snippets are short, and execute in under 20 cycles, it might be better to use the more accurate timings ( RDTSC). But let's look at Hutch's timings. Choosing the first 2 values for your ADD and his LEA.

1235 for his LEA. He has a 2.8 GHz Prescott. GetTickCount returns times in milliseconds. There are 1000000000 loops. So doing a little math we can convert to cycle count. Keep in mind this isn't very accurate since GetTickCount has a resolution measured in milliseconds. So if we divide by the number of loops, we will get the total number of milliseconds from one loop. Multiplying by 1000 converts to microseconds, and multiplying by 2800 converts to cycles.

Code Select


1235 / 1000000000 = 0.000001235 milliseconds.   0.000001235 * 1000 * 2800 = 3.458 cycles.

Eyeballing his code, I would have guessed it ran maybe in about 6-7 cycles ( 4 cycles for the LEA) due to all the stalls. So this might be off some what from using GetTickCount()

Code Select


    mov ecx, eax            ; store eax in ecx
    lea eax, [eax+eax*8]    ; multiply by 9
    add eax, eax            ; double it for 18
    add eax, ecx            ; add stored EAX for 19

Your code runs in 2469 milliseconds for 1000000000 loops, converting to cycles we get = 6.9132 cycles. Your idea for a RISC-like solution and the possibility of the P4 executing 4 ALU instructions a cycle if there aren't any dependencies are all spot on. But you have a lot of dependencies in your code that keep it from running that fast. You have to watch read/after-write depedencies. You have to keep in mind that the 4 first instructions all execute in parallel and in 1 cycle. So to the processor you have modified, EAX, EBX, ECX, and EDX all on the previous clock cycle, and now you want to do "ADD EBX, EBX" on the next line. That introduces a depedency.

Code Select


	MOV EBX, EAX \
	MOV ECX, EAX  \
	MOV EDX, EAX  /  ----- these 4 instructions all execute in 1 clock cycle.  You modified EBX on the previous cycle, and now you
 are doing an "ADD EBX,EBX", which has a depedency with the previous 4 instructions since they execute in parallel, and modify EBX.
	ADD EAX, EAX /

	ADD EBX, EBX
	ADD ECX, ECX
	ADD EDX, EAX   <-  3*EAX
	ADD EAX, EAX
	ADD EBX, EBX
	ADD ECX, ECX
	ADD EAX, EAX   <-  8*EAX
	ADD EDX, EBX
	ADD EAX, ECX   <-  12*EAX
	ADD EAX, EDX   <-  19*EAX

EDIT: I have a Linux system at work, or I'd test all this myself.

Mark

Title: Re: best way of multiplying
Post by: Ian_B on January 19, 2006, 09:08:35 PM

Thanks, Mark. Good to see you're still around!

I'd assumed that since the instructions were completing 4 in one cycle, and the latency of these small ops is 0.5 cycles, then the latency/dependency would already be resolved by the next quad-pump cycle. Clearly that's not the case. :'(

IanB

Title: Re: best way of multiplying
Post by: EduardoS on January 19, 2006, 09:27:21 PM

according to juliomacedo post he want get the value on eax and return the result on ebx, so:
lea ebx, [eax+eax*8]
lea ebx, [ebx+ebx*2]

They generate a dependency and lea is slow on PIV, but in world there aren't only PIV, you can't make this code only with independent instructions, and here there is only 2 instructions,
Making the code bigger may cause other parts to cross memory pages and slow down the final program, if this multiplication is very comon maybe it is a good idea, but how many times you need to multiply something by 19?
And last, no program execute only multiplications by 19, he probably want do something more, and this code allow him to do 2 independt instructions before and 2 after in the same clock cycles.

BTW, In Agner's pentium optimisation manual he say the PIV can only send 3 uops from trace cache to processor per clock,

looking at PIV i have the impression when it decode the instruction to put them in trace cache it group them in blocks of 3 independent instructions, and don't look for dependency when executing them.

Title: Re: best way of multiplying
Post by: Mark_Larson on January 19, 2006, 09:56:31 PM

Quote from: EduardoS on January 19, 2006, 09:27:21 PM
BTW, In Agner's pentium optimisation manual he say the PIV can only send 3 uops from trace cache to processor per clock,

looking at PIV i have the impression when it decode the instruction to put them in trace cache it group them in blocks of 3 independent instructions, and don't look for dependency when executing them.

That is correct. That is one of the drawbacks to being able to issue 4 ALU instructions a cycle. It only works if it isn't in the trace cache. This tidbit of information is also talked about in the P4 Optimization Manual ( along with the part of how it issues 4 ALU instructions per cycle for certain instructions). (I use the P4 Optimization Manual as my primary reference for code optimzation, since it has so much more detail). The 4 ALU instructions in parallel only works with some ALU instructions, and not all. If you have the P4 optimization manual, all the ALU instructions that have 0.5 cycles, can be issued at 4 a cycle when not in the trace cache. The rest cannot. The trace cache ends up hurting the ability of the processor to issue 4 ALU instructions in parallel, but it helps greatly with how fast MMX/SSE/SSE2 runs on the P4. The number of uops they take up is small. So you can actually write some SSE2 single precision floating point code that runs at about 1 cycle per instruction due to the tracecache and interleaving use of the proper execution units ( no dependencies, and interleaving MULs and ADDS).

Mark

Title: Re: best way of multiplying
Post by: hutch-- on January 20, 2006, 12:21:33 AM

Here are two variations. The first being close to identical to the LEA version I posted above. The next is a shorter RISC version that deliberately ignores the read after write stalls. It is faster than the long version by about the reduction in the instruction count.

If I have learnt this much over time, lower instruction counts with similar instruction type usually end up faster so I tend to avoid the known pitfalls with repeated slow instructions and interleave the odd slow one with the preferred shorter instructions like add, sub etc ...

I have yet to see much evidence that smaller as in byte count code goes any faster in most instances.

Code Select


    mov ecx, eax
    lea eax, [eax+eax*8]
    shl eax, 1
    add eax, ecx

Code Select


    mov ecx, eax    ; read after write stall
    add eax, eax    ; read after write stall
    add eax, eax    ; read after write stall
    add eax, eax    ; read after write stall
    add eax, eax    ; read after write stall
    add eax, ecx
    add eax, ecx
    add eax, ecx

Title: Re: best way of multiplying
Post by: juliomacedo on January 20, 2006, 04:24:21 AM

1. I was looking for the best solution in terms of time taken to execute...

2. I'm not worried about code size but using all registers doesn't seem a good solution since I need them for other tasks (store some other values).

3. I didn't mention processor because I expected a solution that works well in most of processors not in a specific processor... Anyway, it's very good to talk about optimizations for each processor.

4. I used 19 just as an example (I should have said this before). I'm looking for good solutions (in terms of time) that apply to all multiplications of a register for an interger constant, not specifically 19.

Thanks :U

Julio

Title: Re: best way of multiplying
Post by: hutch-- on January 20, 2006, 04:46:18 AM

julio,

For general purpose multiplication, your choices are MUL, IMUL, the floating point instructions and MMX/SSE(2) range. Fixed multiplys are faster but this is not much use to you if you need the normal range.

Title: Re: best way of multiplying
Post by: juliomacedo on January 20, 2006, 06:17:15 AM

My question was based on the following:

"Intel recommends that any fixed integer with fewer than 8 '1' bits set should be decomposed to shifts and adds"

It seems to mean that I shouldn't simply use MUL or IMUL when performing integer multiplications since MUL and IMUL are very slow instructions...

The article suggests "do the obvious 'horner's rule' method of just shifting and adding" and what I wanted to know when I created this post was if HORNER'S RULE (exemplified by the code that I wrote with first code) is the best (in terms of time) solution...

As I see, there's other suggestions as LEA solution, although I don't know if it is a good solution for all integer constants or only for a few cases such as 19.

We need more performance tests using more-reliable RDSTC and other constants besides 19...

Title: Re: best way of multiplying
Post by: gabor on January 20, 2006, 07:15:41 AM

Hello!

Quote from: hutch-- on January 20, 2006, 12:21:33 AM
I have yet to see much evidence that smaller as in byte count code goes any faster in most instances.

I am sure that longer code is faster in some cases! Example: you have an instruction that is NOT 4 or 8 bytes long. AFAIK it is recommended to insert nops or to rearrange the instruction flow so that it becomes dword aligned. Am I right, is this so? Some time ago I tried to optimize a code of mine and experienced about 7x acceleration when alignement was adjusted. I hope I'm not messing up my memories and I remember right?

There turn out to be so many condition to be taken in count that it is high time to have a tester tool that gives the opportunity to test and compare codes. I beliece Intel created such tool, but I don't remember its name. And it was not freeware... Do you know any other tools for this purpose?

btw, I always like to read about such topics: they seem a bit nonsense (mul by 19 is rather special case), and yet I have learnt from post like this really a lot! :bg

Greets, Gábor

Title: Re: best way of multiplying
Post by: juliomacedo on January 20, 2006, 01:55:31 PM

Gabor,

The purpose is not finding a way of multiplying for 19, it was just a arbitrary constant that I chose to demonstrate the method of decomposing in shifts and adds that I wanted to dicuss...

What I want to discuss is the faster way of making any integer multiplication...

Thanks

Julio

Title: Re: best way of multiplying
Post by: Tedd on January 20, 2006, 02:05:37 PM

Quote from: juliomacedo on January 20, 2006, 01:55:31 PM
What I want to discuss is the faster way of making any integer multiplication...

Then you should probably use the shift & add method.
You could include within this special cases for particular numbers, using lea and other tricks, where it would be faster than shifting and adding. However, it depends how you want to use it, if it's to generate code for performing the muliplications, then you can go to some expense at finding the best combination, but if you want to perform the multiplications then the code required to catch the special cases will probably remove the benefit of using the special cases anyway.

Title: Re: best way of multiplying
Post by: Mark_Larson on January 20, 2006, 04:17:24 PM

Quote from: hutch-- on January 20, 2006, 12:21:33 AM
...I tend to avoid the known pitfalls with repeated slow instructions and interleave the odd slow one with the preferred shorter instructions like add, sub etc ...

That's a recommended optimization technique on the P3 and earlier processors from the P3 and earlier optimization manuals. They describe it as interleaving a multi-uop instruction followed by one to three 1 uop instructions. In the P4 optimization manual they say don't do this anymore. However I have found it still works good on P4 ( at least the one I have).

Mark

Title: Re: best way of multiplying
Post by: juliomacedo on January 20, 2006, 07:10:02 PM

Quote from: Tedd on January 20, 2006, 02:05:37 PM

if it's to generate code for performing the muliplications, then you can go to some expense at finding the best combination, but if you want to perform the multiplications then the code required to catch the special cases will probably remove the benefit of using the special cases anyway.

My purpose is generate code to perform the multiplications so the code to catch the special cases will run at compile-time and won't affect run-time performance...

Title: Re: best way of multiplying
Post by: daydreamer on January 20, 2006, 07:15:08 PM

Quote from: !Czealot on January 18, 2006, 05:56:38 PM
I was gonna ask this one
mov eax,050009h
mov ebx, 0190019h
mul ebx
should eax contain 9x19h and edx 5x19h ?

ok my idea was to use it for byte multiplications, for example blend operations, using full range of one 32bit mul and first result in eax and second in edx
I have a vague memory of Mark somewhere wrote MMX/SSE has some execution units and ALU some, so execute this together with some MMX/SSE2 muls? should that be working in parallel?

Title: Re: best way of multiplying
Post by: EduardoS on January 20, 2006, 09:15:29 PM

A generic mul for PIV,
You want multiply eax by a constant,
for each bit in constant from right to left do
add ebx, eax
add eax, eax
if the bit is seted and
add eax, eax
if the bit is claered
both code should run in 1 clock cycle
with 19 (for example)

mov ebx, eax ; ebx should start with 0, so mov instead of add, the first bit is set
add eax, eax
add ebx, eax ; the second bit is seted
add eax, eax
; the third bit is cleared
add eax, eax
; the forth bit is cleared
add eax, eax
add ebx, eax ; the fifth bit is seted
;dont need final add eax, eax, i already have the result

this code should run within 5 clocks, if the tables on pentium otimisation manuals are right its not possible to be faster, and this is a general form.

Title: Re: best way of multiplying
Post by: juliomacedo on January 20, 2006, 09:55:01 PM

Quote from: EduardoS on January 20, 2006, 09:15:29 PM
if the tables on pentium otimisation manuals are right

Where are these tables?

Title: Re: best way of multiplying
Post by: juliomacedo on January 20, 2006, 10:04:19 PM

Quote from: EduardoS on January 20, 2006, 09:15:29 PM
A generic mul for PIV,
You want multiply eax by a constant,
for each bit in constant from right to left do
add ebx, eax
add eax, eax
if the bit is seted and
add eax, eax
if the bit is claered
both code should run in 1 clock cycle

Another interesting generic solution... Now we have Horner and Eduardo... Maybe LEA solution that I don't know if it is generic or specific...

What's is the faster solution (in average, for all integer constants)??? Mathematicians are welcome to answer this question :toothy

Title: Re: best way of multiplying
Post by: EduardoS on January 20, 2006, 10:55:03 PM

try www.agner.org

lea isn't so generic, it works well for 19, but not for 23 for example, and also lea is slow on PIV (still fast on AMD processors)...

Title: Re: best way of multiplying
Post by: Snouphruh on February 08, 2006, 10:17:07 AM

Quote from: Mark_Larson on January 19, 2006, 07:48:52 PM

EDIT: I have a Linux system at work, or I'd test all this myself.

Mark

What kind of Linux? Fedora Core, Suse, Mandriva, Red Hat, Alt Linux, Linux XP, ASP Linux, Debian?

The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: juliomacedo on January 18, 2006, 01:55:15 PM