best way of multiplying

juliomacedo · January 18, 2006, 01:55:15 PM

Instead of executing:


mul(19, eax);

I tried to decompose it to shifts and adds, this way:

Code Select


mov(eax, ebx);
shl(3, ebx);
add(eax, ebx);
shl(1, ebx);
add(eax, ebx);

Is it the best way of doing it?

QvasiModo · January 18, 2006, 03:31:58 PM

My attempt... (not benchmarked)

Code Select


mov(eax, ecx);  // ecx := eax
mov(eax, edx);  // edx := eax
shl(4, ecx);    // ecx := ecx * 16
shl(1, edx);    // edx := edx * 2
add(ecx, eax);  // eax := eax + ecx
add(edx, eax);  // eax := eax + edx    16x + 2x + x = 19x

This should break the dependency chain, faster for Pentium 2 and above I think.

hutch-- · January 18, 2006, 03:39:34 PM

Give this a blast. The LEA may be a bit slow on a PIV but its still probably faster than using a shift.

Code Select


    mov ecx, eax            ; store eax in ecx
    lea eax, [eax+eax*8]    ; multiply by 9
    add eax, eax            ; double it for 18
    add eax, ecx            ; add stored EAX for 19

daydreamer · January 18, 2006, 05:56:38 PM

I was gonna ask this one
mov eax,050009h
mov ebx, 0190019h
mul ebx
should eax contain 9x19h and edx 5x19h ?

Petroizki · January 18, 2006, 07:30:49 PM

Code Select

lea ecx, [eax*2 + eax] ; ecx = eax * 3
shl eax, 4             ; eax = eax * 16
add eax, ecx

Or:

Code Select

lea ecx, [eax*8 + eax]
lea eax, [ecx*2 + eax]

I would use imul, rather than playing with these.. :eek

Ian_B · January 18, 2006, 11:18:55 PM

Quote from: Petroizki on January 18, 2006, 07:30:49 PM
I would use imul, rather than playing with these.. :eek

Depends what he's targetting his code for, and how critical the speed requirement is. If it's P4, then IMUL runs with minimum latency of 14 clocks, whereas 2 LEAs cost 4 clocks each, which already almost halves the "cost". It also depends what julio meant by "best" in his initial post, which he didn't clarify. Shortest code, or shortest time to process, or least number of extra registers used? It's always a trade-off.

If you have code space and registers to waste, using the simplest possible and shortest latency instructions is best for P4, as these are not only 1/2 clock cycle latency, but they can "quad-pump" through both arithmetic units, so you can process 4 at once per clock cycle if the results do not depend on one of the simultaneous calculations. That means code such as this, where register dependence (ie. using EAX after you've just modified it) is spaced as far as possible, preferably by 4 instructions, should finish in about one to one and a half times the cycles needed by a single LEA instruction, which as seen in the examples above can only do half the work.

Code Select


MOV EBX, EAX
MOV ECX, EAX
MOV EDX, EAX
ADD EAX, EAX
ADD EBX, EBX
ADD ECX, ECX
ADD EDX, EAX   <-  3*EAX
ADD EAX, EAX
ADD EBX, EBX
ADD ECX, ECX
ADD EAX, EAX   <-  8*EAX
ADD EDX, EBX
ADD EAX, ECX   <-  12*EAX
ADD EAX, EDX   <-  19*EAX

This also adds more irony to the "Intel is NOT RISC" debate, as here the P4 is doing something very RISC-like, processing simple, small instructions very fast and beating the CISC alternatives in its instruction set hands down. I see that as a strength where it allows me as a programmer to make the choice of tradeoff.

IanB

hutch-- · January 19, 2006, 01:57:53 AM

Fine but get it to run on a PIII. It is a specific to a PIV that LEA is no longer fast and the recommended replacement by Intel is to use up to 3 adds instead. Shifts got slower as well and both are to be avoided where possible on a PIV. Below is a clocked comparison of the two methods.

1235 LEA version
2469 ADD version
1219 LEA version
2484 ADD version
1172 LEA version
2500 ADD version
1250 LEA version
2469 ADD version
1234 LEA version
2485 ADD version
1187 LEA version
2500 ADD version
1297 LEA version
2484 ADD version
1250 LEA version
2516 ADD version

On the 2.8 gig PIV Prescott I develop on, the shorter CISC version is more than twice as fast as the longer RISC version.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    .code

start:
   
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    call main
    inkey
    exit

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

main proc

    push ebx
    push esi
    push edi

    ;; mul(19, eax)

    REPEAT 8

    invoke GetTickCount
    push eax

    mov esi, 1000000000
  @@:
    mov eax, 10
  ; ----------------------
    mov ecx, eax            ; store eax in ecx
    lea eax, [eax+eax*8]    ; multiply by 9
    add eax, eax            ; double it for 18
    add eax, ecx            ; add stored EAX for 19
  ; ----------------------
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print str$(eax)," LEA version",13,10

    invoke GetTickCount
    push eax

    mov esi, 1000000000
  @@:
    mov eax, 10
  ; ----------------------
    MOV EBX, EAX
    MOV ECX, EAX
    MOV EDX, EAX
    ADD EAX, EAX
    ADD EBX, EBX
    ADD ECX, ECX
    ADD EDX, EAX   ; <-  3*EAX
    ADD EAX, EAX
    ADD EBX, EBX
    ADD ECX, ECX
    ADD EAX, EAX   ; <-  8*EAX
    ADD EDX, EBX
    ADD EAX, ECX   ; <-  12*EAX
    ADD EAX, EDX   ; <-  19*EAX
  ; ----------------------
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print str$(eax)," ADD version",13,10

    ENDM

    pop edi
    pop esi
    pop ebx

    ret

main endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

end start

LATER:

Here is the olde skoole approach which is still about twice as fast on this Prescott PIV.

Code Select


    invoke GetTickCount
    push eax

    mov esi, 1000000000
  @@:
    mov eax, 10
  ; ----------------------
    mov ecx, eax
    add ecx, eax
    mov edx, eax
    shl eax, 4
    add eax, ecx
    add eax, edx
  ; ----------------------
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print str$(eax)," SHL version",13,10

The trick is when you have a slow instruction you need to use is to interleave it with shorter faster instructions so the following instruction falls into the hole left by the slow one. You pay a speed penalty by using two consecutive slow instructions as the second one is not absorbed by the hole left by the first.

Ian_B · January 19, 2006, 07:59:30 AM

I agree with your timings, hutch. I really don't understand why that should be the case, though. It's a pity Mark Larson's not around much, I'm sure he'd be able to explain why 4 dependent instructions including a slow LEA are faster than my "RISC-like" code suggestion.

However, I'd point out that I was originally questioning the use of IMUL or MUL, which works out 4 times slower. And I qualified myself explicitly by pointing out that julio gave us no baseline to work from, as he didn't specify what processor he wanted to target or what aspect of "best" he considered was most important.

IanB

Mark_Larson · January 19, 2006, 07:48:52 PM

Quote from: Ian_B on January 19, 2006, 07:59:30 AM
I agree with your timings, hutch. I really don't understand why that should be the case, though. It's a pity Mark Larson's not around much, I'm sure he'd be able to explain why 4 dependent instructions including a slow LEA are faster than my "RISC-like" code suggestion.

I read all the posts, I just don't have a lot of time to post. You had a good idea to make it RISC-like. Usually that works great in these cases. I am a bit concerned about using GetTickCount, since it returns times in milliseconds, and not clock cycles. So you loose some accuracy. And since both your code snippets are short, and execute in under 20 cycles, it might be better to use the more accurate timings ( RDTSC). But let's look at Hutch's timings. Choosing the first 2 values for your ADD and his LEA.

1235 for his LEA. He has a 2.8 GHz Prescott. GetTickCount returns times in milliseconds. There are 1000000000 loops. So doing a little math we can convert to cycle count. Keep in mind this isn't very accurate since GetTickCount has a resolution measured in milliseconds. So if we divide by the number of loops, we will get the total number of milliseconds from one loop. Multiplying by 1000 converts to microseconds, and multiplying by 2800 converts to cycles.

Code Select


1235 / 1000000000 = 0.000001235 milliseconds.   0.000001235 * 1000 * 2800 = 3.458 cycles.

Eyeballing his code, I would have guessed it ran maybe in about 6-7 cycles ( 4 cycles for the LEA) due to all the stalls. So this might be off some what from using GetTickCount()

Code Select


    mov ecx, eax            ; store eax in ecx
    lea eax, [eax+eax*8]    ; multiply by 9
    add eax, eax            ; double it for 18
    add eax, ecx            ; add stored EAX for 19

Your code runs in 2469 milliseconds for 1000000000 loops, converting to cycles we get = 6.9132 cycles. Your idea for a RISC-like solution and the possibility of the P4 executing 4 ALU instructions a cycle if there aren't any dependencies are all spot on. But you have a lot of dependencies in your code that keep it from running that fast. You have to watch read/after-write depedencies. You have to keep in mind that the 4 first instructions all execute in parallel and in 1 cycle. So to the processor you have modified, EAX, EBX, ECX, and EDX all on the previous clock cycle, and now you want to do "ADD EBX, EBX" on the next line. That introduces a depedency.

Code Select


	MOV EBX, EAX \
	MOV ECX, EAX  \
	MOV EDX, EAX  /  ----- these 4 instructions all execute in 1 clock cycle.  You modified EBX on the previous cycle, and now you
 are doing an "ADD EBX,EBX", which has a depedency with the previous 4 instructions since they execute in parallel, and modify EBX.
	ADD EAX, EAX /

	ADD EBX, EBX
	ADD ECX, ECX
	ADD EDX, EAX   <-  3*EAX
	ADD EAX, EAX
	ADD EBX, EBX
	ADD ECX, ECX
	ADD EAX, EAX   <-  8*EAX
	ADD EDX, EBX
	ADD EAX, ECX   <-  12*EAX
	ADD EAX, EDX   <-  19*EAX

EDIT: I have a Linux system at work, or I'd test all this myself.

Mark

Ian_B · January 19, 2006, 09:08:35 PM

Thanks, Mark. Good to see you're still around!

I'd assumed that since the instructions were completing 4 in one cycle, and the latency of these small ops is 0.5 cycles, then the latency/dependency would already be resolved by the next quad-pump cycle. Clearly that's not the case. :'(

IanB

EduardoS · January 19, 2006, 09:27:21 PM

according to juliomacedo post he want get the value on eax and return the result on ebx, so:
lea ebx, [eax+eax*8]
lea ebx, [ebx+ebx*2]

They generate a dependency and lea is slow on PIV, but in world there aren't only PIV, you can't make this code only with independent instructions, and here there is only 2 instructions,
Making the code bigger may cause other parts to cross memory pages and slow down the final program, if this multiplication is very comon maybe it is a good idea, but how many times you need to multiply something by 19?
And last, no program execute only multiplications by 19, he probably want do something more, and this code allow him to do 2 independt instructions before and 2 after in the same clock cycles.

BTW, In Agner's pentium optimisation manual he say the PIV can only send 3 uops from trace cache to processor per clock,

looking at PIV i have the impression when it decode the instruction to put them in trace cache it group them in blocks of 3 independent instructions, and don't look for dependency when executing them.

Mark_Larson · January 19, 2006, 09:56:31 PM

Quote from: EduardoS on January 19, 2006, 09:27:21 PM
BTW, In Agner's pentium optimisation manual he say the PIV can only send 3 uops from trace cache to processor per clock,

looking at PIV i have the impression when it decode the instruction to put them in trace cache it group them in blocks of 3 independent instructions, and don't look for dependency when executing them.

That is correct. That is one of the drawbacks to being able to issue 4 ALU instructions a cycle. It only works if it isn't in the trace cache. This tidbit of information is also talked about in the P4 Optimization Manual ( along with the part of how it issues 4 ALU instructions per cycle for certain instructions). (I use the P4 Optimization Manual as my primary reference for code optimzation, since it has so much more detail). The 4 ALU instructions in parallel only works with some ALU instructions, and not all. If you have the P4 optimization manual, all the ALU instructions that have 0.5 cycles, can be issued at 4 a cycle when not in the trace cache. The rest cannot. The trace cache ends up hurting the ability of the processor to issue 4 ALU instructions in parallel, but it helps greatly with how fast MMX/SSE/SSE2 runs on the P4. The number of uops they take up is small. So you can actually write some SSE2 single precision floating point code that runs at about 1 cycle per instruction due to the tracecache and interleaving use of the proper execution units ( no dependencies, and interleaving MULs and ADDS).

Mark

hutch-- · January 20, 2006, 12:21:33 AM

Here are two variations. The first being close to identical to the LEA version I posted above. The next is a shorter RISC version that deliberately ignores the read after write stalls. It is faster than the long version by about the reduction in the instruction count.

If I have learnt this much over time, lower instruction counts with similar instruction type usually end up faster so I tend to avoid the known pitfalls with repeated slow instructions and interleave the odd slow one with the preferred shorter instructions like add, sub etc ...

I have yet to see much evidence that smaller as in byte count code goes any faster in most instances.

Code Select


    mov ecx, eax
    lea eax, [eax+eax*8]
    shl eax, 1
    add eax, ecx

Code Select


    mov ecx, eax    ; read after write stall
    add eax, eax    ; read after write stall
    add eax, eax    ; read after write stall
    add eax, eax    ; read after write stall
    add eax, eax    ; read after write stall
    add eax, ecx
    add eax, ecx
    add eax, ecx

juliomacedo · January 20, 2006, 04:24:21 AM

1. I was looking for the best solution in terms of time taken to execute...

2. I'm not worried about code size but using all registers doesn't seem a good solution since I need them for other tasks (store some other values).

3. I didn't mention processor because I expected a solution that works well in most of processors not in a specific processor... Anyway, it's very good to talk about optimizations for each processor.

4. I used 19 just as an example (I should have said this before). I'm looking for good solutions (in terms of time) that apply to all multiplications of a register for an interger constant, not specifically 19.

Thanks :U

Julio

hutch-- · January 20, 2006, 04:46:18 AM

julio,

For general purpose multiplication, your choices are MUL, IMUL, the floating point instructions and MMX/SSE(2) range. Fixed multiplys are faster but this is not much use to you if you need the normal range.

News:

best way of multiplying

juliomacedo

QvasiModo

Ian_B

Ian_B

Ian_B

EduardoS

juliomacedo