Any way to simplify this loop?

hutch-- · August 16, 2009, 02:30:57 AM

Astro,

Give this a blast.


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
    include \masm32\include\masm32rt.inc
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    .data?
      value dd ?

    .data
      array dd 0,1,2,3,4

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    push esi
    push edi

    mov esi, OFFSET array           ; ESI holds the BASE ADDRESS

  ; changing displacement
  ; ---------------------
    print str$( [esi] ),13,10
    print str$( [esi+4] ),13,10
    print str$( [esi+8] ),13,10
    print str$( [esi+12] ),13,10
    print str$( [esi+16] ),13,10,13,10


  ; changing address with INDEX
  ; ---------------------------
    mov edi, 0                      ; EDI is INDEX

    print str$( [esi+edi] ),13,10
    add edi, 4
    print str$( [esi+edi] ),13,10
    add edi, 4
    print str$( [esi+edi] ),13,10
    add edi, 4
    print str$( [esi+edi] ),13,10
    add edi, 4
    print str$( [esi+edi] ),13,10,13,10
    add edi, 4


  ; changing address with INDEX times MULTIPLIER
  ; ------------------------------
    mov edi, 0
    print str$( [esi+edi*4] ),13,10
    add edi, 1
    print str$( [esi+edi*4] ),13,10
    add edi, 1
    print str$( [esi+edi*4] ),13,10
    add edi, 1
    print str$( [esi+edi*4] ),13,10
    add edi, 1
    print str$( [esi+edi*4] ),13,10,13,10
    add edi, 1

  ; BASE address (ESI) + INDEX (EDI) times MULTIPLIER + 12 bytes displacement
    mov edi, 0
    print str$( [esi+edi*4+12] ),13,10

    pop edi
    pop esi

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start

Rockoon · August 16, 2009, 12:14:58 PM

As someone else had noted, the LEA instruction is very powerfull. It has evolved into quite the workhorse. It isnt always the optimal solutions, but has capabilities that simply do not exist otherwise.

It isnt always the most efficient solution, but it is always quite efficient none-the-less because the AGU (Address Generation Unit) gets quite a bit of loving from both Intel and AMD due to its overall importance. I think P4's are the "worst off" in regards to the AGU because there is only one of them, but even there its often the best solution to particular problems (such as when you need a 3-operand addition instruction)

FORTRANS · August 16, 2009, 01:08:33 PM

Quote from: Astro on August 16, 2009, 01:03:25 AM
Quote"lea ecx,[ecx+4*ecx]" is a shortcut that multiplies ecx by 5
much faster than mul by 5
A few hours later and I'm no wiser.

Hi,

Well, a multiply is (was) regarded as an expensive
(time consuming) instruction. Adds and shifts are
much faster and require less setup. So:

5(ecx) = 1(ecx) + 4(ecx)

Is better than

5(ecx) = 5*ecx

Especially as

Quotehere, *4 corespond to "shl ecx,2"

So, 5(ecx) = ecx + (SHL ecx,2), an add and a shift
instead of an actual multiply (which requires eax and
a real live 5 as well). So, a bit complex to see for the
first time, but a good tool once understood.

HTH,

Steve N.

hutch-- · August 16, 2009, 01:28:23 PM

The performance drop with LEA in the PIV family of processors was very unfortunate as it was generally a very fast instruction that could also do compact calculations but with a PIV you only use it on odd occasions.

I would be interested to see if it has come back up to speed on the later CORE and i7 range of Intel hardware.

jj2007 · August 16, 2009, 01:58:38 PM

Quote from: hutch-- on August 16, 2009, 01:28:23 PM
I would be interested to see if it has come back up to speed on the later CORE and i7 range of Intel hardware.

Test it :bg

Code Select


.nolist
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
LOOP_COUNT	= 10000

.code
start:
	print "Comparing mul, imul and lea for performing 'mov eax, 10*eax' :", 13, 10
	counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
		REPEAT 125
		mov eax, 1
		mov ecx, 10		; 6 bytes for mov ecx, 10 plus mul ecx
		REPEAT 8
		mul ecx
		ENDM
		ENDM
	counter_end
	print str$(eax), 9, "millicycles for mul ecx", 13, 10

	counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
		REPEAT 125
		mov eax, 1	; 3 bytes
		REPEAT 8
		imul eax, eax, 10
		ENDM
		ENDM
	counter_end
	print str$(eax), 9, "millicycles for imul eax, eax, 10", 13, 10

	counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
		REPEAT 125
		mov eax, 1	; 5 bytes
		REPEAT 8
		lea eax, [eax+4*eax]
		add eax, eax
		ENDM
		ENDM
	counter_end
	print str$(eax), 9, "millicycles for lea", 13, 10


	inkey chr$(13, 10, "--- ok ---", 13)
	exit
end start

Celeron M:

Code Select


Comparing mul, imul and lea for performing 'mov eax, 10*eax' :
2657    millicycles for mul ecx
1122    millicycles for imul eax, eax, 10
1214    millicycles for lea

hutch-- · August 16, 2009, 02:03:39 PM

No use, I have 3 PIVs to play with but no later stuff yet. I want a Nehalem double quad running at 4 gig next. :P

Rockoon · August 16, 2009, 02:35:29 PM

For the record, all AMD64's can execute all of the "fastpath" ALU instructions, other than multiplication, on each of its ALU's. The ALU's and AGU's are all symmetric except for multiplication, which can only be handled on one of the ALU's.

This fact throttles the AMD64 to a maximum of 1 integer multiplication per clock cycle, so in practice is can sometimes be advantageous to use LEA instead of MUL/IMUL. In practice you will find that it isnt usualy advantageous because it is quite easy to have 3 or more AGU instructions in the pipeline at any given time (all memory reads and writes require the AGU's.) When you unroll that loop with the single memory read and single memory write, its now a loop with 4 or more AGU instructions... and that means that the AGU's are often one of your bottlenecks points.

All of the integer ALU operations that read from or write to memory directly require BOTH an ALU and and AGU to accomplish the work.

The AMD64 has 9 execution units.

3 ALU's (Arithmetic and Logic Unit)
3 AGU's (Address Generation Unit)
1 FADD unit (additions, subtractions, and "any")
1 FMUL unit (multiplication, division, and "any")
1 FMISC unit (float<->int, load and store, and "any")

The "any" includes most of the macro-operations, which sometimes require all 3 units simultaneously.

Also of some note is the fact that an AMD64 can only perform two memory reads per clock cycle even though it has 3 AGU's. You can still perform an LEA on that 3rd AGU but there are only 2 paths from the CPU to the L1 cache, so only 2 AGU's can actualy access memory simultaneously.

(yes, its true.. I know an insane amount of AMD64-specific stuff)

The AMD64, for the most part, is a much more pleasant optimization experience than the P4 is. The P4 has an insanely long pipeline and some crazy assymetric execution units.. the situation exists that some instructions are PORT 1 or 2, while others are only PORT 1, and others are just PORT 2. This makes it very hard to schedule properly so that you don't have one of those "1 or 2" instructions sitting in 2 while you also have some 2-only work to do. This has lead to generalized optimization strategies such as pairing up two single-uops or two double-uops (never having an odd number of either next to each other, unless you have a tripple-uop as well, in which case 3-1-1-1 .. the P4 is crazy stuff.. really.. its horrible from an optimizer standpoint)

jj2007 · August 16, 2009, 03:11:44 PM

Rockoon,
Can you test my snippet above on your AMD? I am curious how imul performs relative to lea. If you have an idea why the simple mul is so slow in comparison, I'd like to know...

qWord · August 16, 2009, 03:20:14 PM

The advantage on lea is that you can choose the registers you want to use. For this purpose I've written an macro that let me multiply values up to 50 - maybe some is interested in ...

regards, qWord

dedndave · August 16, 2009, 04:13:04 PM

thanks qWord :U
dang - i wanted to use 51 - lol

Rockoon · August 16, 2009, 05:45:19 PM

C:\asm\masm_forums>jjmuls
Comparing mul, imul and lea for performing 'mov eax, 10*eax' :
1603 millicycles for mul ecx
1409 millicycles for imul eax, eax, 10
1162 millicycles for lea

--- ok ---

C:\asm\masm_forums>jjmuls
Comparing mul, imul and lea for performing 'mov eax, 10*eax' :
1610 millicycles for mul ecx
2714 millicycles for imul eax, eax, 10
1128 millicycles for lea

--- ok ---

C:\asm\masm_forums>jjmuls
Comparing mul, imul and lea for performing 'mov eax, 10*eax' :
1614 millicycles for mul ecx
1382 millicycles for imul eax, eax, 10
1160 millicycles for lea

--- ok ---

C:\asm\masm_forums>jjmuls
Comparing mul, imul and lea for performing 'mov eax, 10*eax' :
2944 millicycles for mul ecx
1344 millicycles for imul eax, eax, 10
1143 millicycles for lea

--- ok ---

C:\asm\masm_forums>jjmuls
Comparing mul, imul and lea for performing 'mov eax, 10*eax' :
1593 millicycles for mul ecx
1378 millicycles for imul eax, eax, 10
1133 millicycles for lea

--- ok ---

C:\asm\masm_forums>

There seems to be quite a bit of deviation in the tests as-is, and the test isnt "real world" which is something I dont spend time on anymore (its futile, as you don't actualy learn anything usefull.. such as how well it pairs with other relevant instructions in a real situation)

jj2007 · August 16, 2009, 05:54:21 PM

Quote from: qWord on August 16, 2009, 03:20:14 PM
The advantage on lea is that you can choose the registers you want to use. For this purpose I've written an macro that let me multiply values up to 50 - maybe some is interested in ...

regards, qWord

Cute. I added the imul variant which, by the way, allows to choose the registers.

Code Select

lea vs. mul
const   IMUL    MUL     LEA             faster?
0       17      41      12              lea or macro
1       18      41      3               lea or macro
2       17      41      12              lea or macro
3       17      41      12              lea or macro
4       17      41      12              lea or macro
5       17      41      13              lea or macro

My timings for the mul variant are higher than yours because I added one line:
         mov edi,cntr
         mov esi,num
         REPEAT 16
            mov eax,esi
            mov ecx, cntr   ; should be loaded somewhere
            mul ecx
         ENDM

dedndave · August 16, 2009, 06:01:08 PM

millicycles ?
i knew a gal named Milli, once

qWord · August 16, 2009, 08:16:29 PM

Quote from: jj2007 on August 16, 2009, 05:54:21 PM
My timings for the mul variant are higher than yours because I added one line:
mov edi,cntr
mov esi,num

it should be "mov ecx,cntr" :red

Quote..imul variant which, by the way, allows to choose the registers

I've never thought that imul is such fast... You live and learn :U

regards, qWord

Rockoon · August 16, 2009, 08:26:49 PM

One of the thoughts I had about why the P4 has such trouble with the 1-op mul is that it produces a 64-bit result. Is the system in question an older 32-bit P4 or one of the later 64-bit ones?

News:

Any way to simplify this loop?