News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Any way to simplify this loop?

Started by Astro, August 15, 2009, 12:46:19 PM

Previous topic - Next topic

hutch--

Astro,

Give this a blast.


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
    include \masm32\include\masm32rt.inc
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    .data?
      value dd ?

    .data
      array dd 0,1,2,3,4

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    push esi
    push edi

    mov esi, OFFSET array           ; ESI holds the BASE ADDRESS

  ; changing displacement
  ; ---------------------
    print str$( [esi] ),13,10
    print str$( [esi+4] ),13,10
    print str$( [esi+8] ),13,10
    print str$( [esi+12] ),13,10
    print str$( [esi+16] ),13,10,13,10


  ; changing address with INDEX
  ; ---------------------------
    mov edi, 0                      ; EDI is INDEX

    print str$( [esi+edi] ),13,10
    add edi, 4
    print str$( [esi+edi] ),13,10
    add edi, 4
    print str$( [esi+edi] ),13,10
    add edi, 4
    print str$( [esi+edi] ),13,10
    add edi, 4
    print str$( [esi+edi] ),13,10,13,10
    add edi, 4


  ; changing address with INDEX times MULTIPLIER
  ; ------------------------------
    mov edi, 0
    print str$( [esi+edi*4] ),13,10
    add edi, 1
    print str$( [esi+edi*4] ),13,10
    add edi, 1
    print str$( [esi+edi*4] ),13,10
    add edi, 1
    print str$( [esi+edi*4] ),13,10
    add edi, 1
    print str$( [esi+edi*4] ),13,10,13,10
    add edi, 1

  ; BASE address (ESI) + INDEX (EDI) times MULTIPLIER + 12 bytes displacement
    mov edi, 0
    print str$( [esi+edi*4+12] ),13,10

    pop edi
    pop esi

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start


Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Rockoon

As someone else had noted, the LEA instruction is very powerfull. It has evolved into quite the workhorse. It isnt always the optimal solutions, but has capabilities that simply do not exist otherwise.

It isnt always the most efficient solution, but it is always quite efficient none-the-less because the AGU (Address Generation Unit) gets quite a bit of loving from both Intel and AMD due to its overall importance. I think P4's are the "worst off" in regards to the AGU because there is only one of them, but even there its often the best solution to particular problems (such as when you need a 3-operand addition instruction)
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

FORTRANS

Quote from: Astro on August 16, 2009, 01:03:25 AM
Quote"lea ecx,[ecx+4*ecx]" is a shortcut that multiplies ecx by 5
much faster than mul by 5
A few hours later and I'm no wiser.

Hi,

   Well, a multiply is (was) regarded as an expensive
(time consuming) instruction.  Adds and shifts are
much faster and require less setup.  So:

5(ecx) = 1(ecx) + 4(ecx)

Is better than

5(ecx) = 5*ecx

Especially as

Quotehere, *4 corespond to "shl ecx,2"

So, 5(ecx) = ecx + (SHL ecx,2), an add and a shift
instead of an actual multiply (which requires eax and
a real live 5 as well).  So, a bit complex to see for the
first time, but a good tool once understood.

HTH,

Steve N.

hutch--

The performance drop with LEA in the PIV family of processors was very unfortunate as it was generally a very fast instruction that could also do compact calculations but with a PIV you only use it on odd occasions.

I would be interested to see if it has come back up to speed on the later CORE and i7 range of Intel hardware.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

#19
Quote from: hutch-- on August 16, 2009, 01:28:23 PM
I would be interested to see if it has come back up to speed on the later CORE and i7 range of Intel hardware.

Test it :bg


.nolist
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
LOOP_COUNT = 10000

.code
start:
print "Comparing mul, imul and lea for performing 'mov eax, 10*eax' :", 13, 10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 125
mov eax, 1
mov ecx, 10 ; 6 bytes for mov ecx, 10 plus mul ecx
REPEAT 8
mul ecx
ENDM
ENDM
counter_end
print str$(eax), 9, "millicycles for mul ecx", 13, 10

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 125
mov eax, 1 ; 3 bytes
REPEAT 8
imul eax, eax, 10
ENDM
ENDM
counter_end
print str$(eax), 9, "millicycles for imul eax, eax, 10", 13, 10

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 125
mov eax, 1 ; 5 bytes
REPEAT 8
lea eax, [eax+4*eax]
add eax, eax
ENDM
ENDM
counter_end
print str$(eax), 9, "millicycles for lea", 13, 10


inkey chr$(13, 10, "--- ok ---", 13)
exit
end start



Celeron M:

Comparing mul, imul and lea for performing 'mov eax, 10*eax' :
2657    millicycles for mul ecx
1122    millicycles for imul eax, eax, 10
1214    millicycles for lea

hutch--

No use, I have 3 PIVs to play with but no later stuff yet. I want a Nehalem double quad running at 4 gig next.  :P
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Rockoon

For the record, all AMD64's can execute all of the "fastpath" ALU instructions, other than multiplication, on each of its ALU's. The ALU's and AGU's are all symmetric except for multiplication, which can only be handled on one of the ALU's.

This fact throttles the AMD64 to a maximum of 1 integer multiplication per clock cycle, so in practice is can sometimes be advantageous to use LEA instead of MUL/IMUL. In practice you will find that it isnt usualy advantageous because it is quite easy to have 3 or more AGU instructions in the pipeline at any given time (all memory reads and writes require the AGU's.) When you unroll that loop with the single memory read and single memory write, its now a loop with 4 or more AGU instructions... and that means that the AGU's are often one of your bottlenecks points.

All of the integer ALU operations that read from or write to memory directly require BOTH an ALU and and AGU to accomplish the work.

The AMD64 has 9 execution units.

3 ALU's (Arithmetic and Logic Unit)
3 AGU's (Address Generation Unit)
1 FADD unit (additions, subtractions, and "any")
1 FMUL unit (multiplication, division, and "any")
1 FMISC unit (float<->int, load and store, and "any")

The "any" includes most of the macro-operations, which sometimes require all 3 units simultaneously.

Also of some note is the fact that an AMD64 can only perform two memory reads per clock cycle even though it has 3 AGU's. You can still perform an LEA on that 3rd AGU but there are only 2 paths from the CPU to the L1 cache, so only 2 AGU's can actualy access memory simultaneously.

(yes, its true.. I know an insane amount of AMD64-specific stuff)

The AMD64, for the most part, is a much more pleasant optimization experience than the P4 is. The P4 has an insanely long pipeline and some crazy assymetric execution units.. the situation exists that some instructions are PORT 1 or 2, while others are only PORT 1, and others are just PORT 2. This makes it very hard to schedule properly so that you don't have one of those "1 or 2" instructions sitting in 2 while you also have some 2-only work to do. This has lead to generalized optimization strategies such as pairing up two single-uops or two double-uops (never having an odd number of either next to each other, unless you have a tripple-uop as well, in which case 3-1-1-1 .. the P4 is crazy stuff.. really.. its horrible from an optimizer standpoint)
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

jj2007

Rockoon,
Can you test my snippet above on your AMD? I am curious how imul performs relative to lea. If you have an idea why the simple mul is so slow in comparison, I'd like to know...

qWord

The advantage on lea is that you can choose the registers you want to use. For this purpose I've written an macro that let me multiply values  up to 50 - maybe some is interested in ...

regards, qWord
FPU in a trice: SmplMath
It's that simple!

dedndave

thanks qWord   :U
dang - i wanted to use 51 - lol

Rockoon


C:\asm\masm_forums>jjmuls
Comparing mul, imul and lea for performing 'mov eax, 10*eax' :
1603    millicycles for mul ecx
1409    millicycles for imul eax, eax, 10
1162    millicycles for lea

--- ok ---

C:\asm\masm_forums>jjmuls
Comparing mul, imul and lea for performing 'mov eax, 10*eax' :
1610    millicycles for mul ecx
2714    millicycles for imul eax, eax, 10
1128    millicycles for lea

--- ok ---

C:\asm\masm_forums>jjmuls
Comparing mul, imul and lea for performing 'mov eax, 10*eax' :
1614    millicycles for mul ecx
1382    millicycles for imul eax, eax, 10
1160    millicycles for lea

--- ok ---

C:\asm\masm_forums>jjmuls
Comparing mul, imul and lea for performing 'mov eax, 10*eax' :
2944    millicycles for mul ecx
1344    millicycles for imul eax, eax, 10
1143    millicycles for lea

--- ok ---

C:\asm\masm_forums>jjmuls
Comparing mul, imul and lea for performing 'mov eax, 10*eax' :
1593    millicycles for mul ecx
1378    millicycles for imul eax, eax, 10
1133    millicycles for lea

--- ok ---

C:\asm\masm_forums>


There seems to be quite a bit of deviation in the tests as-is, and the test isnt "real world" which is something I dont spend time on anymore (its futile, as you don't actualy learn anything usefull.. such as how well it pairs with other relevant instructions in a real situation)
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

jj2007

Quote from: qWord on August 16, 2009, 03:20:14 PM
The advantage on lea is that you can choose the registers you want to use. For this purpose I've written an macro that let me multiply values  up to 50 - maybe some is interested in ...

regards, qWord

Cute. I added the imul variant which, by the way, allows to choose the registers.

lea vs. mul
const   IMUL    MUL     LEA             faster?
0       17      41      12              lea or macro
1       18      41      3               lea or macro
2       17      41      12              lea or macro
3       17      41      12              lea or macro
4       17      41      12              lea or macro
5       17      41      13              lea or macro


My timings for the mul variant are higher than yours because I added one line:
         mov edi,cntr
         mov esi,num
         REPEAT 16
            mov eax,esi
            mov ecx, cntr   ; should be loaded somewhere
            mul ecx
         ENDM

dedndave

millicycles ?
i knew a gal named Milli, once

qWord

Quote from: jj2007 on August 16, 2009, 05:54:21 PM
My timings for the mul variant are higher than yours because I added one line:
         mov edi,cntr
         mov esi,num
it should be "mov ecx,cntr"  :red

Quote..imul variant which, by the way, allows to choose the registers
I've never thought that imul is such fast... You live and learn  :U

regards, qWord
FPU in a trice: SmplMath
It's that simple!

Rockoon

One of the thoughts I had about why the P4 has such trouble with the 1-op mul is that it produces a 64-bit result. Is the system in question an older 32-bit P4 or one of the later 64-bit ones?
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.