Optimisation list.

hutch-- · November 12, 2006, 05:30:38 AM

I am playing with a tool design at the moment for doing a number of well known optimisations on existing assembler code. I have had a dabble at it before in one of my other tools and it worked OK but I am playing with the idea of a far more powerful one. What I am after is block improvements of more or less common code. The list I have put together so far is as follows.

Any suggestions here would be appreciated.

Code Select


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
;   source                  ; replacement           ; reason
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
    or reg, -1              ; mov reg, -1           ; smaller
; -------------------------------------------------------------------------
    mov reg, 0              ; xor reg, reg          ; smaller and faster
                            ; sub reg, reg          ; faster
; -------------------------------------------------------------------------
    cmp reg, 0              ; test reg, reg         ; faster
; -------------------------------------------------------------------------
    mov reg, mem
    add reg, im/reg         ; add mem, im/reg       ; smaller and does not use register
    mov mem, reg                                    ; reduces 1 memory access
; -------------------------------------------------------------------------
    mov reg, mem
    sub reg, im/reg         ; sub mem, im/reg       ; smaller and does not use register
    mov mem, reg                                    ; reduces 1 memory access
; -------------------------------------------------------------------------
    sub reg, im/reg/mem     ; sub reg, im/reg/mem   ; 1 instruction shorter and faster
    cmp reg, 0              ; jz target
    je target
; -------------------------------------------------------------------------
    add reg, im/reg/mem     ; add reg, im/reg/mem   ; 1 instruction shorter and faster
    cmp reg, 0              ; jz target
    je target
; -------------------------------------------------------------------------
    sub mem, im/reg         ; sub mem, im/reg       ; 1 instruction shorter and faster
    cmp mem, 0              ; jz target
    je target
; -------------------------------------------------------------------------
    add mem, im/reg         ; add mem, im/reg       ; 1 instruction shorter and faster
    cmp mem, 0              ; jz target
    je target
; -------------------------------------------------------------------------
    mov reg1, mem           ; mov reg2, mem         ; remove redundant load / store
    mov reg2, reg1
; -------------------------------------------------------------------------
    shl reg, 1              ; add reg, reg          ; faster
; -------------------------------------------------------------------------
    shl reg, 2              ; add reg, reg          ; faster on PIV
                            ; add reg, reg          ; and most later hardware
; -------------------------------------------------------------------------
    or reg, reg             ; test reg, reg         ; test does not write to reg
; -------------------------------------------------------------------------
    inc reg                 ; add reg, 1            ; faster on PIV but larger
; -------------------------------------------------------------------------
    dec reg                 ; sub reg, 1            ; faster on PIV but larger
; -------------------------------------------------------------------------

pro3carp3 · November 12, 2006, 06:17:06 AM

Changing INC/DEC to ADD/SUB could break code as it affects the flags.

asmrixstar · November 12, 2006, 07:21:53 AM

Whats this for hutch? I only ask because i wrote a generic exe optimiser a while ago and ,well to be blunt, it foobared pretty bad on all but the smallest exe's.
:)
Just asking ...

hutch-- · November 12, 2006, 11:15:27 AM

asmrixstar,

I have played with these in the past and what I have done has worked OK but I will be working purely on source code if it looks like it is viable to do. Most of this stuff is fussy parsing of source code and I have most of the toys to do this. What I am looking for at the moment is code blocks of the type in the list, transformations are a different animal which would take a far more complex heuristic to get going properly.

pro3carp3,

You are correct but the flag that differs is rarely used so it should be easy enough to test if the code depends on that flag.

zooba · November 12, 2006, 11:34:47 AM

I find the carry flag is quite often used, and having it be preserved by inc/dec instructions is useful. Then again, if you're basing your code on that fact, you probably don't want a machine optimising your code for you :bg

Have you gone through Mark Larson's list of optimisations? There's probably heaps of things in there you can use.

Cheers,

Zooba :U

Mark_Larson · November 12, 2006, 01:35:53 PM

Quote from: zooba on November 12, 2006, 11:34:47 AM
I find the carry flag is quite often used, and having it be preserved by inc/dec instructions is useful. Then again, if you're basing your code on that fact, you probably don't want a machine optimising your code for you :bg

I also do that. Sometimes it helps having INC/DEC not affects CF.

I think Hutch is trying to find a way to better optimize larger chunks of code, kind of like how people now adays use a C compiler to optimize the bulk of their project, and then do the rest in ASM for speed.

I was interested in a slightly different code optimizer. I wanted one that swapped lines around. I haven't tried this much on my new AMD, but on my Intel I found that it does a less than perfect job of optimizing the instruction re-ordering. I can get good speed ups from swapping instructions around. So I had planned on working on a dependency list. Line A has to run before Line B, etc.

EDIT: I also did something slightly different with the "prefetch" instruction. Both Intel and AMD recommend putting it at the top of your loop. I found out that it varies. I can put it in different places in the loop and get a speed up. So my code tried different places in the loop AND different offsets into memory to fetch.

Mark

EduardoS · November 12, 2006, 04:37:37 PM

Some of those optimizations are processor depent, or have some exceptions:

Code Select


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
;   source                  ; replacement           ; reason
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

Code Select


    or reg, -1              ; mov reg, -1           ; smaller

Exception: sometimes a bigger version is usefull for code alignment.

Code Select


    mov reg, 0              ; xor reg, reg          ; smaller and faster
                            ; sub reg, reg          ; faster

Same as above, and faster only on P4.

Code Select


    cmp reg, 0              ; test reg, reg         ; faster

Shouldn't also be smaller? ::)
And also, on Intel Core 2 processors cmp can't be fused with jg/jl, but test can.

Code Select


    mov reg, mem
    add reg, im/reg         ; add mem, im/reg       ; smaller and does not use register
    mov mem, reg                                    ; reduces 1 memory access

Athlon CPUs don't reorder L/S, moving the "mov mem, reg" some instructions forward can speed up the code more than the replacement;
Intel "Core" CPUs can decode "op mem, reg" in one decoder, using too many "op mem, reg" in a sequence can limit the decoding througput (remember, these CPUs have only 1 complex decoder and 3 simple decoders).

Code Select


    mov reg1, mem           ; mov reg2, mem         ; remove redundant load / store
    mov reg2, reg1

Sometimes this redundancy exists because you need two copies of mem :wink

Code Select


    shl reg, 2              ; add reg, reg          ; faster on PIV
                            ; add reg, reg          ; and most later hardware

shl is as fast as add on AMD CPUs, on Intel "Core" CPUs shl have a lower throughput than add (2 vs 3), but same latency, so one shl is faster than two add.

Code Select


    inc reg                 ; add reg, 1            ; faster on PIV but larger

Same speed on AMD CPUs, due to flags stall, slower on Intel CPUs.

dsouza123 · November 12, 2006, 07:56:56 PM

Shouldn't or reg, -1 and mov reg, -1 be swapped
or reg, -1 is smaller than mov reg, -1

Code Select


00401000 33C0                   xor     eax,eax
00401002 F7D0                   not     eax
00401004 83C8FF                 or      eax,0FFFFFFFFh   ; or  eax, -1
00401007 B8FFFFFFFF             mov     eax,0FFFFFFFFh   ; mov eax, -1

there is also

xor reg, reg
not reg

but other than it is two instructions and in between the other two in size
don't know what other advantages/disadvantages it has.

hutch-- · November 12, 2006, 10:29:23 PM

Yes,

I put them in the wrong way around, the OR version is the replacement.

EduardoS,

I know all of this stuff but what I need is block replacements that work on both forms of hardware, particularly as Intel chips are still the vast majority of computers around the world.

Just for example I know that MOVZX can be used profitably in a context like,

Code Select


    mov al, [esi]
    cmp al, [edi]

by doing this replacement

Code Select


    movzx eax, BYTE PTR [esi]
    cmp al, [edi]

It is clearly faster on late Intel but no faster on AMD but to get code that works on most late hardware, I need stuff like this for the list I am working to get.

EduardoS · November 12, 2006, 11:16:19 PM

I unterstood that hutch, but a porblem persists,
I will never know what's the tomorrow latest hardware, a too specific optimization sometimes is risky, as an actual example, Pentium 4 and Core 2, Pentium 4 have fast ALUs and a slow shift barrel (can i call it shift barrel?), but this is a specific characteristic, and some guys replaced a simple shl eax, 3 for three add eax, eax, wich was faster on first Pentium 4, but now Intel introduce a new processor and want's to phase out Pentium 4, the new processor, wich should be faster, will run this example a bit slower.
There are some more cases like the above and may happen to any hardware, i think it's an important note on optimizations like this.

hutch-- · November 12, 2006, 11:56:02 PM

EduardoS,

I think we are all stuck with that one, the changes from PIII to PIV drove me nuts, especially the drop in performance with LEA which is a very useful instruction. I don't think the core 2 duo's are stable enough yet to buy one and I know the 4 core versions are coming so i am in no hurry to do so.

What I really need are any other well known optimisations that occur on a block replacement level, I already know most of the consierations of optimising for different hardware.

Human · November 21, 2006, 08:47:49 PM

dsouza123:
lol not eax
not isnt pairable, will stall pipelines.
only way is xor eax,eax & dec eax also or eax,-1 as 3 bytes.

hutch:
have you tried VTune 5.0, its last one where you have assembly couch, and in mmx cpu mode shows ticks of code and suggests changes.
so is there any reason to write similiar tool? too bad 6.0+ doesnt have it anymore. its .net c++ java tools, but no more assembly.
bad thing you cant profile by time or msr regs on amd cpu's.
to quad core, do you need it for omitizations? its way of kernel to decide which core takes thread. and that decision doesnt belong to us anymore.

News:

Optimisation list.

hutch--

pro3carp3

asmrixstar

hutch--

zooba

Mark_Larson

EduardoS

dsouza123

hutch--

EduardoS

hutch--

Human