News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

LZW encoder to optimize

Started by gabor, April 25, 2005, 10:03:22 PM

Previous topic - Next topic

gabor

Hi!

I'd like to learn more about optimization and I was told to post some code and see what the opt. talents can make of it.
I have here an LZW encoder proc, I optimized it as much as I could, but there must be a lot more, I suspect.

Greets




[attachment deleted by admin]

Tedd

algorithm optimization > instruction tweaking

(..in general)
No snowflake in an avalanche feels responsible.

Mark_Larson

  You should start off with something a bit smaller.  Starting off with larger programs means that there are usually a larger number of tricks you can do to optimize it.  And it's a lot harder remembering and understanding a large number of optimization tricks when you are new to optimizing, than it is to remember a few.  So try something that's under 20 or 30 lines in code size.  Also try posting the chunk of code you are trying to optimize ( I am assuming it's the encode procedure in your code), so that people can view it without having to download it.  You can still attach the ZIP file, but displaying the procedure that you want to optimize makes it easier on the people trying to help.  In addition run it with the timing macros, and tell us what kind of processor you have, and how long it took to execute.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
;
; Encodes compressed data according to the LZW method
;
; -------------------------------------------------------------------------
;  IN:    src - pointer to input data
;         dst - pointer to output data
;       bytes - count of input bytes
; -------------------------------------------------------------------------
LZW_encode      PROC USES ebx ecx edx esi edi,src:DWORD,dst:DWORD,bytes:DWORD

;PrintText '==ENCODE=============='
                mov     esi,src
                mov     edi,dst

                add     bytes,esi

                xor     eax,eax
                mov     al,[esi]                ; Read first char
; -------------------------------------------------------------------------
; Main loop
; ---------------
@_1:
                mov     ecx,[LZW_code_size]
                add     esi,1
               
                shl     eax,cl

                mov     ecx,[LZW_entry_count]
                mov     al,[esi]
;PrintHex eax,'read'
; -------------------------------------------------------------------------
; LZW_lookup
; ---------------
;PrintHex eax,'lookup'
                push    esi
                mov     esi,[LZW_dictionary]

                sub     ecx,1
                jc  @_10
@_LU1:
                cmp     eax,[esi+ecx*4]
                jz  @_2

                sub     ecx,1
                jnc @_LU1
; -------------------------------------------------------------------------
; Not found
; ---------------
@_10:
;PrintHex eax,'not found'
; -------------------------------------------------------------------------
; LZW_add_entry
; ---------------
                mov     ebx,[LZW_entry_count]
                mov     esi,[LZW_dictionary]

                cmp     ebx,[LZW_max_entry]
                jnc  @_11

;PrintHex eax,'new entry'
                mov     [esi+ebx*4],eax
                add     [LZW_entry_count],1
; ---------------
@_11:
; -------------------------------------------------------------------------
; LZW_write_code
; ---------------
                pop     esi
                mov     ecx,[LZW_code_size]

                shr     eax,cl

                mov     ebx,[edi]
                mov     ecx,[LZW_stream_offs]

                shl     eax,cl                       ; shift the code

                add     ecx,[LZW_code_size]          ; offset increment in bits
                or      eax,ebx

                mov     bl,cl
                mov     [edi],eax                   ; write the code, buffer must originally be filled with zero

                shr     cl,3                        ; /8
                and     ebx,07h                     ; new offset in bits

                add     edi,ecx                     ; increment in bytes
                mov     [LZW_stream_offs],ebx
; -------------------------------------------------------------------------
                xor     eax,eax
                mov     al,[esi]

;PrintHex eax,'code'
                jmp @_3
; -------------------------------------------------------------------------
; Found:
; ---------------
@_2:
;PrintHex eax,'found'
                add     ecx,0100h
                pop     esi

                mov     eax,ecx
; ---------------
@_3:
;PrintText '--------'
                cmp     esi,bytes
                jc     @_1
; -------------------------------------------------------------------------
                or      bl,bl
                jz  @_31

                add     edi,1
@_31:
                mov     eax,edi
                sub     eax,dst

                ret
               
LZW_encode      ENDP
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

gabor

Thanks Mark!

I have an AMD Athlon XP 2,5+. As it can be seen from the main source, test.asm the LZW_encode proc is timed. It ran for 364-368 ms. I managed to determine the weak spot (was not hard  :toothy ):


; -------------------------------------------------------------------------
; LZW_lookup
; ---------------
;PrintHex eax,'lookup'
                ...


When I commented out this part the measured time was around 5 ms!!! So if Tedd thought of this problem (algorithm optimization > instruction tweaking) then he was right. From this on the point is to optimize that small lookup loop, but first in my algo...I think of some hashing. Anyway, thanks for the comments!

Here is something more relevant to optimization:


Old version

                push    esi
                mov     esi,[LZW_dictionary]

                sub     ecx,1
                jc  @_10
@_LU1:
                cmp     eax,[esi+ecx*4]
                jz  @_2

                sub     ecx,1
                jnc @_LU1


New version

                push    esi
                mov     esi,[LZW_dictionary]

                sub     ecx,1
                jc  @_10
@_LU1:
               cmp     eax,[esi]
                jz  @_2

                add     esi,4
                sub     ecx,1

                jnc @_LU1


On the left is the old version, this was in the original zip file. The timing results were between 365-368. On the right is another version that does not use displacement with ecx*4, but increments esi. This version gave results between 336-33D. So this is a bit faster...

Greets,gábor

Mirno

Instead of "mov al, [esi]", replace them with "movzx eax, BYTE PTR [esi]". this should avoid the partial register stall when you then later shift eax.

Also find out which is the hottest part of the loop, and make sure it's aligned.

Mirno

roticv

Hello Mirno,

I tried your recommendation, but it does not really make much of a difference. Maybe it is due to the fact that partial register stall was already prevented when he used "xor eax,eax" before the "mov al,[esi]".

Mirno

Yes, you're right. It serves me right for not properly looking at the code.

I think I'd probably replace the pairs of xor / mov, with just the one movzx and try it though.

Where the code states:

mov     ecx,[LZW_code_size]
add     esi,1

shl     eax,cl

mov     ecx,[LZW_entry_count]

Replace the first mov ecx with "mov cl, BYTE PTR [LZW_code_size]" as this will have the same effect.

Also try and find a use for edx if possible (maybe keep a value in a register, so you don't go referencing memory). This could free up cl for shifting, rather than swapping values in and out of ecx.

One thing I find when trying to optimise stuff, I prefer to get rid of the comments. If you read comments when trying to optimise, you can fall into the mindset you were in when you tried to program it (you do A, then B, then C), rather than shuffle things around to better positions.

Mirno

lingo

#7
Gabor,
Pls test and time it for me... :bg

Add some global variables in your .data or .data? segment

SaveESP dd ?
SaveEBP dd ?
SaveEBX dd ?
SaveESI dd ?
SaveEDI dd ?
CurrentRetAddress dd ?
LZW_code_size     dd ?
LZW_entry_count   dd ?
LZW_dictionary    dd ?
LZW_stream_offs   dd ?
LZW_max_entry   dd ?
Memory1           dd ?
Memory2           dd ?
tmp           dd ?

How to call:

@_Read_OK1:
            invoke CloseHandle,hf

            timer_begin1, HIGH_PRIORITY_CLASS

           ;invoke LZW_encode, Memory1, Memory2, tmp
            mov CurrentRetAddress, offset @_ReturnHere1
            jmp LZW_encode
@_ReturnHere1:
            mov tmp,eax


;Encodes compressed data according to the LZW method
OPTION PROLOGUE:NONE                         ;turn it off
OPTION EPILOGUE:NONE
Align 16
db 8Dh,9Bh,0,0,0,0
LZW_encode PROC
           mov   SaveESP, esp                ; save register ESP in global variable SaveESP
           mov   SaveEBP, ebp                ; save register EBP in global variable SaveEBP
           mov   SaveEBX, ebx                ; save register EBX in global variable SaveEBX
           mov   SaveESI, esi                ; save register ESI in global variable SaveESI
           mov   SaveEDI, edi                ; save register EDI in global variable SaveEDI
           mov   esi, Memory1
           mov   edi, Memory2
           add   tmp, esi
           mov   ecx, LZW_code_size
           mov   ebx, LZW_entry_count
           mov   ebp, LZW_dictionary
           mov   esp, LZW_stream_offs
           jmp   StartIt
LZW_add_entry:
           cmp   ebx,LZW_max_entry            ;ebx->LZW_entry_count
           jnl   LZW_write_code
           mov   [ebp+ebx*4], eax             ;ebp->LZW_dictionary
           db    3Eh                          ;DS:segment prefix as a nop
           add   ebx, 1                       ;ebx->LZW_entry_count
LZW_write_code:
           shr   eax, cl                      ;ecx->LZW_code_size
           xchg  esp, ecx                     ;esp->LZW_stream_offs
           shl   eax, cl
           xadd  esp, ecx
            or   [edi], eax
           mov   edx, esp
           shr   edx, 3                       ; /8
           and   esp,07h                      ;new offset in bits
           add   edi,edx                      ;increment in bytes
StartIt:
           cmp   esi, tmp
           movzx eax, byte ptr [esi]
           jnl   Enda
Main_loop:
           shl   eax, cl
           movzx edx, byte ptr [esi+1]
           db    81h, 0C6h,01,0,0,0            ;long form of add esi,1
           add   eax, edx
           mov   edx, ebx                      ;ebx->LZW_entry_count
LZW_lookup:
           sub   edx, 1
            jl   LZW_add_entry
           cmp   eax, [ebp+edx*4]
           jne   LZW_lookup
           cmp   esi, tmp
           lea   eax, [edx+100h]
            jl   Main_loop
Enda:
           xor   eax, eax
           test  esp, esp
           mov   ecx, CurrentRetAddress         ;get return address from
           setne al
           sub   edi, Memory2
           mov   esp, SaveESP                   ;restore register ESP from global variable SaveESP
           mov   ebp, SaveEBP                   ;restore register EBP from global variable SaveEBP
           mov   ebx, SaveEBX                   ;restore register EBX from global variable SaveEBX
           mov   esi, SaveESI                   ;restore register ESI from global variable SaveESI
           add   eax, edi
           mov   edi, SaveEDI                   ;restore register EDI from global variable SaveEDI
           jmp   ecx                            ;Return
LZW_encode ENDP
OPTION PROLOGUE:PROLOGUEDEF                     ;turn back on the defaults
OPTION EPILOGUE:EPILOGUEDEF


Regards,
Lingo

gabor

Hi lingo!

I promise I'll do it this week...

Greets, gábor


Mark_Larson

  Some general tips ( I've been super busy).  Keep in mind that sometimes you will get slow downs from trying speed ups.  So time your code before making any changes.  Use that as a baseline.  Make one optimization change ( not multiple) and re-time to see if you get a speed up.  MichaelW posted some great timing macros as a sticky in the Laboratory.


1)  Look at aligning your code and data.  For code, inner loops tend to be the best thing to optimize.  For data you basically align on the size of the data.  2 bytes for word ( and as a general rule, don't use word data or word registers on P4, it runs really slow), 4 bytes for dword, 8 bytes for qword, and 16 bytes for dqword.

2)  You can you pick different single instructions that do the same thing.  For instance let's say you need to zero a register.  There are lots of ways to do that.  However some ways are better than others. If you use XOR to do it, it will break dependency chains and allow your code to run faster.  Heck inserting random XORs in your code for temporary registers can give you speed ups, because it breaks dependencies.  I was helping someone a while back with some GCC code that they had an assemble file for ( .S extension).  I put XORs of the temporary registers in various spots and got a speed up.

3) Sometimes you have several instructions that do something.  There might be more than one way to do that something.  So try different ways of doing it.

4)  Unrolling loops usually gives a good speed up.  Thankfully MASM gave us a great way to do that automatically without having to cut and paste.

5)  If you are converting C code to assembler code look at the variables that get used the most.  Keep them in CPU registers.  That way you reduce memory accesses.

6)  Make your loop variables count down.  Saves an instruction because you can do a JNZ and avoid the CMP instruction.

7)  You can use LEA to increment or decrement a variable and it doesn't affect flags. So you can use it between a ADD/SUB or other ALU instruction, and a CMP ( like if you have a loop counter).  On the downside LEA is a lot slower on the P4, so you are less likely to get a speed up from this on a P4.


   dec eax,1
    lea esi,[esi+4]
    jnz top_of_loop


8)  MOVZX a lot of times can be faster than using XOR and a byte or word move to a dword register.

9)  A lot of times lookup tables can be used to speed up code.  It's a great way to pre-compute things and then just get it from the table, saving time later.

10)  As a general rule for large programs local variables are going to be faster than global variables.  Local variables are created on the fly, but they are created on the stack , which is usually in the cache.  Global variables often cause cache misses, if they haven't been accessed in a while.

11)  For data, put stuff you access together close together in memory.  For instance if you are doing a bunch of operations on some vertex data, keep temporary variables as part of the structure of data would give speed ups, because the data will be less likely to cause cache misses.

12)  For MASM and invoke, you can force MASM to skip the prologue and epilogue code generation. That usually shaves a few cycles off, at the cost of more harder to understand code.

13)  Memory accesses are expensive, so try to avoid pushes/pops and moves from memory to registers.  Once you get a value from memory into a register, try to keep it there.

14) For algorithms there are generally multiple ways to do anything.  So always do searches on the net or ask in these forums about alternate algorithms to do something. 
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm