News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

LUT Optimization

Started by Jimbo, January 26, 2009, 04:01:35 PM

Previous topic - Next topic

Jimbo

Hi everyone,

I've written a procedure which applies a look-up table to an array.  I was just wondering if it can be made any faster.
The main loop is below,
    esi -> source array (bytes)
    edi -> destination array (bytes)
    edx -> look-up table (256 byte elements)
    ecx -> length of array (in bytes)


loopc:

        mov     al, [esi]
        mov     al, [edx+eax]
        mov     [edi], al

        add     esi, 1h
        add     edi, 1h

        sub     ecx, 1h
        jnz     loopc


I'm really struggling to optimize it as it doesn't lend itself to MMX or SSE very well.

Anyone know a better way?

I look forward to your ideas,

Thanks!

drizz

Hi,

precalculate and use movzx (avoid using partial registers as much as you can) .

add esi,ecx
add edi,ecx
neg ecx
loopc:
movzx eax,byte ptr [esi+ecx]
movzx eax,byte ptr [edx+eax]
mov [edi+ecx],al
add ecx,1
jnz loopc
The truth cannot be learned ... it can only be recognized.

Jimbo

Hi,

That was a quick reply, and impressive too (2x faster on my Core 2 Duo).
Do you think this is about as fast as it gets?  It's running at ~6 clocks/element which isn't bad considering what it's doing.  I've messed around with your code but i can't beat it.

Thanks.

drizz

Well, maybe you could unroll it once or more, but thats about it i think...
To bad that there is no instruction that could lookup 4 or more elements at once. like xlatb but with all 4 bytes not just "al".

add esi,ecx
add edi,ecx
neg ecx
test ecx,1
jnz @F
loopc:
movzx eax,byte ptr [esi+ecx]
movzx eax,byte ptr [edx+eax]
mov [edi+ecx],al
add ecx,1
@@: movzx eax,byte ptr [esi+ecx]
movzx eax,byte ptr [edx+eax]
mov [edi+ecx],al
add ecx,1
jnz loopc
The truth cannot be learned ... it can only be recognized.

Jimg

If it's really really important to be the fastest, and you are processing enough data to make up the initial overhead of table creation, perhaps you could create a lookup table with64K words, one word for each possible two byte combination in, and then do a word at a time.  It may just create a caching situation that is slower, you'd have to try it.

lingo

"I've messed around with your code but.."
it is easier:
loopc:
movzx         eax, byte ptr [esi+ecx-1]
sub ecx, 1
mov al, [edx+eax]
mov [edi+ecx],al
jne loopc

BlackVortex

I see you guys using add and sub to increment and decrement values. Are inc/dec slower ?

donkey

Quote from: BlackVortex on January 27, 2009, 03:17:26 AM
I see you guys using add and sub to increment and decrement values. Are inc/dec slower ?

On Pentiums I to III INC/DEC is faster, on IV ADD/SUB is faster. Not sure for the x86-64 CPUs or AMD.
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

Jimbo

Thanks everyone, some interesting ideas.

Jimg, it doesn't need to be the fastest, i just thought seeing as it's a common technique there might be a better way, and there is.


Mark Jones

Quote from: donkey on January 27, 2009, 04:44:37 AM
Not sure for the x86-64 CPUs or AMD.

The newer AMD's seem to favor INC/DEC substantially.
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

chrisw

it's sometimes (not always, in my experience) advantageous to combine small stores to a larger one, so on my system, the following code performs best:


sub    ecx,    3
add    esi,    ecx
add    edi,    ecx
neg    ecx
jns    loopc_entry

loopc4:

    movzx    eax,    byte ptr [esi+ecx]
    movzx    ebx,    byte ptr [edx+eax]

    movzx    eax,    byte ptr [esi+ecx+1]
    movzx    eax,    byte ptr [edx+eax]
    shl      eax,    8
    or       ebx,    eax

    movzx    eax,    byte ptr [esi+ecx+2]
    movzx    eax,    byte ptr [edx+eax]
    shl      eax,    16
    or       ebx,    eax

    movzx    eax,    byte ptr [esi+ecx+3]
    movzx    eax,    byte ptr [edx+eax]
    shl      eax,    24
    or       ebx,    eax

    mov      [edi+ecx],    ebx
    add      ecx,          4

js loopc4

loopc:

    movzx    eax,    byte ptr [esi+ecx]
    movzx    eax,    byte ptr [edx+eax]

    mov      [edi+ecx],  al
    add      ecx,        1

loopc_entry:
    cmp      ecx,    3

jl loopc


With an additional register (e.g. ebp) each two lookups could be performed and combined without any RAW dependency, but i did not test it and do not expect a substantial performance improvement in that case.

Jimg

#11
Not bad Chris.
To test, I modified my old test for using table lookup to convert strings to upper case.

Results:
AuthenticAMD  Family 7  Model 10  Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE

Method: Min
Iterations: 100
        Clocks   Description
Test 1  616573   Original Jimbo routine
Test 2  614149   drizz routine
Test 3  615432   drizz routine 2
Test 4  611650   lingo
Test 5  346073   chrisw routine
Test 12 611642   using table lookup
Test 13 367586   using table and dword pickup, byte store
Test 14 304890   using table, dword pickup and store, shrshl
Test 15 231323   using table, dword pickup and store, shrshl
Test 18 244933   using table, dword pickup and store, ebp
Test 20 142774   using word size table, dword pickup and store, adjust to even dword

The last one uses the word size table I spoke about above.

Results will vary greatly on a pentium, as my routines were developed on an AMD.

Edit:  added new test below.  Deleted old upload.

chrisw

btw,

typing the program code into my previous post, i missed a jump to the loopc_entry at the start of loopc:   :(



    ...
    add      ecx,          4

js loopc4

jmp loopc_entry
loopc:

    movzx    eax,    byte ptr [esi+ecx]
    ...


old programmer

Quote from: BlackVortex on January 27, 2009, 03:17:26 AM
I see you guys using add and sub to increment and decrement values. Are inc/dec slower ?
And I see that you are not using the string operations, either, ...?

Jimg

Updated chris's fix.

Added test 16, avoided small stall by using ebp.

AuthenticAMD  Family 7  Model 10  Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE
Iterations: 100
        Clocks   Description
Test 1  623774   Original Jimbo routine
Test 2  620497   drizz routine
Test 3  622401   drizz routine 2
Test 4  619057   lingo
Test 5  350532   chrisw routine
Test 12 619346   using table lookup
Test 13 373252   using table and dword pickup, byte store
Test 14 300690   using table, dword pickup and store, shrshl
Test 15 235278   using table, dword pickup and store, shrshl
Test 16 173673   using table, dword pickup and store, shrshl, ebp
Test 18 248373   using table, dword pickup and store, ebp
Test 20 146488   using word size table, dword pickup and store, adjust to even dword

[attachment deleted by admin]