LUT Optimization

Jimbo · January 26, 2009, 04:01:35 PM

Hi everyone,

I've written a procedure which applies a look-up table to an array. I was just wondering if it can be made any faster.
The main loop is below,
esi -> source array (bytes)
edi -> destination array (bytes)
edx -> look-up table (256 byte elements)
ecx -> length of array (in bytes)

Code Select


loopc:

        mov     al, [esi]
        mov     al, [edx+eax]
        mov     [edi], al

        add     esi, 1h
        add     edi, 1h

        sub     ecx, 1h
        jnz     loopc

I'm really struggling to optimize it as it doesn't lend itself to MMX or SSE very well.

Anyone know a better way?

I look forward to your ideas,

Thanks!

drizz · January 26, 2009, 04:58:47 PM

Hi,

precalculate and use movzx (avoid using partial registers as much as you can) .

Code Select

	add esi,ecx
	add edi,ecx
	neg ecx
loopc:
	movzx eax,byte ptr [esi+ecx]
	movzx eax,byte ptr [edx+eax]
	mov [edi+ecx],al
	add ecx,1
	jnz loopc

Jimbo · January 26, 2009, 09:18:05 PM

Hi,

That was a quick reply, and impressive too (2x faster on my Core 2 Duo).
Do you think this is about as fast as it gets? It's running at ~6 clocks/element which isn't bad considering what it's doing. I've messed around with your code but i can't beat it.

Thanks.

drizz · January 27, 2009, 01:41:11 AM

Well, maybe you could unroll it once or more, but thats about it i think...
To bad that there is no instruction that could lookup 4 or more elements at once. like xlatb but with all 4 bytes not just "al".

Code Select

	add esi,ecx
	add edi,ecx
	neg ecx
	test ecx,1
	jnz @F
loopc:
	movzx eax,byte ptr [esi+ecx]
	movzx eax,byte ptr [edx+eax]
	mov [edi+ecx],al
	add ecx,1
@@:	movzx eax,byte ptr [esi+ecx]
	movzx eax,byte ptr [edx+eax]
	mov [edi+ecx],al
	add ecx,1
	jnz loopc

Jimg · January 27, 2009, 02:01:59 AM

If it's really really important to be the fastest, and you are processing enough data to make up the initial overhead of table creation, perhaps you could create a lookup table with64K words, one word for each possible two byte combination in, and then do a word at a time. It may just create a caching situation that is slower, you'd have to try it.

lingo · January 27, 2009, 02:14:12 AM

"I've messed around with your code but.."
it is easier:

Code Select


loopc:
	movzx 	        eax, byte ptr [esi+ecx-1]
	sub		ecx, 1
	mov		al, [edx+eax]
	mov		[edi+ecx],al
	jne		loopc

BlackVortex · January 27, 2009, 03:17:26 AM

I see you guys using add and sub to increment and decrement values. Are inc/dec slower ?

donkey · January 27, 2009, 04:44:37 AM

Quote from: BlackVortex on January 27, 2009, 03:17:26 AM
I see you guys using add and sub to increment and decrement values. Are inc/dec slower ?

On Pentiums I to III INC/DEC is faster, on IV ADD/SUB is faster. Not sure for the x86-64 CPUs or AMD.

Jimbo · January 27, 2009, 03:26:59 PM

Thanks everyone, some interesting ideas.

Jimg, it doesn't need to be the fastest, i just thought seeing as it's a common technique there might be a better way, and there is.

Mark Jones · January 27, 2009, 03:51:38 PM

Quote from: donkey on January 27, 2009, 04:44:37 AM
Not sure for the x86-64 CPUs or AMD.

The newer AMD's seem to favor INC/DEC substantially.

chrisw · January 30, 2009, 01:24:48 PM

it's sometimes (not always, in my experience) advantageous to combine small stores to a larger one, so on my system, the following code performs best:

Code Select


sub    ecx,    3
add    esi,    ecx
add    edi,    ecx
neg    ecx
jns    loopc_entry

loopc4:

    movzx    eax,    byte ptr [esi+ecx]
    movzx    ebx,    byte ptr [edx+eax]

    movzx    eax,    byte ptr [esi+ecx+1]
    movzx    eax,    byte ptr [edx+eax]
    shl      eax,    8
    or       ebx,    eax

    movzx    eax,    byte ptr [esi+ecx+2]
    movzx    eax,    byte ptr [edx+eax]
    shl      eax,    16
    or       ebx,    eax

    movzx    eax,    byte ptr [esi+ecx+3]
    movzx    eax,    byte ptr [edx+eax]
    shl      eax,    24
    or       ebx,    eax

    mov      [edi+ecx],    ebx
    add      ecx,          4

js loopc4

loopc:

    movzx    eax,    byte ptr [esi+ecx]
    movzx    eax,    byte ptr [edx+eax]
 
    mov      [edi+ecx],  al
    add      ecx,        1

loopc_entry:
    cmp      ecx,    3

jl loopc

With an additional register (e.g. ebp) each two lookups could be performed and combined without any RAW dependency, but i did not test it and do not expect a substantial performance improvement in that case.

Jimg · January 30, 2009, 03:32:26 PM

Not bad Chris.
To test, I modified my old test for using table lookup to convert strings to upper case.

Results:

Code Select

AuthenticAMD  Family 7  Model 10  Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE 

Method: Min
Iterations: 100
        Clocks   Description
Test 1  616573   Original Jimbo routine
Test 2  614149   drizz routine
Test 3  615432   drizz routine 2
Test 4  611650   lingo
Test 5  346073   chrisw routine
Test 12 611642   using table lookup
Test 13 367586   using table and dword pickup, byte store
Test 14 304890   using table, dword pickup and store, shrshl
Test 15 231323   using table, dword pickup and store, shrshl
Test 18 244933   using table, dword pickup and store, ebp
Test 20 142774   using word size table, dword pickup and store, adjust to even dword

The last one uses the word size table I spoke about above.

Results will vary greatly on a pentium, as my routines were developed on an AMD.

Edit: added new test below. Deleted old upload.

chrisw · January 30, 2009, 04:19:30 PM

btw,

typing the program code into my previous post, i missed a jump to the loopc_entry at the start of loopc: :(

Code Select



    ...
    add      ecx,          4

js loopc4

jmp loopc_entry
loopc:

    movzx    eax,    byte ptr [esi+ecx]
    ...

old programmer · January 30, 2009, 05:16:17 PM

Quote from: BlackVortex on January 27, 2009, 03:17:26 AM
I see you guys using add and sub to increment and decrement values. Are inc/dec slower ?

And I see that you are not using the string operations, either, ...?

Jimg · January 31, 2009, 12:21:15 AM

Updated chris's fix.

Added test 16, avoided small stall by using ebp.

Code Select

AuthenticAMD  Family 7  Model 10  Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE 
Iterations: 100
        Clocks   Description
Test 1  623774   Original Jimbo routine
Test 2  620497   drizz routine
Test 3  622401   drizz routine 2
Test 4  619057   lingo
Test 5  350532   chrisw routine
Test 12 619346   using table lookup
Test 13 373252   using table and dword pickup, byte store
Test 14 300690   using table, dword pickup and store, shrshl
Test 15 235278   using table, dword pickup and store, shrshl
Test 16 173673   using table, dword pickup and store, shrshl, ebp
Test 18 248373   using table, dword pickup and store, ebp
Test 20 146488   using word size table, dword pickup and store, adjust to even dword

[attachment deleted by admin]

News:

LUT Optimization

Jimbo

Jimbo

Jimbo

old programmer