The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: Jimbo on January 26, 2009, 04:01:35 PM

Title: LUT Optimization
Post by: Jimbo on January 26, 2009, 04:01:35 PM
Hi everyone,

I've written a procedure which applies a look-up table to an array.  I was just wondering if it can be made any faster.
The main loop is below,
    esi -> source array (bytes)
    edi -> destination array (bytes)
    edx -> look-up table (256 byte elements)
    ecx -> length of array (in bytes)


loopc:

        mov     al, [esi]
        mov     al, [edx+eax]
        mov     [edi], al

        add     esi, 1h
        add     edi, 1h

        sub     ecx, 1h
        jnz     loopc


I'm really struggling to optimize it as it doesn't lend itself to MMX or SSE very well.

Anyone know a better way?

I look forward to your ideas,

Thanks!
Title: Re: LUT Optimization
Post by: drizz on January 26, 2009, 04:58:47 PM
Hi,

precalculate and use movzx (avoid using partial registers as much as you can) .

add esi,ecx
add edi,ecx
neg ecx
loopc:
movzx eax,byte ptr [esi+ecx]
movzx eax,byte ptr [edx+eax]
mov [edi+ecx],al
add ecx,1
jnz loopc
Title: Re: LUT Optimization
Post by: Jimbo on January 26, 2009, 09:18:05 PM
Hi,

That was a quick reply, and impressive too (2x faster on my Core 2 Duo).
Do you think this is about as fast as it gets?  It's running at ~6 clocks/element which isn't bad considering what it's doing.  I've messed around with your code but i can't beat it.

Thanks.
Title: Re: LUT Optimization
Post by: drizz on January 27, 2009, 01:41:11 AM
Well, maybe you could unroll it once or more, but thats about it i think...
To bad that there is no instruction that could lookup 4 or more elements at once. like xlatb but with all 4 bytes not just "al".

add esi,ecx
add edi,ecx
neg ecx
test ecx,1
jnz @F
loopc:
movzx eax,byte ptr [esi+ecx]
movzx eax,byte ptr [edx+eax]
mov [edi+ecx],al
add ecx,1
@@: movzx eax,byte ptr [esi+ecx]
movzx eax,byte ptr [edx+eax]
mov [edi+ecx],al
add ecx,1
jnz loopc
Title: Re: LUT Optimization
Post by: Jimg on January 27, 2009, 02:01:59 AM
If it's really really important to be the fastest, and you are processing enough data to make up the initial overhead of table creation, perhaps you could create a lookup table with64K words, one word for each possible two byte combination in, and then do a word at a time.  It may just create a caching situation that is slower, you'd have to try it.
Title: Re: LUT Optimization
Post by: lingo on January 27, 2009, 02:14:12 AM
"I've messed around with your code but.."
it is easier:
loopc:
movzx         eax, byte ptr [esi+ecx-1]
sub ecx, 1
mov al, [edx+eax]
mov [edi+ecx],al
jne loopc
Title: Re: LUT Optimization
Post by: BlackVortex on January 27, 2009, 03:17:26 AM
I see you guys using add and sub to increment and decrement values. Are inc/dec slower ?
Title: Re: LUT Optimization
Post by: donkey on January 27, 2009, 04:44:37 AM
Quote from: BlackVortex on January 27, 2009, 03:17:26 AM
I see you guys using add and sub to increment and decrement values. Are inc/dec slower ?

On Pentiums I to III INC/DEC is faster, on IV ADD/SUB is faster. Not sure for the x86-64 CPUs or AMD.
Title: Re: LUT Optimization
Post by: Jimbo on January 27, 2009, 03:26:59 PM
Thanks everyone, some interesting ideas.

Jimg, it doesn't need to be the fastest, i just thought seeing as it's a common technique there might be a better way, and there is.

Title: Re: LUT Optimization
Post by: Mark Jones on January 27, 2009, 03:51:38 PM
Quote from: donkey on January 27, 2009, 04:44:37 AM
Not sure for the x86-64 CPUs or AMD.

The newer AMD's seem to favor INC/DEC substantially.
Title: Re: LUT Optimization
Post by: chrisw on January 30, 2009, 01:24:48 PM
it's sometimes (not always, in my experience) advantageous to combine small stores to a larger one, so on my system, the following code performs best:


sub    ecx,    3
add    esi,    ecx
add    edi,    ecx
neg    ecx
jns    loopc_entry

loopc4:

    movzx    eax,    byte ptr [esi+ecx]
    movzx    ebx,    byte ptr [edx+eax]

    movzx    eax,    byte ptr [esi+ecx+1]
    movzx    eax,    byte ptr [edx+eax]
    shl      eax,    8
    or       ebx,    eax

    movzx    eax,    byte ptr [esi+ecx+2]
    movzx    eax,    byte ptr [edx+eax]
    shl      eax,    16
    or       ebx,    eax

    movzx    eax,    byte ptr [esi+ecx+3]
    movzx    eax,    byte ptr [edx+eax]
    shl      eax,    24
    or       ebx,    eax

    mov      [edi+ecx],    ebx
    add      ecx,          4

js loopc4

loopc:

    movzx    eax,    byte ptr [esi+ecx]
    movzx    eax,    byte ptr [edx+eax]

    mov      [edi+ecx],  al
    add      ecx,        1

loopc_entry:
    cmp      ecx,    3

jl loopc


With an additional register (e.g. ebp) each two lookups could be performed and combined without any RAW dependency, but i did not test it and do not expect a substantial performance improvement in that case.
Title: Re: LUT Optimization
Post by: Jimg on January 30, 2009, 03:32:26 PM
Not bad Chris.
To test, I modified my old test for using table lookup to convert strings to upper case.

Results:
AuthenticAMD  Family 7  Model 10  Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE

Method: Min
Iterations: 100
        Clocks   Description
Test 1  616573   Original Jimbo routine
Test 2  614149   drizz routine
Test 3  615432   drizz routine 2
Test 4  611650   lingo
Test 5  346073   chrisw routine
Test 12 611642   using table lookup
Test 13 367586   using table and dword pickup, byte store
Test 14 304890   using table, dword pickup and store, shrshl
Test 15 231323   using table, dword pickup and store, shrshl
Test 18 244933   using table, dword pickup and store, ebp
Test 20 142774   using word size table, dword pickup and store, adjust to even dword

The last one uses the word size table I spoke about above.

Results will vary greatly on a pentium, as my routines were developed on an AMD.

Edit:  added new test below.  Deleted old upload.
Title: Re: LUT Optimization
Post by: chrisw on January 30, 2009, 04:19:30 PM
btw,

typing the program code into my previous post, i missed a jump to the loopc_entry at the start of loopc:   :(



    ...
    add      ecx,          4

js loopc4

jmp loopc_entry
loopc:

    movzx    eax,    byte ptr [esi+ecx]
    ...

Title: Re: LUT Optimization
Post by: old programmer on January 30, 2009, 05:16:17 PM
Quote from: BlackVortex on January 27, 2009, 03:17:26 AM
I see you guys using add and sub to increment and decrement values. Are inc/dec slower ?
And I see that you are not using the string operations, either, ...?
Title: Re: LUT Optimization
Post by: Jimg on January 31, 2009, 12:21:15 AM
Updated chris's fix.

Added test 16, avoided small stall by using ebp.

AuthenticAMD  Family 7  Model 10  Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE
Iterations: 100
        Clocks   Description
Test 1  623774   Original Jimbo routine
Test 2  620497   drizz routine
Test 3  622401   drizz routine 2
Test 4  619057   lingo
Test 5  350532   chrisw routine
Test 12 619346   using table lookup
Test 13 373252   using table and dword pickup, byte store
Test 14 300690   using table, dword pickup and store, shrshl
Test 15 235278   using table, dword pickup and store, shrshl
Test 16 173673   using table, dword pickup and store, shrshl, ebp
Test 18 248373   using table, dword pickup and store, ebp
Test 20 146488   using word size table, dword pickup and store, adjust to even dword

[attachment deleted by admin]
Title: Re: LUT Optimization
Post by: cmpxchg on January 31, 2009, 04:59:38 AM
Quote from: drizz on January 26, 2009, 04:58:47 PM
Hi,

precalculate and use movzx (avoid using partial registers as much as you can) .

add esi,ecx
add edi,ecx
neg ecx
loopc:
movzx eax,byte ptr [esi+ecx]
movzx eax,byte ptr [edx+eax]
mov [edi+ecx],al
add ecx,1
jnz loopc


I'd go with this version on Intel:

add esi,ecx
add edi,ecx
xor eax, eax
neg ecx
loopc:
mov al,byte ptr [esi+ecx]
mov al,byte ptr [edx+eax]
mov [edi+ecx], al
inc ecx
jnz loopc

mov to partial register instead of movzx is perfectly fine if 32bit reg zerod and if you are not moving constant larger that 127

"inc ecx" - small stall due to partial modification of EFLAGS on 1st loop pass on latest CPUs but no stall on all other passes - jump is considered taken and some kind of mechanism doesn't need EFLAGS at the moment of the jump
Title: Re: LUT Optimization
Post by: sinsi on January 31, 2009, 05:07:01 AM
Hmm, from my q6600/win7

GenuineIntel  Family 6  Model 15  Stepping 11
Intel Brand String: NA, processor is Pentium III
Features: FPU TSC CX8 CMOV CLFSH FXSR HTT MMX SSE SSE2 SSE3
Iterations: 100
        Clocks   Description
Test 1  138533   Original Jimbo routine
Test 2  136400   drizz routine
Test 3  136081   drizz routine 2
Test 4  144961   lingo
Test 5  136271   chrisw routine
Test 12 141354   using table lookup
Test 13 270815   using table and dword pickup, byte store
Test 14 415238   using table, dword pickup and store, shrshl
Test 15 408414   using table, dword pickup and store, shrshl
Test 16 314306   using table, dword pickup and store, shrshl, ebp
Test 18 406859   using table, dword pickup and store, ebp
Test 20 78024    using word size table, dword pickup and store, adjust to even dword

So how many lamps is that  :bg
Title: Re: LUT Optimization
Post by: Jimg on January 31, 2009, 06:14:37 AM
Now that is strange.  I normally wouldn't be surprised that an intel is different than amd, but I ran it on my old celeron and got similar numbers to the amd so I have no idea what's going on.  Please change line 8 in TimeTest.asm to read tsttype=1 and let me know if you get similar answers.  Also run one of the tests and look at stats.txt and see if there are just a few runs that are blowing up the average. The first column is the clocks, second column the number of times for that clock amount, and the last a running total of times the test was run.
Title: Re: LUT Optimization
Post by: sinsi on January 31, 2009, 06:44:24 AM
results.txt

GenuineIntel  Family 6  Model 15  Stepping 11
Intel Brand String: NA, processor is Pentium III
Features: FPU TSC CX8 CMOV CLFSH FXSR HTT MMX SSE SSE2 SSE3

Method: Min
Iterations: 100
        Clocks   Description
Test 1  135684   Original Jimbo routine
Test 2  135648   drizz routine
Test 3  135639   drizz routine 2
Test 4  141093   lingo
Test 5  135540   chrisw routine
Test 12 141354   using table lookup
Test 13 270351   using table and dword pickup, byte store
Test 14 413712   using table, dword pickup and store, shrshl
Test 15 405522   using table, dword pickup and store, shrshl
Test 16 307224   using table, dword pickup and store, shrshl, ebp
Test 18 405531   using table, dword pickup and store, ebp
Test 20 77940    using word size table, dword pickup and store, adjust to even dword

stats.txt

77940 1 1
77958 1 2
77967 2 4
77976 1 5
77985 69 74
77994 23 97
78012 1 98
78111 1 99
137511 1 100

Sorry for the delay, but I had a few problems assembling it via qeditor...?weird

I'll reboot into XP home and see what that says.
Title: Re: LUT Optimization
Post by: sinsi on January 31, 2009, 06:51:43 AM
here you go, from xp home
results.txt

GenuineIntel  Family 6  Model 15  Stepping 11
Intel Brand String: NA, processor is Pentium III
Features: FPU TSC CX8 CMOV CLFSH FXSR HTT MMX SSE SSE2 SSE3

Method: Min
Iterations: 100
        Clocks   Description
Test 1  135693   Original Jimbo routine
Test 2  135639   drizz routine
Test 3  135639   drizz routine 2
Test 4  141264   lingo
Test 5  135486   chrisw routine
Test 12 140517   using table lookup
Test 13 270351   using table and dword pickup, byte store
Test 14 413712   using table, dword pickup and store, shrshl
Test 15 405522   using table, dword pickup and store, shrshl
Test 16 307224   using table, dword pickup and store, shrshl, ebp
Test 18 405531   using table, dword pickup and store, ebp
Test 20 77985    using word size table, dword pickup and store, adjust to even dword

stats.txt

77985 97 97
77994 3 100


wow xp is so slow...
Title: Re: LUT Optimization
Post by: Jimg on January 31, 2009, 01:58:15 PM
My last post was at 10pm my time, so no problem. :wink

Something is really weird here.  If you get a chance, please just click on the button for test 16 and take a look at the stats file.  The stats are only for the last routine run. Unless there is some really strange cacheing going on, I have no idea what the problem is.  Here is the results on my celeron laptop-
GenuineIntel  Family 15  Model 2  Stepping 7
Intel Brand String: Mobile Intel(R) Celeron(R) processor
Features: FPU TSC CX8 CMOV CLFSH FXSR HTT MMX SSE SSE2

Method: Min
Iterations: 100
        Clocks   Description
Test 1  998412   Original Jimbo routine
Test 2  828560   drizz routine
Test 3  811436   drizz routine 2
Test 4  821916   lingo
Test 5  524968   chrisw routine
Test 12 905772   using table lookup
Test 13 604740   using table and dword pickup, byte store
Test 14 501088   using table, dword pickup and store, shrshl
Test 15 444300   using table, dword pickup and store, shrshl
Test 16 368240   using table, dword pickup and store, shrshl, ebp
Test 18 457372   using table, dword pickup and store, ebp
Test 20 280684   using word size table, dword pickup and store, adjust to even dword



Also if you get a chance, please try the attached which is only two tests, the fastest on your machine and mine (discarding the huge table test).
edit: removed short test to save space.
Title: Re: LUT Optimization
Post by: Mark Jones on January 31, 2009, 03:52:03 PM

AuthenticAMD  Family 15  Model 17  Stepping 1
AMD Name String: AMD Athlon(tm) 64 X2 Dual Core Processor 4000+
Features: FPU TSC CX8 CMOV CLFSH FXSR HTT MMX SSE SSE2 SSE3
Iterations: 100
        Clocks   Description
Test 1  634218   Original Jimbo routine
Test 2  624923   drizz routine
Test 3  626890   drizz routine 2
Test 4  615348   lingo
Test 5  375798   chrisw routine
Test 12 624631   using table lookup
Test 13 369817   using table and dword pickup, byte store
Test 14 298597   using table, dword pickup and store, shrshl
Test 15 244773   using table, dword pickup and store, shrshl
Test 16 157065   using table, dword pickup and store, shrshl, ebp
Test 18 241636   using table, dword pickup and store, ebp
Test 20 157318   using word size table, dword pickup and store, adjust to even dword


It would be nice if the header also included the major Windows data for completeness - this was WinXP x32 SP3 (still trying to decide if I'm ever going to switch to x64 since there are drawbacks...)
Title: Re: LUT Optimization
Post by: Jimg on January 31, 2009, 04:44:32 PM
Thanks Mark.  At least your timings confirm mine.  I thought I was going more crazy there for a while.  I'll add version info next time.
Title: Re: LUT Optimization
Post by: sinsi on January 31, 2009, 10:10:52 PM

GenuineIntel  Family 6  Model 15  Stepping 11
Intel Brand String: NA, processor is Pentium III
Features: FPU TSC CX8 CMOV CLFSH FXSR HTT MMX SSE SSE2 SSE3

Method: Min
Iterations: 100
        Clocks   Description
Test 16 307233   dword pickup and store, shrshl, ebp

307233 85 85
307242 10 95
338859 1 96
342108 1 97
356796 1 98
366417 1 99
376740 1 100


This is a bit strange though

Intel Brand String: NA, processor is Pentium III

Pentium 3?
Title: Re: LUT Optimization
Post by: Jimg on January 31, 2009, 11:37:27 PM
Clearly I need a later version of MichaelW's routine :red
Title: Re: LUT Optimization
Post by: sinsi on February 01, 2009, 12:19:19 AM
Using CPUID with EAX=80000002,80000003,80000004 gives me a string "Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz" but seems to be for P4 or better.
Title: Re: LUT Optimization
Post by: Jimg on February 01, 2009, 12:39:29 AM
Okay, I downloaded the latest cpu id info from intel and amd. 
Amd's document is 25 pages long, Intel's is 100 pages.  It may take me awhile to get everything updated. ::)