Hi everyone,
I've written a procedure which applies a look-up table to an array. I was just wondering if it can be made any faster.
The main loop is below,
esi -> source array (bytes)
edi -> destination array (bytes)
edx -> look-up table (256 byte elements)
ecx -> length of array (in bytes)
loopc:
mov al, [esi]
mov al, [edx+eax]
mov [edi], al
add esi, 1h
add edi, 1h
sub ecx, 1h
jnz loopc
I'm really struggling to optimize it as it doesn't lend itself to MMX or SSE very well.
Anyone know a better way?
I look forward to your ideas,
Thanks!
Hi,
precalculate and use movzx (avoid using partial registers as much as you can) .
add esi,ecx
add edi,ecx
neg ecx
loopc:
movzx eax,byte ptr [esi+ecx]
movzx eax,byte ptr [edx+eax]
mov [edi+ecx],al
add ecx,1
jnz loopc
Hi,
That was a quick reply, and impressive too (2x faster on my Core 2 Duo).
Do you think this is about as fast as it gets? It's running at ~6 clocks/element which isn't bad considering what it's doing. I've messed around with your code but i can't beat it.
Thanks.
Well, maybe you could unroll it once or more, but thats about it i think...
To bad that there is no instruction that could lookup 4 or more elements at once. like xlatb but with all 4 bytes not just "al".
add esi,ecx
add edi,ecx
neg ecx
test ecx,1
jnz @F
loopc:
movzx eax,byte ptr [esi+ecx]
movzx eax,byte ptr [edx+eax]
mov [edi+ecx],al
add ecx,1
@@: movzx eax,byte ptr [esi+ecx]
movzx eax,byte ptr [edx+eax]
mov [edi+ecx],al
add ecx,1
jnz loopc
If it's really really important to be the fastest, and you are processing enough data to make up the initial overhead of table creation, perhaps you could create a lookup table with64K words, one word for each possible two byte combination in, and then do a word at a time. It may just create a caching situation that is slower, you'd have to try it.
"I've messed around with your code but.."
it is easier:
loopc:
movzx eax, byte ptr [esi+ecx-1]
sub ecx, 1
mov al, [edx+eax]
mov [edi+ecx],al
jne loopc
I see you guys using add and sub to increment and decrement values. Are inc/dec slower ?
Quote from: BlackVortex on January 27, 2009, 03:17:26 AM
I see you guys using add and sub to increment and decrement values. Are inc/dec slower ?
On Pentiums I to III INC/DEC is faster, on IV ADD/SUB is faster. Not sure for the x86-64 CPUs or AMD.
Thanks everyone, some interesting ideas.
Jimg, it doesn't need to be the fastest, i just thought seeing as it's a common technique there might be a better way, and there is.
Quote from: donkey on January 27, 2009, 04:44:37 AM
Not sure for the x86-64 CPUs or AMD.
The newer AMD's seem to favor INC/DEC substantially.
it's sometimes (not always, in my experience) advantageous to combine small stores to a larger one, so on my system, the following code performs best:
sub ecx, 3
add esi, ecx
add edi, ecx
neg ecx
jns loopc_entry
loopc4:
movzx eax, byte ptr [esi+ecx]
movzx ebx, byte ptr [edx+eax]
movzx eax, byte ptr [esi+ecx+1]
movzx eax, byte ptr [edx+eax]
shl eax, 8
or ebx, eax
movzx eax, byte ptr [esi+ecx+2]
movzx eax, byte ptr [edx+eax]
shl eax, 16
or ebx, eax
movzx eax, byte ptr [esi+ecx+3]
movzx eax, byte ptr [edx+eax]
shl eax, 24
or ebx, eax
mov [edi+ecx], ebx
add ecx, 4
js loopc4
loopc:
movzx eax, byte ptr [esi+ecx]
movzx eax, byte ptr [edx+eax]
mov [edi+ecx], al
add ecx, 1
loopc_entry:
cmp ecx, 3
jl loopc
With an additional register (e.g. ebp) each two lookups could be performed and combined without any RAW dependency, but i did not test it and do not expect a substantial performance improvement in that case.
Not bad Chris.
To test, I modified my old test for using table lookup to convert strings to upper case.
Results:
AuthenticAMD Family 7 Model 10 Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE
Method: Min
Iterations: 100
Clocks Description
Test 1 616573 Original Jimbo routine
Test 2 614149 drizz routine
Test 3 615432 drizz routine 2
Test 4 611650 lingo
Test 5 346073 chrisw routine
Test 12 611642 using table lookup
Test 13 367586 using table and dword pickup, byte store
Test 14 304890 using table, dword pickup and store, shrshl
Test 15 231323 using table, dword pickup and store, shrshl
Test 18 244933 using table, dword pickup and store, ebp
Test 20 142774 using word size table, dword pickup and store, adjust to even dword
The last one uses the word size table I spoke about above.
Results will vary greatly on a pentium, as my routines were developed on an AMD.
Edit: added new test below. Deleted old upload.
btw,
typing the program code into my previous post, i missed a jump to the loopc_entry at the start of loopc: :(
...
add ecx, 4
js loopc4
jmp loopc_entry
loopc:
movzx eax, byte ptr [esi+ecx]
...
Quote from: BlackVortex on January 27, 2009, 03:17:26 AM
I see you guys using add and sub to increment and decrement values. Are inc/dec slower ?
And I see that you are not using the string operations, either, ...?
Updated chris's fix.
Added test 16, avoided small stall by using ebp.
AuthenticAMD Family 7 Model 10 Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE
Iterations: 100
Clocks Description
Test 1 623774 Original Jimbo routine
Test 2 620497 drizz routine
Test 3 622401 drizz routine 2
Test 4 619057 lingo
Test 5 350532 chrisw routine
Test 12 619346 using table lookup
Test 13 373252 using table and dword pickup, byte store
Test 14 300690 using table, dword pickup and store, shrshl
Test 15 235278 using table, dword pickup and store, shrshl
Test 16 173673 using table, dword pickup and store, shrshl, ebp
Test 18 248373 using table, dword pickup and store, ebp
Test 20 146488 using word size table, dword pickup and store, adjust to even dword
[attachment deleted by admin]
Quote from: drizz on January 26, 2009, 04:58:47 PM
Hi,
precalculate and use movzx (avoid using partial registers as much as you can) .
add esi,ecx
add edi,ecx
neg ecx
loopc:
movzx eax,byte ptr [esi+ecx]
movzx eax,byte ptr [edx+eax]
mov [edi+ecx],al
add ecx,1
jnz loopc
I'd go with this version on Intel:
add esi,ecx
add edi,ecx
xor eax, eax
neg ecx
loopc:
mov al,byte ptr [esi+ecx]
mov al,byte ptr [edx+eax]
mov [edi+ecx], al
inc ecx
jnz loopc
mov to partial register instead of movzx is perfectly fine if 32bit reg zerod and if you are not moving constant larger that 127
"inc ecx" - small stall due to partial modification of EFLAGS on 1st loop pass on latest CPUs but no stall on all other passes - jump is considered taken and some kind of mechanism doesn't need EFLAGS at the moment of the jump
Hmm, from my q6600/win7
GenuineIntel Family 6 Model 15 Stepping 11
Intel Brand String: NA, processor is Pentium III
Features: FPU TSC CX8 CMOV CLFSH FXSR HTT MMX SSE SSE2 SSE3
Iterations: 100
Clocks Description
Test 1 138533 Original Jimbo routine
Test 2 136400 drizz routine
Test 3 136081 drizz routine 2
Test 4 144961 lingo
Test 5 136271 chrisw routine
Test 12 141354 using table lookup
Test 13 270815 using table and dword pickup, byte store
Test 14 415238 using table, dword pickup and store, shrshl
Test 15 408414 using table, dword pickup and store, shrshl
Test 16 314306 using table, dword pickup and store, shrshl, ebp
Test 18 406859 using table, dword pickup and store, ebp
Test 20 78024 using word size table, dword pickup and store, adjust to even dword
So how many lamps is that :bg
Now that is strange. I normally wouldn't be surprised that an intel is different than amd, but I ran it on my old celeron and got similar numbers to the amd so I have no idea what's going on. Please change line 8 in TimeTest.asm to read tsttype=1 and let me know if you get similar answers. Also run one of the tests and look at stats.txt and see if there are just a few runs that are blowing up the average. The first column is the clocks, second column the number of times for that clock amount, and the last a running total of times the test was run.
results.txt
GenuineIntel Family 6 Model 15 Stepping 11
Intel Brand String: NA, processor is Pentium III
Features: FPU TSC CX8 CMOV CLFSH FXSR HTT MMX SSE SSE2 SSE3
Method: Min
Iterations: 100
Clocks Description
Test 1 135684 Original Jimbo routine
Test 2 135648 drizz routine
Test 3 135639 drizz routine 2
Test 4 141093 lingo
Test 5 135540 chrisw routine
Test 12 141354 using table lookup
Test 13 270351 using table and dword pickup, byte store
Test 14 413712 using table, dword pickup and store, shrshl
Test 15 405522 using table, dword pickup and store, shrshl
Test 16 307224 using table, dword pickup and store, shrshl, ebp
Test 18 405531 using table, dword pickup and store, ebp
Test 20 77940 using word size table, dword pickup and store, adjust to even dword
stats.txt
77940 1 1
77958 1 2
77967 2 4
77976 1 5
77985 69 74
77994 23 97
78012 1 98
78111 1 99
137511 1 100
Sorry for the delay, but I had a few problems assembling it via qeditor...?weird
I'll reboot into XP home and see what that says.
here you go, from xp home
results.txt
GenuineIntel Family 6 Model 15 Stepping 11
Intel Brand String: NA, processor is Pentium III
Features: FPU TSC CX8 CMOV CLFSH FXSR HTT MMX SSE SSE2 SSE3
Method: Min
Iterations: 100
Clocks Description
Test 1 135693 Original Jimbo routine
Test 2 135639 drizz routine
Test 3 135639 drizz routine 2
Test 4 141264 lingo
Test 5 135486 chrisw routine
Test 12 140517 using table lookup
Test 13 270351 using table and dword pickup, byte store
Test 14 413712 using table, dword pickup and store, shrshl
Test 15 405522 using table, dword pickup and store, shrshl
Test 16 307224 using table, dword pickup and store, shrshl, ebp
Test 18 405531 using table, dword pickup and store, ebp
Test 20 77985 using word size table, dword pickup and store, adjust to even dword
stats.txt
77985 97 97
77994 3 100
wow xp is so slow...
My last post was at 10pm my time, so no problem. :wink
Something is really weird here. If you get a chance, please just click on the button for test 16 and take a look at the stats file. The stats are only for the last routine run. Unless there is some really strange cacheing going on, I have no idea what the problem is. Here is the results on my celeron laptop-
GenuineIntel Family 15 Model 2 Stepping 7
Intel Brand String: Mobile Intel(R) Celeron(R) processor
Features: FPU TSC CX8 CMOV CLFSH FXSR HTT MMX SSE SSE2
Method: Min
Iterations: 100
Clocks Description
Test 1 998412 Original Jimbo routine
Test 2 828560 drizz routine
Test 3 811436 drizz routine 2
Test 4 821916 lingo
Test 5 524968 chrisw routine
Test 12 905772 using table lookup
Test 13 604740 using table and dword pickup, byte store
Test 14 501088 using table, dword pickup and store, shrshl
Test 15 444300 using table, dword pickup and store, shrshl
Test 16 368240 using table, dword pickup and store, shrshl, ebp
Test 18 457372 using table, dword pickup and store, ebp
Test 20 280684 using word size table, dword pickup and store, adjust to even dword
Also if you get a chance, please try the attached which is only two tests, the fastest on your machine and mine (discarding the huge table test).
edit: removed short test to save space.
AuthenticAMD Family 15 Model 17 Stepping 1
AMD Name String: AMD Athlon(tm) 64 X2 Dual Core Processor 4000+
Features: FPU TSC CX8 CMOV CLFSH FXSR HTT MMX SSE SSE2 SSE3
Iterations: 100
Clocks Description
Test 1 634218 Original Jimbo routine
Test 2 624923 drizz routine
Test 3 626890 drizz routine 2
Test 4 615348 lingo
Test 5 375798 chrisw routine
Test 12 624631 using table lookup
Test 13 369817 using table and dword pickup, byte store
Test 14 298597 using table, dword pickup and store, shrshl
Test 15 244773 using table, dword pickup and store, shrshl
Test 16 157065 using table, dword pickup and store, shrshl, ebp
Test 18 241636 using table, dword pickup and store, ebp
Test 20 157318 using word size table, dword pickup and store, adjust to even dword
It would be nice if the header also included the major Windows data for completeness - this was WinXP x32 SP3 (still trying to decide if I'm ever going to switch to x64 since there are drawbacks...)
Thanks Mark. At least your timings confirm mine. I thought I was going more crazy there for a while. I'll add version info next time.
GenuineIntel Family 6 Model 15 Stepping 11
Intel Brand String: NA, processor is Pentium III
Features: FPU TSC CX8 CMOV CLFSH FXSR HTT MMX SSE SSE2 SSE3
Method: Min
Iterations: 100
Clocks Description
Test 16 307233 dword pickup and store, shrshl, ebp
307233 85 85
307242 10 95
338859 1 96
342108 1 97
356796 1 98
366417 1 99
376740 1 100
This is a bit strange though
Intel Brand String: NA, processor is Pentium III
Pentium 3?
Clearly I need a later version of MichaelW's routine :red
Using CPUID with EAX=80000002,80000003,80000004 gives me a string "Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz" but seems to be for P4 or better.
Okay, I downloaded the latest cpu id info from intel and amd.
Amd's document is 25 pages long, Intel's is 100 pages. It may take me awhile to get everything updated. ::)