News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

LUT Optimization

Started by Jimbo, January 26, 2009, 04:01:35 PM

Previous topic - Next topic

cmpxchg

Quote from: drizz on January 26, 2009, 04:58:47 PM
Hi,

precalculate and use movzx (avoid using partial registers as much as you can) .

add esi,ecx
add edi,ecx
neg ecx
loopc:
movzx eax,byte ptr [esi+ecx]
movzx eax,byte ptr [edx+eax]
mov [edi+ecx],al
add ecx,1
jnz loopc


I'd go with this version on Intel:

add esi,ecx
add edi,ecx
xor eax, eax
neg ecx
loopc:
mov al,byte ptr [esi+ecx]
mov al,byte ptr [edx+eax]
mov [edi+ecx], al
inc ecx
jnz loopc

mov to partial register instead of movzx is perfectly fine if 32bit reg zerod and if you are not moving constant larger that 127

"inc ecx" - small stall due to partial modification of EFLAGS on 1st loop pass on latest CPUs but no stall on all other passes - jump is considered taken and some kind of mechanism doesn't need EFLAGS at the moment of the jump

sinsi

Hmm, from my q6600/win7

GenuineIntel  Family 6  Model 15  Stepping 11
Intel Brand String: NA, processor is Pentium III
Features: FPU TSC CX8 CMOV CLFSH FXSR HTT MMX SSE SSE2 SSE3
Iterations: 100
        Clocks   Description
Test 1  138533   Original Jimbo routine
Test 2  136400   drizz routine
Test 3  136081   drizz routine 2
Test 4  144961   lingo
Test 5  136271   chrisw routine
Test 12 141354   using table lookup
Test 13 270815   using table and dword pickup, byte store
Test 14 415238   using table, dword pickup and store, shrshl
Test 15 408414   using table, dword pickup and store, shrshl
Test 16 314306   using table, dword pickup and store, shrshl, ebp
Test 18 406859   using table, dword pickup and store, ebp
Test 20 78024    using word size table, dword pickup and store, adjust to even dword

So how many lamps is that  :bg
Light travels faster than sound, that's why some people seem bright until you hear them.

Jimg

Now that is strange.  I normally wouldn't be surprised that an intel is different than amd, but I ran it on my old celeron and got similar numbers to the amd so I have no idea what's going on.  Please change line 8 in TimeTest.asm to read tsttype=1 and let me know if you get similar answers.  Also run one of the tests and look at stats.txt and see if there are just a few runs that are blowing up the average. The first column is the clocks, second column the number of times for that clock amount, and the last a running total of times the test was run.

sinsi

results.txt

GenuineIntel  Family 6  Model 15  Stepping 11
Intel Brand String: NA, processor is Pentium III
Features: FPU TSC CX8 CMOV CLFSH FXSR HTT MMX SSE SSE2 SSE3

Method: Min
Iterations: 100
        Clocks   Description
Test 1  135684   Original Jimbo routine
Test 2  135648   drizz routine
Test 3  135639   drizz routine 2
Test 4  141093   lingo
Test 5  135540   chrisw routine
Test 12 141354   using table lookup
Test 13 270351   using table and dword pickup, byte store
Test 14 413712   using table, dword pickup and store, shrshl
Test 15 405522   using table, dword pickup and store, shrshl
Test 16 307224   using table, dword pickup and store, shrshl, ebp
Test 18 405531   using table, dword pickup and store, ebp
Test 20 77940    using word size table, dword pickup and store, adjust to even dword

stats.txt

77940 1 1
77958 1 2
77967 2 4
77976 1 5
77985 69 74
77994 23 97
78012 1 98
78111 1 99
137511 1 100

Sorry for the delay, but I had a few problems assembling it via qeditor...?weird

I'll reboot into XP home and see what that says.
Light travels faster than sound, that's why some people seem bright until you hear them.

sinsi

here you go, from xp home
results.txt

GenuineIntel  Family 6  Model 15  Stepping 11
Intel Brand String: NA, processor is Pentium III
Features: FPU TSC CX8 CMOV CLFSH FXSR HTT MMX SSE SSE2 SSE3

Method: Min
Iterations: 100
        Clocks   Description
Test 1  135693   Original Jimbo routine
Test 2  135639   drizz routine
Test 3  135639   drizz routine 2
Test 4  141264   lingo
Test 5  135486   chrisw routine
Test 12 140517   using table lookup
Test 13 270351   using table and dword pickup, byte store
Test 14 413712   using table, dword pickup and store, shrshl
Test 15 405522   using table, dword pickup and store, shrshl
Test 16 307224   using table, dword pickup and store, shrshl, ebp
Test 18 405531   using table, dword pickup and store, ebp
Test 20 77985    using word size table, dword pickup and store, adjust to even dword

stats.txt

77985 97 97
77994 3 100


wow xp is so slow...
Light travels faster than sound, that's why some people seem bright until you hear them.

Jimg

#20
My last post was at 10pm my time, so no problem. :wink

Something is really weird here.  If you get a chance, please just click on the button for test 16 and take a look at the stats file.  The stats are only for the last routine run. Unless there is some really strange cacheing going on, I have no idea what the problem is.  Here is the results on my celeron laptop-
GenuineIntel  Family 15  Model 2  Stepping 7
Intel Brand String: Mobile Intel(R) Celeron(R) processor
Features: FPU TSC CX8 CMOV CLFSH FXSR HTT MMX SSE SSE2

Method: Min
Iterations: 100
        Clocks   Description
Test 1  998412   Original Jimbo routine
Test 2  828560   drizz routine
Test 3  811436   drizz routine 2
Test 4  821916   lingo
Test 5  524968   chrisw routine
Test 12 905772   using table lookup
Test 13 604740   using table and dword pickup, byte store
Test 14 501088   using table, dword pickup and store, shrshl
Test 15 444300   using table, dword pickup and store, shrshl
Test 16 368240   using table, dword pickup and store, shrshl, ebp
Test 18 457372   using table, dword pickup and store, ebp
Test 20 280684   using word size table, dword pickup and store, adjust to even dword



Also if you get a chance, please try the attached which is only two tests, the fastest on your machine and mine (discarding the huge table test).
edit: removed short test to save space.

Mark Jones


AuthenticAMD  Family 15  Model 17  Stepping 1
AMD Name String: AMD Athlon(tm) 64 X2 Dual Core Processor 4000+
Features: FPU TSC CX8 CMOV CLFSH FXSR HTT MMX SSE SSE2 SSE3
Iterations: 100
        Clocks   Description
Test 1  634218   Original Jimbo routine
Test 2  624923   drizz routine
Test 3  626890   drizz routine 2
Test 4  615348   lingo
Test 5  375798   chrisw routine
Test 12 624631   using table lookup
Test 13 369817   using table and dword pickup, byte store
Test 14 298597   using table, dword pickup and store, shrshl
Test 15 244773   using table, dword pickup and store, shrshl
Test 16 157065   using table, dword pickup and store, shrshl, ebp
Test 18 241636   using table, dword pickup and store, ebp
Test 20 157318   using word size table, dword pickup and store, adjust to even dword


It would be nice if the header also included the major Windows data for completeness - this was WinXP x32 SP3 (still trying to decide if I'm ever going to switch to x64 since there are drawbacks...)
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

Jimg

Thanks Mark.  At least your timings confirm mine.  I thought I was going more crazy there for a while.  I'll add version info next time.

sinsi


GenuineIntel  Family 6  Model 15  Stepping 11
Intel Brand String: NA, processor is Pentium III
Features: FPU TSC CX8 CMOV CLFSH FXSR HTT MMX SSE SSE2 SSE3

Method: Min
Iterations: 100
        Clocks   Description
Test 16 307233   dword pickup and store, shrshl, ebp

307233 85 85
307242 10 95
338859 1 96
342108 1 97
356796 1 98
366417 1 99
376740 1 100


This is a bit strange though

Intel Brand String: NA, processor is Pentium III

Pentium 3?
Light travels faster than sound, that's why some people seem bright until you hear them.

Jimg

Clearly I need a later version of MichaelW's routine :red

sinsi

Using CPUID with EAX=80000002,80000003,80000004 gives me a string "Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz" but seems to be for P4 or better.
Light travels faster than sound, that's why some people seem bright until you hear them.

Jimg

Okay, I downloaded the latest cpu id info from intel and amd. 
Amd's document is 25 pages long, Intel's is 100 pages.  It may take me awhile to get everything updated. ::)