I have the need to find a specific slot in an array quickly.
I could just iterate from one end to the other, but I'm wondering if some of you know a faster way, possibly even using MMX etc.
So given a sequence of 32bit values, how can I find the first zero the fastest ?
Rgs, James.
James,
I don't have the time right now, but the answer is in that inner loop coming from the strlen32s proc in the strlen thread. You will need the pcmpeqd variants.
lea eax, [edx+16] ; aligned pointer + 16 (first 0..15 dealt with by movups above)
@@: pcmpeqb xmm0, [eax] ; ---- inner loop inspired by Lingo, with adaptions -----
pcmpeqb xmm1, [eax+16] ; compare packed bytes in [m128] and xmm1 for equality
lea eax, [eax+32] ; len counter (moving up lea or add costs 3 cycles for the 191 byte string)
por xmm1, xmm0 ; or them: one of the mem locations may contain a nullbyte
pmovmskb edx, xmm1 ; set byte mask in edx
test edx, edx
jz @B
@@: sub eax, [esp-4] ; subtract original src pointer
Or, just keep track of where the next empty slot is.
Initially it will be the first, then second, and so on as the array fills up.
When you remove an item, it then becomes the next empty slot. The next slot after that will require some search, but only from that point onwards.
(Or even make it a linked list of empty slot pointers - each empty slot points to the next empty slot.)
Thanks Ted & JJ,
I should have mentioned that the array of 32 bit values can have random slots made empty, so keeping a track of
next free isnt possible. However, Ill think about the free slot pointing to next free slot approach.
Rgs, James.
Good to see you're still alive James. :bg
I know it's not what you're looking for, but my vote is for the linked-list. It is powerful and fast. One modification could be made, in that since you'd only be concerned with whether or not a DWORD contains a value or not, then one bit could represent a DWORD in the array. Thus, the entire array "full flags" could fit in 2n bits, where n is the size of the array. (Thus the "list" could be very small in relation to the array.) Of course, this would take a little thinking to implement effectively - just an idea.
REP SCASD, Baby! It's probably not the fastest, but you don't get smaller than a single instruction (some of us still irresponsibly optimize for size)
-ac
redskull,
QuoteREP SCASD, Baby! It's probably not the fastest, but you don't get smaller than a single instruction (some of us still irresponsibly optimize for size)
All depends on whether the array is an array of DWORDS, and if the indexed entry just happens to be an actual value of 0. If the array is an array of QWORDS, a short (<32 bit) QWORD entry would give you a false stop on the second DWORD of the QWORD.
Use a linked list! It only needs a Root DWORD (initially 0). Adding an entry amounts to putting the Root value in the new entry to add, and saving the OFFSET of the new entry in Root. Removing an entry simply gets the Root value and uses it as the entry to use, then moves the first DWORD of that new entry to the Root.
Dave.
Quote from: KeepingRealBusy on April 10, 2009, 03:26:14 AM
a short (<32 bit) QWORD entry would give you a false stop on the second DWORD of the QWORD.
:P you know someone who will read the seconds dwords of a dword*2 list ? scasd should be avoided here, yes, but a dword read is still interesting...
you might keep a long binary value and set a bit whenever a slot is filled
use the low order bit for the first slot
SHR on the binary until the CF is clear
Have fun :bg
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
findslotSSE2 45095 kilocycles, result: slot 10000000
findslotScasd 47280 kilocycles, result: slot 10000000
findslotSSE2 4536 kilocycles, result: slot 1000000
findslotScasd 4712 kilocycles, result: slot 1000000
findslotSSE2 162 kilocycles, result: slot 100000
findslotScasd 395 kilocycles, result: slot 100000
[attachment deleted by admin]
AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)
findslotSSE2 32452 kilocycles, result: slot 10000000
findslotScasd 40024 kilocycles, result: slot 10000000
findslotSSE2 3155 kilocycles, result: slot 1000000
findslotScasd 3905 kilocycles, result: slot 1000000
findslotSSE2 208 kilocycles, result: slot 100000
findslotScasd 300 kilocycles, result: slot 100000
So far the biggest gain for large slot tables:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
findslotSSE2 22805 kilocycles, result: slot 10000000
findslotScasd 41255 kilocycles, result: slot 10000000
findslotSSE2 2264 kilocycles, result: slot 1000000
findslotScasd 4127 kilocycles, result: slot 1000000
findslotSSE2 88 kilocycles, result: slot 100000
findslotScasd 388 kilocycles, result: slot 100000
This is the best I could do without SSE2 capability on my P3:
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
FindSlot1 proc base:DWORD
mov edx, [esp+4]
mov eax, -2
jmp @F
align 16
@@:
add eax, 2
mov ecx, -1
and ecx, [edx+eax*4]
test ecx, [edx+eax*4+4]
jnz @B
test DWORD PTR [edx+eax*4], -1
jz @F
add eax, 1
@@:
ret 4
FindSlot1 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
BTW, this code depends on a zeroed sentinel at the end, and returns a slot index instead of a byte index.
EDIT: To handle even as well as odd length arrays, I think there should be two zeroed sentinels at the end.
Quote from: MichaelW on April 11, 2009, 02:04:08 AM
This is the best I could do without SSE2 capability on my P3:
Pretty fast, a big improvement over rep scasd:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
Code sizes:
findslotSSE2 66
findslotMW 48
findslotScasd 26
Timings:
findslotSSE2 22036 kilocycles, result: slot 10000000
findslotMW 25308 kilocycles, result: slot 10000000
findslotScasd 40990 kilocycles, result: slot 10000000
findslotSSE2 2194 kilocycles, result: slot 1000000
findslotMW 2534 kilocycles, result: slot 1000000
findslotScasd 4130 kilocycles, result: slot 1000000
findslotSSE2 88 kilocycles, result: slot 100000
findslotMW 152 kilocycles, result: slot 100000
findslotScasd 388 kilocycles, result: slot 100000
Quote
BTW, this code depends on a zeroed sentinel at the end, and returns a slot index instead of a byte index.
Same for the others in the new version.
[attachment deleted by admin]
AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)
Code sizes:
findslotSSE2 66
findslotMW 48
findslotScasd 26
Timings:
findslotSSE2 33843 kilocycles, result: slot 10000000
findslotMW 42792 kilocycles, result: slot 10000000
findslotScasd 42301 kilocycles, result: slot 10000000
findslotSSE2 3278 kilocycles, result: slot 1000000
findslotMW 4190 kilocycles, result: slot 1000000
findslotScasd 4097 kilocycles, result: slot 1000000
findslotSSE2 138 kilocycles, result: slot 100000
findslotMW 272 kilocycles, result: slot 100000
findslotScasd 258 kilocycles, result: slot 100000
Intel(R) Core(TM)2 Duo CPU E4500 @ 2.20GHz (SSE4)
Code sizes:
findslotSSE2 66
findslotMW 48
findslotScasd 26
Timings:
findslotSSE2 19536 kilocycles, result: slot 10000000
findslotMW 22199 kilocycles, result: slot 10000000
findslotScasd 44976 kilocycles, result: slot 10000000
findslotSSE2 1968 kilocycles, result: slot 1000000
findslotMW 2220 kilocycles, result: slot 1000000
findslotScasd 4247 kilocycles, result: slot 1000000
findslotSSE2 59 kilocycles, result: slot 100000
findslotMW 116 kilocycles, result: slot 100000
findslotScasd 391 kilocycles, result: slot 100000
I think it is awesome that someone can post a simple question and get such great responses !
Thanks all.
Quote from: redskull on April 09, 2009, 10:53:12 PM
REP SCASD, Baby! It's probably not the fastest, but you don't get smaller than a single instruction (some of us still irresponsibly optimize for size)
Hard to believe, but you can get faster and smaller:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
Code sizes:
findslotSSE2 65
findslotMW 48
findslotJJ 18
findslotScasd 26
Timings:
findslotSSE2 86 cycles/100 slots, slot found at byte 500000
findslotMW 157 cycles/100 slots, slot found at byte 500000
findslotJJ 181 cycles/100 slots, slot found at byte 500000
findslotScasd 398 cycles/100 slots, slot found at byte 500000
James, perhaps it's time to reveal how big your array is... the timings show a strong dependency on whether the array fits into the cache ;-)
EDIT: A bugfix in my non-SSE algo, and 3 cycles less for the SSE2 version.
[attachment deleted by admin]
The easiest way to find an empty slot in an array is via a mathematical progression. You continually add the current slot to the next slot and then compare the result to the first slot. If they are equal, then the next slot is empty. If not equal you update the array pointer and add again. Do this until the entire array is tested. You can also build an array of pointers to empty slots and when you use one of them you remove it from the second array. This will work for any type of array, ASCII or otherwise.
Paul