News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Find zero/empty slot quickly ...

Started by James Ladd, April 05, 2009, 06:34:51 AM

Previous topic - Next topic

James Ladd

I have the need to find a specific slot in an array quickly.

I could just iterate from one end to the other, but I'm wondering if some of you know a faster way, possibly even using MMX etc.

So given a sequence of 32bit values, how can I find the first zero the fastest ?

Rgs, James.


jj2007

James,
I don't have the time right now, but the answer is in that inner loop coming from the strlen32s proc in the strlen thread. You will need the  pcmpeqd variants.
lea eax, [edx+16] ; aligned pointer + 16 (first 0..15 dealt with by movups above)

@@: pcmpeqb xmm0, [eax] ; ---- inner loop inspired by Lingo, with adaptions -----
pcmpeqb xmm1, [eax+16] ; compare packed bytes in [m128] and xmm1 for equality
lea eax, [eax+32] ; len counter (moving up lea or add costs 3 cycles for the 191 byte string)
por xmm1, xmm0 ; or them: one of the mem locations may contain a nullbyte
pmovmskb edx, xmm1 ; set byte mask in edx
test edx, edx
jz @B
@@: sub eax, [esp-4] ; subtract original src pointer

Tedd

Or, just keep track of where the next empty slot is.
Initially it will be the first, then second, and so on as the array fills up.
When you remove an item, it then becomes the next empty slot. The next slot after that will require some search, but only from that point onwards.
(Or even make it a linked list of empty slot pointers - each empty slot points to the next empty slot.)
No snowflake in an avalanche feels responsible.

James Ladd

Thanks Ted & JJ,

I should have mentioned that the array of 32 bit values can have random slots made empty, so keeping a track of
next free isnt possible. However, Ill think about the free slot pointing to next free slot approach.

Rgs, James.

Mark Jones

Good to see you're still alive James. :bg

I know it's not what you're looking for, but my vote is for the linked-list. It is powerful and fast. One modification could be made, in that since you'd only be concerned with whether or not a DWORD contains a value or not, then one bit could represent a DWORD in the array. Thus, the entire array "full flags" could fit in 2n bits, where n is the size of the array. (Thus the "list" could be very small in relation to the array.) Of course, this would take a little thinking to implement effectively - just an idea.
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

redskull

REP SCASD, Baby!  It's probably not the fastest, but you don't get smaller than a single instruction (some of us still irresponsibly optimize for size)

-ac
Strange women, lying in ponds, distributing swords, is no basis for a system of government

KeepingRealBusy

redskull,

QuoteREP SCASD, Baby!  It's probably not the fastest, but you don't get smaller than a single instruction (some of us still irresponsibly optimize for size)

All depends on whether the array is an array of DWORDS, and if the indexed entry just happens to be an actual value of 0. If the array is an array of QWORDS, a short (<32 bit) QWORD entry would give you a false stop on the second DWORD of the QWORD.

Use a linked list! It only needs a Root DWORD (initially 0). Adding an entry amounts to putting the Root value in the new entry to add, and saving the OFFSET of the new entry in Root. Removing an entry simply gets the Root value and uses it as the entry to use, then moves the first DWORD of that new entry to the Root.

Dave.

NightWare

Quote from: KeepingRealBusy on April 10, 2009, 03:26:14 AM
a short (<32 bit) QWORD entry would give you a false stop on the second DWORD of the QWORD.
:P you know someone who will read the seconds dwords of a dword*2 list ? scasd should be avoided here, yes, but a dword read is still interesting...

dedndave

you might keep a long binary value and set a bit whenever a slot is filled
use the low order bit for the first slot
SHR on the binary until the CF is clear

jj2007

Have fun :bg

              Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)


findslotSSE2   45095 kilocycles, result: slot 10000000
findslotScasd  47280 kilocycles, result: slot 10000000

findslotSSE2   4536 kilocycles, result: slot 1000000
findslotScasd  4712 kilocycles, result: slot 1000000

findslotSSE2   162 kilocycles, result: slot 100000
findslotScasd  395 kilocycles, result: slot 100000



[attachment deleted by admin]

Mark Jones

AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)
findslotSSE2   32452 kilocycles, result: slot 10000000
findslotScasd  40024 kilocycles, result: slot 10000000

findslotSSE2   3155 kilocycles, result: slot 1000000
findslotScasd  3905 kilocycles, result: slot 1000000

findslotSSE2   208 kilocycles, result: slot 100000
findslotScasd  300 kilocycles, result: slot 100000
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

jj2007

So far the biggest gain for large slot tables:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

findslotSSE2   22805 kilocycles, result: slot 10000000
findslotScasd  41255 kilocycles, result: slot 10000000

findslotSSE2   2264 kilocycles, result: slot 1000000
findslotScasd  4127 kilocycles, result: slot 1000000

findslotSSE2   88 kilocycles, result: slot 100000
findslotScasd  388 kilocycles, result: slot 100000

MichaelW

#12
This is the best I could do without SSE2 capability on my P3:

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 16
FindSlot1 proc base:DWORD

    mov edx, [esp+4]
    mov eax, -2
    jmp @F
    align 16
  @@:
    add eax, 2
    mov ecx, -1
    and ecx, [edx+eax*4]
    test ecx, [edx+eax*4+4]
    jnz @B
    test DWORD PTR [edx+eax*4], -1
    jz  @F
    add eax, 1
  @@:
    ret 4

FindSlot1 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef


BTW, this code depends on a zeroed sentinel at the end, and returns a slot index instead of a byte index.

EDIT: To handle even as well as odd length arrays, I think there should be two zeroed sentinels at the end.
eschew obfuscation

jj2007

Quote from: MichaelW on April 11, 2009, 02:04:08 AM
This is the best I could do without SSE2 capability on my P3:

Pretty fast, a big improvement over rep scasd:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

Code sizes:
findslotSSE2   66
findslotMW     48
findslotScasd  26

Timings:
findslotSSE2   22036   kilocycles, result: slot 10000000
findslotMW     25308   kilocycles, result: slot 10000000
findslotScasd  40990   kilocycles, result: slot 10000000

findslotSSE2   2194    kilocycles, result: slot 1000000
findslotMW     2534    kilocycles, result: slot 1000000
findslotScasd  4130    kilocycles, result: slot 1000000

findslotSSE2   88      kilocycles, result: slot 100000
findslotMW     152     kilocycles, result: slot 100000
findslotScasd  388     kilocycles, result: slot 100000

Quote

BTW, this code depends on a zeroed sentinel at the end, and returns a slot index instead of a byte index.


Same for the others in the new version.

[attachment deleted by admin]

Mark Jones

AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)

Code sizes:
findslotSSE2   66
findslotMW     48
findslotScasd  26

Timings:
findslotSSE2   33843   kilocycles, result: slot 10000000
findslotMW     42792   kilocycles, result: slot 10000000
findslotScasd  42301   kilocycles, result: slot 10000000

findslotSSE2   3278    kilocycles, result: slot 1000000
findslotMW     4190    kilocycles, result: slot 1000000
findslotScasd  4097    kilocycles, result: slot 1000000

findslotSSE2   138     kilocycles, result: slot 100000
findslotMW     272     kilocycles, result: slot 100000
findslotScasd  258     kilocycles, result: slot 100000
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08