News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

repne scasb

Started by BlackVortex, October 15, 2009, 11:31:08 AM

Previous topic - Next topic

BlackVortex

I'm making an algorithm to search for some bytes in an unknown size chunk of memory.  I'm using "repne scasb" to find the first byte and then "repe cmpsb" to see if it's all the bytes I want, or else I continue from the next byte from where "repne scasb" left off

Reading Agner's optimization manuals it seems that these commands are't really fast. Is it worth the trouble to replace them with the usual instructions ? I mean, "repne scasb" does it all, the comparison, decrementing the ecx counter, incrementing the target "pointer" and repeats itself so no need for jumping, will I really save cycles ?

Will there be a noticeable speed improvement, or is the main bottleneck the"scanning" through memory ? On my PC it goes through 870mb memory in 1.3 seconds by the way   :U

P.S.: It's easy to time it and see any speed differences, but isn't there an emulator or something that provides a profiling of a code snippet ?

dedndave

i was surprised to find that scanning with a loop was much faster
the trick is to always access memory as dwords on 4-aligned addresses
after you have found the "key" byte, using repz scasb might be good
(actually - if the string is 5 bytes or longer, a dword CMP or a single CMPSD might be good to eliminate boggies)
so, you want to load aligned dwords with MOV EAX,[EDI], then make the 4 tests for the key byte in register
it will make for bigger code, getting aligned and all that - but should be worth the effort
let us know how it turns out

dedndave

#2
in my application, i wanted to scan an input BigNum integer for unused bytes (from the top down)
the original code used STD | REPZ SCASB | CLD - i scanned for 0 in unsigned or positives or FF in signed negatives
i found STD to be very slow, so i wanted to come up with something different
even without the STD problem, this code is still faster
i checked it over a wide range of lengths and found it to be ~3 times faster on a P4
as you can see, it handles up to 3 single bytes at the beginning and end, but aligned dwords otherwise
i know - lol - it looks clumsy - but it hauls ass
your application is a little different
you will want to get the dword into eax and compare the 4 bytes against the key
if the key is found, you have to adjust the address according to mod 4 position and compare the next 4 bytes
you might be able to improve it by comparing n bytes already in register

;reduce input length to ignore unused high-order bytes

;eax = sign XOR mask (0 or -1)
;ebx = 3
;ecx = input size in bytes
;edi = offset of last input byte + 1

scan00: test    edi,ebx                 ;
        jz      scan01                  ;

        dec     edi                     ;
        cmp     al,[edi]                ;
        jnz     scan06                  ;

        dec     ecx                     ;
        jnz     scan00                  ;

        jmp short scan07                ;

scan01: inc     ebx                     ;

scan02: sub     ecx,ebx                 ;
        jb      scan04                  ;

        sub     edi,ebx                 ;
        cmp     eax,[edi]               ;
        jnz     scan03                  ;

        jecxz   scan07                  ;

        jmp     scan02                  ;

scan03: add     edi,ebx                 ;

scan04: add     ecx,ebx                 ;

scan05: dec     edi                     ;
        cmp     al,[edi]                ;
        jnz     scan06                  ;

        dec     ecx                     ;
        jnz     scan05                  ;

        inc     ecx                     ;

scan06: dec     ecx                     ;

scan07: mov     ebx,eax                 ;

i have since modified my code to remove the count in ecx
Japheth reminded me that wasn't necessary in another topic
this stuff i know - just somewhat forgotten - lol - old age, ya know

hutch--

BlackVortex,

On most hardware, treat REP[E] ANYTHING like the plague, there are a couple of exceptions, REP MOVSD for block memory copy, a straight linear scanner using less registers generally is a lot faster. If you are doing binary search, try a boyer moore algo, if the binary pattern you are searching for is over about 6 bytes in length and you are scanning data over a couple of K, it really is a lot faster.

LATER: I just had a quick play on this quad that runs at about the same speed as your duo and the byte scanner is not that much faster than the time you mentioned, I get about 890ms just on the bare linear scan for 1 character. Unrolled it by 8 and it did not make any difference so that many memory reads seems to take a finite amount of time, no matter what. None the less its about 1.1 gig a second, how much faster do you need it ?
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

I can lend you the MasmBasic Len() algo. A very fast byte scanner, but you have to replace the pxor xmm0, xmm0 in line 3 with the byte you want to find. Apart from that, there is a thread treating the alternatives here.

MbStrLen proc ; src:DWORD (pseudo arg; uncommenting adds frame!) - with lots of inspiration from Lingo, NightWare and Agner Fog
pop eax ; trash the return address (still on stack, though)
pop eax ; get the src pointer
pxor xmm0, xmm0 ; zero for comparison
mov [esp-12], edx ; edx preserved because Masm32 szLen preserves it
movups xmm1, [eax] ; move 16 bytes into xmm1, unaligned (adapted from Lingo/NightWare)
pcmpeqb xmm1, xmm0 ; set bytes in xmm1 to FF if nullbytes found in xmm1
mov edx, eax ; save pointer to string
pmovmskb eax, xmm1 ; set byte mask in eax
bsf eax, eax ; bit scan forward
jne Lt16 ; less than 16 bytes, we are done
and edx, -16 ; align initial pointer to 16-byte boundary
lea eax, [edx+16] ; aligned pointer + 16 (first 0..15 dealt with by movups above)

@@: pcmpeqb xmm0, [eax] ; ---- inner loop inspired by Lingo, with adaptions -----
pcmpeqb xmm1, [eax+16] ; compare packed bytes in [m128] and xmm1 for equality
lea eax, [eax+32] ; len counter (moving up lea or add costs 3 cycles for the 191 byte string)
por xmm1, xmm0 ; or them: one of the mem locations may contain a nullbyte
pmovmskb edx, xmm1 ; set byte mask in edx
test edx, edx
jz @B

sub eax, [esp-4] ; subtract original src pointer
shl edx, 16 ; create space for the ecx bytes
push ecx ; this push trashes the src pointer (all registers preserved, except eax = return value)
pmovmskb ecx, xmm0 ; set byte mask in ecx (has to be repeated, sorry)
or edx, ecx ; combine xmm0 and xmm1 results
bsf edx, edx ; bit scan for the index
pop ecx
lea eax, [eax+edx-32] ; add scan index
Lt16: mov edx, [esp-12] ; for saving edx, we use a free stack slot; -8 is the ret address
jmp dword ptr [esp-2*4] ; ret address, one arg
MbStrLen endp

BlackVortex

Sorry jj, I can't understand mmx instructions   :(

I made an alternative algo and replaced the rep command but I've screwed up it seems, it's much slower and it doesn't find the pattern. But I don't want to fix it because the "first byte scanning" too slow, so ...

My original (100% working) proc takes 4 parameters :
starting offset
search length
offset to the bytes to search for
number of the bytes to search for (can also be 1, algo still works)

And it returns the offset of the first match or -1 if not found

I've made it into a dll and call it from another language, is there some equivalent library I can test against ? Is there an equivalent procedure I can quickly use in its place and compare ?

EDIT: The Boyer–Moore algorithm seems great, is it implemented somewhere so I can test ?

hutch--

3 versions in the MASM32 library and from memory a member IanB coded and improvement on the fastest of them.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php