News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Finding a character in a string - strchr.

Started by KeepingRealBusy, June 24, 2010, 04:25:24 AM

Previous topic - Next topic

KeepingRealBusy

I want to start another thread with this instead of complicating the "Compare
Two Strings" thread. With my success in speeding up string compare, I wanted to
see if I could speed up a strchr using SSE. I use a character search in the
process of splitting up a huge buffer of variable length text strings so I can
do a string sort. I came up with an SSE method that worked, and it saved me
about a minute total time in the sort.

I thought I would see what the other experts could do to even improve on that.

Here are the ground rules:

    The test mimics a "C" strchr call thus should have a null check.

    The caller supplies the string pointer as the first argument, and a
    character (as an int) as the second argument.

    The function returns a pointer to the first matching character in the string
    or a null if there was no match.

Pretty simple, really.

I supplied a KRBOld version which I had been using before, and a KRBNew version
that uses SSE. I supplied two test cases for each version, one expecting a match
at the end of a 5000 byte string, and the other expecting a no match. I also
supplied the timing for both conditions using crt_strchr as a reference point.

The code is basically the strcomp code with a partial replacement of the data,
new testing calls for the new functions replacing the old testing calls,
modification of the length reporting code to reference the new functions that
are called, and all of the strcomp test code.

Here are my timings:


AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (SSE3)
Find character in string: long string 5000 bytes.
7549    cycles for crt_strchr, match long string
10357   cycles for crt_strchr, no match long string
17374   cycles for KRBOld, match long string
3166    cycles for KRBNew, match long string
18711   cycles for KRBOld, no match long string
3169    cycles for KRBNew, no match long string
8079    cycles for crt_strchr, match long string
10829   cycles for crt_strchr, no match long string
20738   cycles for KRBOld, match long string
3211    cycles for KRBNew, match long string
20033   cycles for KRBOld, no match long string
3172    cycles for KRBNew, no match long string
Codesizes:
KRBOld: 33
KRBNew: 97
--- ok ---


Dave.

jj2007

Note that in KRBNew you might get a wrong position if the matching char is after the nullbyte. Rare case but the bug potential is high.

KeepingRealBusy

JJ,

Thank you very much for finding this. This would be one of those hard to find bugs. I will rework this and re-post.

My actual code for splitting a huge buffer always worked because I back scanned the buffer looking for the last CRLF, and only included that length in my forward scan (and I set the processed length such that I would seek and re-read the data following that CRLF for the next buffer full). It is only my short test case in these timing tests that will fail.

The solution is to do both the char test and also the null test, saving both of the character positions, then return the character position if it is less than the null position, else return a null. If there is no match for either of the tests, use a -1 as irs position, then if the other test does find a match, the character positions will be correct for determining what to return. You also must be careful to check if the supplied character is a null, in which case you would want to return the actual position of the null character and not a null return.

Dave.

KeepingRealBusy

I fixed my code to correctly detect a found character (match not allowed
following the null). I added test code to verify correct match/nomatch results
under conditional assembly with /DDEBUGHALT, however, the following source code
does not assemble correctly, it drops the invoke:


counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; LONG SOURCE
lea esi, Src3
lea edi, Src6
invoke crt_strchr, esi, edi
ifdef DEBUGHALT
cmp eax,offset Match
jnz Bad1
endif
    counter_end


This is the .lst file content for that:


0000009F  C7 05 0000141C R  2         mov   __counter__loop__counter__, LOOP_COUNT
   00002710
000000A9  33 C0      2         xor   eax, eax
000000AB  0F A2      2         cpuid                 
000000B0      2       ??001E:                 
000000B0  8D 35 00000001 R  1 lea esi, Src3
000000B6  8D 3D 000013C0 R  1 lea edi, Src6
000000C7  3D 00001387 R     1 cmp eax,offset Match
000000CC  0F 85 00000F5D    1 jnz Bad1
000000D2  83 2D 0000141C R  2         sub   __counter__loop__counter__, 1
   01
000000D9  75 D5      2         jnz   __counter__loop__label__


I guess I need to inform Microsoft of another bug I have found, for whatever
good that will do. They will just tell me to buy VS15 or whatever when they get
around to fixing it.

To get around this, I will move the invoke of crt_strchr into a proc as was done
for the other tests which correctly assemble the call:


counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; LONG SOURCE
push offset Src3
        mov  bl,'!'
push ebx
call KRBOld
ifdef DEBUGHALT
cmp eax,offset Match
jnz Bad3
endif
    counter_end


Giving:


0000027F  C7 05 0000141C R  2         mov   __counter__loop__counter__, LOOP_COUNT
   00002710
00000289  33 C0      2         xor   eax, eax
0000028B  0F A2      2         cpuid                 
00000290      2       ??0026:                 
00000290  68 00000001 R     1 push offset Src3
00000295  B3 21      1         mov  bl,'!'
00000297  53      1 push ebx
00000298  E8 00000E2F      1 call KRBOld
0000029D  3D 00001387 R     1 cmp eax,offset Match
000002A2  0F 85 00000D89    1 jnz Bad3
000002A8  83 2D 0000141C R  2         sub   __counter__loop__counter__, 1
   01
000002AF  75 DF      2         jnz   __counter__loop__label__


I found where the above assembly goes wrong by trying to reproduce the error to
send an error report to Microsoft. It is not the code that is bad, but it is the
.lst file that is bad. The code actually exists in the .obj and .exe, it just
doesn't show up in the .lst. For the test for Microsoft, I left the invoke
crt_strchr in the main code and did not call a proc containing the invoke. I
executed the .exe where the .lst showed no code for the invoke of crt_strchr and
the execution shows that crt_strchr is called and timed.

I also noticed that the original assembly was done with the masm32 version of ml:

    Microsoft (R) Macro Assembler Version 8.00.50727.104       06/24/10 11:25:27

so I deleted the path and assembled with masm from Visual Studio 2008:

    Microsoft (R) Macro Assembler Version 9.00.21022.08       06/24/10 11:23:32

This made no difference in the .lst file, the invoke is still missing.

The following is the timing with these corrections:


AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (SSE3)
Find character in string: long string 5000 bytes.
8277    cycles for crt_strchr, match long string
9251    cycles for crt_strchr, no match long string
20395   cycles for KRBOld, match long string
3188    cycles for KRBNew, match long string
2895    cycles for KRBNew2, match long string
20096   cycles for KRBOld, no match long string
3172    cycles for KRBNew, no match long string
2914    cycles for KRBNew2, no match long string
9085    cycles for crt_strchr, match long string
7969    cycles for crt_strchr, no match long string
22138   cycles for KRBOld, match long string
3328    cycles for KRBNew, match long string
3042    cycles for KRBNew2, match long string
20061   cycles for KRBOld, no match long string
3177    cycles for KRBNew, no match long string
2913    cycles for KRBNew2, no match long string
Codesizes:
dostrchr:       12
KRBOld: 27
KRBNew: 97
KRBNew2:        141
--- ok ---


Dave.

frktons

On my Core Duo  the timings are:

Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
Find character in string: long string 5000 bytes.
7126    cycles for crt_strchr, match long string
7149    cycles for crt_strchr, no match long string
10052   cycles for KRBOld, match long string
2817    cycles for KRBNew, match long string
3172    cycles for KRBNew2, match long string
10071   cycles for KRBOld, no match long string
2935    cycles for KRBNew, no match long string
3176    cycles for KRBNew2, no match long string
7131    cycles for crt_strchr, match long string
7104    cycles for crt_strchr, no match long string
10076   cycles for KRBOld, match long string
2819    cycles for KRBNew, match long string
3204    cycles for KRBNew2, match long string
10051   cycles for KRBOld, no match long string
2817    cycles for KRBNew, no match long string
3221    cycles for KRBNew2, no match long string
Codesizes:
dostrchr:       12
KRBOld: 27
KRBNew: 97
KRBNew2:        141
--- ok ---


Probably AMD and INTEL have different performances due to their
inner architecture, but in case you want to compare.  :P
Mind is like a parachute. You know what to do in order to use it :-)

jj2007

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
Find character in string: long string 5000 bytes.
9730    cycles for crt_strchr, match long string
9699    cycles for crt_strchr, no match long string
13352   cycles for KRBOld, match long string
3375    cycles for KRBNew, match long string
3459    cycles for KRBNew2, match long string
13084   cycles for KRBOld, no match long string
3378    cycles for KRBNew, no match long string
3508    cycles for KRBNew2, no match long string
9698    cycles for crt_strchr, match long string
9697    cycles for crt_strchr, no match long string
13360   cycles for KRBOld, match long string
3384    cycles for KRBNew, match long string
3473    cycles for KRBNew2, match long string
13082   cycles for KRBOld, no match long string
3374    cycles for KRBNew, no match long string
3507    cycles for KRBNew2, no match long string


Much faster KRBNew2:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
Find character in string: long string 5000 bytes.
9688    cycles for crt_strchr, match long string
9704    cycles for crt_strchr, no match long string
13356   cycles for KRBOld, match long string
2696    cycles for KRBNew, match long string
2496    cycles for KRBNew2, match long string
13107   cycles for KRBOld, no match long string
2706    cycles for KRBNew, no match long string
2494    cycles for KRBNew2, no match long string
9719    cycles for crt_strchr, match long string
9702    cycles for crt_strchr, no match long string
13363   cycles for KRBOld, match long string
2709    cycles for KRBNew, match long string
2507    cycles for KRBNew2, match long string
13126   cycles for KRBOld, no match long string
2705    cycles for KRBNew, no match long string
2490    cycles for KRBNew2, no match long string


I cheated, though: Src3 is 16-byte aligned, and movups became movaps. The proper way to do it is to run the first iteration with movups, then align the source and enter the main loop. Excerpt from MasmBasic's StrLen routine:


movups xmm1, [eax] ; move 16 bytes into xmm1, unaligned
pcmpeqb xmm1, xmm0 ; set bytes in xmm1 to FF if nullbytes found in xmm1
mov edx, eax ; save pointer to string
pmovmskb eax, xmm1 ; set byte mask in eax
bsf eax, eax ; bit scan forward
jne Lt16 ; less than 16 bytes, we are done
and edx, -16 ; align initial pointer to 16-byte boundary
lea eax, [edx+16] ; aligned pointer + 16 (first 0..15 dealt with by movups above)
@@: pcmpeqb xmm0, [eax] ; ---- inner loop ----
....

KeepingRealBusy

JJ,

Aha! From one of the experts. Thank you for the hint. I will add that to my version 3 and see what relative timing differences it makes.

Overall, what do you think of my approach to the problem, both for strcomp and strchar?

Dave.

Queue

On a 1.3 GHz Athlon:
AMD Athlon(tm) 4 Processor (SSE1)
Find character in string: long string 5000 bytes.
9122    cycles for crt_strchr, match long string
9061    cycles for crt_strchr, no match long string
20704   cycles for KRBOld, match long string
3664    cycles for KRBNew, match long string
4330    cycles for KRBNew2, match long string
15481   cycles for KRBOld, no match long string
3800    cycles for KRBNew, no match long string
4328    cycles for KRBNew2, no match long string
9056    cycles for crt_strchr, match long string
9095    cycles for crt_strchr, no match long string
20721   cycles for KRBOld, match long string
3657    cycles for KRBNew, match long string
4346    cycles for KRBNew2, match long string
15528   cycles for KRBOld, no match long string
3819    cycles for KRBNew, no match long string
4323    cycles for KRBNew2, no match long string

Queue

KeepingRealBusy

The call to the first version (KRBNew) was corrected to eliminate the
verification test because it fails - it finds the @ after the null and this is
wrong. The version was retained only for the timing comparison.

Version 1 (KRBNew1) was only used to create the report for the .lst assembly
error for Microsoft.

The second version (KRBNew2) is valid, but uses all unaligned xmm loads for
scanning and thus is slower than it could be.

This third version (KRBNew3) adds a lot of code (almost duplicates the code) but
executes quicker. It forces 16 byte alignment of the scan register after the
first 16 bytes are processed (some slight re-process because of the backup the
first time but then the following aligned loads are quicker). All of the other
logic remains the same, so the speed improvement is due only to the aligned
loads. Thank you for the tip JJ.

An extra check had been added to KRBNew2 to allow a search for a null at the end
of the string instead of searching for any other character, thus completely
paralleling the crt_strchr functionality. A test and a verification was added to
this version to verify that this functionality actually works correctly.

The following is the timing with this new version:


AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (SSE3)
Find character in string: long string 5000 bytes.
7550    cycles for crt_strchr, match long string
7552    cycles for crt_strchr, no match long string
20435   cycles for KRBOld, match long string
3124    cycles for KRBNew, match long string
2895    cycles for KRBNew2, match long string
2389    cycles for KRBNew3, match long string
1291    cycles for KRBNew3, match null in long string
17252   cycles for KRBOld, no match long string
3271    cycles for KRBNew, no match long string
3113    cycles for KRBNew2, no match long string
2560    cycles for KRBNew3, no match long string
7664    cycles for crt_strchr, match long string
7580    cycles for crt_strchr, no match long string
20035   cycles for KRBOld, match long string
3168    cycles for KRBNew, match long string
2759    cycles for KRBNew2, match long string
2379    cycles for KRBNew3, match long string
1287    cycles for KRBNew3, match null in long string
19655   cycles for KRBOld, no match long string
3166    cycles for KRBNew, no match long string
2898    cycles for KRBNew2, no match long string
2255    cycles for KRBNew3, no match long string
Codesizes:
dostrchr:       12
KRBOld: 30
KRBNew: 97
KRBNew2:        141
KRBNew3:        219
--- ok ---


Dave.

Rockoon


AMD Phenom(tm) II X6 1055T Processor (SSE3)
Find character in string: long string 5000 bytes.
7543    cycles for crt_strchr, match long string
7552    cycles for crt_strchr, no match long string
20041   cycles for KRBOld, match long string
2224    cycles for KRBNew, match long string
1939    cycles for KRBNew2, match long string
1751    cycles for KRBNew3, match long string
858     cycles for KRBNew3, match null in long string
10865   cycles for KRBOld, no match long string
2254    cycles for KRBNew, no match long string
1953    cycles for KRBNew2, no match long string
1754    cycles for KRBNew3, no match long string
7548    cycles for crt_strchr, match long string
7544    cycles for crt_strchr, no match long string
20042   cycles for KRBOld, match long string
2257    cycles for KRBNew, match long string
1938    cycles for KRBNew2, match long string
1747    cycles for KRBNew3, match long string
858     cycles for KRBNew3, match null in long string
20049   cycles for KRBOld, no match long string
2266    cycles for KRBNew, no match long string
1964    cycles for KRBNew2, no match long string
1761    cycles for KRBNew3, no match long string
Codesizes:
dostrchr:       12
KRBOld: 30
KRBNew: 97
KRBNew2:        141
KRBNew3:        219
--- ok ---

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

jj2007

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
Find character in string: long string 5000 bytes.
9705    cycles for crt_strchr, match long string
3460    cycles for KRBNew2, match long string
2489    cycles for KRBNew3, match long string
1148    cycles for KRBNew3, match null in long string
3515    cycles for KRBNew2, no match long string
2530    cycles for KRBNew3, no match long string

lingo

#11
"I cheated, though: Src3 is 16-byte aligned, and movups became movaps. The proper way to do it is to run the first iteration with movups, then align the source and enter the main loop. Excerpt from MasmBasic's StrLen routine:"
and
"Thank you for the tip JJ."

Dave,
JJ is just a THIEF of my code here: :lol

Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
Find character in string: long string 5000 bytes.
7144    cycles for crt_strchr, match long string
7215    cycles for crt_strchr, no match long string
10213   cycles for KRBOld, match long string
2873    cycles for KRBNew, match long string
3076    cycles for KRBNew2, match long string
1505    cycles for KRBNew3, match long string
729     cycles for KRBNew3, match null in long string
1057    cycles for KRBLingo, match long string
680     cycles for KRBLingo, match null in long string
10225   cycles for KRBOld, no match long string
2858    cycles for KRBNew, no match long string
3077    cycles for KRBNew2, no match long string
1509    cycles for KRBNew3, no match long string
1056    cycles for KRBLingo, no match long string
7183    cycles for crt_strchr, match long string
7256    cycles for crt_strchr, no match long string
10204   cycles for KRBOld, match long string
2959    cycles for KRBNew, match long string
3119    cycles for KRBNew2, match long string
1481    cycles for KRBNew3, match long string
751     cycles for KRBNew3, match null in long string
1054    cycles for KRBLingo, match long string
724     cycles for KRBLingo, match null in long string
10103   cycles for KRBOld, no match long string
2967    cycles for KRBNew, no match long string
3118    cycles for KRBNew2, no match long string
1474    cycles for KRBNew3, no match long string
1074    cycles for KRBLingo, no match long string
Codesizes:
dostrchr:       12
KRBOld: 30
KRBNew: 97
KRBNew2:        141
KRBNew3:        219
KRBLingo:       154
--- ok ---

and my code:
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
Align 16
KRBNew4     proc
                  pop        ecx
                  pop        edx   
                  pop        eax
                  movdqu     xmm2, [eax]
                  test       edx,  edx
                  je         mystrl     
                  movd       xmm0, edx
                  punpcklbw  xmm0, xmm0 
                  pxor       xmm1, xmm1
                  pshuflw    xmm0, xmm0, 0
                  pcmpeqb    xmm1, xmm2
                  pshufd     xmm0, xmm0, 0
                  pcmpeqb    xmm2, xmm0
                  por        xmm2, xmm1
                  pmovmskb   edx,  xmm2
                  test       edx,  edx
                  jne        @f    +29
                  and        eax,  -16
@@:
                  pcmpeqb  xmm1, [eax+16]
                  movdqa   xmm2, xmm0
                  pcmpeqb  xmm2, [eax+16]
                  por      xmm1, xmm2
                  add      eax,  16   
                  pmovmskb edx,  xmm1
                  test     edx,  edx
                  je       @b
                  bsf      edx, edx
                  add      eax, edx
                  xor      edx, edx   
                  cmp      [eax],dl
                  cmove    eax, edx   
                  jmp      ecx
align 16
mystrl:
                  pxor      xmm0, xmm0
                  pcmpeqb   xmm2, xmm0
                  pmovmskb  edx,  xmm2
                  test      edx,  edx
                  jne       @f    +15
                  and       eax,  -16
@@:
                   pcmpeqb  xmm0, [eax+16]
                   add      eax,  16
                   pmovmskb edx,  xmm0
                   test     edx,  edx
                   je       @b
                   bsf      edx,  edx
                   add      eax,  edx
                   jmp      ecx
KRBNew4     endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

KeepingRealBusy

Lingo,

Had I known of your code or had you posted the hint, I would have given you credit.

Remember my comment when I posted my solution to "Compare Two Strings"?


I have not analyzed Lingo's method completely, but what can I say. His code is
good, and fast, and good and fast. He even returns the result of the comparison
(I think it does, I did not spend too much time verifying all of the possible
conditions).


Your times are impressive. I will have to examine your code to learn new ways to beat the clock.

Dave.



jj2007

Quote from: lingo on June 26, 2010, 12:37:05 AM
Dave,
JJ is just a THIEF of my code here: :lol

Lingo, I can only repeat what I wrote in replies #83 and #90. Or try to get a patent for all code that contains pcmpeqb.

KeepingRealBusy

Lingo,

"I'M Shocked! I'M Shocked to know that there's been gambling going on in Rick's
place!" A paraphrased quote from "Casablanca".

Lingo code that does not work? What is happening here?

I modified your code (strchar3a.zip) as follows. I added a short string to the
data because it appeared from the code that you would walk off of the end of a
short string without detecting the null (the appearance was my error, I took the
offset addition as a comment and thought you were jumping to the following @@:
symbol with @F and not (@F + 15)).


ALIGN 16
        db 1
Src8    db "Shortstr"
Src9    db 16 dup(0)


First of all, I just added a comment to use the new short string but kept the
original code and assembled using Masm 9.0 from Visual Studio 2008:


counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; LONG SOURCE
push offset Src3
; push offset Src8
        xor  eax, eax
push eax
int 3
call KRBNew4
ifdef DEBUGHALT
cmp eax,offset Src9
jnz Bad7
endif
    counter_end

print str$(eax), 9, "cycles for KRBLingo, match null in long string", 13, 10


I executed the code and stepped into KRBLingo (AKA KRBNew4) and copied the
dissasembly code (this is using Visual Studio 2008).


00402EB0  pop         ecx 
00402EB1  pop         edx 
00402EB2  pop         eax 
00402EB3  movdqu      xmm2,xmmword ptr [eax]
00402EB7  test        edx,edx
00402EB9  je          00402F20
00402EBB  movd        xmm0,edx
00402EBF  and         eax,0FFFFFFF0h
00402EC2  punpcklbw   xmm0,xmm0
00402EC6  pxor        xmm1,xmm1
00402ECA  pshuflw     xmm0,xmm0,0
00402ECF  pcmpeqb     xmm1,xmm2
00402ED3  pshufd      xmm0,xmm0,0
00402ED8  pcmpeqb     xmm2,xmm0
00402EDC  por         xmm2,xmm1
00402EE0  pmovmskb    edx,xmm2
00402EE4  test        edx,edx
00402EE6  jne         00402F05
00402EE8  pcmpeqb     xmm1,xmmword ptr [eax+10h]
00402EED  movdqa      xmm2,xmm0
00402EF1  pcmpeqb     xmm2,xmmword ptr [eax+10h]
00402EF6  por         xmm1,xmm2
00402EFA  add         eax,10h
00402EFD  pmovmskb    edx,xmm1
00402F01  test        edx,edx
00402F03  je          00402EE8
00402F05  bsf         edx,edx
00402F08  add         eax,edx
00402F0A  xor         edx,edx
00402F0C  cmp         byte ptr [eax],dl
00402F0E  cmove       eax,edx
00402F11  jmp         ecx 
00402F13  lea         esp,[esp]
00402F1A  lea         ebx,[ebx]
00402F20  pxor        xmm0,xmm0
00402F24  pcmpeqb     xmm2,xmm0
00402F28  pmovmskb    edx,xmm2
00402F2C  test        edx,edx
00402F2E  jne         00402F42
00402F30  and         eax,0FFFFFFF0h
00402F33  pcmpeqb     xmm0,xmmword ptr [eax+10h]
00402F38  add         eax,10h
00402F3B  pmovmskb    edx,xmm0
00402F3F  test        edx,edx
00402F41  je          00402F33
00402F43  bsf         edx,edx
00402F46  add         eax,edx
00402F48  jmp         ecx 


I started to execute the code by stepping and it appeared to be correctly
executing for the long string. Then I stopped execution, and modified the code
to use the short string:


ALIGN 16
        db 1
Src8    db "Shortstr"
Src9    db 16 dup(0)

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; LONG SOURCE
; push offset Src3
push offset Src8
        xor  eax, eax
push eax
int 3
call KRBNew4
ifdef DEBUGHALT
cmp eax,offset Src9
jnz Bad7
endif
    counter_end

print str$(eax), 9, "cycles for KRBLingo, match null in long string", 13, 10


I executed the code and stepped into KRBLingo (AKA KRBNew4) and copied the
dissasembly code (this is using Visual Studio 2008). I have put a copy of the
first dissasembly data on the same lines as the second dissassembly data
(everything matched):


With short string                                  With long string from above

00402EB0  pop         ecx                          00402EB0  pop         ecx
00402EB1  pop         edx                          00402EB1  pop         edx
00402EB2  pop         eax                          00402EB2  pop         eax
00402EB3  movdqu      xmm2,xmmword ptr [eax]       00402EB3  movdqu      xmm2,xmmword ptr [eax]
00402EB7  test        edx,edx                      00402EB7  test        edx,edx
00402EB9  je          00402F20                     00402EB9  je          00402F20
00402EBB  movd        xmm0,edx                     00402EBB  movd        xmm0,edx
00402EBF  and         eax,0FFFFFFF0h               00402EBF  and         eax,0FFFFFFF0h
00402EC2  punpcklbw   xmm0,xmm0                    00402EC2  punpcklbw   xmm0,xmm0
00402EC6  pxor        xmm1,xmm1                    00402EC6  pxor        xmm1,xmm1
00402ECA  pshuflw     xmm0,xmm0,0                  00402ECA  pshuflw     xmm0,xmm0,0
00402ECF  pcmpeqb     xmm1,xmm2                    00402ECF  pcmpeqb     xmm1,xmm2
00402ED3  pshufd      xmm0,xmm0,0                  00402ED3  pshufd      xmm0,xmm0,0
00402ED8  pcmpeqb     xmm2,xmm0                    00402ED8  pcmpeqb     xmm2,xmm0
00402EDC  por         xmm2,xmm1                    00402EDC  por         xmm2,xmm1
00402EE0  pmovmskb    edx,xmm2                     00402EE0  pmovmskb    edx,xmm2
00402EE4  test        edx,edx                      00402EE4  test        edx,edx
00402EE6  jne         00402F05                     00402EE6  jne         00402F05
00402EE8  pcmpeqb     xmm1,xmmword ptr [eax+10h]   00402EE8  pcmpeqb     xmm1,xmmword ptr [eax+10h]
00402EED  movdqa      xmm2,xmm0                    00402EED  movdqa      xmm2,xmm0
00402EF1  pcmpeqb     xmm2,xmmword ptr [eax+10h]   00402EF1  pcmpeqb     xmm2,xmmword ptr [eax+10h]
00402EF6  por         xmm1,xmm2                    00402EF6  por         xmm1,xmm2
00402EFA  add         eax,10h                      00402EFA  add         eax,10h
00402EFD  pmovmskb    edx,xmm1                     00402EFD  pmovmskb    edx,xmm1
00402F01  test        edx,edx                      00402F01  test        edx,edx
00402F03  je          00402EE8                     00402F03  je          00402EE8
00402F05  bsf         edx,edx                      00402F05  bsf         edx,edx
00402F08  add         eax,edx                      00402F08  add         eax,edx
00402F0A  xor         edx,edx                      00402F0A  xor         edx,edx
00402F0C  cmp         byte ptr [eax],dl            00402F0C  cmp         byte ptr [eax],dl
00402F0E  cmove       eax,edx                      00402F0E  cmove       eax,edx
00402F11  jmp         ecx                          00402F11  jmp         ecx
00402F13  lea         esp,[esp]                    00402F13  lea         esp,[esp]
00402F1A  lea         ebx,[ebx]                    00402F1A  lea         ebx,[ebx]


00402F20  pxor        xmm0,xmm0                    00402F20  pxor        xmm0,xmm0
00402F24  pcmpeqb     xmm2,xmm0                    00402F24  pcmpeqb     xmm2,xmm0
00402F28  pmovmskb    edx,xmm2                     00402F28  pmovmskb    edx,xmm2
00402F2C  test        edx,edx                      00402F2C  test        edx,edx
00402F2E  jne         00402F42                     00402F2E  jne         00402F42
00402F30  and         eax,0FFFFFFF0h               00402F30  and         eax,0FFFFFFF0h
00402F33  pcmpeqb     xmm0,xmmword ptr [eax+10h]   00402F33  pcmpeqb     xmm0,xmmword ptr [eax+10h]
00402F38  add         eax,10h                      00402F38  add         eax,10h
00402F3B  pmovmskb    edx,xmm0                     00402F3B  pmovmskb    edx,xmm0
00402F3F  test        edx,edx                      00402F3F  test        edx,edx
00402F41  je          00402F33                     00402F41  je          00402F33
00402F43  bsf         edx,edx                      00402F43  bsf         edx,edx
00402F46  add         eax,edx                      00402F46  add         eax,edx
00402F48  jmp         ecx                          00402F48  jmp         ecx


So far, so good! (That is what was heard on the 83rd floor of the Empire State
Building as the man that just jumped off of the top of the building passed by).
I stepped through the entry code and got to mystrl:. The next 4 instructions
execute as expected, and the null at the end of the short string was detected.

Then the fun begins. I am sitting on the jne and notice that the displayed code
does not look the same as before, instead it looks like this:


What it is now displayed as:                    What it originally was displayed as:

00402F20  pxor        xmm0,xmm0
00402F24  pcmpeqb     xmm2,xmm0
00402F28  pmovmskb    edx,xmm2
00402F2C  test        edx,edx
00402F2E  jne         00402F42
00402F30  and         eax,0FFFFFFF0h
00402F33  pcmpeqb     xmm0,xmmword ptr [eax+10h]
00402F38  add         eax,10h
00402F3B  pmovmskb    edx,xmm0
00402F3F  test        edx,edx
00402F41  db          74h                          00402F41  je          00402F33
00402F42  lock bsf    edx,edx                      00402F43  bsf         edx,edx
00402F46  add         eax,edx                      00402F46  add         eax,edx
00402F48  jmp         ecx                          00402F48  jmp         ecx


As soon as I step the jne I get the following in a message box:


Unhandled exception at 0x00402f42 in strchar4.exe: 0xC000001E:
An attempt was made to execute an invalid lock sequence.


I do not know why Visual Studio displayed that code as it did (well, I think the
initial display is "as coded and never executed", but as soon as you hit the
jne, Visual Studio calculates the target address and re-interpretes the data
around that point). However, the following source code also seems wrong (the
reason I was testing with a short string):


mystrl:
                  pxor             xmm0, xmm0
                  pcmpeqb     xmm2, xmm0
                  pmovmskb  edx,    xmm2
                  test              edx,    edx
                  jne               @f       +15    ;   I read this as @f with a comment of +15
                                                    ;   or jump to @@:
                                                    ;   but it really was @f+15
                  and              eax,     -16
@@:
                   pcmpeqb   xmm0, [eax+16]
                   add             eax,    16
                   pmovmskb edx,    xmm0
                   test             edx,    edx
                   je                @b
                   bsf              edx,    edx
                   add             eax,    edx
                   jmp             ecx


The problem is the +15. It should really be:


mystrl:
                  pxor             xmm0, xmm0
                  pcmpeqb     xmm2, xmm0
                  pmovmskb  edx,    xmm2
                  test              edx,    edx
                  jne               @f       +16   <--------------------
                  and              eax,     -16
@@:
                   pcmpeqb   xmm0, [eax+16]
                   add             eax,    16
                   pmovmskb edx,    xmm0
                   test             edx,    edx
                   je                @b
                   bsf              edx,    edx
                   add             eax,    edx
                   jmp             ecx


Actually what it should really be is:


mystrl:
                  pxor             xmm0, xmm0
                  pcmpeqb     xmm2, xmm0
                  pmovmskb  edx,    xmm2
                  test              edx,    edx
                  jne               Found          <-----------------------
                  and              eax,     -16
@@:
                   pcmpeqb   xmm0, [eax+16]
                   add             eax,    16
                   pmovmskb edx,    xmm0
                   test             edx,    edx
                   je                @b
Found:                                             <-----------------------
                   bsf              edx,    edx
                   add             eax,    edx
                   jmp             ecx


What may have happened is that you used a different assembler, and it created
one of the following instructions with one less byte than my masm 9.0 (so for my
assembly, the offset needs to be +16), but that is exactly why I would never try
to jump to a label +/- an offset - give it a new label.

This new version works for the short string and for the long string.

I changed all the code to be StrChar4 (to avoid conflicts with my StrChar3
version) and this is the timing result:


AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (SSE3)
Find character in string: long string 5000 bytes.
12406   cycles for crt_strchr, match long string
2710    cycles for crt_strchr, no match long string
27037   cycles for KRBOld, match long string
3477    cycles for KRBNew, match long string
2934    cycles for KRBNew2, match long string
2260    cycles for KRBNew3, match long string
1286    cycles for KRBNew3, match null in long string
1932    cycles for KRBLingo, match long string
1289    cycles for KRBLingo, match null in long string
27      cycles for KRBLingo, match null in short string
17994   cycles for KRBOld, no match long string
3168    cycles for KRBNew, no match long string
2900    cycles for KRBNew2, no match long string
2258    cycles for KRBNew3, no match long string
1933    cycles for KRBLingo, no match long string
7538    cycles for crt_strchr, match long string
12374   cycles for crt_strchr, no match long string
21052   cycles for KRBOld, match long string
3351    cycles for KRBNew, match long string
3082    cycles for KRBNew2, match long string
2382    cycles for KRBNew3, match long string
1358    cycles for KRBNew3, match null in long string
2041    cycles for KRBLingo, match long string
1389    cycles for KRBLingo, match null in long string
27      cycles for KRBLingo, match null in short string
16540   cycles for KRBOld, no match long string
3374    cycles for KRBNew, no match long string
3097    cycles for KRBNew2, no match long string
2298    cycles for KRBNew3, no match long string
1941    cycles for KRBLingo, no match long string
Codesizes:
dostrchr:       12
KRBOld: 30
KRBNew: 97
KRBNew2:        141
KRBNew3:        219
KRBLingo:       154
--- ok ---


Dave.