I want to start another thread with this instead of complicating the "Compare
Two Strings" thread. With my success in speeding up string compare, I wanted to
see if I could speed up a strchr using SSE. I use a character search in the
process of splitting up a huge buffer of variable length text strings so I can
do a string sort. I came up with an SSE method that worked, and it saved me
about a minute total time in the sort.
I thought I would see what the other experts could do to even improve on that.
Here are the ground rules:
The test mimics a "C" strchr call thus should have a null check.
The caller supplies the string pointer as the first argument, and a
character (as an int) as the second argument.
The function returns a pointer to the first matching character in the string
or a null if there was no match.
Pretty simple, really.
I supplied a KRBOld version which I had been using before, and a KRBNew version
that uses SSE. I supplied two test cases for each version, one expecting a match
at the end of a 5000 byte string, and the other expecting a no match. I also
supplied the timing for both conditions using crt_strchr as a reference point.
The code is basically the strcomp code with a partial replacement of the data,
new testing calls for the new functions replacing the old testing calls,
modification of the length reporting code to reference the new functions that
are called, and all of the strcomp test code.
Here are my timings:
AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (SSE3)
Find character in string: long string 5000 bytes.
7549 cycles for crt_strchr, match long string
10357 cycles for crt_strchr, no match long string
17374 cycles for KRBOld, match long string
3166 cycles for KRBNew, match long string
18711 cycles for KRBOld, no match long string
3169 cycles for KRBNew, no match long string
8079 cycles for crt_strchr, match long string
10829 cycles for crt_strchr, no match long string
20738 cycles for KRBOld, match long string
3211 cycles for KRBNew, match long string
20033 cycles for KRBOld, no match long string
3172 cycles for KRBNew, no match long string
Codesizes:
KRBOld: 33
KRBNew: 97
--- ok ---
Dave.
Note that in KRBNew you might get a wrong position if the matching char is after the nullbyte. Rare case but the bug potential is high.
JJ,
Thank you very much for finding this. This would be one of those hard to find bugs. I will rework this and re-post.
My actual code for splitting a huge buffer always worked because I back scanned the buffer looking for the last CRLF, and only included that length in my forward scan (and I set the processed length such that I would seek and re-read the data following that CRLF for the next buffer full). It is only my short test case in these timing tests that will fail.
The solution is to do both the char test and also the null test, saving both of the character positions, then return the character position if it is less than the null position, else return a null. If there is no match for either of the tests, use a -1 as irs position, then if the other test does find a match, the character positions will be correct for determining what to return. You also must be careful to check if the supplied character is a null, in which case you would want to return the actual position of the null character and not a null return.
Dave.
I fixed my code to correctly detect a found character (match not allowed
following the null). I added test code to verify correct match/nomatch results
under conditional assembly with /DDEBUGHALT, however, the following source code
does not assemble correctly, it drops the invoke:
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; LONG SOURCE
lea esi, Src3
lea edi, Src6
invoke crt_strchr, esi, edi
ifdef DEBUGHALT
cmp eax,offset Match
jnz Bad1
endif
counter_end
This is the .lst file content for that:
0000009F C7 05 0000141C R 2 mov __counter__loop__counter__, LOOP_COUNT
00002710
000000A9 33 C0 2 xor eax, eax
000000AB 0F A2 2 cpuid
000000B0 2 ??001E:
000000B0 8D 35 00000001 R 1 lea esi, Src3
000000B6 8D 3D 000013C0 R 1 lea edi, Src6
000000C7 3D 00001387 R 1 cmp eax,offset Match
000000CC 0F 85 00000F5D 1 jnz Bad1
000000D2 83 2D 0000141C R 2 sub __counter__loop__counter__, 1
01
000000D9 75 D5 2 jnz __counter__loop__label__
I guess I need to inform Microsoft of another bug I have found, for whatever
good that will do. They will just tell me to buy VS15 or whatever when they get
around to fixing it.
To get around this, I will move the invoke of crt_strchr into a proc as was done
for the other tests which correctly assemble the call:
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; LONG SOURCE
push offset Src3
mov bl,'!'
push ebx
call KRBOld
ifdef DEBUGHALT
cmp eax,offset Match
jnz Bad3
endif
counter_end
Giving:
0000027F C7 05 0000141C R 2 mov __counter__loop__counter__, LOOP_COUNT
00002710
00000289 33 C0 2 xor eax, eax
0000028B 0F A2 2 cpuid
00000290 2 ??0026:
00000290 68 00000001 R 1 push offset Src3
00000295 B3 21 1 mov bl,'!'
00000297 53 1 push ebx
00000298 E8 00000E2F 1 call KRBOld
0000029D 3D 00001387 R 1 cmp eax,offset Match
000002A2 0F 85 00000D89 1 jnz Bad3
000002A8 83 2D 0000141C R 2 sub __counter__loop__counter__, 1
01
000002AF 75 DF 2 jnz __counter__loop__label__
I found where the above assembly goes wrong by trying to reproduce the error to
send an error report to Microsoft. It is not the code that is bad, but it is the
.lst file that is bad. The code actually exists in the .obj and .exe, it just
doesn't show up in the .lst. For the test for Microsoft, I left the invoke
crt_strchr in the main code and did not call a proc containing the invoke. I
executed the .exe where the .lst showed no code for the invoke of crt_strchr and
the execution shows that crt_strchr is called and timed.
I also noticed that the original assembly was done with the masm32 version of ml:
Microsoft (R) Macro Assembler Version 8.00.50727.104 06/24/10 11:25:27
so I deleted the path and assembled with masm from Visual Studio 2008:
Microsoft (R) Macro Assembler Version 9.00.21022.08 06/24/10 11:23:32
This made no difference in the .lst file, the invoke is still missing.
The following is the timing with these corrections:
AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (SSE3)
Find character in string: long string 5000 bytes.
8277 cycles for crt_strchr, match long string
9251 cycles for crt_strchr, no match long string
20395 cycles for KRBOld, match long string
3188 cycles for KRBNew, match long string
2895 cycles for KRBNew2, match long string
20096 cycles for KRBOld, no match long string
3172 cycles for KRBNew, no match long string
2914 cycles for KRBNew2, no match long string
9085 cycles for crt_strchr, match long string
7969 cycles for crt_strchr, no match long string
22138 cycles for KRBOld, match long string
3328 cycles for KRBNew, match long string
3042 cycles for KRBNew2, match long string
20061 cycles for KRBOld, no match long string
3177 cycles for KRBNew, no match long string
2913 cycles for KRBNew2, no match long string
Codesizes:
dostrchr: 12
KRBOld: 27
KRBNew: 97
KRBNew2: 141
--- ok ---
Dave.
On my Core Duo the timings are:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
Find character in string: long string 5000 bytes.
7126 cycles for crt_strchr, match long string
7149 cycles for crt_strchr, no match long string
10052 cycles for KRBOld, match long string
2817 cycles for KRBNew, match long string
3172 cycles for KRBNew2, match long string
10071 cycles for KRBOld, no match long string
2935 cycles for KRBNew, no match long string
3176 cycles for KRBNew2, no match long string
7131 cycles for crt_strchr, match long string
7104 cycles for crt_strchr, no match long string
10076 cycles for KRBOld, match long string
2819 cycles for KRBNew, match long string
3204 cycles for KRBNew2, match long string
10051 cycles for KRBOld, no match long string
2817 cycles for KRBNew, no match long string
3221 cycles for KRBNew2, no match long string
Codesizes:
dostrchr: 12
KRBOld: 27
KRBNew: 97
KRBNew2: 141
--- ok ---
Probably AMD and INTEL have different performances due to their
inner architecture, but in case you want to compare. :P
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
Find character in string: long string 5000 bytes.
9730 cycles for crt_strchr, match long string
9699 cycles for crt_strchr, no match long string
13352 cycles for KRBOld, match long string
3375 cycles for KRBNew, match long string
3459 cycles for KRBNew2, match long string
13084 cycles for KRBOld, no match long string
3378 cycles for KRBNew, no match long string
3508 cycles for KRBNew2, no match long string
9698 cycles for crt_strchr, match long string
9697 cycles for crt_strchr, no match long string
13360 cycles for KRBOld, match long string
3384 cycles for KRBNew, match long string
3473 cycles for KRBNew2, match long string
13082 cycles for KRBOld, no match long string
3374 cycles for KRBNew, no match long string
3507 cycles for KRBNew2, no match long string
Much faster KRBNew2:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
Find character in string: long string 5000 bytes.
9688 cycles for crt_strchr, match long string
9704 cycles for crt_strchr, no match long string
13356 cycles for KRBOld, match long string
2696 cycles for KRBNew, match long string
2496 cycles for KRBNew2, match long string
13107 cycles for KRBOld, no match long string
2706 cycles for KRBNew, no match long string
2494 cycles for KRBNew2, no match long string
9719 cycles for crt_strchr, match long string
9702 cycles for crt_strchr, no match long string
13363 cycles for KRBOld, match long string
2709 cycles for KRBNew, match long string
2507 cycles for KRBNew2, match long string
13126 cycles for KRBOld, no match long string
2705 cycles for KRBNew, no match long string
2490 cycles for KRBNew2, no match long string
I cheated, though: Src3 is 16-byte aligned, and movups became movaps. The proper way to do it is to run the first iteration with movups, then align the source and enter the main loop. Excerpt from MasmBasic's StrLen routine:
movups xmm1, [eax] ; move 16 bytes into xmm1, unaligned
pcmpeqb xmm1, xmm0 ; set bytes in xmm1 to FF if nullbytes found in xmm1
mov edx, eax ; save pointer to string
pmovmskb eax, xmm1 ; set byte mask in eax
bsf eax, eax ; bit scan forward
jne Lt16 ; less than 16 bytes, we are done
and edx, -16 ; align initial pointer to 16-byte boundary
lea eax, [edx+16] ; aligned pointer + 16 (first 0..15 dealt with by movups above)
@@: pcmpeqb xmm0, [eax] ; ---- inner loop ----
....
JJ,
Aha! From one of the experts. Thank you for the hint. I will add that to my version 3 and see what relative timing differences it makes.
Overall, what do you think of my approach to the problem, both for strcomp and strchar?
Dave.
On a 1.3 GHz Athlon:
AMD Athlon(tm) 4 Processor (SSE1)
Find character in string: long string 5000 bytes.
9122 cycles for crt_strchr, match long string
9061 cycles for crt_strchr, no match long string
20704 cycles for KRBOld, match long string
3664 cycles for KRBNew, match long string
4330 cycles for KRBNew2, match long string
15481 cycles for KRBOld, no match long string
3800 cycles for KRBNew, no match long string
4328 cycles for KRBNew2, no match long string
9056 cycles for crt_strchr, match long string
9095 cycles for crt_strchr, no match long string
20721 cycles for KRBOld, match long string
3657 cycles for KRBNew, match long string
4346 cycles for KRBNew2, match long string
15528 cycles for KRBOld, no match long string
3819 cycles for KRBNew, no match long string
4323 cycles for KRBNew2, no match long string
Queue
The call to the first version (KRBNew) was corrected to eliminate the
verification test because it fails - it finds the @ after the null and this is
wrong. The version was retained only for the timing comparison.
Version 1 (KRBNew1) was only used to create the report for the .lst assembly
error for Microsoft.
The second version (KRBNew2) is valid, but uses all unaligned xmm loads for
scanning and thus is slower than it could be.
This third version (KRBNew3) adds a lot of code (almost duplicates the code) but
executes quicker. It forces 16 byte alignment of the scan register after the
first 16 bytes are processed (some slight re-process because of the backup the
first time but then the following aligned loads are quicker). All of the other
logic remains the same, so the speed improvement is due only to the aligned
loads. Thank you for the tip JJ.
An extra check had been added to KRBNew2 to allow a search for a null at the end
of the string instead of searching for any other character, thus completely
paralleling the crt_strchr functionality. A test and a verification was added to
this version to verify that this functionality actually works correctly.
The following is the timing with this new version:
AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (SSE3)
Find character in string: long string 5000 bytes.
7550 cycles for crt_strchr, match long string
7552 cycles for crt_strchr, no match long string
20435 cycles for KRBOld, match long string
3124 cycles for KRBNew, match long string
2895 cycles for KRBNew2, match long string
2389 cycles for KRBNew3, match long string
1291 cycles for KRBNew3, match null in long string
17252 cycles for KRBOld, no match long string
3271 cycles for KRBNew, no match long string
3113 cycles for KRBNew2, no match long string
2560 cycles for KRBNew3, no match long string
7664 cycles for crt_strchr, match long string
7580 cycles for crt_strchr, no match long string
20035 cycles for KRBOld, match long string
3168 cycles for KRBNew, match long string
2759 cycles for KRBNew2, match long string
2379 cycles for KRBNew3, match long string
1287 cycles for KRBNew3, match null in long string
19655 cycles for KRBOld, no match long string
3166 cycles for KRBNew, no match long string
2898 cycles for KRBNew2, no match long string
2255 cycles for KRBNew3, no match long string
Codesizes:
dostrchr: 12
KRBOld: 30
KRBNew: 97
KRBNew2: 141
KRBNew3: 219
--- ok ---
Dave.
AMD Phenom(tm) II X6 1055T Processor (SSE3)
Find character in string: long string 5000 bytes.
7543 cycles for crt_strchr, match long string
7552 cycles for crt_strchr, no match long string
20041 cycles for KRBOld, match long string
2224 cycles for KRBNew, match long string
1939 cycles for KRBNew2, match long string
1751 cycles for KRBNew3, match long string
858 cycles for KRBNew3, match null in long string
10865 cycles for KRBOld, no match long string
2254 cycles for KRBNew, no match long string
1953 cycles for KRBNew2, no match long string
1754 cycles for KRBNew3, no match long string
7548 cycles for crt_strchr, match long string
7544 cycles for crt_strchr, no match long string
20042 cycles for KRBOld, match long string
2257 cycles for KRBNew, match long string
1938 cycles for KRBNew2, match long string
1747 cycles for KRBNew3, match long string
858 cycles for KRBNew3, match null in long string
20049 cycles for KRBOld, no match long string
2266 cycles for KRBNew, no match long string
1964 cycles for KRBNew2, no match long string
1761 cycles for KRBNew3, no match long string
Codesizes:
dostrchr: 12
KRBOld: 30
KRBNew: 97
KRBNew2: 141
KRBNew3: 219
--- ok ---
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
Find character in string: long string 5000 bytes.
9705 cycles for crt_strchr, match long string
3460 cycles for KRBNew2, match long string
2489 cycles for KRBNew3, match long string
1148 cycles for KRBNew3, match null in long string
3515 cycles for KRBNew2, no match long string
2530 cycles for KRBNew3, no match long string
"I cheated, though: Src3 is 16-byte aligned, and movups became movaps. The proper way to do it is to run the first iteration with movups, then align the source and enter the main loop. Excerpt from MasmBasic's StrLen routine:"
and
"Thank you for the tip JJ."
Dave,
JJ is just a THIEF of my code here: (http://www.masm32.com/board/index.php?topic=11353.75) :lol
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
Find character in string: long string 5000 bytes.
7144 cycles for crt_strchr, match long string
7215 cycles for crt_strchr, no match long string
10213 cycles for KRBOld, match long string
2873 cycles for KRBNew, match long string
3076 cycles for KRBNew2, match long string
1505 cycles for KRBNew3, match long string
729 cycles for KRBNew3, match null in long string
1057 cycles for KRBLingo, match long string
680 cycles for KRBLingo, match null in long string
10225 cycles for KRBOld, no match long string
2858 cycles for KRBNew, no match long string
3077 cycles for KRBNew2, no match long string
1509 cycles for KRBNew3, no match long string
1056 cycles for KRBLingo, no match long string
7183 cycles for crt_strchr, match long string
7256 cycles for crt_strchr, no match long string
10204 cycles for KRBOld, match long string
2959 cycles for KRBNew, match long string
3119 cycles for KRBNew2, match long string
1481 cycles for KRBNew3, match long string
751 cycles for KRBNew3, match null in long string
1054 cycles for KRBLingo, match long string
724 cycles for KRBLingo, match null in long string
10103 cycles for KRBOld, no match long string
2967 cycles for KRBNew, no match long string
3118 cycles for KRBNew2, no match long string
1474 cycles for KRBNew3, no match long string
1074 cycles for KRBLingo, no match long string
Codesizes:
dostrchr: 12
KRBOld: 30
KRBNew: 97
KRBNew2: 141
KRBNew3: 219
KRBLingo: 154
--- ok ---
and my code:
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
Align 16
KRBNew4 proc
pop ecx
pop edx
pop eax
movdqu xmm2, [eax]
test edx, edx
je mystrl
movd xmm0, edx
punpcklbw xmm0, xmm0
pxor xmm1, xmm1
pshuflw xmm0, xmm0, 0
pcmpeqb xmm1, xmm2
pshufd xmm0, xmm0, 0
pcmpeqb xmm2, xmm0
por xmm2, xmm1
pmovmskb edx, xmm2
test edx, edx
jne @f +29
and eax, -16
@@:
pcmpeqb xmm1, [eax+16]
movdqa xmm2, xmm0
pcmpeqb xmm2, [eax+16]
por xmm1, xmm2
add eax, 16
pmovmskb edx, xmm1
test edx, edx
je @b
bsf edx, edx
add eax, edx
xor edx, edx
cmp [eax],dl
cmove eax, edx
jmp ecx
align 16
mystrl:
pxor xmm0, xmm0
pcmpeqb xmm2, xmm0
pmovmskb edx, xmm2
test edx, edx
jne @f +15
and eax, -16
@@:
pcmpeqb xmm0, [eax+16]
add eax, 16
pmovmskb edx, xmm0
test edx, edx
je @b
bsf edx, edx
add eax, edx
jmp ecx
KRBNew4 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Lingo,
Had I known of your code or had you posted the hint, I would have given you credit.
Remember my comment when I posted my solution to "Compare Two Strings"?
I have not analyzed Lingo's method completely, but what can I say. His code is
good, and fast, and good and fast. He even returns the result of the comparison
(I think it does, I did not spend too much time verifying all of the possible
conditions).
Your times are impressive. I will have to examine your code to learn new ways to beat the clock.
Dave.
Quote from: lingo on June 26, 2010, 12:37:05 AM
Dave,
JJ is just a THIEF of my code here: (http://www.masm32.com/board/index.php?topic=11353.75) :lol
Lingo, I can only repeat what I wrote in replies #83 and #90. Or try to get a patent for all code that contains pcmpeqb.
Lingo,
"I'M Shocked! I'M Shocked to know that there's been gambling going on in Rick's
place!" A paraphrased quote from "Casablanca".
Lingo code that does not work? What is happening here?
I modified your code (strchar3a.zip) as follows. I added a short string to the
data because it appeared from the code that you would walk off of the end of a
short string without detecting the null (the appearance was my error, I took the
offset addition as a comment and thought you were jumping to the following @@:
symbol with @F and not (@F + 15)).
ALIGN 16
db 1
Src8 db "Shortstr"
Src9 db 16 dup(0)
First of all, I just added a comment to use the new short string but kept the
original code and assembled using Masm 9.0 from Visual Studio 2008:
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; LONG SOURCE
push offset Src3
; push offset Src8
xor eax, eax
push eax
int 3
call KRBNew4
ifdef DEBUGHALT
cmp eax,offset Src9
jnz Bad7
endif
counter_end
print str$(eax), 9, "cycles for KRBLingo, match null in long string", 13, 10
I executed the code and stepped into KRBLingo (AKA KRBNew4) and copied the
dissasembly code (this is using Visual Studio 2008).
00402EB0 pop ecx
00402EB1 pop edx
00402EB2 pop eax
00402EB3 movdqu xmm2,xmmword ptr [eax]
00402EB7 test edx,edx
00402EB9 je 00402F20
00402EBB movd xmm0,edx
00402EBF and eax,0FFFFFFF0h
00402EC2 punpcklbw xmm0,xmm0
00402EC6 pxor xmm1,xmm1
00402ECA pshuflw xmm0,xmm0,0
00402ECF pcmpeqb xmm1,xmm2
00402ED3 pshufd xmm0,xmm0,0
00402ED8 pcmpeqb xmm2,xmm0
00402EDC por xmm2,xmm1
00402EE0 pmovmskb edx,xmm2
00402EE4 test edx,edx
00402EE6 jne 00402F05
00402EE8 pcmpeqb xmm1,xmmword ptr [eax+10h]
00402EED movdqa xmm2,xmm0
00402EF1 pcmpeqb xmm2,xmmword ptr [eax+10h]
00402EF6 por xmm1,xmm2
00402EFA add eax,10h
00402EFD pmovmskb edx,xmm1
00402F01 test edx,edx
00402F03 je 00402EE8
00402F05 bsf edx,edx
00402F08 add eax,edx
00402F0A xor edx,edx
00402F0C cmp byte ptr [eax],dl
00402F0E cmove eax,edx
00402F11 jmp ecx
00402F13 lea esp,[esp]
00402F1A lea ebx,[ebx]
00402F20 pxor xmm0,xmm0
00402F24 pcmpeqb xmm2,xmm0
00402F28 pmovmskb edx,xmm2
00402F2C test edx,edx
00402F2E jne 00402F42
00402F30 and eax,0FFFFFFF0h
00402F33 pcmpeqb xmm0,xmmword ptr [eax+10h]
00402F38 add eax,10h
00402F3B pmovmskb edx,xmm0
00402F3F test edx,edx
00402F41 je 00402F33
00402F43 bsf edx,edx
00402F46 add eax,edx
00402F48 jmp ecx
I started to execute the code by stepping and it appeared to be correctly
executing for the long string. Then I stopped execution, and modified the code
to use the short string:
ALIGN 16
db 1
Src8 db "Shortstr"
Src9 db 16 dup(0)
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; LONG SOURCE
; push offset Src3
push offset Src8
xor eax, eax
push eax
int 3
call KRBNew4
ifdef DEBUGHALT
cmp eax,offset Src9
jnz Bad7
endif
counter_end
print str$(eax), 9, "cycles for KRBLingo, match null in long string", 13, 10
I executed the code and stepped into KRBLingo (AKA KRBNew4) and copied the
dissasembly code (this is using Visual Studio 2008). I have put a copy of the
first dissasembly data on the same lines as the second dissassembly data
(everything matched):
With short string With long string from above
00402EB0 pop ecx 00402EB0 pop ecx
00402EB1 pop edx 00402EB1 pop edx
00402EB2 pop eax 00402EB2 pop eax
00402EB3 movdqu xmm2,xmmword ptr [eax] 00402EB3 movdqu xmm2,xmmword ptr [eax]
00402EB7 test edx,edx 00402EB7 test edx,edx
00402EB9 je 00402F20 00402EB9 je 00402F20
00402EBB movd xmm0,edx 00402EBB movd xmm0,edx
00402EBF and eax,0FFFFFFF0h 00402EBF and eax,0FFFFFFF0h
00402EC2 punpcklbw xmm0,xmm0 00402EC2 punpcklbw xmm0,xmm0
00402EC6 pxor xmm1,xmm1 00402EC6 pxor xmm1,xmm1
00402ECA pshuflw xmm0,xmm0,0 00402ECA pshuflw xmm0,xmm0,0
00402ECF pcmpeqb xmm1,xmm2 00402ECF pcmpeqb xmm1,xmm2
00402ED3 pshufd xmm0,xmm0,0 00402ED3 pshufd xmm0,xmm0,0
00402ED8 pcmpeqb xmm2,xmm0 00402ED8 pcmpeqb xmm2,xmm0
00402EDC por xmm2,xmm1 00402EDC por xmm2,xmm1
00402EE0 pmovmskb edx,xmm2 00402EE0 pmovmskb edx,xmm2
00402EE4 test edx,edx 00402EE4 test edx,edx
00402EE6 jne 00402F05 00402EE6 jne 00402F05
00402EE8 pcmpeqb xmm1,xmmword ptr [eax+10h] 00402EE8 pcmpeqb xmm1,xmmword ptr [eax+10h]
00402EED movdqa xmm2,xmm0 00402EED movdqa xmm2,xmm0
00402EF1 pcmpeqb xmm2,xmmword ptr [eax+10h] 00402EF1 pcmpeqb xmm2,xmmword ptr [eax+10h]
00402EF6 por xmm1,xmm2 00402EF6 por xmm1,xmm2
00402EFA add eax,10h 00402EFA add eax,10h
00402EFD pmovmskb edx,xmm1 00402EFD pmovmskb edx,xmm1
00402F01 test edx,edx 00402F01 test edx,edx
00402F03 je 00402EE8 00402F03 je 00402EE8
00402F05 bsf edx,edx 00402F05 bsf edx,edx
00402F08 add eax,edx 00402F08 add eax,edx
00402F0A xor edx,edx 00402F0A xor edx,edx
00402F0C cmp byte ptr [eax],dl 00402F0C cmp byte ptr [eax],dl
00402F0E cmove eax,edx 00402F0E cmove eax,edx
00402F11 jmp ecx 00402F11 jmp ecx
00402F13 lea esp,[esp] 00402F13 lea esp,[esp]
00402F1A lea ebx,[ebx] 00402F1A lea ebx,[ebx]
00402F20 pxor xmm0,xmm0 00402F20 pxor xmm0,xmm0
00402F24 pcmpeqb xmm2,xmm0 00402F24 pcmpeqb xmm2,xmm0
00402F28 pmovmskb edx,xmm2 00402F28 pmovmskb edx,xmm2
00402F2C test edx,edx 00402F2C test edx,edx
00402F2E jne 00402F42 00402F2E jne 00402F42
00402F30 and eax,0FFFFFFF0h 00402F30 and eax,0FFFFFFF0h
00402F33 pcmpeqb xmm0,xmmword ptr [eax+10h] 00402F33 pcmpeqb xmm0,xmmword ptr [eax+10h]
00402F38 add eax,10h 00402F38 add eax,10h
00402F3B pmovmskb edx,xmm0 00402F3B pmovmskb edx,xmm0
00402F3F test edx,edx 00402F3F test edx,edx
00402F41 je 00402F33 00402F41 je 00402F33
00402F43 bsf edx,edx 00402F43 bsf edx,edx
00402F46 add eax,edx 00402F46 add eax,edx
00402F48 jmp ecx 00402F48 jmp ecx
So far, so good! (That is what was heard on the 83rd floor of the Empire State
Building as the man that just jumped off of the top of the building passed by).
I stepped through the entry code and got to mystrl:. The next 4 instructions
execute as expected, and the null at the end of the short string was detected.
Then the fun begins. I am sitting on the jne and notice that the displayed code
does not look the same as before, instead it looks like this:
What it is now displayed as: What it originally was displayed as:
00402F20 pxor xmm0,xmm0
00402F24 pcmpeqb xmm2,xmm0
00402F28 pmovmskb edx,xmm2
00402F2C test edx,edx
00402F2E jne 00402F42
00402F30 and eax,0FFFFFFF0h
00402F33 pcmpeqb xmm0,xmmword ptr [eax+10h]
00402F38 add eax,10h
00402F3B pmovmskb edx,xmm0
00402F3F test edx,edx
00402F41 db 74h 00402F41 je 00402F33
00402F42 lock bsf edx,edx 00402F43 bsf edx,edx
00402F46 add eax,edx 00402F46 add eax,edx
00402F48 jmp ecx 00402F48 jmp ecx
As soon as I step the jne I get the following in a message box:
Unhandled exception at 0x00402f42 in strchar4.exe: 0xC000001E:
An attempt was made to execute an invalid lock sequence.
I do not know why Visual Studio displayed that code as it did (well, I think the
initial display is "as coded and never executed", but as soon as you hit the
jne, Visual Studio calculates the target address and re-interpretes the data
around that point). However, the following source code also seems wrong (the
reason I was testing with a short string):
mystrl:
pxor xmm0, xmm0
pcmpeqb xmm2, xmm0
pmovmskb edx, xmm2
test edx, edx
jne @f +15 ; I read this as @f with a comment of +15
; or jump to @@:
; but it really was @f+15
and eax, -16
@@:
pcmpeqb xmm0, [eax+16]
add eax, 16
pmovmskb edx, xmm0
test edx, edx
je @b
bsf edx, edx
add eax, edx
jmp ecx
The problem is the +15. It should really be:
mystrl:
pxor xmm0, xmm0
pcmpeqb xmm2, xmm0
pmovmskb edx, xmm2
test edx, edx
jne @f +16 <--------------------
and eax, -16
@@:
pcmpeqb xmm0, [eax+16]
add eax, 16
pmovmskb edx, xmm0
test edx, edx
je @b
bsf edx, edx
add eax, edx
jmp ecx
Actually what it should really be is:
mystrl:
pxor xmm0, xmm0
pcmpeqb xmm2, xmm0
pmovmskb edx, xmm2
test edx, edx
jne Found <-----------------------
and eax, -16
@@:
pcmpeqb xmm0, [eax+16]
add eax, 16
pmovmskb edx, xmm0
test edx, edx
je @b
Found: <-----------------------
bsf edx, edx
add eax, edx
jmp ecx
What may have happened is that you used a different assembler, and it created
one of the following instructions with one less byte than my masm 9.0 (so for my
assembly, the offset needs to be +16), but that is exactly why I would never try
to jump to a label +/- an offset - give it a new label.
This new version works for the short string and for the long string.
I changed all the code to be StrChar4 (to avoid conflicts with my StrChar3
version) and this is the timing result:
AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (SSE3)
Find character in string: long string 5000 bytes.
12406 cycles for crt_strchr, match long string
2710 cycles for crt_strchr, no match long string
27037 cycles for KRBOld, match long string
3477 cycles for KRBNew, match long string
2934 cycles for KRBNew2, match long string
2260 cycles for KRBNew3, match long string
1286 cycles for KRBNew3, match null in long string
1932 cycles for KRBLingo, match long string
1289 cycles for KRBLingo, match null in long string
27 cycles for KRBLingo, match null in short string
17994 cycles for KRBOld, no match long string
3168 cycles for KRBNew, no match long string
2900 cycles for KRBNew2, no match long string
2258 cycles for KRBNew3, no match long string
1933 cycles for KRBLingo, no match long string
7538 cycles for crt_strchr, match long string
12374 cycles for crt_strchr, no match long string
21052 cycles for KRBOld, match long string
3351 cycles for KRBNew, match long string
3082 cycles for KRBNew2, match long string
2382 cycles for KRBNew3, match long string
1358 cycles for KRBNew3, match null in long string
2041 cycles for KRBLingo, match long string
1389 cycles for KRBLingo, match null in long string
27 cycles for KRBLingo, match null in short string
16540 cycles for KRBOld, no match long string
3374 cycles for KRBNew, no match long string
3097 cycles for KRBNew2, no match long string
2298 cycles for KRBNew3, no match long string
1941 cycles for KRBLingo, no match long string
Codesizes:
dostrchr: 12
KRBOld: 30
KRBNew: 97
KRBNew2: 141
KRBNew3: 219
KRBLingo: 154
--- ok ---
Dave.
Dave, please just redownload my code and test it again... :wink
Lingo,
Please explain in detail what you want me to do to "test it again".
Do you want me to test your .exe? If I do this, how can I test a short string? Your function does not fail with a long string and those were the only tests included with the timing driver calls. To test a short string requires that I re-assemble the source file and then I get into the same problem.
Have you updated your source and produced a new .zip? If so where is it? The path I see in the post for StrChar3a.zip is somewhere in the MASM32 site. Had the .zip been a link to your site, then downloading an updated copy of the .zip would potentially get an updated version that would work.
What assembler did you use to assemble your version? You will note that my makeit.bat had removed the \masm32\bin\ prefix for the ml command and I just call ml and get the Visual Studio masm (9.0). I assembled with masm (8.0) by adding back the \masm32\bin\ and got the same result, the same bad jne @f+15 that goes to the middle of the two byte instruction je @b instead of the start of the next instruction bsf edx,edx.
Please give me some more detail and I will be glad to test it again.
Dave.
I changed it a bit to become more clear... :wink
Lingo,
I downloaded the zip and looked at the code. Ugh! I then executed the .exe without re-assembling (this is on my laptop without any development tools). Since it was based on my StrChar4.zip, it contained the short string test, and this test (all of them) all executed. The times were impressive. The code was short, but ugly! One symbol in the middle of 141 bytes, and 2 forward jumps to that symbol with declared offsets of +43, +29, and 3 back jumps to that symbol with offsets of 0, +78, and +62. If you are trying to impress, you didn't make it this time.
Dave.
Quote from: KeepingRealBusy on June 26, 2010, 08:18:35 PM
Lingo,
... If you are trying to impress, you didn't make it this time.
He is not trying to impress anybody. He obfuscates his code by not commenting it and with these "odd" jumps because he is scared shitless that I might "steal" his miraculous code. To be honest, usually his code is faster than mine; but then, it often takes a while, with a little help from some friends here on the forum, until it's bug-free :green
JJ,
The odd jumps are ok, if and only if one knows what assembler is used to assemble the code. As I found out, his jne @f +15 jumped right into the middle of the other je @b and crashed. Code generation is an art and not a science, there are several ways to code some instructions, different lengths to be sure, and these can cause uncountable debug misery.
Dave.
Quote from: KeepingRealBusy on June 26, 2010, 09:52:59 PM
Code generation is an art and not a science
You what? :lol Art has fluffy edges and completely unnecessary features, code generation is logic it contains only what it must to logically complete the task
Quote from: oex on June 26, 2010, 10:00:16 PM
Quote from: KeepingRealBusy on June 26, 2010, 09:52:59 PM
Code generation is an art and not a science
You what? :lol Art has fluffy edges and completely unnecessary features, code generation is logic it contains only what it must to logically complete the task
Well the issue is going to be that only AMD and Intel have really accurate models of their cores, and quite a lot of the fine detail change between CPUs. Some of the differences can profoundly effect the performance/choice of instructions which perform the same task.
I would say there is an art to generating effective code, and that Lingo's code is pretty artful. A compiler can apply all the logic it wants, it's not going to "understand" your algorithm to any real degree.
FCS - Fancy Code Syndrome
Quote from: clive on June 26, 2010, 10:43:04 PM
I would say there is an art to generating effective code, and that Lingo's code is pretty artful.
I especially liked his way of interleaving two different threads at the beginning (threads - maybe that is a bad term to use in a programming example because this is not a multi-core threading example), both building the argument for the character test while at the same time testing the first input for nulls, doing this by different indentations. I liked that. I also had not thought of using those sse instructions that way.
Dave.
If I turn on DEBUGHALT, it seems like every algo is failing on match null. Is anyone else having that problem?
Queue
Queue,
I did not try to re-assemble, just executed the .exe, and all tests ran. Without re-assembly, I cannot turn on DEBUGHALT. Later I will move this new .zip to my development system and try.
I just looked at the .asm file. It looks OK and when I tested my version (StrChar4.zip) I did test both ways, with DEBUGHALT, and then without DEBUGHALT to get accurate timing, and both versions ran all tests.
Have you tried to make the same test with my version?
Dave.
Yeah, I've tried with StrChar4, and Lingo's later 4a, and with either, after the cycle count for ''KRBNew3 match long string'' prints, it crashes (INT 3).
Edit - Ok, must be my processor (or OS)? I WAS trying it on a 1.3 GHz Athlon on Win98SE and it was failing on the match nulls, but on a 2.8 GHz Core2Duo on WinXP, it works fine.
Edit2 - If I comment out Dummy3 so that the data is aligned, my Athlon only fails on the match null in short string check for KRBLingo.
Queue
Queue,
I just move this to my development system, modified the makeit.bat file to define DEBUGHALT and emit a .lst file, modified the .asm to add a .list directive following the .data directive, and assembled. I checked the .list file and the check code was assembled. I executed the new .exe and all tests ran and verified, no halts.
Note: Lingo did not include a makeit.bat file, not has he replied to my question about which assembler he is using. I had to copy my makeit.bat file from my StrChar4 directory (which I included in my StrChar4.zip file).
Dave.
It looks like it's pure luck that it's succeeding on my Athlon when data's aligned. Adding arbitrary lengths of misalignment to the data makes it succeed or fail. For example, if I add between 10 and 16 bytes of padding before Src3, it succeeds on all algos, but between 1 and 9 makes it fail. For the short string null check, I need between 8 and 15 bytes of padding (I assume this is making it long enough to no longer be a short string?).
While I really have no idea what I'm doing, I can assemble and link this code, and it's interesting to me that it fails so arbitrarily on my old computer.
Queue
Quote from: Queue on June 27, 2010, 01:28:58 AMWhile I really have no idea what I'm doing, I can assemble and link this code, and it's interesting to me that it fails so arbitrarily on my old computer.
To illustrate what these "unorthodox" jumps do, here a snippet:
include \masm32\include\masm32rt.inc
.code
start: mov ebx, -1
jmp @F+3 ; attention "artwork"
@@: nop
L0: inc ebx
L1: inc ebx
L2: inc ebx
L3: inc ebx
L4: inc ebx
L5: inc ebx
MsgBox 0, str$(ebx), "I hit label No. :", MB_OK
exit
end start
That code works also without the L1, L2 etc labels. Now what happens in real life is that
- the clever author of the code simply writes his code with "normal" labels, e.g. jxx MyGoodLabel
- then launches Olly or whatever to see what is the offset against a reference label (@@, i.e. the nop in the example above)
- replaces jmp MyGoodLabel with jmp @F+123
- deletes MyGoodLabel in the hope that who reads his code is too stupid to get the trick.
The real fun comes up when people use different assemblers, because in complex code the offsets *may* differ on rare occasions. If that happens, the jmp @F+123 ends in no man's land, and you have "arbitrarily" behaving source code that is meant to drive you insane.
No, the jumps don't fail for me. That is something I actually do understand about assembler and jumps in general. Actually, as an exercise, I put in the labels Lingo had chopped out so I could try and follow what was going on. Also took a stab at converting the SSE to MMX with... some degree of success. What I meant by it's failing is the SSE-related stuff isn't working properly on my old Athlon; I don't know enough about SSE and my processor to know why, it's just how it is.
Queue
So which SSE level does the first line give you?
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
SSE1
that could be the problem :P
most of the SSE code in here requires at least SSE2 support
Quote from: jj2007
To illustrate what these "unorthodox" jumps do, here a snippet:
include \masm32\include\masm32rt.inc
.code
start: mov ebx, -1
jmp @F+3 ; attention "artwork"
@@: nop
L0: inc ebx
L1: inc ebx
L2: inc ebx
L3: inc ebx
L4: inc ebx
L5: inc ebx
MsgBox 0, str$(ebx), "I hit label No. :", MB_OK
exit
end start
A quick static analysis suggests this doesn't work the way you wanted it too. L1 will report 4, L0 5.
Quote from: jj2007 on June 27, 2010, 08:06:16 AM
The real fun comes up when people use different assemblers, because in complex code the offsets *may* differ on rare occasions. If that happens, the jmp @F+123 ends in no man's land, and you have "arbitrarily" behaving source code that is meant to drive you insane.
Is there any way to tell from examining the .exe which assembler created the .obj, and which loader was used?
Quote from: dedndave on June 27, 2010, 09:54:42 AM
that could be the problem :P
most of the SSE code in here requires at least SSE2 support
Well, that answers that. Why does an SSE1 processor not just outright crash the program when SSE2 opcodes are used?
Queue
"Is there any way to tell from examining the .exe which assembler created the .obj"
Yes, if you have a MS "Rich" signature after DOS Header of your exe file, ONLY... :wink
Quote from: clive on June 27, 2010, 01:31:33 PM
A quick static analysis suggests this doesn't work the way you wanted it too. L1 will report 4, L0 5.
Thanks, Clive. Here is the correct version.
include \masm32\include\masm32rt.inc
.code
start: mov ebx, 7
jmp @F+5 ; attention "artwork"
@@: nop
L0: dec ebx
L1: dec ebx
L2: dec ebx
L3: dec ebx
L4: dec ebx
L5: dec ebx
MsgBox 0, str$(ebx), "I hit label No. :", MB_OK
exit
end start
Quote from: lingo on June 27, 2010, 04:48:32 PM
"Is there any way to tell from examining the .exe which assembler created the .obj"
Yes, if you have a MS "Rich" signature after DOS Header of your exe file, ONLY... :wink
The Rich header is added by the Microsoft linker. I know of some differences in structure between linkers, but not between assemblers. For example, if the code and import sections are merged, that results in a different layout between MS's LINK and POLINK. The resource secton's organization can vary based on how the RES file was converted to object code and by which linker. Different linkers also tend to use slightly different section flags, especially when merging sections.
Queue
Quote from: Queue on June 27, 2010, 05:39:14 PM
I know of some differences in structure between linkers, but not between assemblers.
In general, there are no differences in the code generated by various versions of MASM, from ml v6.15 onwards.
Jwasm generates something else for .if eax, but that is not particularly relevant. However, there is one little bug that can make a difference for the jmp @F+123 "artwork": Jwasm generates long jumps in situations where a short jump would be sufficient. That can be cumbersome indeed :bg
"The Rich header is added by the Microsoft linker"
Congratulation, you just reinvent the wheel....
Will be better to tell as why and how to use it?
For your info Rich is just a part of PRODITEM structure for "end of tallies" rather than a "header"... :wink
Quote from: lingo on June 27, 2010, 06:01:23 PM
"The Rich header is added by the Microsoft linker"
Congratulation, you just reinvent the wheel....
Will be better to tell as why and how to use it?
For your info Rich is just a part of PRODITEM structure for "end of tallies" rather than a "header"... :wink
Wait, what? You said a way to tell the difference between assemblers is the presence of the Rich header. I simply pointed out it's the linker that adds the Rich header, not the assembler.
The Rich header is used to identify which versions of various MS utilities were used to make the executable.
http://ntcore.com/files/richsign.htm
Queue
I wonder how many hits that site will get today?
Interesting read.
Dave.
"The Rich header is used to identify which versions of various MS utilities were used to make the executable."
and
"It seems that the fixed value of the Rich Signature represents the product version, whereas the other entries represent file versions. Of course, this is only a guess. We would have to dig further to be 100% sure. At the end, the value of @comp.id is quite harmless, compared to what other people thought it could be." by Daniel Pistelli
I know this document very well but he (and you) don't know exactly what is inside (what is the product(tool) name and it's version and subversion
and how MS define the product identifiers and tags used to identify which MS tool built any particular object file)
With other words: He (and you) is unable to translate all the bytes between DanS and Rich in readable way exactly... :wink
"You said a way to tell the difference between assemblers is the presence of the Rich header"
Yes, but only for MASM assemblers of MS with different versions.
Quote from: lingo on June 27, 2010, 07:57:16 PM
"The Rich header is used to identify which versions of various MS utilities were used to make the executable."
and
"It seems that the fixed value of the Rich Signature represents the product version, whereas the other entries represent file versions. Of course, this is only a guess. We would have to dig further to be 100% sure. At the end, the value of @comp.id is quite harmless, compared to what other people thought it could be." by Daniel Pistelli
I know this document very well but he (and you) don't know exactly what is inside (what is the product(tool) name and it's version and subversion
and how MS define the product identifiers and tags used to identify which MS tool built any particular object file)
With other words: He (and you) is unable to translate all the bytes between DanS and Rich in readable way exactly... :wink
"You said a way to tell the difference between assemblers is the presence of the Rich header"
Yes, but only for MASM assemblers of MS with different versions.
Lingo,
Even more restrictive than that:
"undocumented Rich Signature produced by Microsoft compilers. I'm not completely sure when this signature was introduced, I wrongly believed that I had been introduced with Visual Studio 2003, but I was shown that it is present even in VC++ 6 executables." by Daniel Pistelli.
Just assemblingASM source with MASM does not create a Rich signature in the executable (.exe), it takes the C compiler (examine any of the .exe files in this topic).
Dave.
"Just assemblingASM source with MASM does not create a Rich signature in the executable (.exe),"
After successfully assembled ASM source with MASM we have .obj file with @comp.id number in it."
Later if we use MS Linker (not patched) we will have .exe file with Rich signature in it (with masked @comp.id inside). :wink
You are right, I forgot that the makeit.bat was using PoLink.