News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

InstrCi - case-insensitive string search

Started by jj2007, June 15, 2008, 10:09:46 PM

Previous topic - Next topic

UtillMasm

c:\china>\masm32\bin\ml.exe /c /coff InstrCi11_BL_BSF.asm
Microsoft (R) Macro Assembler Version 6.15.8803
        Patched for you by promethee [ECL] in the year 2001 - enjoy
Copyright (C) Microsoft Corp 1981-2000.  All rights reserved.

Assembling: InstrCi11_BL_BSF.asm

c:\china>\masm32\bin\link.exe /subsystem:console InstrCi11_BL
_BSF.obj
Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.

c:\china>InstrCi11_BL_BSF.exe
Genuine Intel(R) CPU           T2400  @ 1.83GHz (SSE3)
Test correctness for InstrJJ: OK
Average cycle counts:
1748     InstrJJ (jj2007)
2817     BMLinDB (Lingo)
10305    InString (Masm32 Lib)
InstrJJ : InString = 16 %

Code sizes:
InstrJJ      = 319
BMLinDB      = 395
InStringL    = 305

c:\china>

jj2007

> InstrJJ : InString = 16 %

Thanks, UtillMasm. One third faster than the previous version, not too bad :8)

fearless

Intel(R) Core(TM)2 Quad  CPU   Q9550  @ 2.83GHz (SSE4)
Test correctness for InstrJJ: OK
Average cycle counts:
1148     InstrJJ (jj2007)
2725     BMLinDB (Lingo)
9355     InString (Masm32 Lib)
InstrJJ : InString = 12 %

Code sizes:
InstrJJ      = 319
BMLinDB      = 395
InStringL    = 305
ƒearless

jj2007

Quote from: fearless on May 11, 2009, 03:41:20 PM

Intel(R) Core(TM)2 Quad  CPU   Q9550  @ 2.83GHz (SSE4)
..
InstrJJ : InString = 12 %


Thanks. For a P4 I get now 25%, for the T2400 16%, and a Quad with SSE4 gives factor 8 compared to non-SSE2. Interesting pattern...
Quote

redskull

Intel(R) Core(TM)2 Duo CPU     E4500  @ 2.20GHz (SSE4)
Test correctness for InstrJJ: OK
Average cycle counts:
1071     InstrJJ (jj2007)
2728     BMLinDB (Lingo)
9324     InString (Masm32 Lib)
InstrJJ : InString = 11 %

Code sizes:
InstrJJ      = 319
BMLinDB      = 395
InStringL    = 305
Strange women, lying in ponds, distributing swords, is no basis for a system of government

jj2007

Quote from: redskull on May 11, 2009, 04:16:54 PM
Intel(R) Core(TM)2 Duo CPU     E4500  @ 2.20GHz (SSE4)
Test correctness for InstrJJ: OK
Average cycle counts:
1071     InstrJJ (jj2007)
2728     BMLinDB (Lingo)
9324     InString (Masm32 Lib)
InstrJJ : InString = 11 %



Thanks. Not a Quad but SSE4. On my old one-core CPU, I get much less of an improvement:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
Test correctness for InstrJJ: OK
Average cycle counts:
1717     InstrJJ (jj2007)
2780     BMLinDB (Lingo)
10196    InString (Masm32 Lib)
InstrJJ : InString = 16 %


Lingo's algo is SSE2, too, and pretty stable, so I guess Intel has made progress on a few very specific instructions that I use in the inner loop.

L1: movdqa xmm2, [esi]
Lu: movdqa xmm1, xmm0 ; xmm0 has first byte of pattern
lea esi, [esi+16] ; position in mainstring
pcmpeqb xmm1, xmm2 ; compare memory to pattern byte
pmovmskb edx, xmm1 ; set byte mask in edx for search pattern byte
pcmpeqb xmm1, xmm2 ; look for zero byte (xmm1 is either 0 or ff)
pmovmskb eax, xmm1 ; set byte mask in eax for zero delimiter byte
test eax, eax ; db 02eh ; branch hint: not taken, no effect
jne ZBF ; zero byte found? check/clear ebp, then ChkNull
test edx, edx ; db 03eh ; branch hint: taken, no effect
je L1 ; pattern byte found? no=go back, yes=repeat the exercise with a word

FORTRANS

Hi,


AMD Athlon(tm) 64 Processor 3200+ (SSE2)
Test correctness for InstrJJ: OK
Average cycle counts:
3106     InstrJJ (jj2007)
3872     BMLinDB (Lingo)
13736    InString (Masm32 Lib)
InstrJJ : InString = 22 %

Code sizes:
InstrJJ      = 319
BMLinDB      = 395
InStringL    = 305


Regards,

Steve N.

Mark Jones


AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)
Test correctness for InstrJJ: OK
Average cycle counts:
2985     InstrJJ (jj2007)
3872     BMLinDB (Lingo)
13725    InString (Masm32 Lib)
InstrJJ : InString = 21 %

Code sizes:
InstrJJ      = 319
BMLinDB      = 395
InStringL    = 305
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

jj2007

Thanks to all who tested this code. I have run into a very odd problem that's driving me nuts:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
Test correctness for InstrJJ: OK
Average cycle counts:
2636     InstrJJ (jj2007)
2778     BMLinDB (Lingo)
10198    InString (Masm32 Lib)
InstrJJ : InString = 25 %

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
Test correctness for InstrJJ: OK
Average cycle counts:
1719     InstrJJ (jj2007)
2778     BMLinDB (Lingo)
10088    InString (Masm32 Lib)
InstrJJ : InString = 17 %

The only difference is the IFNOP switch in line 6 of the attachment.
If set to 1, it inserts an apparently harmless

   .if res1            ;; any global variable will do, but not .if eax
      nop
   .endif

into the main loop of the timings macro:
  REPEAT REP_CT
  if IFNOP
.if res1 ;; any global variable will do, but not .if eax
nop
.endif
endif
invoke Sleep, 50
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
tmp$
counter_end
add globcycles, eax
inc globcount
if DisplayCycles
.if eax>65536
shr eax, 10
.endif
out$ CATSTR <print str$(eax), 9, ">, Description, <", 13, 10>
out$
endif
  ENDM


Otherwise the code is identical. Any ideas?
::)

[attachment deleted by admin]

Jimg

Why are you so surprised?  I see this all the time on amd. 
You move the call to the routine by n bytes.  I didn't actually assemble it since I don't have sse2, but check for odd/even address, etc.
Some instructions prefer to be on an even address while others an odd address.

dedndave

that is a HUGE difference, though
it makes me wonder if it is worth the effort to optimize, beyond a certain point
get it running sufficiently fast, then insert nop's behind each instruction to tune it in - lol

jj2007

Quote from: Jimg on May 11, 2009, 08:14:15 PM
Some instructions prefer to be on an even address while others an odd address.

Both data and code are aligned 16 plus a fixed offset. But it is an "alignment" issue - I can reproduce the effect by inserting a given number of nops before the getkey+exit...

Do the two exes produce different results on your CPUs??

:(

EDIT: I managed to get the difference between the slow and fast calls to the algo:

;CPU Disasm FAST
;Address               Hex dump                       Command
;004036D2              ? 68 6C614000                  push offset InstrCi11_BL_BSF.0040616C
;004036D7              ? 68 47574000                  push offset InstrCi11_BL_BSF.00405747
;004036DC              . 6A 01                        push 1
;004036DE              . E8 3D0E0000                  call 00404520 #######
;004036E3              . 832D 047A4000 01             sub dword ptr [407A04], 1

;CPU Disasm slow
;004036D2              ? 68 6C614000                  push offset InstrCi11_BL_BSF.0040616C
;004036D7              ? 68 47574000                  push offset InstrCi11_BL_BSF.00405747
;004036DC              . 6A 01                        push 1
;004036DE              . E8 2D0E0000                  call 00404510 #######
;004036E3              . 832D 047A4000 01             sub dword ptr [407A04], 1


A code caching issue maybe??

jj2007

Quote from: dedndave on May 11, 2009, 09:10:33 PM
that is a HUGE difference, though
it makes me wonder if it is worth the effort to optimize, beyond a certain point
get it running sufficiently fast, then insert nop's behind each instruction to tune it in - lol


Inserting nops inside the algo helps sometimes :bg

But here it's from where you call it that makes the difference. The algo itself, and the data addresses, are unchanged... same algo size, no sudden conversions from near to far jumps etc...

dedndave

i don't think the stack cares if you push an odd number or an even one
it has to be related to the EIP, itself
perhaps the lag occurs upon returning from the call, as opposed to making the call
so that, after the routine is done, it wants an even addy to return to ?
it does have to re-fill the queue at that point

now, we just need an assembler that will even-align (16-byte align - ouch) the return addy of calls - what a difference !
hint - hint GoAsm guys

MichaelW

I haven't been following this thread, but if the alignment of the return address is a problem it can readily be aligned by any one of several methods. For example:

    ; push arguments
    push OFFSET return_label
    push OFFSET procedure
    ret
    align 16
  return_label:


Note that the OFFSET directive is not necessary, it was just added for clarity.
eschew obfuscation