News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Optimizing string procedures

Started by brethren, November 29, 2010, 03:30:39 PM

Previous topic - Next topic

dedndave

you could add another section to test for opposite case, too   :U

dedndave

later....
here is a variation on the theme
it may not be the fastest way, but it is ~60% faster than a rep scasb version
it uses all registers, preserves none, and passes the address in a register
it could easily be made into an INVOKE'able function, though
in my application, Form Feed is used to clear screen, so it was included
StrScan PROC

;scan a line of text and return the number of characters preceeding the first "special" character
;special characters include 0,9,10,12,13
;
;Call With: ESI = offset of string
;
;  Returns: ECX = bytes preceeding special char
;           EDX = special char identifier
;                 0 (12) form feed
;                 4 (0) null terminator
;                 5 (9) tab
;                 6 (13) carriage return
;                 7 (10) line feed
;           ZF  = set if ECX = 0, otherwise cleared

        or      ecx,-1
        mov     ebx,7F7F7F7Fh
        mov     edi,9090909h
        xor     ebp,ebp

sScan0: push    edi
        mov     eax,[esi]
        push    ecx
        dec     ebp
        push    5050505h
        mov     ecx,5
        jmp short sScan2

sScan1: pop     edx
        xor     eax,edi
        xor     edi,edx
        ror     ebp,1
        sub     edx,2020202h
        and     edi,7070707h
        push    edx

sScan2: mov     edx,eax
        and     edx,ebx
        add     edx,ebx
        or      edx,eax
        or      edx,ebx
        and     ebp,edx
        dec     ecx
        jnz     sScan1

        pop     edx
        pop     ecx
        add     esi,4
        inc     ecx
        inc     ebp
        pop     edi
        jz      sScan0

        bsf     eax,ebp
        shl     ecx,2
        lea     edx,[eax+1]
        shr     eax,3
        and     edx,7
        add     ecx,eax
        ret

StrScan ENDP

qWord

hi dedndave,
whats about using SSEx?
Quotewhat_ever proc psz: ptr CHAR
   
    .data
        align 16
        _9 db 16 dup (9)
        _10 db 16 dup (10)
        _12 db 16 dup (12)
        _13 db 16 dup (13)
    .code
   
    pxor xmm5,xmm5
    mov edx,psz
    xor ecx,ecx
@@:      
    movdqu xmm0,OWORD ptr [edx+ecx]
    movdqa xmm1,xmm0
    movdqa xmm2,xmm0
    movdqa xmm3,xmm0
    movdqa xmm4,xmm0
    pcmpeqb xmm0,OWORD ptr _9
    pcmpeqb xmm1,OWORD ptr _10
    pcmpeqb xmm2,OWORD ptr _12
    pcmpeqb xmm3,OWORD ptr _13
    pcmpeqb xmm4,xmm5
    por xmm0,xmm1
    por xmm2,xmm3
    por xmm0,xmm2
    por xmm0,xmm4
    pmovmskb eax,xmm0
    test eax,eax
    jnz @F
    lea ecx,[ecx+16]
    jz @B
@@:
    bsf eax,eax
    lea eax,[eax+ecx]
    ret
   
what_ever endp
FPU in a trice: SmplMath
It's that simple!

dedndave

thanks qWord
but, these are un-aligned strings to be placed in a display buffer
the special characters are "stopping points", where the string is terminated, or the char is otherwise expanded
once an expanded char has been actioned upon, you pick up where you left off in the input string
there is no practical way to align them

jj2007

Quote from: dedndave on March 23, 2011, 02:13:31 AM
thanks qWord
but, these are un-aligned strings to be placed in a display buffer

qWord's code does not require alignment.

dedndave

oh - cool
i will give it a try
maybe i can get my feet wet with SSE and try to understand it - lol

lingo

If someone needs a speed, he updates to faster CPU and AFTER that is OK to try MASM optimization...
Only the idiots optimized for archaic CPU.. :lol   

dedndave

yah - you got a new machine - now you are gonna harp
who gives a shit what you think, lingo...
....absolutely noone

lingo

The forum's idiots as a dedndave who say "no" or
"....absolutely noone" to be so kind to read and compare
times of the different CPUs of the same algos here...

For example:

Prescott P4:
110 cycles for agner fog StrLen-masmlib

i7-2600K:
28 cycles for agner fog StrLen-masmlib

Is it possible to compensate the difference with MASM optimization?  :lol

hutch--

Guys,

Try and keep the agro out of here, we have other subforums for "debates". I doubt there is any argument that a late i7 is faster than a Prescott PIV but that does not help anyone who owns or uses an old timer and speed is relative to the box running it, not what someone else owns. I use my i7 Win64 box to watch TV and movies, I would rather develop on my Core2 quad.  :P
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

good point, Hutch
if you optimize for a brand new processor, your code will be fast on only a small portion of computers that are in use
whereas, if you optimize on a processor that is 5 to 10 years old, it will run well on the vast majority of machines
it will run well on newer machines, simply because they are fast anyways - lol

as for lingo - he shits on everything i do - which - i don't care
but - more than once, he has taken all the fun out of a thread with his assholeism

redskull

For what it's worth, optimization for a P4 is a whole different bag of tricks than almost every other processor.  In many cases, 'optimized' code actually runs slower on a P4 than non-optimized versions, and vice-versa.  Only an "idiot" would compare the same algorithm on a PIV to a non-PIV; they are apples and oranges.

-r
Strange women, lying in ponds, distributing swords, is no basis for a system of government

lingo

"but that does not help anyone who owns or uses an old timer and speed is relative to the box running it, "

I'm not so crazy to keep at home ten different configurations with archaic
CPUs just to create ten different optimized versions of the same algo... :lol

donkey

Quote from: lingo on March 24, 2011, 02:16:04 PM
"but that does not help anyone who owns or uses an old timer and speed is relative to the box running it, "

I'm not so crazy to keep at home ten different configurations with archaic
CPUs just to create ten different optimized versions of the same algo... :lol

Except to get a rise out of Hutch, I gave up optimizing code a long time ago. The work that you might spend 20 minutes or more on could save a couple of milliseconds of runtime and be negated by things like the ADD EAX,1 / INC EAX type changes from processor to processor. That said there are some things that will always give you a boost. Any way that you can find to process more data simultaneously will always help though I would have to see timings on short strings for the XMM version of lstrlen above it looks like it would do well with large documents so in that case its a practical exercise. For X64 programming, the instruction sizes can be huge depending on addressing modes and the FASTCALL convention is bloated to say the least so I have lately been looking at reducing code size in order to make up for it, not so much for speed but for overall compactness. Compactness doesn't make a lot of difference especially considering that my main PC has 8GB of memory and 1 TB of hard disk space but I like it so I do it anyway.
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

lingo

 "For what it's worth, optimization for a P4 is a whole different bag of tricks than almost every other processor.
In many cases, 'optimized' code actually runs slower on a P4 than non-optimized versions, and vice-versa.."


Many thanks to our guru of code optimization...you just open my eyes but I can't see your P4 Prescott code here (see my replay #11) :lol