News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

MASM for FUN - #0.2 [string reversing]

Started by frktons, May 20, 2010, 08:24:50 PM

Previous topic - Next topic

clive

Atom
516     cycles for RevString
427     cycles for RevString2
419     cycles for RevStr
1663    cycles for szRev

Sizes:
64      RevString
64      RevString2
65      RevStr
72      szRev
It could be a random act of randomness. Those happen a lot as well.

jj2007

A factor 4... that's quite a big difference. I wonder what the Atom doesn't like in szRev's inner loop:
  @@:
    mov cl, [esi+eax]       ; load end pairs
    mov dl, [edi]
    mov [esi+eax], dl       ; swap end pairs
    mov [edi], cl
    sub edi, 1
    add eax, 1
    jnz @B                  ; exit on half length

dedndave

RevStr runs fastest on the prescott dual core, as well
hard to know why only the celeron M runs the new one faster   :red
337     cycles for RevString
..sogla ym gnitset sa hcus ,sesoprup fo yteirav a sevres taht ,gnol sretcarahc 0
01 ,gnirts a si sihT
474     cycles for RevString2
This is a string, 100 characters long, that serves a variety of purposes, such a
s testing my algos..
328     cycles for RevStr
..sogla ym gnitset sa hcus ,sesoprup fo yteirav a sevres taht ,gnol sretcarahc 0
01 ,gnirts a si sihT
600     cycles for szRev
This is a string, 100 characters long, that serves a variety of purposes, such a
s testing my algos..


at any rate, the biggest change i made was a simple improvement in loop structure
if your loop ends with a JMP instruction, it is usually a sign that something can be done better

in this example, we check the loop control (count) prior to executing the body of the loop
this leaves us with the ugly tell-tale sign of a JMP instruction at the end
;loop fall-through entry begins here

Loop01: cmp     ecx,0           ;used for example only
        jz      Loop03

Loop02:

;code representing the body of the loop goes here

        jmp     Loop01

;process continues after completion

Loop03:

by placing the control test at the end, and using JMP (usually SHORT) at entry,
we can avoid executing that JMP instruction with each pass of the loop
;loop fall-through entry begins here

        jmp short Loop02

Loop01:

;code representing the body of the loop goes here

Loop02: cmp     ecx,0           ;used for example only
        jnz     Loop01

;process continues after completion

Loop03:

notice, there is no difference in total size
we have simply moved the JMP from inside the loop to outside the loop

jj2007

Dave,
What are your timings? What happens if you put if 0 in line 109? It could be the lodsd...

dedndave

it's a little better...
339     cycles for RevString
..sogla ym gnitset sa hcus ,sesoprup fo yteirav a sevres taht ,gnol sretcarahc 0
01 ,gnirts a si sihT
396     cycles for RevString2
This is a string, 100 characters long, that serves a variety of purposes, such a
s testing my algos..
333     cycles for RevStr
..sogla ym gnitset sa hcus ,sesoprup fo yteirav a sevres taht ,gnol sretcarahc 0
01 ,gnirts a si sihT
599     cycles for szRev
This is a string, 100 characters long, that serves a variety of purposes, such a
s testing my algos..

i liked the first code - simple and easy to understand - lol
(except for the individual-byte loop structure, as mentioned above)

hutch--

Here is a quick twiddle of the test piece. I removed some of the stuff from szRev so it was doing that same amount of work and the others and it dropped in timings about 40%. Misaligned the string by 3 bytes and all of the DWORD versions fell over timings wise like normal.

I wrote it as byte reads and writes because its alignment insensitive.


291     cycles for RevString
..sogla ym gnitset sa hcus ,sesoprup fo yteirav a sevres taht ,gnol sretcarahc 0
01 ,gnirts a si sihT
283     cycles for RevString2
This is a string, 100 characters long, that serves a variety of purposes, such a
s testing my algos..
286     cycles for RevStr
..sogla ym gnitset sa hcus ,sesoprup fo yteirav a sevres taht ,gnol sretcarahc 0
01 ,gnirts a si sihT
225     cycles for szRev
his is a string, 100 characters long, that serves a variety of purposes, such as
testing my algos..T

Sizes:
64      RevString
64      RevString2
65      RevStr
62      szRev

--- ok ---
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

i like the idea, Hutch, but i think there is a little bug in there
what processor were those times on ?

prescott dual core 3 GHz
442     cycles for RevString
..sogla ym gnitset sa hcus ,sesoprup fo yteirav a sevres taht ,gnol sretcarahc 0
01 ,gnirts a si sihT
579     cycles for RevString2
This is a string, 100 characters long, that serves a variety of purposes, such a
s testing my algos..
442     cycles for RevStr
..sogla ym gnitset sa hcus ,sesoprup fo yteirav a sevres taht ,gnol sretcarahc 0
01 ,gnirts a si sihT
418     cycles for szRev
his is a string, 100 characters long, that serves a variety of purposes, such as
testing my algos..T

notice the string at the end   :red
i think you have to increment the length or something

jj2007

Bug corrected. Celeron M:
314     cycles for RevString
..sogla ym gnitset sa hcus ,
249     cycles for RevString2
This is a string, 100 charac
313     cycles for RevStr
..sogla ym gnitset sa hcus ,
263     cycles for szRev
This is a string, 100 charac

Sizes:
64      RevString
64      RevString2
65      RevStr
59      szRev


RevString2 is a bit faster now because I shamelessly copied a line from Hutch.... :bg

What is still missing is
    invoke StrLen, esi
    cmp eax, 1
    jbe TooShort

Otherwise szRev chokes on one-byte or null strings.

dedndave

prescott
430     cycles for RevString
..sogla ym gnitset sa hcus ,sesoprup fo yteirav a sevres taht ,gnol sretcarahc 0
01 ,gnirts a si sihT
439     cycles for RevString2
This is a string, 100 characters long, that serves a variety of purposes, such a
s testing my algos..
428     cycles for RevStr
..sogla ym gnitset sa hcus ,sesoprup fo yteirav a sevres taht ,gnol sretcarahc 0
01 ,gnirts a si sihT
404     cycles for szRev
This is a string, 100 characters long, that serves a variety of purposes, such a
s testing my algos..

hutch--

If it actually mattered which I am not convinced about, using the Agner Fog style StrLen unrolled by 4 locally would see some gains in speed but it would need to be corrected for the leading alignment first.

I didn't fix the 1 byte error as I was testing the idea, it would need both the counter corrected and the test for 1 or 0 length.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: dedndave on May 22, 2010, 04:53:17 PM
prescott
Where is lingo when I need him? A little rant on legacy CPUs??

Wait for my pshufb version. It may take a while though until I have enough bucks for a brand new SSE3 puter :bg

dedndave

oh - we are calling this a legacy CPU, already ?
:lol
that was quick - seems like i just bought the damn thing

jj2007

Quote from: hutch-- on May 22, 2010, 04:57:09 PM
using the Agner Fog style StrLen unrolled by 4 locally would see some gains in speed

That was actually the line I copied from you:
if newversion_shamelessly_copied_from_Hutch
invoke StrLen, esi
lea edi, [esi+eax-4] ; you must know the end of the string
else
lea edi, [esi+len(esi)-4] ; szLen is horribly slow
endif

I use a faster SSE2 strlen in MasmBasic, but it would be a bit of an overkill here.

dedndave

i would think a quick SCASB to find the null should work as fast as anything
with a little adjustment, that could leave EDI pointing at the end of the string, too
what do i know - lol

jj2007

Quote from: dedndave on May 22, 2010, 05:24:16 PM
i would think a quick SCASB to find the null should work as fast as anything
with a little adjustment, that could leave EDI pointing at the end of the string, too
what do i know - lol

repne scasb is horribly slow, I am afraid :(