Hex string to dword optimization

Brett Kuntz · April 05, 2005, 09:06:31 PM

Hello, I wrote up this proc to convert a varible length string of hexidecimal characters into a dword. I've tried to make it as fast as possible, but it's only running for 190ms vs the C version running at 195ms (2.5% speed increase). I was wondering if you guys had further methods of optimizing it:

(i:DWORd is a pointer to the string)

Code Select


Hex2Str proc i:DWORD

    mov eax, 0
    mov edx, [i]
    jmp first
    top:
    inc edx
    first:
    shl eax, 4
    movsx ecx, byte ptr [edx]
    sub ecx, 48
    cmp ecx, 10
    jnb @F
        add eax, ecx
        movsx ecx, byte ptr [edx+1]
        test ecx, ecx
        jnz top
        ret
    @@:
    movsx ecx, byte ptr [edx]
    sub ecx, 65
    cmp ecx, 6
    jnb @F
        lea eax, [eax+ecx+10]
        movsx ecx, byte ptr [edx+1]
        test ecx, ecx
        jnz top
        ret
    @@:
    movsx ecx, byte ptr [edx]
    sub ecx, 97
    cmp ecx, 6
    jnb @F
        lea eax, [eax+ecx+10]
        movsx ecx, byte ptr [edx+1]
        test ecx, ecx
        jnz top
        ret
    @@:
    mov eax, -1
    ret

Hex2Str endp

Mark_Larson · April 06, 2005, 02:22:05 AM

You definitely posted this question on the right forum :) All the optimization freaks hang out here ;) I don't have time to actually write code, until tomorrow. But I will point you in the right direction to making it run faster for now. And I'll post some code tomorrow. First up I have a webpage for optimizing in assembly language www.mark.masmcode.com. I have 60 tips/tricks you can do to speed up your code that you can't do in a high level language. Intel has a PDF manual on their webpage for optimizing for their processors. And Agner Fog wrote a great optimization PDF www.agner.org/assem. It's the very first PDF on that webpage. Learning how to optimize is a fine art. You really have to understand how the processor works to make it run like the wind. So what are some things you can do to make it faster?

1) Well for non-P4 aligning code is very important. For all processors aligning data is important.

2) All processors don't run as fast as they can if you do write-after-write accesses or read-after-write accesses. They both cause the pipeline to stall. What are both of those? A read-after-write access is reading a register after you just wrote to it. It causes a stall because the processor has to wait for the value to be determined, before you can use it. Write-after-write is the same thing only that it means doing a write to a register while the processor still hasn't completed the first write. Here's an example:

Code Select


mov eax,10
and ebx,eax    ; read-after-write stall.  We just the value 10 to EAX and now we are reading it when we AND it with EBX.

mov eax,10
xor eax,20    ;write-after-write stall.  We just wrote the value 10 to EAX and now we are XORing a 20 to it.

If you look at the code you wrote, you have several places where you write a value to a register and then compare it, or do some other operation with it. Here is one of the places. How can you fix it up so that you don't do that? You can move other instructions that you also need to do in between the accesses to ECX. For instance you can exchange the "shl eax,4" with the line after it, to help break up the write-after-write stall that is going to occur. Things are actually even more complex, because you can actually get both stall types over multiple lines. Meaning sometimes it isn't enough to have one line of padding in between writing a value to a register and reading it. It depends on how long the original write instruction takes to execute. So you might have to play around with moving lines around and re-timing your code to find the best way to do it.

Code Select


    shl eax, 4
    movsx ecx, byte ptr [edx]
    sub ecx, 48
    cmp ecx, 10

3) On a P4 ADD/SUB is faster than INC/DEC. If you are running your code on a P4, use ADD/SUB instead of the INC/DEC I saw in your code.

4) Lookup table. This is going to be your BIG speed up if you can get it to work. You basically use the value you read from the string as a lookup into a table that does the conversion for you. That will simplify a lot of your code, and get rid of a lot of your comparisons.

5) Unrolling your loop. See if you can completely or semi-completely unroll your loop.

6) ( this isn't an optimization) Change your MOVSX to MOVZX. You are dealing with unsigned ascii data.

7) you use "mov eax,0" to set EAX to 0. You want to use "xor eax,eax" to set EAX to 0. It's generally faster and it breaks dependency chains.

All the stuff I covered here is all on my optimization webpage. Good luck.

Brett Kuntz · April 06, 2005, 03:44:33 AM

I'll read all your links to get a better idea of it all, I knew about the register burn (read-writes/write-writes) but don't have any ideas of how to remove it in my code. I'm on an AMD system and am looking for AMD optimization. I don't understand code alinging, though I think it's when you align the code to 4 bytes? Why does this make execution faster? Or is that covered in your guide. Are there non-PDF guides, I don't and wont install Adobe as it's a pile.

gabor · April 06, 2005, 10:10:57 AM

Hi!

According to my experiences INC,DEC are not only on a P4 slower. I've got such results on an AMD Athlon 2500+ and on an Intel P3 too. Ok, on the Intel the difference was not that significant.

And when talking about optimization, how about this:

Is it possible, that sometimes it's faster to use mem variables, not only registers???? This is kinda strange, but please look at my code bellow:

Code Select


Proc1      PROC USES ebx ecx edx esi edi,src:DWORD,dst:DWORD,bytes:DWORD
                mov     edx,bytes
                mov     esi,src
                mov     edi,dst
                add     edx,esi
; the important loop starts here...
@_1:
                mov     al,[esi]
                xor     ecx,ecx
                add     esi,1

                mov     cl,[stream_offs]
                mov     ebx,[edi]
                shl     eax,cl
                add     cl,[code_size]
                or      ebx,eax
                mov     al,cl
                mov     [edi],ebx
                shr     cl,3
                and     al,07h
                add     edi,ecx
                mov     [stream_offs],al
; ---------------
                cmp     esi,edx
                jnz     @_1
; ...ends here
                mov     eax,edi
                sub     eax,dst

                ret
Proc1      ENDP

I measure the ellapsed time for executing this procedure, so I've set up the timing on the procedure call.

Did I mess up something, or can that be, that this pipeline stall stuff has such heavy impact on quickness, that inserting some instructions, even accessing a mem variable is faster??? Please give me a reasonable answer!

Another issue about this alignment: does it have any sense to take care of the instruction size? I mean should I write the instructions in such order that their size can be aligned correctly say on 4 byte boundaries? Is this possible at all?

Thanks for patience!

Greets, Gábor

roticv · April 06, 2005, 12:08:06 PM

Maybe it is because your code is not aligned to 16bytes...

Anyway some mmx codes for htod (not by me)

Code Select

mmxbC9 dq 0C9C9C9C9C9C9C9C9h
mmxb39 dq '99999999'
mmxb07 dq 0707070707070707h

; < esi-string (upper or lower case)
; > eax-number
        movq mm0, [esi]
    movq mm1, mm0
    paddb mm0, [mmxbC9]
    pcmpgtb mm1, [mmxb39]
    pandn mm1, [mmxb07]
    paddb mm0, mm1
    movq mm1, mm0
    psllw mm0,12
    psllw mm1,4
    psrlw mm0,8
    psrlw mm1,12
    paddb mm0, mm1
    packuswb mm0, mm0
    movd eax, mm0
    bswap eax

Ian_B · October 24, 2005, 02:46:59 PM

See my code in a related thread on this subject:

http://www.masmforum.com/simple/index.php?topic=2221.msg23452#msg23452

IanB

News:

Hex string to dword optimization

Brett Kuntz

Mark_Larson

Brett Kuntz

gabor

roticv

Ian_B