News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Binary string to dword

Started by MichaelW, January 03, 2005, 09:36:45 AM

Previous topic - Next topic

MichaelW

#15
This appears to produce the correct results:

btodw5 proc uses ebx esi String:DWORD
    mov   esi, String
    xor   ebx, ebx
  align 4
  @@:
    mov   ecx, [esi]
    sub   ecx, 30303030h
    test  ecx, 80000000h
    jnz   dobytes
    add   esi, 4
    bswap ecx
    mov   eax, 01020408h ; Good stuff Mirno :)
    mul   ecx
    shl   ebx, 4
    shr   eax, 24
    or    ebx, eax
    jmp   @B
  dobytes:
    movzx ecx, byte ptr [esi]
    add   esi, 1
    test  ecx, ecx
    je    fini
    lea   ebx, [ebx*2 + ecx - 30h]
    jmp   dobytes
  fini:
    mov   eax, ebx
    ret
btodw5 endp


But, for the cases where you are not handling a multiple of four bytes, I think it will be slower than this:

btodw4 proc uses ebx esi edi String:DWORD
    mov edx, String
    xor eax, eax
align 4
  @@:
    movzx ecx, byte ptr [edx]
    movzx ebx, byte ptr [edx+1]
    movzx esi, byte ptr [edx+2]
    movzx edi, byte ptr [edx+3]
    add   edx, 4
    test  ecx, ecx
    jz    fini
    lea   eax, [eax*2 + ecx - 30h]
    test  ebx, ebx
    jz    fini
    lea   eax, [eax*2 + ebx - 30h]
    test  esi, esi
    jz    fini
    lea   eax, [eax*2 + esi - 30h]
    test  edi, edi
    jz    fini
    lea   eax, [eax*2 + edi - 30h]
    jmp   @B
  fini:
    ret
btodw4 endp


Mark,

Could you post the P4 clock counts for reference?
eschew obfuscation

Ratch

#16
MichaelW,
     I reworked your code to make it smaller, faster and fixed a bug.  A test of 80808080H is needed to find the zero byte in a string such as BYTE '111111010',0,'111111'.  Ratch


;-------------------------------------------------------------------------------
;*******************************************************************************
; ABS_ABDW:  Converts a ASCII binary string into an arithmetic binary DWORD.   *
;                                                                              *
; Called by: PUSH <address of string>                                          *
;            CALL ABS_ABDW                                                     *
;                                                                              *
; Returns:   EAX=arithmetic binary value                                       *
;                                                                              *
; Notes:     This subroutine conforms with WIN32 conventions regarding         *
;            register preservation and stack balancing. EBX,EBP,ESI,EDI are    *
;            preserved.                                                        *
;                                                                              *
;            This subroutine makes the assumption that the string length will  *
;            not exceed 32 bytes, and will be terminated by a zero byte.       *
;            Furthermore, since it is a ASCII binary string, it is assumed     *
;            that the string will contain only ASCII zeros and ones.           *
;                                                                              *
; Coder:     Ratch                                                             *
;*******************************************************************************
ABS_ABDW:              ;it all begins here
PUSH ESI              ;guard register for Windows
XOR ECX,ECX           ;clear the collector
MOV ESI,[ESP+2*DWORD] ;ESI=address of string

ALIGN 4           ;line it up
@@:
MOV EDX,[ESI]     ;the whole word and nothing but the word
SUB EDX,'0000'    ;convert to binary bytes
TEST EDX,80808080H ;any zero bytes?
JNZ @F            ;yes, do one byte at a time
MOV EAX,08040201H ;compaction constant
MUL EDX           ;compaction operation
SHL ECX,4         ;make room for a nibble
SHR EAX,24        ;line up nibble
ADD ESI,DWORD     ;next DWORD
OR  ECX,EAX       ;tack on nibble
JMP @B            ;do next DWORD

@@:
MOVZX EDX,BYTE PTR [ESI]  ;one byte at a time
TEST EDX,EDX              ;zero byte?
JE @F                     ;yes, all done
INC ESI                   ;advance pointer one byte
LEA   ECX,[ECX*2+EDX-'0'] ;tack on nibble
JMP @B                    ;do next byte

@@:
MOV EAX,ECX ;hand off for Windows
POP ESI     ;restore for Windows

RET DWORD ;return to sender and balance the stack

MichaelW

Ratch,

You are correct about needing to test with 80808080h. I ran some tests before I decided to use 80000000h, but I didn't test all possible combinations. I was not aware that reversing the compaction constant would reverse the order of the bytes at their final destination. Doing so allowed me to remove the bswap from my procedure, which did speed it up somewhat, but even so yours is faster by  ~8%.

:U
eschew obfuscation

MichaelW

The attachment contains the two best implementations (IMO) along with a new test app that uses my clock_ctr macros. I took the liberty of reformatting the code posted by Petroizki and Ratch (I hope neither of you will be offended).


[attachment deleted by admin]
eschew obfuscation

Mark_Larson

#19
Quote from: MichaelW on January 06, 2005, 12:03:17 AM

Mark,

Could you post the P4 clock counts for reference?


I've been bad.  Played WOW ( World of Warcraft) all weekend ;).  So didn't run anyting.

For the new code on my P3 the small version runs in 59 cycles ,and the fast version runs in 33 cycles.


When I cut and paste btodw_fast into the framework I already have for timing it runs in 40 cycles.  My version sets the priority to REALTIME before doing any timings.  I am not sure what the difference are.  If I have time later I will post my code.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

MichaelW

If I eliminate the code that compensates for the loop overhead, the cycle count goes from 39 to 45.

In the process of checking the above I noticed that the cycle counts for btodw_small increase from 59 to 72. The loop in the procedure is already aligned, and aligning the procedure, or the invoke that calls it, does not correct the problem. This leads me to suspect that something in the macros needs to be aligned.
eschew obfuscation

Mark_Larson

#21
  Let me cut and paste the timing portion of my code.  To calculate the 6 cycles of overhead that I subtract out below, I simply commented out the call to the procedure.  You automated that part in your start macro. 

  I think I figured out your bug.  Some people who ran your code posted negative times.  You can't have a negative time unless the portion oof the start macro that calculates the loop overhead is calculating too large of a value.

When you get the macro's flushed out, we need to make that post a sticky.  We get asked all the time for how to do timing code.



LOOP_CNT    equ     10000000

    invoke GetCurrentProcess
    invoke SetPriorityClass,eax,REALTIME_PRIORITY_CLASS

    cpuid
    rdtsc
    push edx
    push eax

    mov ecx,LOOP_CNT
align 4
outer_loop:
    push ecx

    invoke btodw_fast,ADDR buffer

    pop  ecx


    dec  ecx
    jnz  outer_loop


    cpuid
    rdtsc
    pop  esi
    sub  eax,esi
    pop  edi
    sbb  edx,edi

    emms

    mov  esi,LOOP_CNT
    div  esi
;accumulates 6 cycles of overhead AFTER the division
   sub eax,6
    invoke wsprintf,ADDR time_msg,ADDR time_fmt,eax
    invoke StdOut,ADDR time_msg
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

MichaelW

The only reason I can see for the loop overhead being wrong is alignment. And experimenting with adding variable numbers of NOPs at the start of the program, I determined that the result is definitely affected by the overall alignment. So I did essentially as Agner Fog did in his WTEST.ASM, and modified the clockctr_begin macro adding an ALIGN 16 in front of the test data, the reference loop, and the test loop. After doing so the cycle counts stabilized at 81 and 39, independent of the overall alignment. New code in the attachment.



[attachment deleted by admin]
eschew obfuscation

Mark_Larson

Quote from: MichaelW on January 10, 2005, 10:29:29 PM
The only reason I can see for the loop overhead being wrong is alignment. And experimenting with adding variable numbers of NOPs at the start of the program, I determined that the result is definitely affected by the overall alignment. So I did essentially as Agner Fog did in his WTEST.ASM, and modified the clockctr_begin macro adding an ALIGN 16 in front of the test data, the reference loop, and the test loop. After doing so the cycle counts stabilized at 81 and 39, independent of the overall alignment. New code in the attachment.



WOO WOO.   Good job. :)  The 39 is only one cycle off what I got ( which was 40).  So it sounds like you got it tweaked and working :) 
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Vortex

Hi MichaelW,

The results on my PIV 2.66 GHz:
btodw_small : 63 clocks
btodw_fast  : 61 clocks

MichaelW

Hi Vortex,

Interesting, a > 2:1 advantage on a P3 and a negligible advantage on a P4.

I had previously tried correcting what appeared to me to be dependency problems with the btodw_fast code (moving one instruction), and altering the alignment, but no improvement on a P3 so I discarded the changes. I have now added a btodw_fast_p4 procedure that includes these two changes. I also added a prompt at the start of the program, because a short delay after startup seems to improve the consistency of the results for the first test. Running on a P3 the clock cycle counts are 81, 39, 39.  New code in the attachment.



[attachment deleted by admin]
eschew obfuscation

Vortex

Hi MichaelW,

Thanks for the new release, here the latest results:

1st  run:61,62,70
2nd run:61,62,74
3rd  run:60,63,68

MichaelW

Quote
1st run:61,62,70
2nd run:61,62,74
3rd run:60,63,68
Obviously, my optimizations weren't :bg I suspect the call to call variation in cycle counts indicates some serious problems with the code. I think the (original) fast version should be faster than the small version on any processor. I would like to know what the problems are, but unfortunately I gave my only P4 system to a nephew early last year.
eschew obfuscation