News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Bin$

Started by jj2007, June 20, 2008, 01:39:06 AM

Previous topic - Next topic

jj2007

I hope I am not reinventing the wheel, but I didn't find a Bin$ macro in the library. Str$ and Hex$ are there but Bin$ is not, so I rolled my own. Here it is, for comments and fine-tuning.

include \masm32\include\masm32rt.inc

Bin$ MACRO dwArg:REQ
  cc INSTR <dwArg>, <eax> ;; to avoid a mov eax, eax
  if cc eq 0
mov eax, dwArg
  endif
  call Dword2Bin
  EXITM <edx>
ENDM

.code
MyApp1  db "Dword2Bin: press Control to see GetKeyState (VK_CONTROL)", 0
MyApp2  db "Dword2Bin: press CAPS to see GetKeyState (VK_CAPITAL)", 0
MyApp3  db "Dword2Bin: binary representation of GetKeyState (VK_CAPITAL)", 0

start: ; direct call: 0FC0FC0AAh =  11111100000011111100000010101010b = -66076502 decimal

invoke MessageBox, 0, Bin$(0FC0FC0AAh), addr MyApp1, MB_OK

; indirect call, using a result in eax:

invoke GetKeyState, VK_CONTROL
invoke MessageBox, 0, Bin$(eax), addr MyApp2, MB_OK

invoke GetKeyState, VK_CAPITAL
invoke MessageBox, 0, Bin$(eax), addr MyApp3, MB_OK

exit

Dword2Bin proc uses ebx
  ifndef Dw2BinBuffer
.data?
Dw2BinBuffer db 36 dup (?) ; 32 are needed, we pad 4 for alignment
.code
  endif
mov edx, offset Dw2BinBuffer+32
xor ecx, ecx
mov [edx], cl ; null terminator
add ecx, 31

@@: xor ebx, ebx
dec edx ; on exit, edx will point to the string
sar eax, 1 ; we work from right to left
adc ebx, 48 ; +48: Ascii 0 or, with carry, +49 : Ascii 1
mov [edx], bl
dec ecx ; decrement counter
jge @B ; dec sets only the sign flag

ret
Dword2Bin endp
end start

GregL

jj2007,

Your code works fine, it correctly handled everything I threw at it.

The procedure in the masm32 library is dw2bin_ex.

Here is a macro I wrote that uses it:

BinString MACRO n:REQ
    IFNDEF szBinStringBuffer
     .DATA
         szBinStringBuffer BYTE 40 DUP(0)
     .CODE   
    ENDIF
    INVOKE dw2bin_ex, n, ADDR szBinStringBuffer
    mov eax, OFFSET szBinStringBuffer
    EXITM <eax>
ENDM





jj2007

Thanks for pointing me to dw2bin_ex, Greg.

I had a look at dw2bin_ex now and must admit it's very fast - about 140 cycles as compared to 460 for my own. So I tried to optimise my code, and Dword2Bin2 does it in 335 cycles - see attachment, benchmarks welcome.

You may ask "why roll your own if the lib routine is three times as fast?". Well, the only justification is that dw2bin_ex performs badly on the LeanAndMeanPoints test, where LMP=cycles * sqrt(size):

139 cycles timing dw2bin_ex     size 2140 bytes   139*sqrt(2140)=6384
462 cycles timing Dword2Bin     size 31 bytes      462*sqrt(31)=2572
335 cycles timing Dword2Bin2   size 57 bytes      335*sqrt(57)=2529

Here is the winning code, for fine-tuning and suggestions. What struck me is that:
- apparently it is a little bit faster when I activate the SizeTest switch
- it is a lot faster with the extra NOP !

Any explanation?

Dword2Bin2 proc uses ebx

if SizeTest
mov ecx, offset gL2
sub ecx, offset pStart
add ecx, 4
mov cs2, ecx
endif

pStart:
mov edx, offset Dw2BinBuffer
xor ecx, ecx
mov [edx+32], cl ; null terminator
add ecx, 7 ; 31

; 4* setc bl plus add ebx, 30303030h is a lot slower

@@: sub ebx, ebx ; 2 cycle
; ## this nop looks superfluous but leads to a decrease of up to 78 cycles ...! #################
nop
sar eax, 1 ; 2 cycles - we read from left to right
adc ebx, 48 ; 1 +48: Ascii 0 or, with carry, +49 : Ascii 1

sal ebx, 8 ; 2 cycles - we write from right to left
sar eax, 1 ; 2 cycles - we read from left to right
adc ebx, 48 ; 1 +48: Ascii 0 or, with carry, +49 : Ascii 1

sal ebx, 8 ; 2 cycles - we write from right to left
sar eax, 1 ; 2 cycles - we read from left to right
adc ebx, 48 ; 1 +48: Ascii 0 or, with carry, +49 : Ascii 1

sal ebx, 8 ; 2 cycles - we write from right to left
sar eax, 1 ; 2 cycles - we read from left to right
adc ebx, 48 ; 1 +48: Ascii 0 or, with carry, +49 : Ascii 1

mov [edx+4*ecx], ebx
dec ecx ; 1 decrement counter
; sub ecx, 1 ; 1 but in practice 9 cycles slower
jge @B ; dec sets only the sign flag
ret
Dword2Bin2 endp
gL2:  ; global label for getting code size

[attachment deleted by admin]

hutch--

jj,

Give this version a blast, I removed the stack frame and tweaked a couple of mnemonics and its about 12% faster on my old PIV. The algo is actually designed for streaming and its size is secondary to its intened task. The real limit on the speed of this algo is the number of memory accesses. It is probably possible to make a faster version but it would be an interesting table design and may be a lot larger again.

Timings

00000000000000000000010011010010 dw2bin_ex2
00000000000000000000010011010010 original

672 dw2bin_ex2
750 dw2bin_ex
672 dw2bin_ex2
766 dw2bin_ex
671 dw2bin_ex2
750 dw2bin_ex
672 dw2bin_ex2
750 dw2bin_ex
672 dw2bin_ex2
750 dw2bin_ex
672 dw2bin_ex2
734 dw2bin_ex
672 dw2bin_ex2
750 dw2bin_ex
672 dw2bin_ex2
750 dw2bin_ex

dw2bin_ex2 timing average = 671
dw2bin_ex  timing average = 750
Press any key to continue ...


Source

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    dw2bin_ex2 PROTO :DWORD,:DWORD

    EXTERNDEF bintable:DWORD

    .code

start:
   
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    call main
    inkey
    exit

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    lcnt equ <50000000>

main proc

    LOCAL buf1[64]:BYTE
    LOCAL buf2[64]:BYTE
    LOCAL ptr1  :DWORD
    LOCAL ptr2  :DWORD
    LOCAL tot1  :DWORD
    LOCAL tot2  :DWORD

    mov ptr1, ptr$(buf1)
    mov ptr2, ptr$(buf2)

    mov tot1, 0
    mov tot2, 0

    invoke dw2bin_ex2,1234,ptr1
    print ptr1," dw2bin_ex2",13,10

    invoke dw2bin_ex,1234,ptr2
    print ptr2," original",13,10,13,10

    push esi

    REPEAT 8

  ; =====================================

    invoke GetTickCount
    push eax

    mov esi, lcnt
  @@:
    invoke dw2bin_ex2,1234,ptr1
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    add tot1, eax
    print str$(eax)," dw2bin_ex2",13,10

  ; =====================================

    invoke GetTickCount
    push eax

    mov esi, lcnt
  @@:
    invoke dw2bin_ex,1234,ptr1
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    add tot2, eax
    print str$(eax)," dw2bin_ex",13,10

  ; =====================================

    ENDM

    shr tot1, 3
    shr tot2, 3

    print chr$(13,10),"dw2bin_ex2 timing average = "
    print str$(tot1),13,10

    print "dw2bin_ex  timing average = "
    print str$(tot2),13,10

    pop esi

    ret

main endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 16

dw2bin_ex2 proc var:DWORD,buffer:DWORD

    push ebx

    mov ebx, OFFSET bintable
    mov edx, [esp+12]

    movzx eax, BYTE PTR [esp+8][3]
    mov ecx, [ebx+eax*8]
    mov [edx], ecx
    mov ecx, [ebx+eax*8+4]
    mov [edx+4], ecx

    movzx eax, BYTE PTR [esp+8][2]
    mov ecx, [ebx+eax*8]
    mov [edx+8], ecx
    mov ecx, [ebx+eax*8+4]
    mov [edx+12], ecx

    movzx eax, BYTE PTR [esp+8][1]
    mov ecx, [ebx+eax*8]
    mov [edx+16], ecx
    mov ecx, [ebx+eax*8+4]
    mov [edx+20], ecx

    movzx eax, BYTE PTR [esp+8]
    mov ecx, [ebx+eax*8]
    mov [edx+24], ecx
    mov ecx, [ebx+eax*8+4]
    mov [edx+28], ecx

    mov BYTE PTR [edx+32], 0

    pop ebx

    ret 8

dw2bin_ex2 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

end start
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on June 22, 2008, 12:58:54 AM
Give this version a blast, I removed the stack frame and tweaked a couple of mnemonics and its about 12% faster on my old PIV. The algo is actually designed for streaming and its size is secondary to its intended task.
That explains everything :bg
Was there ever a more compact dw2bin_no_ex version? As said above, dw2bin_ex performs poorly on the LeanAndMeanPoints test, where LAMPs = cycles * sqrt(size):

139 cycles timing dw2bin_ex     size 2140 bytes   139*sqrt(2140)=6384 LAMPs, poor
462 cycles timing Dword2Bin     size 31 bytes      462*sqrt(31)=2572 LAMPS, almost good
335 cycles timing Dword2Bin2   size 57 bytes      335*sqrt(57)=2529 LAMPS, good

Imho there should be a compact medium fast dw2bin_lean_and_mean version. The ex is great for streaming, yes, but I checked my 0.6 MB Dashboard source for BIN$ and found a mere 7 entries, all of them in non-critical loops. Many small rarely used functions should be optimised for size, while frequently used algos should focus on speed. An economist's 2 cents worth  :wink

I got no reply re the pretty significant speed oddities mentioned above - stalls??

hutch--

 :bg

JJ,

I use a different criterion, there is the FAST test, then the FASTER test and then there is the EVEN FASTER test in most algo design.

It does make sense to have a smaller version for general purpose work so it would be worth the effort to make a super reliable one that is fast enough in most instances.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

hutch--

JJ,

Just a quick look at your last posted algo and the comment about some stalls.

The SETx byte will be slow due to partial read of a 32 bit register.

Your other slow instructions are SAL SAR and ADC.

See if you can break it down into lower level primitives even if its a bit longer in byte count.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

sinsi

A little bit faster -

00000000000000000000010011010010 dw2bin_ex2
00000000000000000000010011010010 pbin

687 dw2bin_ex2
407 pbin
687 dw2bin_ex2
391 pbin
687 dw2bin_ex2
406 pbin
688 dw2bin_ex2
406 pbin
688 dw2bin_ex2
390 pbin
688 dw2bin_ex2
406 pbin
688 dw2bin_ex2
390 pbin
688 dw2bin_ex2
406 pbin

dw2bin_ex2 timing average = 687
pbin       timing average = 400


align 16
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
pbin PROC PUBLIC num:DWORD,buf:DWORD
    push ebx
    push esi
    push edi
   
    mov ecx,16[esp]
    mov edx,20[esp]
    mov edi,offset bintable
   
    movzx eax,cl
    mov ebx,[edi+eax*8]
    mov esi,[edi+eax*8+4]
    mov 24[edx],ebx
    mov 28[edx],esi

    movzx eax,ch
    mov ebx,[edi+eax*8]
    mov esi,[edi+eax*8+4]
    shr ecx,16
    mov 16[edx],ebx
    mov 20[edx],esi

    movzx eax,cl
    mov ebx,[edi+eax*8]
    mov esi,[edi+eax*8+4]
    mov 8[edx],ebx
    mov 12[edx],esi

    movzx eax,ch
    mov ebx,[edi+eax*8]
    mov esi,[edi+eax*8+4]
    sub eax,eax
    mov [edx],ebx
    mov 4[edx],esi

    mov 32[edx],al

    pop edi
    pop esi
    pop ebx
    ret 8
pbin ENDP
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

Light travels faster than sound, that's why some people seem bright until you hear them.

hutch--

Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on June 23, 2008, 02:59:38 AM
I use a different criterion, there is the FAST test, then the FASTER test and then there is the EVEN FASTER test in most algo design.

It does make sense to have a smaller version for general purpose work so it would be worth the effort to make a super reliable one that is fast enough in most instances.

So you put more emphasis on the mean but we agree basically that lean and mean is good :bg

Quote from: hutch-- on June 23, 2008, 06:30:08 AM
sinsi,

:U

60 cycles timing pbin            2140 bytes     2776 LAMPs
82 cycles timing dw2bin_ex       2140 bytes     3793 LAMPs
299 cycles timing Dword2Bin2     57 bytes       2257 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

The choice gets harder with Sinsi's LAMPs approaching mine...
I wrote a little macro for measuring LAMPs:


MinLampSize = 200 ; "default size" in case of external APIs, e.g. crt_strstr

LAMP$ MACRO cycles:REQ, csize:REQ ; Usage: print LAMP$(cycles, size), " LAMPs", 13, 10
ffree ST(7) ;; free the stack for pushing
push csize
fild dword ptr [esp] ;; move into ST (0)
pop eax
fsqrt
mov eax, cycles
.if eax==0
add eax, MinLampSize
.endif
push eax
fild dword ptr [esp] ;; push cycles on FPU
fmul ST, ST(1) ;; multiply with sqrt(size)
fistp dword ptr [esp]
pop eax
invoke dwtoa, eax, addr LampsBuffer
EXITM <offset LampsBuffer>
ENDM

.data?
LampsBuffer dd 4 dup (?)


Sinsi, the result for your code is faked because I couldn't find this bit:
    mov edi, offset bintable ; where defined??
Attached code is complete except for this line. There is a UsePbin = 0 switch on top, if you manage to activate it, that would be great.


[attachment deleted by admin]

hutch--

JJ,

Its a seperate module in the masm32 library. One VERY LARGE table.  :bg
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

sinsi

Quote from: jj2007 on June 23, 2008, 09:06:16 AM
Sinsi, the result for your code is faked because I couldn't find this bit:
    mov edi, offset bintable ; where defined??
Sorry, I used hutch's code which includes masm32rt.inc. The file is \masm32\m32lib\bintbl.asm

Is there any way of cutting the size of the table down? Maybe somehow using the high bit to cut it down to 128? I can't think now, too pissed... :bg

[edit]
Tried your code with UsePbin=1 and got

33 cycles timing pbin            0 bytes        0 LAMPs
53 cycles timing dw2bin_ex       2140 bytes     2452 LAMPs
164 cycles timing Dword2Bin      31 bytes       913 LAMPs
130 cycles timing Dword2Bin2     57 bytes       981 LAMPs

[/edit]
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Quote from: sinsi on June 23, 2008, 09:33:34 AM
Sorry, I used hutch's code which includes masm32rt.inc. The file is \masm32\m32lib\bintbl.asm
include \masm32\include\masm32rt.inc

Line 1 of my code ;-)

My version of masm32rt.inc has 3217 bytes, dated 25.05.2005. No bintbl in there, and ml chokes when I tried to add the include manually. So I copied the whole table into my source and ran it again:

59 cycles timing pbin            2145 bytes     2733 LAMPs
79 cycles timing dw2bin_ex       2140 bytes     3655 LAMPs
648 cycles timing Dword2Bin      31 bytes       3608 LAMPs
304 cycles timing Dword2Bin2     57 bytes       2295 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)
SizeTest is on (and costs some cycles)

55 cycles timing pbin            2145 bytes     2547 LAMPs    :clap:
79 cycles timing dw2bin_ex       2140 bytes     3655 LAMPs
644 cycles timing Dword2Bin      31 bytes       3586 LAMPs
299 cycles timing Dword2Bin2     57 bytes       2257 LAMPs  :cheekygreen:

LAMPs = Lean And Mean Points = cycles * sqrt(size)
SizeTest is off (means I added the previously calculated sizes by hand)

Seriously: Masm is attractive because it produces fast and compact code. Especially newbies are impressed when they have to start counting in bytes rather than kBytes or MBytes again. So a "LAMP" style optimisation rule makes sense when designing a big library. Those who need a fast routine for streaming should find it, those who want to boast with compact and still fast code should be served, too.

[attachment deleted by admin]

sinsi

Where on earth does "LAMPs = Lean And Mean Points = cycles * sqrt(size)" come from?
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Quote from: sinsi on June 23, 2008, 10:14:39 AM
Where on earth does "LAMPs = Lean And Mean Points = cycles * sqrt(size)" come from?

Northern Italy, on the shores of Lago Maggiore.

After some reflection, we have a new winner now:

56 cycles timing pbin            149 bytes      684 LAMPs :cheekygreen:
80 cycles timing dw2bin_ex       2140 bytes     3701 LAMPs :naughty:
303 cycles timing Dword2Bin2     57 bytes       2288 LAMPs :clap:


[attachment deleted by admin]