The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: jj2007 on June 20, 2008, 01:39:06 AM

Title: Bin$
Post by: jj2007 on June 20, 2008, 01:39:06 AM
I hope I am not reinventing the wheel, but I didn't find a Bin$ macro in the library. Str$ and Hex$ are there but Bin$ is not, so I rolled my own. Here it is, for comments and fine-tuning.

include \masm32\include\masm32rt.inc

Bin$ MACRO dwArg:REQ
  cc INSTR <dwArg>, <eax> ;; to avoid a mov eax, eax
  if cc eq 0
mov eax, dwArg
  endif
  call Dword2Bin
  EXITM <edx>
ENDM

.code
MyApp1  db "Dword2Bin: press Control to see GetKeyState (VK_CONTROL)", 0
MyApp2  db "Dword2Bin: press CAPS to see GetKeyState (VK_CAPITAL)", 0
MyApp3  db "Dword2Bin: binary representation of GetKeyState (VK_CAPITAL)", 0

start: ; direct call: 0FC0FC0AAh =  11111100000011111100000010101010b = -66076502 decimal

invoke MessageBox, 0, Bin$(0FC0FC0AAh), addr MyApp1, MB_OK

; indirect call, using a result in eax:

invoke GetKeyState, VK_CONTROL
invoke MessageBox, 0, Bin$(eax), addr MyApp2, MB_OK

invoke GetKeyState, VK_CAPITAL
invoke MessageBox, 0, Bin$(eax), addr MyApp3, MB_OK

exit

Dword2Bin proc uses ebx
  ifndef Dw2BinBuffer
.data?
Dw2BinBuffer db 36 dup (?) ; 32 are needed, we pad 4 for alignment
.code
  endif
mov edx, offset Dw2BinBuffer+32
xor ecx, ecx
mov [edx], cl ; null terminator
add ecx, 31

@@: xor ebx, ebx
dec edx ; on exit, edx will point to the string
sar eax, 1 ; we work from right to left
adc ebx, 48 ; +48: Ascii 0 or, with carry, +49 : Ascii 1
mov [edx], bl
dec ecx ; decrement counter
jge @B ; dec sets only the sign flag

ret
Dword2Bin endp
end start
Title: Re: Bin$
Post by: GregL on June 21, 2008, 03:24:19 AM
jj2007,

Your code works fine, it correctly handled everything I threw at it.

The procedure in the masm32 library is dw2bin_ex.

Here is a macro I wrote that uses it:

BinString MACRO n:REQ
    IFNDEF szBinStringBuffer
     .DATA
         szBinStringBuffer BYTE 40 DUP(0)
     .CODE   
    ENDIF
    INVOKE dw2bin_ex, n, ADDR szBinStringBuffer
    mov eax, OFFSET szBinStringBuffer
    EXITM <eax>
ENDM




Title: Re: Bin$
Post by: jj2007 on June 21, 2008, 10:28:30 PM
Thanks for pointing me to dw2bin_ex, Greg.

I had a look at dw2bin_ex now and must admit it's very fast - about 140 cycles as compared to 460 for my own. So I tried to optimise my code, and Dword2Bin2 does it in 335 cycles - see attachment, benchmarks welcome.

You may ask "why roll your own if the lib routine is three times as fast?". Well, the only justification is that dw2bin_ex performs badly on the LeanAndMeanPoints test, where LMP=cycles * sqrt(size):

139 cycles timing dw2bin_ex     size 2140 bytes   139*sqrt(2140)=6384
462 cycles timing Dword2Bin     size 31 bytes      462*sqrt(31)=2572
335 cycles timing Dword2Bin2   size 57 bytes      335*sqrt(57)=2529

Here is the winning code, for fine-tuning and suggestions. What struck me is that:
- apparently it is a little bit faster when I activate the SizeTest switch
- it is a lot faster with the extra NOP !

Any explanation?

Dword2Bin2 proc uses ebx

if SizeTest
mov ecx, offset gL2
sub ecx, offset pStart
add ecx, 4
mov cs2, ecx
endif

pStart:
mov edx, offset Dw2BinBuffer
xor ecx, ecx
mov [edx+32], cl ; null terminator
add ecx, 7 ; 31

; 4* setc bl plus add ebx, 30303030h is a lot slower

@@: sub ebx, ebx ; 2 cycle
; ## this nop looks superfluous but leads to a decrease of up to 78 cycles ...! #################
nop
sar eax, 1 ; 2 cycles - we read from left to right
adc ebx, 48 ; 1 +48: Ascii 0 or, with carry, +49 : Ascii 1

sal ebx, 8 ; 2 cycles - we write from right to left
sar eax, 1 ; 2 cycles - we read from left to right
adc ebx, 48 ; 1 +48: Ascii 0 or, with carry, +49 : Ascii 1

sal ebx, 8 ; 2 cycles - we write from right to left
sar eax, 1 ; 2 cycles - we read from left to right
adc ebx, 48 ; 1 +48: Ascii 0 or, with carry, +49 : Ascii 1

sal ebx, 8 ; 2 cycles - we write from right to left
sar eax, 1 ; 2 cycles - we read from left to right
adc ebx, 48 ; 1 +48: Ascii 0 or, with carry, +49 : Ascii 1

mov [edx+4*ecx], ebx
dec ecx ; 1 decrement counter
; sub ecx, 1 ; 1 but in practice 9 cycles slower
jge @B ; dec sets only the sign flag
ret
Dword2Bin2 endp
gL2:  ; global label for getting code size

[attachment deleted by admin]
Title: Re: Bin$
Post by: hutch-- on June 22, 2008, 12:58:54 AM
jj,

Give this version a blast, I removed the stack frame and tweaked a couple of mnemonics and its about 12% faster on my old PIV. The algo is actually designed for streaming and its size is secondary to its intened task. The real limit on the speed of this algo is the number of memory accesses. It is probably possible to make a faster version but it would be an interesting table design and may be a lot larger again.

Timings

00000000000000000000010011010010 dw2bin_ex2
00000000000000000000010011010010 original

672 dw2bin_ex2
750 dw2bin_ex
672 dw2bin_ex2
766 dw2bin_ex
671 dw2bin_ex2
750 dw2bin_ex
672 dw2bin_ex2
750 dw2bin_ex
672 dw2bin_ex2
750 dw2bin_ex
672 dw2bin_ex2
734 dw2bin_ex
672 dw2bin_ex2
750 dw2bin_ex
672 dw2bin_ex2
750 dw2bin_ex

dw2bin_ex2 timing average = 671
dw2bin_ex  timing average = 750
Press any key to continue ...


Source

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    dw2bin_ex2 PROTO :DWORD,:DWORD

    EXTERNDEF bintable:DWORD

    .code

start:
   
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    call main
    inkey
    exit

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    lcnt equ <50000000>

main proc

    LOCAL buf1[64]:BYTE
    LOCAL buf2[64]:BYTE
    LOCAL ptr1  :DWORD
    LOCAL ptr2  :DWORD
    LOCAL tot1  :DWORD
    LOCAL tot2  :DWORD

    mov ptr1, ptr$(buf1)
    mov ptr2, ptr$(buf2)

    mov tot1, 0
    mov tot2, 0

    invoke dw2bin_ex2,1234,ptr1
    print ptr1," dw2bin_ex2",13,10

    invoke dw2bin_ex,1234,ptr2
    print ptr2," original",13,10,13,10

    push esi

    REPEAT 8

  ; =====================================

    invoke GetTickCount
    push eax

    mov esi, lcnt
  @@:
    invoke dw2bin_ex2,1234,ptr1
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    add tot1, eax
    print str$(eax)," dw2bin_ex2",13,10

  ; =====================================

    invoke GetTickCount
    push eax

    mov esi, lcnt
  @@:
    invoke dw2bin_ex,1234,ptr1
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    add tot2, eax
    print str$(eax)," dw2bin_ex",13,10

  ; =====================================

    ENDM

    shr tot1, 3
    shr tot2, 3

    print chr$(13,10),"dw2bin_ex2 timing average = "
    print str$(tot1),13,10

    print "dw2bin_ex  timing average = "
    print str$(tot2),13,10

    pop esi

    ret

main endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 16

dw2bin_ex2 proc var:DWORD,buffer:DWORD

    push ebx

    mov ebx, OFFSET bintable
    mov edx, [esp+12]

    movzx eax, BYTE PTR [esp+8][3]
    mov ecx, [ebx+eax*8]
    mov [edx], ecx
    mov ecx, [ebx+eax*8+4]
    mov [edx+4], ecx

    movzx eax, BYTE PTR [esp+8][2]
    mov ecx, [ebx+eax*8]
    mov [edx+8], ecx
    mov ecx, [ebx+eax*8+4]
    mov [edx+12], ecx

    movzx eax, BYTE PTR [esp+8][1]
    mov ecx, [ebx+eax*8]
    mov [edx+16], ecx
    mov ecx, [ebx+eax*8+4]
    mov [edx+20], ecx

    movzx eax, BYTE PTR [esp+8]
    mov ecx, [ebx+eax*8]
    mov [edx+24], ecx
    mov ecx, [ebx+eax*8+4]
    mov [edx+28], ecx

    mov BYTE PTR [edx+32], 0

    pop ebx

    ret 8

dw2bin_ex2 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

end start
Title: Re: Bin$
Post by: jj2007 on June 22, 2008, 09:46:07 PM
Quote from: hutch-- on June 22, 2008, 12:58:54 AM
Give this version a blast, I removed the stack frame and tweaked a couple of mnemonics and its about 12% faster on my old PIV. The algo is actually designed for streaming and its size is secondary to its intended task.
That explains everything :bg
Was there ever a more compact dw2bin_no_ex version? As said above, dw2bin_ex performs poorly on the LeanAndMeanPoints test, where LAMPs = cycles * sqrt(size):

139 cycles timing dw2bin_ex     size 2140 bytes   139*sqrt(2140)=6384 LAMPs, poor
462 cycles timing Dword2Bin     size 31 bytes      462*sqrt(31)=2572 LAMPS, almost good
335 cycles timing Dword2Bin2   size 57 bytes      335*sqrt(57)=2529 LAMPS, good

Imho there should be a compact medium fast dw2bin_lean_and_mean version. The ex is great for streaming, yes, but I checked my 0.6 MB Dashboard source for BIN$ and found a mere 7 entries, all of them in non-critical loops. Many small rarely used functions should be optimised for size, while frequently used algos should focus on speed. An economist's 2 cents worth  :wink

I got no reply re the pretty significant speed oddities mentioned above - stalls??
Title: Re: Bin$
Post by: hutch-- on June 23, 2008, 02:59:38 AM
 :bg

JJ,

I use a different criterion, there is the FAST test, then the FASTER test and then there is the EVEN FASTER test in most algo design.

It does make sense to have a smaller version for general purpose work so it would be worth the effort to make a super reliable one that is fast enough in most instances.
Title: Re: Bin$
Post by: hutch-- on June 23, 2008, 04:44:15 AM
JJ,

Just a quick look at your last posted algo and the comment about some stalls.

The SETx byte will be slow due to partial read of a 32 bit register.

Your other slow instructions are SAL SAR and ADC.

See if you can break it down into lower level primitives even if its a bit longer in byte count.
Title: Re: Bin$
Post by: sinsi on June 23, 2008, 05:48:21 AM
A little bit faster -

00000000000000000000010011010010 dw2bin_ex2
00000000000000000000010011010010 pbin

687 dw2bin_ex2
407 pbin
687 dw2bin_ex2
391 pbin
687 dw2bin_ex2
406 pbin
688 dw2bin_ex2
406 pbin
688 dw2bin_ex2
390 pbin
688 dw2bin_ex2
406 pbin
688 dw2bin_ex2
390 pbin
688 dw2bin_ex2
406 pbin

dw2bin_ex2 timing average = 687
pbin       timing average = 400


align 16
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
pbin PROC PUBLIC num:DWORD,buf:DWORD
    push ebx
    push esi
    push edi
   
    mov ecx,16[esp]
    mov edx,20[esp]
    mov edi,offset bintable
   
    movzx eax,cl
    mov ebx,[edi+eax*8]
    mov esi,[edi+eax*8+4]
    mov 24[edx],ebx
    mov 28[edx],esi

    movzx eax,ch
    mov ebx,[edi+eax*8]
    mov esi,[edi+eax*8+4]
    shr ecx,16
    mov 16[edx],ebx
    mov 20[edx],esi

    movzx eax,cl
    mov ebx,[edi+eax*8]
    mov esi,[edi+eax*8+4]
    mov 8[edx],ebx
    mov 12[edx],esi

    movzx eax,ch
    mov ebx,[edi+eax*8]
    mov esi,[edi+eax*8+4]
    sub eax,eax
    mov [edx],ebx
    mov 4[edx],esi

    mov 32[edx],al

    pop edi
    pop esi
    pop ebx
    ret 8
pbin ENDP
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

Title: Re: Bin$
Post by: hutch-- on June 23, 2008, 06:30:08 AM
sinsi,

:U
Title: Re: Bin$
Post by: jj2007 on June 23, 2008, 09:06:16 AM
Quote from: hutch-- on June 23, 2008, 02:59:38 AM
I use a different criterion, there is the FAST test, then the FASTER test and then there is the EVEN FASTER test in most algo design.

It does make sense to have a smaller version for general purpose work so it would be worth the effort to make a super reliable one that is fast enough in most instances.

So you put more emphasis on the mean but we agree basically that lean and mean is good :bg

Quote from: hutch-- on June 23, 2008, 06:30:08 AM
sinsi,

:U

60 cycles timing pbin            2140 bytes     2776 LAMPs
82 cycles timing dw2bin_ex       2140 bytes     3793 LAMPs
299 cycles timing Dword2Bin2     57 bytes       2257 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

The choice gets harder with Sinsi's LAMPs approaching mine...
I wrote a little macro for measuring LAMPs:


MinLampSize = 200 ; "default size" in case of external APIs, e.g. crt_strstr

LAMP$ MACRO cycles:REQ, csize:REQ ; Usage: print LAMP$(cycles, size), " LAMPs", 13, 10
ffree ST(7) ;; free the stack for pushing
push csize
fild dword ptr [esp] ;; move into ST (0)
pop eax
fsqrt
mov eax, cycles
.if eax==0
add eax, MinLampSize
.endif
push eax
fild dword ptr [esp] ;; push cycles on FPU
fmul ST, ST(1) ;; multiply with sqrt(size)
fistp dword ptr [esp]
pop eax
invoke dwtoa, eax, addr LampsBuffer
EXITM <offset LampsBuffer>
ENDM

.data?
LampsBuffer dd 4 dup (?)


Sinsi, the result for your code is faked because I couldn't find this bit:
    mov edi, offset bintable ; where defined??
Attached code is complete except for this line. There is a UsePbin = 0 switch on top, if you manage to activate it, that would be great.


[attachment deleted by admin]
Title: Re: Bin$
Post by: hutch-- on June 23, 2008, 09:32:30 AM
JJ,

Its a seperate module in the masm32 library. One VERY LARGE table.  :bg
Title: Re: Bin$
Post by: sinsi on June 23, 2008, 09:33:34 AM
Quote from: jj2007 on June 23, 2008, 09:06:16 AM
Sinsi, the result for your code is faked because I couldn't find this bit:
    mov edi, offset bintable ; where defined??
Sorry, I used hutch's code which includes masm32rt.inc. The file is \masm32\m32lib\bintbl.asm

Is there any way of cutting the size of the table down? Maybe somehow using the high bit to cut it down to 128? I can't think now, too pissed... :bg

[edit]
Tried your code with UsePbin=1 and got

33 cycles timing pbin            0 bytes        0 LAMPs
53 cycles timing dw2bin_ex       2140 bytes     2452 LAMPs
164 cycles timing Dword2Bin      31 bytes       913 LAMPs
130 cycles timing Dword2Bin2     57 bytes       981 LAMPs

[/edit]
Title: Re: Bin$
Post by: jj2007 on June 23, 2008, 10:01:56 AM
Quote from: sinsi on June 23, 2008, 09:33:34 AM
Sorry, I used hutch's code which includes masm32rt.inc. The file is \masm32\m32lib\bintbl.asm
include \masm32\include\masm32rt.inc

Line 1 of my code ;-)

My version of masm32rt.inc has 3217 bytes, dated 25.05.2005. No bintbl in there, and ml chokes when I tried to add the include manually. So I copied the whole table into my source and ran it again:

59 cycles timing pbin            2145 bytes     2733 LAMPs
79 cycles timing dw2bin_ex       2140 bytes     3655 LAMPs
648 cycles timing Dword2Bin      31 bytes       3608 LAMPs
304 cycles timing Dword2Bin2     57 bytes       2295 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)
SizeTest is on (and costs some cycles)

55 cycles timing pbin            2145 bytes     2547 LAMPs    :clap:
79 cycles timing dw2bin_ex       2140 bytes     3655 LAMPs
644 cycles timing Dword2Bin      31 bytes       3586 LAMPs
299 cycles timing Dword2Bin2     57 bytes       2257 LAMPs  :cheekygreen:

LAMPs = Lean And Mean Points = cycles * sqrt(size)
SizeTest is off (means I added the previously calculated sizes by hand)

Seriously: Masm is attractive because it produces fast and compact code. Especially newbies are impressed when they have to start counting in bytes rather than kBytes or MBytes again. So a "LAMP" style optimisation rule makes sense when designing a big library. Those who need a fast routine for streaming should find it, those who want to boast with compact and still fast code should be served, too.

[attachment deleted by admin]
Title: Re: Bin$
Post by: sinsi on June 23, 2008, 10:14:39 AM
Where on earth does "LAMPs = Lean And Mean Points = cycles * sqrt(size)" come from?
Title: Re: Bin$
Post by: jj2007 on June 23, 2008, 01:10:20 PM
Quote from: sinsi on June 23, 2008, 10:14:39 AM
Where on earth does "LAMPs = Lean And Mean Points = cycles * sqrt(size)" come from?

Northern Italy, on the shores of Lago Maggiore.

After some reflection, we have a new winner now:

56 cycles timing pbin            149 bytes      684 LAMPs :cheekygreen:
80 cycles timing dw2bin_ex       2140 bytes     3701 LAMPs :naughty:
303 cycles timing Dword2Bin2     57 bytes       2288 LAMPs :clap:


[attachment deleted by admin]
Title: Re: Bin$
Post by: hutch-- on June 23, 2008, 02:16:50 PM
JJ,

sinsi's code still uses the bintable so it larger than you have listed. The table adds 2k of data.
Title: Re: Bin$
Post by: jj2007 on June 23, 2008, 03:09:17 PM
Quote from: hutch-- on June 23, 2008, 02:16:50 PM
JJ,

sinsi's code still uses the bintable so it larger than you have listed. The table adds 2k of data.

No, it doesn't.

Sinsi, does the sub eax, eax have a function? I timed it repeatedly with and without but could not find a real difference.

    mov esi,[edi+eax*8+4]
    sub eax, eax  <--
    mov [edx],ebx
Title: Re: Bin$
Post by: NightWare on June 23, 2008, 08:55:42 PM
Quote from: sinsi on June 23, 2008, 09:33:34 AM
[Is there any way of cutting the size of the table down? Maybe somehow using the high bit to cut it down to 128? I can't think now, too pissed... :bg
hmm, my attempt :
.DATA
ALIGN 4
BinaryTable DWORD "0000","1000","0100","1100","0010","1010","0110","1110"
DWORD "0001","1001","0101","1101","0011","1011","0111","1111"

.CODE
ALIGN 16
;
; Syntax :
; mov eax,valeur
; mov esi,String address
; call nwDw2Bin
;
nwDw2Bin PROC
mov BYTE PTR [esi+32],0
mov ecx,28
Label1: mov edx,eax
and edx,00000000Fh
mov edx,DWORD PTR [BinaryTable+edx*4]
shr eax,4
mov DWORD PTR [esi+ecx],edx
sub ecx,DWORD
jns Label1
ret
nwDw2Bin ENDP
Title: Re: Bin$
Post by: jj2007 on June 23, 2008, 09:44:05 PM
Welcome on board!

Testing correctness of results for BIN$, pbin2 and nwDw2Bin
11110111000000001111111100000000        BIN$
11110111000000001111111100000000        pbin2
11110111000000001111111100000000        nwDw2Bin
11110111000000001111111100000000        original value

00001000111111110000000011111111
00001000111111110000000011111111
00001000111111110000000011111111        nwDw2Bin
00001000111111110000000011111111

00100011111111000000001111111100
00100011111111000000001111111100
00100011111111000000001111111100        nwDw2Bin
00100011111111000000001111111100

10101010101010101010101010101010
10101010101010101010101010101010
10101010101010101010101010101010        nwDw2Bin
10101010101010101010101010101010

00000000000000000000000000000000
00000000000000000000000000000000
00000000000000000000000000000000        nwDw2Bin
GetKeyState (VK_CAPITAL), i.e. Caps Lock

40 cycles timing BIN$            139 bytes      472 LAMPs
45 cycles timing pbin2           147 bytes      546 LAMPs
67 cycles timing nwDw2Bin        101 bytes      673 LAMPs
128 cycles timing dw2bin_ex      2140 bytes     5921 LAMPs
351 cycles timing Dword2Bin2     57 bytes       2650 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

[attachment deleted by admin]
Title: Re: Bin$
Post by: hutch-- on June 23, 2008, 11:25:31 PM
Just as a note, I am buried in the middle of a mounain of work at the moment so I don't have enough time to track this topic in real detail but it would be very useful to get a conclusion that produced two seperate algos, one for streaming like the table based masm32 lib version and a much smaller one for a non "_ex" procedure.

Something I remember is an algorithm of this type that Michael Webster wrote a couple of years ago that would handle byte, word and dword sized binary conversions but I don't know where it is any longer
Title: Re: Bin$
Post by: sinsi on June 23, 2008, 11:59:48 PM
Quote from: jj2007 on June 23, 2008, 03:09:17 PM
Quote from: hutch-- on June 23, 2008, 02:16:50 PM
JJ,

sinsi's code still uses the bintable so it larger than you have listed. The table adds 2k of data.

No, it doesn't.

    mov edi,offset bintable

yes it does.

Quote
Sinsi, does the sub eax, eax have a function? I timed it repeatedly with and without but could not find a real difference.

    mov esi,[edi+eax*8+4]
    sub eax, eax  <--
    mov [edx],ebx


    sub eax,eax
    mov [edx],ebx
    mov 4[edx],esi

    mov 32[edx],al   ;<---

I was always told to use a register...old habits die hard.
Title: Re: Bin$
Post by: NightWare on June 24, 2008, 01:29:06 AM
stalls reduced  :red :
mov BYTE PTR [esi+32],0
mov ecx,32
Label1: mov edx,eax
and edx,00000000Fh
shr eax,4
mov edx,DWORD PTR [BinaryTable+edx*4]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx
jnz Label1
Title: Re: Bin$
Post by: hutch-- on June 24, 2008, 04:24:09 AM
NightWare,

I have just had a quick play and formatted your algo into a library module. I wonder how much slower it would be with arguments passed on the stack rather than in registers ?


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    .486                      ; maximum processor model
    .model flat, stdcall      ; memory model & calling convention
    option casemap :none      ; case sensitive

    ; Syntax :
    ; mov eax, DWORD value
    ; mov esi, String address
    ; call nwDw2Bin

    .code

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    align 16

  ; ------------------------------------
  ; NightWare's DWORD to bin string algo
  ; ------------------------------------

nwDw2Bin PROC

    push esi

    .data
      BinaryTable DWORD "0000","1000","0100","1100","0010","1010","0110","1110"
                  DWORD "0001","1001","0101","1101","0011","1011","0111","1111"
    .code

    mov BYTE PTR [esi+32], 0
    mov ecx, 32

  lbl0:
    mov edx, eax
    and edx, 00000000Fh
    shr eax, 4
    mov edx, DWORD PTR [BinaryTable+edx*4]
    sub ecx, 4
    mov DWORD PTR [esi+ecx], edx
    jnz lbl0

    pop esi

    ret

nwDw2Bin ENDP

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end
Title: Re: Bin$
Post by: hutch-- on June 24, 2008, 06:20:26 AM
It got bigger but it also got faster.

Results

00000000000000000000010011010010 sinsi_ex
00000000000000000000010011010010 NightWare
00000000000000000000010011010010 hutch_ex2

656 NightWare
547 sinsi_ex
657 hutch_ex2
656 NightWare
547 sinsi_ex
656 hutch_ex2
656 NightWare
547 sinsi_ex
657 hutch_ex2
656 NightWare
547 sinsi_ex
656 hutch_ex2
656 NightWare
547 sinsi_ex
657 hutch_ex2
672 NightWare
547 sinsi_ex
656 hutch_ex2
656 NightWare
547 sinsi_ex
656 hutch_ex2
656 NightWare
547 sinsi_ex
656 hutch_ex2

NightWare timing average = 658
sinsi_ex  timing average = 547
hutch_ex2 timing average = 656
Press any key to continue ...


Algo

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

    align 16

  ; ---------------------------------------------
  ; Modified NightWare's DWORD to bin string algo
  ; ---------------------------------------------

NightWare PROC value:DWORD,buffer:DWORD

    .data
      align 16
      BinaryTable DWORD "0000","1000","0100","1100","0010","1010","0110","1110"
                  DWORD "0001","1001","0101","1101","0011","1011","0111","1111"
    .code

    push esi
    push edi

    mov eax, [esp+4][8]     ;; value
    mov esi, [esp+8][8]     ;; buffer

    mov BYTE PTR [esi+32], 0
    mov ecx, 32

    mov edx, eax
    and edx, 00000000Fh
    shr eax, 4
    mov edx, DWORD PTR [BinaryTable+edx*4]
    mov DWORD PTR [esi+ecx-4], edx

    mov edx, eax
    and edx, 00000000Fh
    shr eax, 4
    mov edx, DWORD PTR [BinaryTable+edx*4]
    mov DWORD PTR [esi+ecx-8], edx

    mov edx, eax
    and edx, 00000000Fh
    shr eax, 4
    mov edx, DWORD PTR [BinaryTable+edx*4]
    mov DWORD PTR [esi+ecx-12], edx

    mov edx, eax
    and edx, 00000000Fh
    shr eax, 4
    mov edx, DWORD PTR [BinaryTable+edx*4]
    mov DWORD PTR [esi+ecx-16], edx

    mov edx, eax
    and edx, 00000000Fh
    shr eax, 4
    mov edx, DWORD PTR [BinaryTable+edx*4]
    mov DWORD PTR [esi+ecx-20], edx

    mov edx, eax
    and edx, 00000000Fh
    shr eax, 4
    mov edx, DWORD PTR [BinaryTable+edx*4]
    mov DWORD PTR [esi+ecx-24], edx

    mov edx, eax
    and edx, 00000000Fh
    shr eax, 4
    mov edx, DWORD PTR [BinaryTable+edx*4]
    mov DWORD PTR [esi+ecx-28], edx

    mov edx, eax
    and edx, 00000000Fh
    shr eax, 4
    mov edx, DWORD PTR [BinaryTable+edx*4]
    mov DWORD PTR [esi+ecx-32], edx

    pop edi
    pop esi

    ret 8

NightWare ENDP

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Title: Re: Bin$
Post by: jj2007 on June 24, 2008, 06:21:23 AM
I apologise for the fact that the attachments I supplied have become apparently so confusing that you dare no longer read them. As said above, Sinsi's modified code does no longer need a table, and in the final "BIN$" version it is also by far the fastest of all previous versions (including Nightware's code). So there is no longer a need for separate fast vs compact codes.

Cheers, JJ
Title: Re: Bin$
Post by: hutch-- on June 24, 2008, 07:32:54 AM
Here is a tweak of sinsi's version, basically some instruction re-ordering and swapping the names of some registers.


00000000000000000000010011010010 sinsi_ex
00000000000000000000010011010010 NightWare
00000000000000000000010011010010 hutch_ex2

657 NightWare
516 sinsi_ex
656 hutch_ex2
656 NightWare
516 sinsi_ex
657 hutch_ex2
656 NightWare
515 sinsi_ex
656 hutch_ex2
656 NightWare
515 sinsi_ex
656 hutch_ex2
657 NightWare
516 sinsi_ex
656 hutch_ex2
656 NightWare
516 sinsi_ex
657 hutch_ex2
656 NightWare
515 sinsi_ex
656 hutch_ex2
656 NightWare
515 sinsi_ex
656 hutch_ex2

NightWare timing average = 656
sinsi_ex  timing average = 515
hutch_ex2 timing average = 656
Press any key to continue ...


The algo

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 16

sinsi_ex proc var:DWORD,buffer:DWORD

    mov ecx,4[esp]
      push ebx
    mov edx,12[esp]
      push esi
    movzx esi,cl
      push edi

    mov edi,offset bintable

    mov ebx,[edi+esi*8]
    mov eax,[edi+esi*8+4]
      movzx esi,ch
    mov 24[edx],ebx
    mov 28[edx],eax

    mov ebx,[edi+esi*8]
    mov eax,[edi+esi*8+4]
      shr ecx,16
      movzx esi,cl
    mov 16[edx],ebx
    mov 20[edx],eax

    mov ebx,[edi+esi*8]
    mov eax,[edi+esi*8+4]
      movzx esi,ch
      mov BYTE PTR 32[edx], 0
    mov 8[edx],ebx
    mov 12[edx],eax

    mov ecx,[edi+esi*8]
    mov eax,[edi+esi*8+4]
      pop edi
      pop esi
      pop ebx
    mov [edx],ecx
    mov 4[edx],eax

    ret 8

sinsi_ex endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤


JJ, would you just post your fast algo so I can put it into a test piece ?
Title: Re: Bin$
Post by: jj2007 on June 24, 2008, 07:51:12 AM
Cycles            Code             size including tables etc.
47 cycles timing BIN$            139 bytes
56 cycles timing pbin2           147 bytes
107 cycles timing nwDw2Bin       101 bytes
107 cycles timing dw2bin_ex      2140 bytes
402 cycles timing Dword2Bin2     57 bytes

Usage:
invoke MessageBox, 0, BIN$(11111000001111100000111111110000b), chr$("A binary string:"), MB_OK
or
mov xxx, BIN$(11111000001111100000111111110000b)
or
mov edx, offset Dw2BinBuffer
mov eax, 11111000001111100000111111110000b
call Dword2Bin
invoke MessageBox, 0, addr Dw2BinBuffer, chr$("A binary string:"), MB_OK

The macro:
BIN$ MACRO dwArg:REQ, tgt:=<0>
  cc INSTR <dwArg>, <eax> ;; to avoid a mov eax, eax
  if cc eq 0
mov eax, dwArg ;; eax passes value to translate
  endif
  if @SizeStr(tgt) gt 1
mov edx, tgt ;; dest buffer - if 0, use Dw2BinBuffer
  else
mov edx, offset Dw2BinBuffer
  endif
  call Dword2Bin
  EXITM <edx>
ENDM

.data?
Dw2BinBuffer dd 9 dup (?) ; 32 are needed, we pad 4 for alignment
Dw2BinTable dd 2048/4 dup (?) ; former bintable now in .data? section


The proc
align 4
Dword2Bin proc
  ifndef Dw2BinBuffer ; credits to Sinsi of the Masm32 Forum
.data?
Dw2BinBuffer dd 36/4 dup (?) ; 32 are needed, we pad 4 for alignment
Dw2BinTable dd 2048/4 dup (?) ; old bintable
.code
  endif
  push ebx
  push esi
  push edi

  mov edi, offset Dw2BinTable ; old \masm32\m32lib\bintbl.asm
  mov ecx, [edi]
  .if ecx==0 ; seems you have to initialise your bintable...
push eax ; save value
push edx ; save destination
mov edx, offset Dw2BinTable+2048 ; destination bintable
mov esi, 63 ; outer loop counter
mov edi, 0FCFDFEFFh ; seed for creating table
btInit:
mov eax, edi
sub edi, 04040404h
mov ecx, 31 ; inner loop counter

@@: xor ebx, ebx
dec edx ; on exit, edx will point to the string
sar eax, 1 ; we work from right to left
adc ebx, 48 ; +48: Ascii 0 or, with carry, +49 : Ascii 1
mov [edx], bl
dec ecx ; decrement inner counter
jge @B ; dec sets only the sign flag

dec esi ; decrement outer counter
jge btInit ; dec sets only the sign flag
mov edi, edx
pop edx ; get destination
pop eax ; get value
  .endif

  movzx ecx,  al ; the value to translate
  mov ebx, [edi+ecx*8]
  mov esi, [edi+ecx*8+4]
  mov 24[edx], ebx
  mov 28[edx], esi

  movzx ecx,  ah
  mov ebx, [edi+ecx*8]
  mov esi, [edi+ecx*8+4]
  shr eax, 16
  mov 16[edx], ebx
  mov 20[edx], esi

  movzx ecx,  al
  mov ebx, [edi+ecx*8]
  mov esi, [edi+ecx*8+4]
  mov 8[edx], ebx
  mov 12[edx], esi

  movzx ecx,  ah
  mov ebx, [edi+ecx*8]
  mov esi, [edi+ecx*8+4]
  mov [edx], ebx
  mov 4[edx], esi
  mov 32[edx], ch ; null terminator
  pop edi
  pop esi
  pop ebx

  ret ; not: 8
Dword2Bin endp
Title: Re: Bin$
Post by: jj2007 on June 24, 2008, 08:28:27 AM
Quote from: sinsi on June 23, 2008, 11:59:48 PM

    mov edi,offset bintable

yes it does.
Sorry, I forgot to say that I ruthlessly eliminated the table from your code ;-)


    sub eax,eax
    mov [edx],ebx
    mov 4[edx],esi

    mov 32[edx],al   ;<---

I was always told to use a register...old habits die hard.
Quote

    movzx eax,ch
    mov ebx,[edi+eax*8]
    mov esi,[edi+eax*8+4]
    ; sub eax, eax
    mov [edx],ebx
    mov 4[edx],esi

    mov 32[edx],ah  ; al is non-zero

If you agree...  :bg
Title: Re: Bin$
Post by: jj2007 on June 24, 2008, 08:39:51 AM
Quote from: NightWare on June 24, 2008, 01:29:06 AM
stalls reduced  :red :

Thanks. The effect is not very clear, maybe the increased counter compensates part of the gain:
mov ecx, 28+4   ; +4 because sub ecx was shifted up??

EDIT:
I just saw that you changed the jns to jnz at the end of the loop.
If I use mov ecx, 32 and jnz, as in your last post, I get garbage.
If I use mov ecx, 28 and jnz, code produces correct results and is faster.
EDIT (2):
Forget Edit (1) and see next post below.

nwDw2Bin PROC
mov BYTE PTR [esi+32],0
mov ecx, 28
Label1:
mov edx, eax
and edx, 00000000Fh
shr eax,4
mov edx, DWORD PTR [BinaryTable+edx*4]
sub ecx, DWORD
; shr eax,4 moved up to reduce stalls
mov DWORD PTR [esi+ecx+4], edx
; sub ecx,DWORD moved up to reduce stalls
jnz Label1 ; ex jns Label1
ret
Title: Re: Bin$
Post by: sinsi on June 24, 2008, 08:54:45 AM
Quote from: jj2007 on June 24, 2008, 08:28:27 AM
Sorry, I forgot to say that I ruthlessly eliminated the table from your code ;-)
Being ruthless is what asm is all about...

Quote from: jj2007 on June 24, 2008, 08:28:27 AM
If you agree...  :bg
Everyone hates a smartarse. :bdg
Anyway, it's hard to see through beer goggles. :dazzled:
Title: Re: Bin$
Post by: hutch-- on June 24, 2008, 09:15:32 AM
OK,

I have untangled JJs code and put it into the test bed. It passed data by registers so I modified both sinsi's algo and NightWares and get these timings. My own algo has run out of legs so I did not bother.

By the results sinsi's code is clearly fastest while NightWare's code is the smallest as it does not dynamically create a table in memory. JJ's result is a good one and it could be made faster. TThe technique of creating the table dynamcally in memory could be useful for a WORD version with a 64k table, far too big for initialised data but no big deal in terms of allocated memory. It would make possible WORD sized reads and writes which shold be nearly twice as fast.


01001001100101100000001011010010 sinsi_ex
01001001100101100000001011010010 NightWare
01001001100101100000001011010010 hutch_ex2
01001001100101100000001011010010 JJ BIN$

593 NightWare
453 sinsi_ex
656 hutch_ex2
500 JJ BIN$
594 NightWare
453 sinsi_ex
656 hutch_ex2
500 JJ BIN$
594 NightWare
453 sinsi_ex
657 hutch_ex2
500 JJ BIN$
594 NightWare
453 sinsi_ex
656 hutch_ex2
500 JJ BIN$
594 NightWare
453 sinsi_ex
656 hutch_ex2
500 JJ BIN$
593 NightWare
453 sinsi_ex
657 hutch_ex2
500 JJ BIN$
594 NightWare
454 sinsi_ex
656 hutch_ex2
500 JJ BIN$
594 NightWare
453 sinsi_ex
656 hutch_ex2
500 JJ BIN$

NightWare timing average = 593
sinsi_ex  timing average = 453
hutch_ex2 timing average = 656
JJ BIN$   timing average = 500
Press any key to continue ...




[attachment deleted by admin]
Title: Re: Bin$
Post by: jj2007 on June 24, 2008, 09:22:06 AM
Quote from: NightWare on June 24, 2008, 01:29:06 AM
stalls reduced  :red :

Your version:
nwDw2Bin PROC
mov BYTE PTR [esi+32], 0
mov ecx, 32 ; 28 would produce garbage
Label1:
mov edx, eax
and edx, 00000000Fh
shr eax, 4
mov edx, DWORD PTR [BinaryTable+edx*4]
sub ecx, DWORD
mov DWORD PTR [esi+ecx], edx
jnz Label1
ret
nwDw2Bin ENDP


My edits:
nwDw2BinJJ PROC
mov BYTE PTR [esi+32], 0
mov ecx, 28 ; works (JJ)
Label1:
mov edx, eax
and edx, 00000000Fh
shr eax, 4
mov edx, DWORD PTR [BinaryTable+edx*4]
sub ecx, DWORD
; shr eax, 4 moved up to reduce stalls (NightWare)
mov DWORD PTR [esi+ecx+4], edx ; JJ: +4 is one byte longer but faster
; sub ecx, DWORD moved up to reduce stalls (NightWare)
jnz Label1 ; jns Label1
ret
nwDw2BinJJ ENDP


Timings:
53 cycles timing BIN$            139 bytes      625 LAMPs
61 cycles timing pbin2           147 bytes      740 LAMPs
89 cycles timing nwDw2Bin        101 bytes      894 LAMPs
74 cycles timing nwDw2BinJJ      102 bytes      747 LAMPs
304 cycles timing Dword2Bin2     59 bytes       2335 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

The difference seems in the factor 28/32. Hmmm...

[attachment deleted by admin]
Title: Re: Bin$
Post by: jj2007 on June 24, 2008, 09:37:23 AM
Quote from: hutch-- on June 24, 2008, 09:15:32 AM
By the results sinsi's code is clearly fastest while NightWare's code is the smallest as it does not dynamically create a table in memory. JJ's result is a good one and it could be made faster.

NightWare timing average = 595
sinsi_ex  timing average = 492
hutch_ex2 timing average = 861
JJ BIN$   timing average = 625

My version, now with OPTION PROLOGUE:NONE for BIN$ but not including sinsi's algo with the external bintable; pbin2 is sinsi plus generated table:
47 cycles timing BIN$            139 bytes      554 LAMPs
59 cycles timing pbin2           147 bytes      715 LAMPs
88 cycles timing nwDw2Bin        101 bytes      884 LAMPs
67 cycles timing nwDw2BinJJ      102 bytes      677 LAMPs
299 cycles timing Dword2Bin2     59 bytes       2297 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

The modified version of NightWare, see previous post, looks also promising (EDIT:), especially since the 101/102 bytes include already the 64 bytes of .data BinTable
Title: Re: Bin$
Post by: hutch-- on June 24, 2008, 09:44:26 AM
JJ,

If you look at the code I posted I unrolled NightWare's version to stabilise its timing and remove the loop code. With sinsi's code I reordered some of the instructions to increase the instruction count between memory read and writes to prevent or at least slow down the read after write stalls and it got faster for doing so.

For small non-critical code I would use NightWare's original as it is small and only has a 64 byte table but with the speed advantage of sinsi's code, it would have to be the one to use for performance reasons.

While algos of short instruction count benefit from register passing, they are not truly general purpose so they are not all that useful for a library. Compliments on your BIN$ macro and the algo it calls, it does perform well.
Title: Re: Bin$
Post by: jj2007 on June 24, 2008, 10:00:14 AM
Quote from: hutch-- on June 24, 2008, 09:44:26 AM
If you look at the code I posted I unrolled NightWare's version

That does the trick, it seems:
48 cycles timing BIN$            139 bytes      566 LAMPs
57 cycles timing pbin2           147 bytes      691 LAMPs
80 cycles timing nwDw2Bin        101 bytes      804 LAMPs
69 cycles timing nwDw2BinJJ      102 bytes      697 LAMPs
49 cycles timing NightWare       234 bytes      750 LAMPs
296 cycles timing Dword2Bin2     56 bytes       2215 LAMPs



[attachment deleted by admin]
Title: Re: Bin$
Post by: jj2007 on June 24, 2008, 10:17:14 AM
Quote from: hutch-- on June 24, 2008, 09:44:26 AM
Compliments on your BIN$ macro and the algo it calls, it does perform well.

Thanxalot, Hutch. Your version of BIN$ still has a little bug (ok in dw2binFinal.zip posted above) :

BIN$ MACRO dwArg:REQ, tgt:=<0>
  cc INSTR <dwArg>, <eax>   ;; to avoid a mov eax, eax
  if cc
   mov eax, dwArg         ;; eax passes value to translate
  endif
  if @SizeStr(tgt) gt 1                  ;; BUG: "if tgt" won't work correctly
   mov edx, tgt         ;; dest buffer - if 0, use Dw2BinBuffer
  else
   mov edx, offset Dw2BinBuffer
  endif
  call Dword2Bin
  EXITM <edx>
ENDM

Quote
While algos of short instruction count benefit from register passing, they are not truly general purpose so they are not all that useful for a library.

BIN$ accepts one required argument and returns a pointer to the result.  I chose to pass the value in eax, since all Win32 API's return values in that register. As it stands, it looks foolproof...

invoke GetKeyState, VK_CAPITAL
invoke MessageBox, 0, BIN$(eax), chr$("Caps Lock in Bit 0:"), MB_OK
Title: Re: Bin$
Post by: MichaelW on June 24, 2008, 10:26:00 AM
This is 27 bytes and 141 cycles on a P3.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      buff db 40 dup(0)
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

dw2bin proc dwNumber:DWORD, pszString:DWORD
    mov eax, [esp+8]
    mov ecx, 31
  @@:
    xor edx, edx
    shr DWORD PTR [esp+4], 1
    adc edx, '0'
    mov [eax+ecx], dl
    dec ecx
    jns @B
    ret 8
dw2bin endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke dw2bin, 0, ADDR buff
    print ADDR buff,13,10
    invoke dw2bin, 01010101h, ADDR buff
    print ADDR buff,13,10
    invoke dw2bin, -1, ADDR buff
    print ADDR buff,13,10,13,10

    invoke Sleep, 3000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke dw2bin, 01010101h, ADDR buff
    counter_end
    print ustr$(eax)," cycles",13,10,13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


For me 32-bit binary strings are hard to read and interpret, because as you move in from the ends it becomes increasingly difficult to know which bit position you are looking at. I think a more reasonable format would include spaces between the bytes.
Title: Re: Bin$
Post by: jj2007 on June 24, 2008, 12:54:47 PM
Quote from: MichaelW on June 24, 2008, 10:26:00 AM
For me 32-bit binary strings are hard to read and interpret, because as you move in from the ends it becomes increasingly difficult to know which bit position you are looking at. I think a more reasonable format would include spaces between the bytes.

Never satisfied? Your wish is my command...

   invoke MessageBox, 0, BIN$(10011001100110011001100110011001b),
   chr$("BIN$ plain:"), MB_OK

   invoke MessageBox, 0, BIN$(10011001100110011001100110011001b, f),
   chr$("BIN$ formatted:"), MB_OK

Fortunately the timings are not affected, but wow, that cost me almost a hundred LAMPs :wink

47 cycles timing BIN$            180 bytes      631 LAMPs
57 cycles timing pbin2           147 bytes      691 LAMPs
92 cycles timing nwDw2Bin        101 bytes      925 LAMPs
67 cycles timing nwDw2BinJJ      102 bytes      677 LAMPs
49 cycles timing NightWare       234 bytes      750 LAMPs
297 cycles timing Dword2Bin2     56 bytes       2223 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

[attachment deleted by admin]
Title: Re: Bin$
Post by: hutch-- on June 24, 2008, 02:33:07 PM
Michael,

I plugged it in and tested it and the result is correct but its very slow against the rest. It also effected the timings of the other test algos but was running about 10 time slower than the others. It may just be a very bad stall on my PIV.
Title: Re: Bin$
Post by: NightWare on June 24, 2008, 09:31:39 PM
Quote from: hutch-- on June 24, 2008, 09:44:26 AM
If you look at the code I posted I unrolled NightWare's version to stabilise its timing and remove the loop code.
hmm, last shr eax,4 is useless  :wink, i've also unrolled the code and made few change :
; unrolled version
mov BYTE PTR [esi+32],0
mov ecx,32

mov edx,eax
and edx,00000000Fh
shr eax,2
mov edx,DWORD PTR [BinaryTable+edx*4]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx

mov edx,eax
and edx,00000003Ch ; 0Fh * 4
shr eax,4
mov edx,DWORD PTR [BinaryTable+edx]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx

mov edx,eax
and edx,00000003Ch
shr eax,4
mov edx,DWORD PTR [BinaryTable+edx]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx

mov edx,eax
and edx,00000003Ch
shr eax,4
mov edx,DWORD PTR [BinaryTable+edx]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx

mov edx,eax
and edx,00000003Ch
shr eax,4
mov edx,DWORD PTR [BinaryTable+edx]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx

mov edx,eax
and edx,00000003Ch
shr eax,4
mov edx,DWORD PTR [BinaryTable+edx]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx

mov edx,eax
and edx,00000003Ch
shr eax,4
mov edx,DWORD PTR [BinaryTable+edx]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx

mov edx,eax
and edx,00000003Ch
mov edx,DWORD PTR [BinaryTable+edx]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx
mixing both should give a good result...

EDIT : something like :; mov ecx,value
; mov edx,buffer
; no need to pudh/pop esi/edi

mov BYTE PTR [edx+32],0

mov eax,ecx
and eax,00000000Fh
shr ecx,2
mov eax,DWORD PTR [BinaryTable+eax*4]
mov DWORD PTR [edx+28],eax

mov eax,ecx
and eax,00000003Ch ; 0Fh * 4
shr ecx,4
mov eax,DWORD PTR [BinaryTable+eax]
mov DWORD PTR [edx+24],eax

mov eax,ecx
and eax,00000003Ch
shr ecx,4
mov eax,DWORD PTR [BinaryTable+eax]
mov DWORD PTR [edx+20],eax

mov eax,ecx
and eax,00000003Ch
shr ecx,4
mov eax,DWORD PTR [BinaryTable+eax]
mov DWORD PTR [edx+16],eax

mov eax,ecx
and eax,00000003Ch
shr ecx,4
mov eax,DWORD PTR [BinaryTable+eax]
mov DWORD PTR [edx+12],eax

mov eax,ecx
and eax,00000003Ch
shr ecx,4
mov eax,DWORD PTR [BinaryTable+eax]
mov DWORD PTR [edx+8],eax

mov eax,ecx
and eax,00000003Ch
shr ecx,4
mov eax,DWORD PTR [BinaryTable+eax]
mov DWORD PTR [edx+4],eax

and ecx,00000003Ch
mov ecx,DWORD PTR [BinaryTable+ecx]
mov DWORD PTR [edx],ecx
Title: Re: Bin$
Post by: hutch-- on June 27, 2008, 12:17:44 AM
I maved this topic to the LAB so it would not get lost in a hurry.
Title: Re: Bin$
Post by: lingo on June 27, 2008, 01:54:26 AM
The same but faster: :lol
        mov BYTE PTR [edx+32],0

        mov eax, ecx
        and ecx, 00000000Fh
        mov ecx, DWORD PTR [BinaryTable+ecx*4]
        shr eax, 2
mov DWORD PTR [edx+28],ecx

mov ecx, eax
and eax, 00000003Ch ; 0Fh * 4
mov eax, DWORD PTR [BinaryTable+eax]
shr ecx, 4
        mov DWORD PTR [edx+24],eax

mov eax, ecx
and ecx, 00000003Ch
mov ecx, DWORD PTR [BinaryTable+ecx]
shr eax, 4
        mov DWORD PTR [edx+20],ecx

mov ecx, eax
and eax, 00000003Ch
mov eax, DWORD PTR [BinaryTable+eax]
shr ecx, 4
        mov DWORD PTR [edx+16],eax

mov eax, ecx
and ecx, 00000003Ch
mov ecx, DWORD PTR [BinaryTable+ecx]
shr eax, 4
        mov DWORD PTR [edx+12],ecx

mov ecx, eax
and eax, 00000003Ch
mov eax, DWORD PTR [BinaryTable+eax]
shr ecx, 4
        mov DWORD PTR [edx+8],eax

mov eax, ecx
and ecx, 00000003Ch
mov ecx, DWORD PTR [BinaryTable+ecx]
shr eax, 4
        mov DWORD PTR [edx+4], ecx

and eax, 00000003Ch
mov ecx, DWORD PTR [BinaryTable+eax]
mov DWORD PTR [edx],ecx


Title: Re: Bin$
Post by: NightWare on June 27, 2008, 02:33:12 AM
Quote from: lingo on June 27, 2008, 01:54:26 AM
The same but faster: :lol
:U
Title: Re: Bin$
Post by: jj2007 on June 27, 2008, 08:03:22 AM
Quote from: lingo on June 27, 2008, 01:54:26 AM
The same but faster: :lol

Not so easy to get stable timings, but I would vote for the Nightware/Lingo algo:

69 cycles timing BIN$            180 bytes      926 LAMPs
128 cycles timing pbin2          147 bytes      1552 LAMPs
124 cycles timing nwDw2Bin       101 bytes      1246 LAMPs
103 cycles timing nwDw2BinJJ     102 bytes      1040 LAMPs
84 cycles timing NightWare       204 bytes      1200 LAMPs
89 cycles timing BinLingo        204 bytes      1271 LAMPs

48 cycles timing BIN$            180 bytes      644 LAMPs
87 cycles timing pbin2           147 bytes      1055 LAMPs
115 cycles timing nwDw2Bin       101 bytes      1156 LAMPs
71 cycles timing nwDw2BinJJ      102 bytes      717 LAMPs
42 cycles timing NightWare       204 bytes      600 LAMPs
75 cycles timing BinLingo        204 bytes      1071 LAMPs

48 cycles timing BIN$            180 bytes      644 LAMPs
57 cycles timing pbin2           147 bytes      691 LAMPs
121 cycles timing nwDw2Bin       101 bytes      1216 LAMPs
72 cycles timing nwDw2BinJJ      102 bytes      727 LAMPs
42 cycles timing NightWare       204 bytes      600 LAMPs
42 cycles timing BinLingo        204 bytes      600 LAMPs

53 cycles timing BIN$            180 bytes      711 LAMPs
59 cycles timing pbin2           147 bytes      715 LAMPs
88 cycles timing nwDw2Bin        101 bytes      884 LAMPs
83 cycles timing nwDw2BinJJ      102 bytes      838 LAMPs
53 cycles timing NightWare       204 bytes      757 LAMPs
43 cycles timing BinLingo        204 bytes      614 LAMPs

Just for fun, here is one with the original library function:
48 cycles timing BIN$            180 bytes      644 LAMPs
58 cycles timing pbin2           147 bytes      703 LAMPs
55 cycles timing NightWare       204 bytes      786 LAMPs
42 cycles timing BinLingo        204 bytes      600 LAMPs
82 cycles timing dw2bin_ex       2140 bytes     3793 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

Full source attached, with some switches.

[attachment deleted by admin]
Title: Re: Bin$
Post by: drizz on June 27, 2008, 03:04:18 PM
here's my attempt  :8):

;; for 32 bytes
bintab equ BinaryTable
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
DwToBin proc dwValue:dword,pBuffer:dword
mov eax,[esp+4];dwValue
mov edx,[esp+8];pBuffer
push edi
push esi
push ebx
mov ebx,1111b
mov ecx,11110000b
mov edi,111100000000b
mov esi,1111000000000000b
and ecx,eax
and edi,eax
shr ecx,4
and esi,eax
shr edi,8
and ebx,eax
shr esi,12
mov ebx,[ebx*4+bintab]
mov ecx,[ecx*4+bintab]
mov edi,[edi*4+bintab]
mov esi,[esi*4+bintab]
shr eax,16
mov [edx+7*4],ebx
mov [edx+6*4],ecx
mov [edx+5*4],edi
mov [edx+4*4],esi
mov ebx,1111b
mov ecx,11110000b
mov edi,111100000000b
mov esi,1111000000000000b
and ecx,eax
and edi,eax
shr ecx,4
and esi,eax
shr edi,8
and ebx,eax
shr esi,12
mov ebx,[ebx*4+bintab]
mov ecx,[ecx*4+bintab]
mov edi,[edi*4+bintab]
mov esi,[esi*4+bintab]
mov [edx+3*4],ebx
mov [edx+2*4],ecx
mov [edx+1*4],edi
mov [edx+0*4],esi
mov byte ptr [edx+32],0
pop ebx
pop esi
pop edi
ret 2*4
DwToBin endp
OPTION PROLOGUE:PROLOGUEDEF
OPTION EPILOGUE:EPILOGUEDEF

and here's my regular function
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
;returns size in eax
DwToBin proc dwValue:dword,pBuffer:dword;buff max 33bytes
push edi
mov ecx,31
mov edx,[esp+4][4];dwValue
mov edi,[esp+8][4];pBuffer
bsr eax,edx
jz @2
sub ecx,eax
shl edx,cl
mov ecx,eax
@1: add edx,edx
inc edi
mov al,'0' shr 1
adc al,al
dec ecx
mov [edi-1],al
jns @1
mov [edi],dl
mov eax,edi
pop edi
sub eax,[esp+8];pBuffer
ret 2*4
@2: mov word ptr [edi],'0'
mov eax,1
pop edi
ret 2*4
DwToBin endp

OPTION PROLOGUE:PROLOGUEDEF
OPTION EPILOGUE:EPILOGUEDEF
Title: Re: Bin$
Post by: lingo on June 27, 2008, 04:10:59 PM
Not the same but faster again :lol:


mov   eax, esp
lea   esp, [edx+32]
mov   BYTE PTR [edx+32],0
mov   edx, eax

shld  eax, ecx, 30
and   ecx, 00000000Fh
push  DWORD PTR [BinaryTable+ecx*4]

shld  ecx, eax, 28
and   eax, 00000003Ch
push  DWORD PTR [BinaryTable+eax]
shld  eax, ecx, 28

and   ecx, 3Ch
push  DWORD PTR [BinaryTable+ecx]

mov   ecx, eax
and   eax, 3Ch
push  DWORD PTR [BinaryTable+eax]
shr   ecx, 4

mov   eax, ecx
and   ecx, 3Ch
push  DWORD PTR [BinaryTable+ecx]
shr   eax, 4

mov   ecx, eax
and   eax, 3Ch
push  DWORD PTR [BinaryTable+eax]
shr   ecx, 4

mov   eax, ecx
and   ecx, 3Ch
push  DWORD PTR [BinaryTable+ecx]
shr   eax, 4

and   eax, 3Ch
push  DWORD PTR [BinaryTable+eax]
mov   esp, edx
ret


My Results:
Intel Core 2 E8500,4000 MHz (9.5 x 421),Vista64-SP1


37 cycles timing BIN$            180 bytes      496 LAMPs
42 cycles timing pbin2           147 bytes      509 LAMPs
36 cycles timing nwDw2Bin        101 bytes      362 LAMPs
32 cycles timing nwDw2BinJJ      102 bytes      323 LAMPs
30 cycles timing NightWare       204 bytes      428 LAMPs
27 cycles timing BinLingo        187 bytes      369 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

[attachment deleted by admin]
Title: Re: Bin$
Post by: qWord on June 27, 2008, 04:55:03 PM
Here is an SIMD version (uses SSSE3). I've test it with aligned and unaligned move (movdqu/movdqa)

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

align 16

_qword proc ;var:DWORD,buffer:DWORD
   ;var in edx
   ;buffer in ecx

    .data
        align 16
        shufmsk     db  8 dup (1)
                    db  8 dup (0)
        bitmsk      db  128,64,32,16,8,4,2,1
                    db  128,64,32,16,8,4,2,1
        ascmsk      db  16 dup (31h)

    .code

    movdqa xmm3,OWORD ptr [shufmsk]
    movdqa xmm4,OWORD ptr [bitmsk]
    movdqa xmm5,OWORD ptr [ascmsk]

    pinsrw xmm0,edx,0
    shr edx,16
    pxor xmm2,xmm2
    pinsrw xmm1,edx,0

    pshufb xmm1,xmm3
    pand xmm1,xmm4
    pcmpeqb xmm1,xmm2
    paddsb xmm1,xmm5
    movdqu OWORD ptr [ecx],xmm1

    align 8
    pshufb xmm0,xmm3
    pand xmm0,xmm4   
    pcmpeqb xmm0,xmm2
    paddsb xmm0,xmm5
    movdqu OWORD ptr [ecx+16],xmm0

    mov BYTE ptr [ecx+32],0

    ret
_qword endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef


my results(Core2Duo):

with movdqu:

NightWare timing average = 347
sinsi_ex  timing average = 401
hutch_ex2 timing average = 680
JJ BIN$   timing average = 386
mqword   timing average = 376

with movdqa:

NightWare timing average = 337
sinsi_ex  timing average = 401
hutch_ex2 timing average = 676
JJ BIN$   timing average = 376
mqword   timing average = 241


[attachment deleted by admin]
Title: Re: Bin$
Post by: jj2007 on June 27, 2008, 06:31:18 PM
Quote from: lingo on June 27, 2008, 04:10:59 PM
Not the same but faster again :lol:

My Results:
Intel Core 2 E8500,4000 MHz (9.5 x 421),Vista64-SP1

37 cycles timing BIN$            180 bytes      496 LAMPs
42 cycles timing pbin2           147 bytes      509 LAMPs
36 cycles timing nwDw2Bin        101 bytes      362 LAMPs
32 cycles timing nwDw2BinJJ      102 bytes      323 LAMPs
30 cycles timing NightWare       204 bytes      428 LAMPs
27 cycles timing BinLingo        187 bytes      369 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

And shorter, too. Will have to study some of these exotic lingoish opcodes ;-)

However, my Celeron does not seem to like it much. Interesting how big the differences between processors are in this case.

39 cycles timing BIN$            180 bytes      523 LAMPs
45 cycles timing pbin2           147 bytes      546 LAMPs
61 cycles timing nwDw2Bin        101 bytes      613 LAMPs
64 cycles timing nwDw2BinJJ      102 bytes      646 LAMPs
53 cycles timing NightWare       204 bytes      757 LAMPs
76 cycles timing BinLingo        204 bytes      1085 LAMPs **** previous version

C:\MASM32\GFA2MASM>bl
39 cycles timing BIN$            180 bytes      523 LAMPs
45 cycles timing pbin2           147 bytes      546 LAMPs
60 cycles timing nwDw2Bin        101 bytes      603 LAMPs
64 cycles timing nwDw2BinJJ      102 bytes      646 LAMPs
53 cycles timing NightWare       204 bytes      757 LAMPs
70 cycles timing BinLingo        187 bytes      957 LAMPs **** latest version
Title: Re: Bin$
Post by: jj2007 on June 27, 2008, 07:13:16 PM
Quote from: drizz on June 27, 2008, 03:04:18 PM
here's my attempt  :8):

Good start, Drizz!

39 cycles timing BIN$            180 bytes      523 LAMPs
45 cycles timing pbin2           147 bytes      546 LAMPs
60 cycles timing nwDw2Bin        101 bytes      603 LAMPs
65 cycles timing nwDw2BinJJ      102 bytes      656 LAMPs
53 cycles timing NightWare       204 bytes      757 LAMPs
70 cycles timing BinLingo        187 bytes      957 LAMPs
46 cycles timing b2aDrizzAt      235 bytes      705 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)


[attachment deleted by admin]
Title: Re: Bin$
Post by: qWord on June 27, 2008, 09:16:36 PM
i have ported my idea to mmx, so more people can test it    :U


mmx_dw2bin proc ;var:DWORD,buffer:DWORD
   ;var in edx
   ;buffer in ecx

    .data
        align 16
        bitmsk      db  128,64,32,16,8,4,2,1
        ascmsk      db  8 dup (031h)
    .code

    bswap edx
    movq mm6,QWORD ptr [bitmsk]
    movq mm7,QWORD ptr [ascmsk]
    pxor mm5,mm5

    movd mm0,edx
    punpcklbw mm0,mm0

    movq mm1,mm0
    pshufw mm1,mm0,0
    pand mm1,mm6
    pcmpeqb mm1,mm5
    paddsb mm1,mm7
    movq QWORD ptr [ecx],mm1

    movq mm2,mm0
    pshufw mm2,mm0,001010101y
    pand mm2,mm6
    pcmpeqb mm2,mm5
    paddsb mm2,mm7
    movq QWORD ptr [ecx+8],mm2

    movq mm1,mm0
    pshufw mm1,mm0,010101010y
    pand mm1,mm6
    pcmpeqb mm1,mm5
    paddsb mm1,mm7
    movq QWORD ptr [ecx+16],mm1

    movq mm2,mm0
    pshufw mm2,mm0,011111111y
    pand mm2,mm6
    pcmpeqb mm2,mm5
    paddsb mm2,mm7
    movq QWORD ptr [ecx+24],mm2

    mov BYTE ptr [ecx+32],0

    ret
mmx_dw2bin endp




results on Core2Duo:

37 cycles timing BIN$            180 bytes      496 LAMPs
43 cycles timing pbin2           147 bytes      521 LAMPs
36 cycles timing nwDw2Bin        101 bytes      362 LAMPs
42 cycles timing nwDw2BinJJ      102 bytes      424 LAMPs
30 cycles timing NightWare       204 bytes      428 LAMPs
27 cycles timing BinLingo        187 bytes      369 LAMPs
34 cycles timing b2aDrizzAt      235 bytes      521 LAMPs
24 cycles timing mmx_dw2bin      177 bytes      319 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)




[attachment deleted by admin]
Title: Re: Bin$
Post by: jj2007 on June 27, 2008, 09:28:52 PM
Quote from: qWord on June 27, 2008, 09:16:36 PM
i have ported my idea to mmx, so more people can test it    :U

Very interesting. Here are my timings:

40 cycles timing BIN$            180 bytes      537 LAMPs
45 cycles timing pbin2           147 bytes      546 LAMPs
62 cycles timing nwDw2Bin        101 bytes      623 LAMPs
64 cycles timing nwDw2BinJJ      102 bytes      646 LAMPs
53 cycles timing NightWare       204 bytes      757 LAMPs
70 cycles timing BinLingo        187 bytes      957 LAMPs
45 cycles timing b2aDrizzAt      235 bytes      690 LAMPs
179 cycles timing mmx_dw2bin     129 bytes      2033 LAMPs

(sorry, I played a bad trick:
mov ecx, offset Dw2BinBuffer
inc ecx
call mmx_dw2bin
... which misaligns the target)

Without this bad trick, your code performs indeed excellently:
40 cycles timing BIN$            180 bytes      537 LAMPs
45 cycles timing pbin2           147 bytes      546 LAMPs
65 cycles timing nwDw2Bin        101 bytes      653 LAMPs
65 cycles timing nwDw2BinJJ      102 bytes      656 LAMPs
54 cycles timing NightWare       204 bytes      771 LAMPs
70 cycles timing BinLingo        187 bytes      957 LAMPs
46 cycles timing b2aDrizzAt      235 bytes      705 LAMPs
35 cycles timing mmx_dw2bin      129 bytes      398 LAMPs

Fast and short, congratulations!
mmx is very sensitive to misalignment, but in the case of a BIN$ macro we can safely assume that we are able to align the target, so imho we have a winning code here  :cheekygreen:

EDIT: I add the modified code; for consistency with the other algos, I exchanged the variable and destination registers as follows:
   ;var in edx NEW: eax
   ;buffer in ecx NEW: edx


[attachment deleted by admin]
Title: Re: Bin$
Post by: jj2007 on June 27, 2008, 11:19:23 PM
Quote from: qWord on June 27, 2008, 09:16:36 PM
i have ported my idea to mmx, so more people can test it    :U

Of minor practical relevance: pshufw needs xmm (=SSE1), not mmx.
Title: Re: Bin$
Post by: qWord on June 28, 2008, 12:17:56 AM
Quote
Of minor practical relevance: pshufw needs xmm (=SSE1), not mmx.

your right. - I've forgotten :lol
Title: Re: Bin$
Post by: drizz on June 28, 2008, 08:14:18 AM
Quote from: jj2007 on June 27, 2008, 11:19:23 PMOf minor practical relevance: pshufw needs xmm (=SSE1), not mmx.
It's not that hard to replace pshufw. Anyway credits go to qWord!  :U
movq mm6,QWORD ptr [bitmsk]
movq mm7,QWORD ptr [ascmsk]
pxor mm5,mm5
movd mm0,edx
punpcklbw mm0,mm0
punpcklwd mm1,mm0
movq mm2,mm0
punpckhwd mm3,mm0
punpcklwd mm0,mm0
punpckhwd mm1,mm1
punpckhwd mm2,mm2
punpckhwd mm3,mm3
punpckldq mm0,mm0
punpckhdq mm1,mm1
punpckldq mm2,mm2
punpckhdq mm3,mm3
pand mm0,mm6
pand mm1,mm6
pand mm2,mm6
pand mm3,mm6
pcmpeqb mm0,mm5
pcmpeqb mm1,mm5
pcmpeqb mm2,mm5
pcmpeqb mm3,mm5
paddb mm0,mm7
paddb mm1,mm7
paddb mm2,mm7
paddb mm3,mm7
movq [ecx+24],mm0
movq [ecx+16],mm1
movq [ecx+8],mm2
movq [ecx],mm3
mov BYTE ptr [ecx+32],0
edit: removed bswap
edit2: removed some more instructions
Title: Re: Bin$
Post by: jj2007 on June 28, 2008, 09:08:12 AM
Looks good. The old qWord version seems to be an edge faster, see attachment qw.exe

[attachment deleted by admin]
Title: Re: Bin$
Post by: drizz on June 28, 2008, 09:28:36 AM
Quote from: jj2007 on June 28, 2008, 09:08:12 AMThe old qWord version seems to be an edge faster.
yes i know, but sse1 is P3 and above, this modified version will now work on all MMX capable cpus (e.g. 13 year old pentium).
Title: Re: Bin$
Post by: jj2007 on June 28, 2008, 09:32:25 AM
Quote from: drizz on June 28, 2008, 09:28:36 AM
Quote from: jj2007 on June 28, 2008, 09:08:12 AMThe old qWord version seems to be an edge faster.
yes i know, but sse1 is P3 and above, this modified version will now work on all MMX capable cpus (e.g. 13 year old pentium).
That was the point indeed - for a general purpose library this is clearly the better solution. And in terms of LAMPs it beats the hell out of the other algos. My BIN$ algo (inspired by Sinsi) is pretty fast on the Celeron but sucks on real Pentiums.

EDIT:
Timings on a Celeron

39 cycles timing BIN$            180 bytes      523 LAMPs
45 cycles timing pbin2           147 bytes      546 LAMPs
61 cycles timing nwDw2Bin        101 bytes      613 LAMPs
64 cycles timing nwDw2BinJJ      102 bytes      646 LAMPs
54 cycles timing NightWare       204 bytes      771 LAMPs
70 cycles timing BinLingo        187 bytes      957 LAMPs
46 cycles timing b2aDrizzAt      235 bytes      705 LAMPs
40 cycles timing mmx_dw2bin      132 bytes      460 LAMPs (new Drizz mmx variant)

39 cycles timing BIN$            180 bytes      523 LAMPs
45 cycles timing pbin2           147 bytes      546 LAMPs
61 cycles timing nwDw2Bin        101 bytes      613 LAMPs
66 cycles timing nwDw2BinJJ      102 bytes      667 LAMPs
53 cycles timing NightWare       204 bytes      757 LAMPs
70 cycles timing BinLingo        187 bytes      957 LAMPs
45 cycles timing b2aDrizzAt      235 bytes      690 LAMPs
36 cycles timing mmx_dw2bin      129 bytes      409 LAMPs (old qWord xmm variant)

LAMPs = Lean And Mean Points = cycles * sqrt(size)
Title: Re: Bin$
Post by: qWord on June 28, 2008, 09:47:41 AM
Quote
The old qWord version seems to be an edge faster

interesting, on my Core2Duo drizzs version is 2~3 clocks faster


sse1:
24 cycles timing mmx_dw2bin      129 bytes      273 LAMPs
drizz's:
22 cycles timing mmx_dw2bin      132 bytes      253 LAMPs


EDIT: syr, i've forgot to delete baswp and adjust pshufw-instructions =>   
sse1:
18 cycles timing mmx_dw2bin     
drizz's:
22 cycles timing mmx_dw2bin     
Title: Re: Bin$
Post by: jj2007 on June 28, 2008, 10:17:50 AM
Quote from: qWord on June 28, 2008, 09:47:41 AM
Quote
The old qWord version seems to be an edge faster

interesting, on my Core2Duo drizzs version is 2~3 clocks faster
Could you do me a fvour and time the CAT$ macro (http://www.masm32.com/board/index.php?topic=9437.0)?
Title: Re: Bin$
Post by: qWord on June 28, 2008, 10:22:49 AM
Sorry, it was an false statement by me , see my previous post
Title: Re: Bin$
Post by: jj2007 on June 28, 2008, 10:26:37 AM
Quote from: qWord on June 28, 2008, 10:22:49 AM
Sorry, it was an false statement by me , see my previous post
No problem; I was just curious how the Core2Duo performs on the CAT$ algo.
Title: Re: Bin$
Post by: DoomyD on July 03, 2008, 05:08:03 PM
I know it's a bit late, but I  came across this topic and thought I could give this a shot =P
Quotedw2binstr   proc
   ;Value  - EAX
   ;Buffer - EBX
   mov      edx, eax   ;EDX holds the value
   mov      ecx, 8      ;Bits per byte
   @@:
      mov      eax, edx
      and      eax, 01010101h         ;Filters the low bit of every byte
      or      eax, 30303030h         ;ASCII convertion
      mov      byte ptr [ebx+31], al   ;Placing the bytes into the buffer
      mov      byte ptr [ebx+23], ah
      ror      eax, 16
      mov      byte ptr [ebx+15], al
      mov      byte ptr [ebx+07], ah
      dec      ebx                  ;Going backwards (buffer-wise)
      ror      edx,1               ;Setting the next set of bits
      dec      ecx                  ;Loop back
      jnz      @B
   retn
dw2binstr   endp
So... what do you think?
Title: Re: Bin$
Post by: jj2007 on July 03, 2008, 05:59:39 PM
Very short but a bit slow. But the design is interesting, maybe it could be tuned a little bit...

[attachment deleted by admin]
Title: Re: Bin$
Post by: qWord on July 03, 2008, 06:39:44 PM
hey jj,

eventually the following code could be interesting for you:



;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;             
;                                                  ;             
;   sse1_dw2hex: converts a dword-value to an      ;             
;                ASC-hex-string                    ;             
;                                                  ;             
;       eax = dwValue                              ;             
;       edx = lpBuffer , should be aligned to 8    ;             
;                                                  ;             
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;             
align 16                                                         
sse1_dw2hex proc ;var:DWORD,buffer:DWORD                         
                                                                 
    .data                                                       
        align 16                                                 
        d2h_bitmsk  dw 4 dup (0f00fh)                           
        d2h_cmpmsk  db 8 dup (9)                                 
        d2h_09_msk  db 8 dup (030h)                             
        d2h_AF_msk  db 8 dup (7)                                 
    .code                                                       
                                                                 
    ;bswap eax                          ;<== insert for mmx only
    movq mm4,QWORD ptr [d2h_bitmsk]     ;       |               
    movq mm5,QWORD ptr [d2h_cmpmsk]     ;       |               
    movq mm6,QWORD ptr [d2h_09_msk]     ;       |               
    movq mm7,QWORD ptr [d2h_AF_msk]     ;       |               
                                        ;       |               
    movd mm1,eax                        ;       |               
    punpcklbw mm1,mm1                   ;       V               
    pshufw mm1,mm1,000011011y           ;<== delete for mmx only
                                                                 
    pand mm1,mm4                                                 
    movq mm0,mm1                                                 
    psrlw mm0,12                                                 
    psllw mm1,8                                                 
                                                                 
    por mm0,mm1                                                 
    movq mm2,mm0                                                 
    pcmpgtb mm2,mm5                                             
                                                                 
    pand mm2,mm7                                                 
    paddb mm2,mm6                                               
    paddb mm2,mm0                                               
                                                                 
    movq QWORD ptr [edx],mm2                                     
    mov BYTE ptr [edx+8],0                                       
                                                                 
    ret                                                         
sse1_dw2hex endp                                                 
Title: Re: Bin$
Post by: DoomyD on July 03, 2008, 07:35:38 PM
Quote from: jj2007 on July 03, 2008, 05:59:39 PM
Very short but a bit slow. But the design is interesting, maybe it could be tuned a little bit...
Hmm... wierd...
Here's the build I used; for some reason it shows different results... Maybe its the timing macro I use(the one from the sticky).

EDIT: My outputs:30 cycles timing BIN$      180 bytes 402 LAMPs
44 cycles timing pbin2      147 bytes 533 LAMPs
36 cycles timing nwDw2Bin 101 bytes 362 LAMPs
42 cycles timing nwDw2BinJJ 102 bytes 424 LAMPs
30 cycles timing NightWare 204 bytes 428 LAMPs
28 cycles timing BinLingo 187 bytes 383 LAMPs
34 cycles timing b2aDrizzAt 235 bytes 521 LAMPs
22 cycles timing mmx_dw2bin 132 bytes 253 LAMPs
75 cycles timing dw2binstr 49 bytes 525 LAMPs
32 CyclesI'm begining to wonder if it has to do with my CPU...

[attachment deleted by admin]
Title: Re: Bin$
Post by: jj2007 on July 03, 2008, 07:59:58 PM
Quote from: DoomyD on July 03, 2008, 07:35:38 PM
Here's the build I used; for some reason it shows different results... Maybe its the timing macro I use(the one from the sticky).
Shows 101 cycles for me, still slow compared to the 40 of the BIN$ and mmx_ variants. But it's indeed weird that I see 215 cycles on my puter, while your exe performs in 101; and you saw from my source that there is not much overhead.

I use timers.asm: \Masm32\macros\TIMERS.ASM 10095 bytes of 15.02.2005

39 cycles timing BIN$            180 bytes      523 LAMPs
45 cycles timing pbin2           147 bytes      546 LAMPs
61 cycles timing nwDw2Bin        101 bytes      613 LAMPs
65 cycles timing nwDw2BinJJ      102 bytes      656 LAMPs
54 cycles timing NightWare       204 bytes      771 LAMPs
70 cycles timing BinLingo        187 bytes      957 LAMPs
45 cycles timing b2aDrizzAt      235 bytes      690 LAMPs
40 cycles timing mmx_dw2bin      132 bytes      460 LAMPs
215 cycles timing dw2binstr      49 bytes       1505 LAMPs


LAMPs = Lean And Mean Points = cycles * sqrt(size)
Title: Re: Bin$
Post by: jj2007 on July 03, 2008, 08:02:35 PM
Quote from: qWord on July 03, 2008, 06:39:44 PM
hey jj,
eventually the following code could be interesting for you:
How much improvement?
Look for mmx_dw2bin in the previously attached source, change the if 0 to if 1, and adapt your old code.

EDIT:
39 cycles timing BIN$            180 bytes      523 LAMPs
23 cycles timing ssemmxonly      97 bytes       227 LAMPs

Could you PLEASE tune it a little bit? Say, 3 cycles less, just to get a round figure?  :cheekygreen:

EDIT (2):
Just discovered that we are comparing apples and oranges: your output is a hex$, as the name rightly suggests :red

EDIT (3):
39 cycles timing BIN$            180 bytes      523 LAMPs
45 cycles timing pbin2           147 bytes      546 LAMPs
63 cycles timing nwDw2Bin        101 bytes      633 LAMPs
66 cycles timing nwDw2BinJJ      102 bytes      667 LAMPs
53 cycles timing NightWare       204 bytes      757 LAMPs
70 cycles timing BinLingo        187 bytes      957 LAMPs
45 cycles timing b2aDrizzAt      235 bytes      690 LAMPs
39 cycles timing QwordMmx        132 bytes      448 LAMPs xxxxxx
216 cycles timing dw2binstr      49 bytes       1512 LAMPs
350 cycles timing Dword2Bin2     56 bytes       2619 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)


I renamed your dw2binstr to QwordMmx. Still the best LAMPs score  :toothy

Attached the latest build with sources, as asm and rtf
Question on the latter:
This displays just fine in (MS) WordPad and (jj) RichMasm (http://www.masm32.com/board/index.php?topic=9044.msg65608#msg65608), but (MS) Word has a serious problem with the (MS Windows) System font – they seem not as compatible as they should... any ideas ?

[attachment deleted by admin]
Title: Re: Bin$
Post by: jj2007 on July 03, 2008, 08:43:07 PM
Quote from: DoomyD on July 03, 2008, 07:35:38 PM
75 cycles timing dw2binstr 49 bytes 525 LAMPs

32 CyclesI'm begining to wonder if it has to do with my CPU...

Nope, your CPU is fine, you are measuring different cycle counts on the same puter; so it's the code, not the CPU. Mind posting your source?
Title: Re: Bin$
Post by: qWord on July 03, 2008, 11:04:53 PM
Quote from: jj2007 on July 03, 2008, 08:02:35 PM
Still the best LAMPs score
nice to see    :green

Quote from: jj2007 on July 03, 2008, 08:02:35 PM
Just discovered that we are comparing apples and oranges: your output is a hex$, as the name rightly suggests :red
sorry for confusing you  :bg

i attached a file with an modified version of dw2hex

EDIT:
 
QuoteHow much improvement?
  a quick speed test shows that see2_dw2hex is approx. 4 times faster than dw2str from masm32.lib





[attachment deleted by admin]
Title: Re: Bin$
Post by: DoomyD on July 04, 2008, 05:10:48 AM
Attached

[attachment deleted by admin]
Title: Re: Bin$
Post by: jj2007 on July 04, 2008, 05:45:04 AM
Quote from: DoomyD on July 04, 2008, 05:10:48 AM
Attached

Mystery solved: Your code runs twice as fast because there is only one loop...

counter_begin 100000h,HIGH_PRIORITY_CLASS
mov eax,00010010001101001010101111001101b ;1234ABCDh
mov ebx,offset str1
invoke dw2binstr
counter_end


My version:

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov ecx, 11111000001111100000111111110000b
mov edx, offset Dw2BinBuffer
call dw2binstr
mov ecx, 00001111100000111110000011111111b
mov edx, offset Dw2BinBuffer
call dw2binstr
counter_end



P.S.: In the standard Masm32 installation, libs sit in \masm32\lib\
   include   \masm32\include\windows.inc
   include   \masm32\macros\timers.asm
   include   \masm32\macros\macros.asm
   
   include       \masm32\include\masm32.inc
   includelib    \masm32\lib\masm32.lib

   include       \masm32\include\kernel32.inc
   includelib    \masm32\lib\kernel32.lib
Title: Re: Bin$
Post by: DoomyD on July 28, 2008, 12:34:00 AM
Finally, I found the time to take a closer look at it...
I modifed drizz's modification, and squeezed another cycle out of it =) (although it takes 300,000,000 loops to actually see it :lol)__QwordMmx proc
movq mm6, QWORD ptr [bitmsk]
movq mm7, QWORD ptr [ascmsk]
pxor mm5, mm5
movd mm0, eax

punpcklbw mm0, mm0

punpckhdq mm2, mm0
punpckldq mm0, mm0

punpckhwd mm0, mm0
punpckhwd mm2, mm2

punpckhdq mm1, mm0
punpckldq mm0, mm0
punpckhdq mm3, mm2
punpckldq mm2, mm2

punpckhdq mm1, mm1
punpckhdq mm3, mm3

pand mm0, mm6
pand mm1, mm6
pand mm2, mm6
pand mm3, mm6

pcmpeqb mm0, mm5
pcmpeqb mm1, mm5
pcmpeqb mm2, mm5
pcmpeqb mm3, mm5

paddb mm0,mm7
paddb mm1,mm7
paddb mm2,mm7
paddb mm3,mm7

movq [edx+24],mm0
movq [edx+16],mm1
movq [edx+8],mm2
movq [edx],mm3
mov BYTE ptr [edx+32],0
retn
__QwordMmx endp
Core 2 Duo x32:25 - QwordMmx - qWord
22 - QwordMmx - drizz
21 - QwordMmx - new

[attachment deleted by admin]
Title: Re: Bin$
Post by: jj2007 on July 28, 2008, 02:34:30 AM
loop count: 300000000
31 - QwordMmx - qWord
33 - QwordMmx - drizz
31 - QwordMmx - new

Celeron M ...
Title: Re: Bin$
Post by: DoomyD on July 28, 2008, 12:03:42 PM
Noticed I can cap another line =)qWordMmxOpt proc
movq mm7, QWORD ptr [ascmsk]
movq mm6, QWORD ptr [bitmsk]
pxor mm5, mm5
movd mm0, eax

punpcklbw mm0, mm0
punpckhdq mm2, mm0

punpcklwd mm0, mm0
punpckhwd mm2, mm2

punpckhdq mm1, mm0
punpckhdq mm3, mm2

punpckldq mm0, mm0
punpckhdq mm1, mm1
punpckldq mm2, mm2
punpckhdq mm3, mm3

pand mm0, mm6
pand mm1, mm6
pand mm2, mm6
pand mm3, mm6

pcmpeqb mm0, mm5
pcmpeqb mm1, mm5
pcmpeqb mm2, mm5
pcmpeqb mm3, mm5

paddb mm0,mm7
paddb mm1,mm7
paddb mm2,mm7
paddb mm3,mm7

movq [edx+24],mm0
movq [edx+16],mm1
movq [edx+08],mm2
movq [edx+00],mm3
mov BYTE ptr [edx+32],0
retn
qWordMmxOpt end
Loop count: 2000000000
Method: timer_begin\timer_end:
10104 {00010010001101000101011001111000} QwordMmx(1) - qWord
10051 {00010010001101000101011001111000} QwordMmx(2) - qWord
9837 {00010010001101000101011001111000} _QwordMmx(1) - drizz
9885 {00010010001101000101011001111000} _QwordMmx(2) - drizz
9245 {00010010001101000101011001111000} __QwordMmx(1) - DoomyD
9191 {00010010001101000101011001111000} __QwordMmx(2) - DoomyD

[attachment deleted by admin]
Title: Re: Bin$
Post by: jj2007 on July 28, 2008, 01:56:24 PM
I timed it twice, the 2nd time with reduced loop count, but results are stable on a P4, 3.4 GHz:

Loop count: 2000000000
Method: timer_begin\timer_end:
11319 {00010010001101000101011001111000} QwordMmx(1) - qWord
11452 {00010010001101000101011001111000} QwordMmx(2) - qWord
14740 {00010010001101000101011001111000} _QwordMmx(1) - drizz
14475 {00010010001101000101011001111000} _QwordMmx(2) - drizz
13840 {00010010001101000101011001111000} __QwordMmx(1) - DoomyD
13833 {00010010001101000101011001111000} __QwordMmx(2) - DoomyD

Loop count: 200000000
Method: timer_begin\timer_end:
1086 {00010010001101000101011001111000} QwordMmx(1) - qWord
1088 {00010010001101000101011001111000} QwordMmx(2) - qWord
1450 {00010010001101000101011001111000} _QwordMmx(1) - drizz
1455 {00010010001101000101011001111000} _QwordMmx(2) - drizz
1385 {00010010001101000101011001111000} __QwordMmx(1) - DoomyD
1387 {00010010001101000101011001111000} __QwordMmx(2) - DoomyD


My timings with your previous code on the Celeron M saw your code on par with qWord... which processor are you using?
Title: Re: Bin$
Post by: DoomyD on July 28, 2008, 02:17:33 PM
Intel Core 2 Due 6300  @ 1.8 GHz
Model: x86 615
Title: Re: Bin$
Post by: jj2007 on July 28, 2008, 03:18:38 PM
I integrated your code into the package. LAMP-wise you made it, compliments... but NightWare has a little edge on speed, at least on my P4:

48 cycles timing BIN$            180 bytes      644 LAMPs
59 cycles timing pbin2           147 bytes      715 LAMPs
86 cycles timing nwDw2Bin        101 bytes      864 LAMPs
69 cycles timing nwDw2BinJJ      102 bytes      697 LAMPs
42 cycles timing NightWare       204 bytes      600 LAMPs ****
86 cycles timing BinLingo        187 bytes      1176 LAMPs
50 cycles timing b2aDrizzAt      235 bytes      766 LAMPs
54 cycles timing MmxQword        132 bytes      620 LAMPs
52 cycles timing MmxDoomy        110 bytes      545 LAMPs ****
81 cycles timing dw2bin_ex       2140 bytes     3747 LAMPs


LAMPs = Lean And Mean Points = cycles * sqrt(size)

EDIT: Results for Celeron M - Doomy clearly in the lead (but BIN$ also ok for speed)

32 cycles timing BIN$            180 bytes      429 LAMPs
42 cycles timing pbin2           147 bytes      509 LAMPs
54 cycles timing nwDw2Bin        101 bytes      543 LAMPs
57 cycles timing nwDw2BinJJ      102 bytes      576 LAMPs
39 cycles timing NightWare       204 bytes      557 LAMPs
36 cycles timing BinLingo        187 bytes      492 LAMPs
43 cycles timing b2aDrizzAt      235 bytes      659 LAMPs
33 cycles timing MmxQword        132 bytes      379 LAMPs
32 cycles timing MmxDoomy        110 bytes      336 LAMPs
60 cycles timing dw2bin_ex       2140 bytes     2776 LAMPs



[attachment deleted by admin]
Title: Re: Bin$
Post by: jj2007 on July 28, 2008, 08:10:24 PM
3 cycles less for everybody - a bin$ is always 32 bytes long, so no need for poking a zero terminator. Celeron M timings:

30 cycles timing BIN$            136 bytes      350 LAMPs
39 cycles timing NightWare       204 bytes      557 LAMPs
36 cycles timing BinLingo        187 bytes      492 LAMPs
33 cycles timing MmxQword        132 bytes      379 LAMPs
29 cycles timing MmxDoomy        106 bytes      299 LAMPs

[attachment deleted by admin]
Title: Re: Bin$
Post by: DoomyD on July 29, 2008, 07:36:41 AM
Quote29 cycles timing BIN$         136 bytes   338 LAMPs
45 cycles timing pbin2         144 bytes   540 LAMPs
49 cycles timing nwDw2Bin    101 bytes   492 LAMPs
43 cycles timing nwDw2BinJJ    102 bytes   434 LAMPs
28 cycles timing NightWare    200 bytes   396 LAMPs
27 cycles timing BinLingo    183 bytes   365 LAMPs
33 cycles timing b2aDrizzAt    235 bytes   506 LAMPs
21 cycles timing MmxQword    128 bytes   238 LAMPs
19 cycles timing MmxDoomy    106 bytes   196 LAMPs
52 cycles timing dw2bin_ex    2140 bytes   2406 LAMPs
Looking at the source thoughh, I should point that the algorithm is using the same data resources as qWord's mmx, I'll include them as seperate.
By the way: I don't think it could be shortened any more than that - Here's my final code:m_mmx_dw2bin macro Value:REQ, lpBuffer
LOCAL mmx_dw2bin_buffer

IFNDEF mmx_dw2bin_enabled
  mmx_dw2bin_enabled equ <1>
  .data
   align 8
   mmx_dw2bin_ascmsk db 8 dup (31h)
   mmx_dw2bin_bitmsk db 80h,40h,20h,10h,08h,04h,02h,01h
ENDIF

.code
  even
  mov  eax, Value
  movq mm7, qword ptr [mmx_dw2bin_ascmsk]
  movq mm6, qword ptr [mmx_dw2bin_bitmsk]
  movd mm0, eax
 
  punpcklbw mm0, mm0
  punpckldq mm2, mm0
 
  punpckhwd mm0, mm0
  punpckldq mm1, mm0
  punpckhwd mm2, mm2
  punpckldq mm3, mm2
 
  punpckhdq mm0, mm0
  punpckhdq mm1, mm1
  punpckhdq mm2, mm2
  punpckhdq mm3, mm3

  pandn mm0, mm6
  pandn mm1, mm6
  pandn mm2, mm6
  pandn mm3, mm6
 
  pcmpeqb mm0, mm6
  pcmpeqb mm1, mm6
  pcmpeqb mm2, mm6
  pcmpeqb mm3, mm6
 
  paddb mm0,mm7
  paddb mm1,mm7
  paddb mm2,mm7
  paddb mm3,mm7
 
 
  IFB <lpBuffer>
   .data
    mmx_dw2bin_buffer db 32 dup (0),0
    align 4
   .code
   mov  eax,offset mmx_dw2bin_buffer
  ELSE
   mov  eax,lpBuffer
  ENDIF
 
  movq [eax+00],mm0
  movq [eax+08],mm1
  movq [eax+16],mm2
  movq [eax+24],mm3
endm