News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Bin$

Started by jj2007, June 20, 2008, 01:39:06 AM

Previous topic - Next topic

lingo

Not the same but faster again :lol:


mov   eax, esp
lea   esp, [edx+32]
mov   BYTE PTR [edx+32],0
mov   edx, eax

shld  eax, ecx, 30
and   ecx, 00000000Fh
push  DWORD PTR [BinaryTable+ecx*4]

shld  ecx, eax, 28
and   eax, 00000003Ch
push  DWORD PTR [BinaryTable+eax]
shld  eax, ecx, 28

and   ecx, 3Ch
push  DWORD PTR [BinaryTable+ecx]

mov   ecx, eax
and   eax, 3Ch
push  DWORD PTR [BinaryTable+eax]
shr   ecx, 4

mov   eax, ecx
and   ecx, 3Ch
push  DWORD PTR [BinaryTable+ecx]
shr   eax, 4

mov   ecx, eax
and   eax, 3Ch
push  DWORD PTR [BinaryTable+eax]
shr   ecx, 4

mov   eax, ecx
and   ecx, 3Ch
push  DWORD PTR [BinaryTable+ecx]
shr   eax, 4

and   eax, 3Ch
push  DWORD PTR [BinaryTable+eax]
mov   esp, edx
ret


My Results:
Intel Core 2 E8500,4000 MHz (9.5 x 421),Vista64-SP1


37 cycles timing BIN$            180 bytes      496 LAMPs
42 cycles timing pbin2           147 bytes      509 LAMPs
36 cycles timing nwDw2Bin        101 bytes      362 LAMPs
32 cycles timing nwDw2BinJJ      102 bytes      323 LAMPs
30 cycles timing NightWare       204 bytes      428 LAMPs
27 cycles timing BinLingo        187 bytes      369 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

[attachment deleted by admin]

qWord

Here is an SIMD version (uses SSSE3). I've test it with aligned and unaligned move (movdqu/movdqa)

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

align 16

_qword proc ;var:DWORD,buffer:DWORD
   ;var in edx
   ;buffer in ecx

    .data
        align 16
        shufmsk     db  8 dup (1)
                    db  8 dup (0)
        bitmsk      db  128,64,32,16,8,4,2,1
                    db  128,64,32,16,8,4,2,1
        ascmsk      db  16 dup (31h)

    .code

    movdqa xmm3,OWORD ptr [shufmsk]
    movdqa xmm4,OWORD ptr [bitmsk]
    movdqa xmm5,OWORD ptr [ascmsk]

    pinsrw xmm0,edx,0
    shr edx,16
    pxor xmm2,xmm2
    pinsrw xmm1,edx,0

    pshufb xmm1,xmm3
    pand xmm1,xmm4
    pcmpeqb xmm1,xmm2
    paddsb xmm1,xmm5
    movdqu OWORD ptr [ecx],xmm1

    align 8
    pshufb xmm0,xmm3
    pand xmm0,xmm4   
    pcmpeqb xmm0,xmm2
    paddsb xmm0,xmm5
    movdqu OWORD ptr [ecx+16],xmm0

    mov BYTE ptr [ecx+32],0

    ret
_qword endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef


my results(Core2Duo):

with movdqu:

NightWare timing average = 347
sinsi_ex  timing average = 401
hutch_ex2 timing average = 680
JJ BIN$   timing average = 386
mqword   timing average = 376

with movdqa:

NightWare timing average = 337
sinsi_ex  timing average = 401
hutch_ex2 timing average = 676
JJ BIN$   timing average = 376
mqword   timing average = 241


[attachment deleted by admin]
FPU in a trice: SmplMath
It's that simple!

jj2007

Quote from: lingo on June 27, 2008, 04:10:59 PM
Not the same but faster again :lol:

My Results:
Intel Core 2 E8500,4000 MHz (9.5 x 421),Vista64-SP1

37 cycles timing BIN$            180 bytes      496 LAMPs
42 cycles timing pbin2           147 bytes      509 LAMPs
36 cycles timing nwDw2Bin        101 bytes      362 LAMPs
32 cycles timing nwDw2BinJJ      102 bytes      323 LAMPs
30 cycles timing NightWare       204 bytes      428 LAMPs
27 cycles timing BinLingo        187 bytes      369 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

And shorter, too. Will have to study some of these exotic lingoish opcodes ;-)

However, my Celeron does not seem to like it much. Interesting how big the differences between processors are in this case.

39 cycles timing BIN$            180 bytes      523 LAMPs
45 cycles timing pbin2           147 bytes      546 LAMPs
61 cycles timing nwDw2Bin        101 bytes      613 LAMPs
64 cycles timing nwDw2BinJJ      102 bytes      646 LAMPs
53 cycles timing NightWare       204 bytes      757 LAMPs
76 cycles timing BinLingo        204 bytes      1085 LAMPs **** previous version

C:\MASM32\GFA2MASM>bl
39 cycles timing BIN$            180 bytes      523 LAMPs
45 cycles timing pbin2           147 bytes      546 LAMPs
60 cycles timing nwDw2Bin        101 bytes      603 LAMPs
64 cycles timing nwDw2BinJJ      102 bytes      646 LAMPs
53 cycles timing NightWare       204 bytes      757 LAMPs
70 cycles timing BinLingo        187 bytes      957 LAMPs **** latest version

jj2007

Quote from: drizz on June 27, 2008, 03:04:18 PM
here's my attempt  :8):

Good start, Drizz!

39 cycles timing BIN$            180 bytes      523 LAMPs
45 cycles timing pbin2           147 bytes      546 LAMPs
60 cycles timing nwDw2Bin        101 bytes      603 LAMPs
65 cycles timing nwDw2BinJJ      102 bytes      656 LAMPs
53 cycles timing NightWare       204 bytes      757 LAMPs
70 cycles timing BinLingo        187 bytes      957 LAMPs
46 cycles timing b2aDrizzAt      235 bytes      705 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)


[attachment deleted by admin]

qWord

i have ported my idea to mmx, so more people can test it    :U


mmx_dw2bin proc ;var:DWORD,buffer:DWORD
   ;var in edx
   ;buffer in ecx

    .data
        align 16
        bitmsk      db  128,64,32,16,8,4,2,1
        ascmsk      db  8 dup (031h)
    .code

    bswap edx
    movq mm6,QWORD ptr [bitmsk]
    movq mm7,QWORD ptr [ascmsk]
    pxor mm5,mm5

    movd mm0,edx
    punpcklbw mm0,mm0

    movq mm1,mm0
    pshufw mm1,mm0,0
    pand mm1,mm6
    pcmpeqb mm1,mm5
    paddsb mm1,mm7
    movq QWORD ptr [ecx],mm1

    movq mm2,mm0
    pshufw mm2,mm0,001010101y
    pand mm2,mm6
    pcmpeqb mm2,mm5
    paddsb mm2,mm7
    movq QWORD ptr [ecx+8],mm2

    movq mm1,mm0
    pshufw mm1,mm0,010101010y
    pand mm1,mm6
    pcmpeqb mm1,mm5
    paddsb mm1,mm7
    movq QWORD ptr [ecx+16],mm1

    movq mm2,mm0
    pshufw mm2,mm0,011111111y
    pand mm2,mm6
    pcmpeqb mm2,mm5
    paddsb mm2,mm7
    movq QWORD ptr [ecx+24],mm2

    mov BYTE ptr [ecx+32],0

    ret
mmx_dw2bin endp




results on Core2Duo:

37 cycles timing BIN$            180 bytes      496 LAMPs
43 cycles timing pbin2           147 bytes      521 LAMPs
36 cycles timing nwDw2Bin        101 bytes      362 LAMPs
42 cycles timing nwDw2BinJJ      102 bytes      424 LAMPs
30 cycles timing NightWare       204 bytes      428 LAMPs
27 cycles timing BinLingo        187 bytes      369 LAMPs
34 cycles timing b2aDrizzAt      235 bytes      521 LAMPs
24 cycles timing mmx_dw2bin      177 bytes      319 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)




[attachment deleted by admin]
FPU in a trice: SmplMath
It's that simple!

jj2007

Quote from: qWord on June 27, 2008, 09:16:36 PM
i have ported my idea to mmx, so more people can test it    :U

Very interesting. Here are my timings:

40 cycles timing BIN$            180 bytes      537 LAMPs
45 cycles timing pbin2           147 bytes      546 LAMPs
62 cycles timing nwDw2Bin        101 bytes      623 LAMPs
64 cycles timing nwDw2BinJJ      102 bytes      646 LAMPs
53 cycles timing NightWare       204 bytes      757 LAMPs
70 cycles timing BinLingo        187 bytes      957 LAMPs
45 cycles timing b2aDrizzAt      235 bytes      690 LAMPs
179 cycles timing mmx_dw2bin     129 bytes      2033 LAMPs

(sorry, I played a bad trick:
mov ecx, offset Dw2BinBuffer
inc ecx
call mmx_dw2bin
... which misaligns the target)

Without this bad trick, your code performs indeed excellently:
40 cycles timing BIN$            180 bytes      537 LAMPs
45 cycles timing pbin2           147 bytes      546 LAMPs
65 cycles timing nwDw2Bin        101 bytes      653 LAMPs
65 cycles timing nwDw2BinJJ      102 bytes      656 LAMPs
54 cycles timing NightWare       204 bytes      771 LAMPs
70 cycles timing BinLingo        187 bytes      957 LAMPs
46 cycles timing b2aDrizzAt      235 bytes      705 LAMPs
35 cycles timing mmx_dw2bin      129 bytes      398 LAMPs

Fast and short, congratulations!
mmx is very sensitive to misalignment, but in the case of a BIN$ macro we can safely assume that we are able to align the target, so imho we have a winning code here  :cheekygreen:

EDIT: I add the modified code; for consistency with the other algos, I exchanged the variable and destination registers as follows:
   ;var in edx NEW: eax
   ;buffer in ecx NEW: edx


[attachment deleted by admin]

jj2007

Quote from: qWord on June 27, 2008, 09:16:36 PM
i have ported my idea to mmx, so more people can test it    :U

Of minor practical relevance: pshufw needs xmm (=SSE1), not mmx.

qWord

Quote
Of minor practical relevance: pshufw needs xmm (=SSE1), not mmx.

your right. - I've forgotten :lol
FPU in a trice: SmplMath
It's that simple!

drizz

Quote from: jj2007 on June 27, 2008, 11:19:23 PMOf minor practical relevance: pshufw needs xmm (=SSE1), not mmx.
It's not that hard to replace pshufw. Anyway credits go to qWord!  :U
movq mm6,QWORD ptr [bitmsk]
movq mm7,QWORD ptr [ascmsk]
pxor mm5,mm5
movd mm0,edx
punpcklbw mm0,mm0
punpcklwd mm1,mm0
movq mm2,mm0
punpckhwd mm3,mm0
punpcklwd mm0,mm0
punpckhwd mm1,mm1
punpckhwd mm2,mm2
punpckhwd mm3,mm3
punpckldq mm0,mm0
punpckhdq mm1,mm1
punpckldq mm2,mm2
punpckhdq mm3,mm3
pand mm0,mm6
pand mm1,mm6
pand mm2,mm6
pand mm3,mm6
pcmpeqb mm0,mm5
pcmpeqb mm1,mm5
pcmpeqb mm2,mm5
pcmpeqb mm3,mm5
paddb mm0,mm7
paddb mm1,mm7
paddb mm2,mm7
paddb mm3,mm7
movq [ecx+24],mm0
movq [ecx+16],mm1
movq [ecx+8],mm2
movq [ecx],mm3
mov BYTE ptr [ecx+32],0
edit: removed bswap
edit2: removed some more instructions
The truth cannot be learned ... it can only be recognized.

jj2007

Looks good. The old qWord version seems to be an edge faster, see attachment qw.exe

[attachment deleted by admin]

drizz

Quote from: jj2007 on June 28, 2008, 09:08:12 AMThe old qWord version seems to be an edge faster.
yes i know, but sse1 is P3 and above, this modified version will now work on all MMX capable cpus (e.g. 13 year old pentium).
The truth cannot be learned ... it can only be recognized.

jj2007

Quote from: drizz on June 28, 2008, 09:28:36 AM
Quote from: jj2007 on June 28, 2008, 09:08:12 AMThe old qWord version seems to be an edge faster.
yes i know, but sse1 is P3 and above, this modified version will now work on all MMX capable cpus (e.g. 13 year old pentium).
That was the point indeed - for a general purpose library this is clearly the better solution. And in terms of LAMPs it beats the hell out of the other algos. My BIN$ algo (inspired by Sinsi) is pretty fast on the Celeron but sucks on real Pentiums.

EDIT:
Timings on a Celeron

39 cycles timing BIN$            180 bytes      523 LAMPs
45 cycles timing pbin2           147 bytes      546 LAMPs
61 cycles timing nwDw2Bin        101 bytes      613 LAMPs
64 cycles timing nwDw2BinJJ      102 bytes      646 LAMPs
54 cycles timing NightWare       204 bytes      771 LAMPs
70 cycles timing BinLingo        187 bytes      957 LAMPs
46 cycles timing b2aDrizzAt      235 bytes      705 LAMPs
40 cycles timing mmx_dw2bin      132 bytes      460 LAMPs (new Drizz mmx variant)

39 cycles timing BIN$            180 bytes      523 LAMPs
45 cycles timing pbin2           147 bytes      546 LAMPs
61 cycles timing nwDw2Bin        101 bytes      613 LAMPs
66 cycles timing nwDw2BinJJ      102 bytes      667 LAMPs
53 cycles timing NightWare       204 bytes      757 LAMPs
70 cycles timing BinLingo        187 bytes      957 LAMPs
45 cycles timing b2aDrizzAt      235 bytes      690 LAMPs
36 cycles timing mmx_dw2bin      129 bytes      409 LAMPs (old qWord xmm variant)

LAMPs = Lean And Mean Points = cycles * sqrt(size)

qWord

Quote
The old qWord version seems to be an edge faster

interesting, on my Core2Duo drizzs version is 2~3 clocks faster


sse1:
24 cycles timing mmx_dw2bin      129 bytes      273 LAMPs
drizz's:
22 cycles timing mmx_dw2bin      132 bytes      253 LAMPs


EDIT: syr, i've forgot to delete baswp and adjust pshufw-instructions =>   
sse1:
18 cycles timing mmx_dw2bin     
drizz's:
22 cycles timing mmx_dw2bin     
FPU in a trice: SmplMath
It's that simple!

jj2007

Quote from: qWord on June 28, 2008, 09:47:41 AM
Quote
The old qWord version seems to be an edge faster

interesting, on my Core2Duo drizzs version is 2~3 clocks faster
Could you do me a fvour and time the CAT$ macro?

qWord

Sorry, it was an false statement by me , see my previous post
FPU in a trice: SmplMath
It's that simple!