Bin$

lingo · June 27, 2008, 04:10:59 PM

Not the same but faster again :lol:


mov   eax, esp
lea   esp, [edx+32]
mov   BYTE PTR [edx+32],0
mov   edx, eax

shld  eax, ecx, 30
and   ecx, 00000000Fh
push  DWORD PTR [BinaryTable+ecx*4]

shld  ecx, eax, 28 
and   eax, 00000003Ch 
push  DWORD PTR [BinaryTable+eax]	
shld  eax, ecx, 28

and   ecx, 3Ch
push  DWORD PTR [BinaryTable+ecx]

mov   ecx, eax
and   eax, 3Ch
push  DWORD PTR [BinaryTable+eax]
shr   ecx, 4

mov   eax, ecx
and   ecx, 3Ch
push  DWORD PTR [BinaryTable+ecx]
shr   eax, 4

mov   ecx, eax
and   eax, 3Ch
push  DWORD PTR [BinaryTable+eax]
shr   ecx, 4

mov   eax, ecx
and   ecx, 3Ch
push  DWORD PTR [BinaryTable+ecx]
shr   eax, 4

and   eax, 3Ch
push  DWORD PTR [BinaryTable+eax]
mov   esp, edx 
ret


My Results:
Intel Core 2 E8500,4000 MHz (9.5 x 421),Vista64-SP1


37 cycles timing BIN$            180 bytes      496 LAMPs
42 cycles timing pbin2           147 bytes      509 LAMPs
36 cycles timing nwDw2Bin        101 bytes      362 LAMPs
32 cycles timing nwDw2BinJJ      102 bytes      323 LAMPs
30 cycles timing NightWare       204 bytes      428 LAMPs
27 cycles timing BinLingo        187 bytes      369 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

[attachment deleted by admin]

qWord · June 27, 2008, 04:55:03 PM

Here is an SIMD version (uses SSSE3). I've test it with aligned and unaligned move (movdqu/movdqa)

Code Select

OPTION PROLOGUE:PrologueDef 
OPTION EPILOGUE:EpilogueDef 

align 16

_qword proc ;var:DWORD,buffer:DWORD
   ;var in edx
   ;buffer in ecx

    .data
        align 16
        shufmsk     db  8 dup (1)
                    db  8 dup (0)
        bitmsk      db  128,64,32,16,8,4,2,1
                    db  128,64,32,16,8,4,2,1
        ascmsk      db  16 dup (31h)

    .code
 
    movdqa xmm3,OWORD ptr [shufmsk]
    movdqa xmm4,OWORD ptr [bitmsk]
    movdqa xmm5,OWORD ptr [ascmsk]

    pinsrw xmm0,edx,0
    shr edx,16
    pxor xmm2,xmm2
    pinsrw xmm1,edx,0

    pshufb xmm1,xmm3
    pand xmm1,xmm4 
    pcmpeqb xmm1,xmm2
    paddsb xmm1,xmm5
    movdqu OWORD ptr [ecx],xmm1

    align 8
    pshufb xmm0,xmm3
    pand xmm0,xmm4   
    pcmpeqb xmm0,xmm2
    paddsb xmm0,xmm5
    movdqu OWORD ptr [ecx+16],xmm0

    mov BYTE ptr [ecx+32],0

    ret
_qword endp

OPTION PROLOGUE:PrologueDef 
OPTION EPILOGUE:EpilogueDef

my results(Core2Duo):

Code Select


with movdqu:

NightWare timing average = 347
sinsi_ex  timing average = 401
hutch_ex2 timing average = 680
JJ BIN$   timing average = 386
mqword   timing average = 376

with movdqa:

NightWare timing average = 337
sinsi_ex  timing average = 401
hutch_ex2 timing average = 676
JJ BIN$   timing average = 376
mqword   timing average = 241

[attachment deleted by admin]

jj2007 · June 27, 2008, 06:31:18 PM

Quote from: lingo on June 27, 2008, 04:10:59 PM
Not the same but faster again :lol:

My Results:
Intel Core 2 E8500,4000 MHz (9.5 x 421),Vista64-SP1

37 cycles timing BIN$ 180 bytes 496 LAMPs
42 cycles timing pbin2 147 bytes 509 LAMPs
36 cycles timing nwDw2Bin 101 bytes 362 LAMPs
32 cycles timing nwDw2BinJJ 102 bytes 323 LAMPs
30 cycles timing NightWare 204 bytes 428 LAMPs
27 cycles timing BinLingo 187 bytes 369 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

And shorter, too. Will have to study some of these exotic lingoish opcodes ;-)

However, my Celeron does not seem to like it much. Interesting how big the differences between processors are in this case.

39 cycles timing BIN$ 180 bytes 523 LAMPs
45 cycles timing pbin2 147 bytes 546 LAMPs
61 cycles timing nwDw2Bin 101 bytes 613 LAMPs
64 cycles timing nwDw2BinJJ 102 bytes 646 LAMPs
53 cycles timing NightWare 204 bytes 757 LAMPs
76 cycles timing BinLingo 204 bytes 1085 LAMPs **** previous version

C:\MASM32\GFA2MASM>bl
39 cycles timing BIN$ 180 bytes 523 LAMPs
45 cycles timing pbin2 147 bytes 546 LAMPs
60 cycles timing nwDw2Bin 101 bytes 603 LAMPs
64 cycles timing nwDw2BinJJ 102 bytes 646 LAMPs
53 cycles timing NightWare 204 bytes 757 LAMPs
70 cycles timing BinLingo 187 bytes 957 LAMPs **** latest version

Code Select

jj2007 · June 27, 2008, 07:13:16 PM

Quote from: drizz on June 27, 2008, 03:04:18 PM
here's my attempt :8):

Good start, Drizz!

39 cycles timing BIN$ 180 bytes 523 LAMPs
45 cycles timing pbin2 147 bytes 546 LAMPs
60 cycles timing nwDw2Bin 101 bytes 603 LAMPs
65 cycles timing nwDw2BinJJ 102 bytes 656 LAMPs
53 cycles timing NightWare 204 bytes 757 LAMPs
70 cycles timing BinLingo 187 bytes 957 LAMPs
46 cycles timing b2aDrizzAt 235 bytes 705 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

[attachment deleted by admin]

qWord · June 27, 2008, 09:16:36 PM

i have ported my idea to mmx, so more people can test it :U

Code Select


mmx_dw2bin proc ;var:DWORD,buffer:DWORD
   ;var in edx
   ;buffer in ecx

    .data
        align 16
        bitmsk      db  128,64,32,16,8,4,2,1
        ascmsk      db  8 dup (031h)
    .code

    bswap edx
    movq mm6,QWORD ptr [bitmsk]
    movq mm7,QWORD ptr [ascmsk]
    pxor mm5,mm5

    movd mm0,edx
    punpcklbw mm0,mm0

    movq mm1,mm0
    pshufw mm1,mm0,0
    pand mm1,mm6
    pcmpeqb mm1,mm5
    paddsb mm1,mm7
    movq QWORD ptr [ecx],mm1

    movq mm2,mm0
    pshufw mm2,mm0,001010101y
    pand mm2,mm6
    pcmpeqb mm2,mm5
    paddsb mm2,mm7
    movq QWORD ptr [ecx+8],mm2

    movq mm1,mm0
    pshufw mm1,mm0,010101010y
    pand mm1,mm6
    pcmpeqb mm1,mm5
    paddsb mm1,mm7
    movq QWORD ptr [ecx+16],mm1

    movq mm2,mm0
    pshufw mm2,mm0,011111111y
    pand mm2,mm6
    pcmpeqb mm2,mm5
    paddsb mm2,mm7
    movq QWORD ptr [ecx+24],mm2

    mov BYTE ptr [ecx+32],0

    ret
mmx_dw2bin endp




results on Core2Duo:

37 cycles timing BIN$            180 bytes      496 LAMPs
43 cycles timing pbin2           147 bytes      521 LAMPs
36 cycles timing nwDw2Bin        101 bytes      362 LAMPs
42 cycles timing nwDw2BinJJ      102 bytes      424 LAMPs
30 cycles timing NightWare       204 bytes      428 LAMPs
27 cycles timing BinLingo        187 bytes      369 LAMPs
34 cycles timing b2aDrizzAt      235 bytes      521 LAMPs
24 cycles timing mmx_dw2bin      177 bytes      319 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

[attachment deleted by admin]

jj2007 · June 27, 2008, 09:28:52 PM

Quote from: qWord on June 27, 2008, 09:16:36 PM
i have ported my idea to mmx, so more people can test it :U

Very interesting. Here are my timings:

40 cycles timing BIN$ 180 bytes 537 LAMPs
45 cycles timing pbin2 147 bytes 546 LAMPs
62 cycles timing nwDw2Bin 101 bytes 623 LAMPs
64 cycles timing nwDw2BinJJ 102 bytes 646 LAMPs
53 cycles timing NightWare 204 bytes 757 LAMPs
70 cycles timing BinLingo 187 bytes 957 LAMPs
45 cycles timing b2aDrizzAt 235 bytes 690 LAMPs
179 cycles timing mmx_dw2bin 129 bytes 2033 LAMPs

(sorry, I played a bad trick:

Code Select

		mov ecx, offset Dw2BinBuffer
	inc ecx
		call mmx_dw2bin

... which misaligns the target)

Without this bad trick, your code performs indeed excellently:
40 cycles timing BIN$ 180 bytes 537 LAMPs
45 cycles timing pbin2 147 bytes 546 LAMPs
65 cycles timing nwDw2Bin 101 bytes 653 LAMPs
65 cycles timing nwDw2BinJJ 102 bytes 656 LAMPs
54 cycles timing NightWare 204 bytes 771 LAMPs
70 cycles timing BinLingo 187 bytes 957 LAMPs
46 cycles timing b2aDrizzAt 235 bytes 705 LAMPs
35 cycles timing mmx_dw2bin 129 bytes 398 LAMPs

Fast and short, congratulations!
mmx is very sensitive to misalignment, but in the case of a BIN$ macro we can safely assume that we are able to align the target, so imho we have a winning code here :cheekygreen:

EDIT: I add the modified code; for consistency with the other algos, I exchanged the variable and destination registers as follows:
;var in edx NEW: eax
;buffer in ecx NEW: edx

[attachment deleted by admin]

jj2007 · June 27, 2008, 11:19:23 PM

Quote from: qWord on June 27, 2008, 09:16:36 PM
i have ported my idea to mmx, so more people can test it :U

Of minor practical relevance: pshufw needs xmm (=SSE1), not mmx.

qWord · June 28, 2008, 12:17:56 AM

Quote
Of minor practical relevance: pshufw needs xmm (=SSE1), not mmx.

your right. - I've forgotten :lol

drizz · June 28, 2008, 08:14:18 AM

Quote from: jj2007 on June 27, 2008, 11:19:23 PMOf minor practical relevance: pshufw needs xmm (=SSE1), not mmx.

It's not that hard to replace pshufw. Anyway credits go to qWord! :U

Code Select

	movq mm6,QWORD ptr [bitmsk]
	movq mm7,QWORD ptr [ascmsk]
	pxor mm5,mm5
	movd mm0,edx
	punpcklbw mm0,mm0
	punpcklwd mm1,mm0
	movq mm2,mm0
	punpckhwd mm3,mm0
	punpcklwd mm0,mm0
	punpckhwd mm1,mm1
	punpckhwd mm2,mm2
	punpckhwd mm3,mm3
	punpckldq mm0,mm0
	punpckhdq mm1,mm1
	punpckldq mm2,mm2
	punpckhdq mm3,mm3
	pand mm0,mm6
	pand mm1,mm6
	pand mm2,mm6
	pand mm3,mm6
	pcmpeqb mm0,mm5
	pcmpeqb mm1,mm5
	pcmpeqb mm2,mm5
	pcmpeqb mm3,mm5
	paddb mm0,mm7
	paddb mm1,mm7
	paddb mm2,mm7
	paddb mm3,mm7
	movq [ecx+24],mm0
	movq [ecx+16],mm1
	movq [ecx+8],mm2
	movq [ecx],mm3
	mov BYTE ptr [ecx+32],0

edit: removed bswap
edit2: removed some more instructions

jj2007 · June 28, 2008, 09:08:12 AM

Looks good. The old qWord version seems to be an edge faster, see attachment qw.exe

[attachment deleted by admin]

drizz · June 28, 2008, 09:28:36 AM

Quote from: jj2007 on June 28, 2008, 09:08:12 AMThe old qWord version seems to be an edge faster.

yes i know, but sse1 is P3 and above, this modified version will now work on all MMX capable cpus (e.g. 13 year old pentium).

jj2007 · June 28, 2008, 09:32:25 AM

Quote from: drizz on June 28, 2008, 09:28:36 AM
Quote from: jj2007 on June 28, 2008, 09:08:12 AMThe old qWord version seems to be an edge faster.
yes i know, but sse1 is P3 and above, this modified version will now work on all MMX capable cpus (e.g. 13 year old pentium).

That was the point indeed - for a general purpose library this is clearly the better solution. And in terms of LAMPs it beats the hell out of the other algos. My BIN$ algo (inspired by Sinsi) is pretty fast on the Celeron but sucks on real Pentiums.

EDIT:
Timings on a Celeron

39 cycles timing BIN$ 180 bytes 523 LAMPs
45 cycles timing pbin2 147 bytes 546 LAMPs
61 cycles timing nwDw2Bin 101 bytes 613 LAMPs
64 cycles timing nwDw2BinJJ 102 bytes 646 LAMPs
54 cycles timing NightWare 204 bytes 771 LAMPs
70 cycles timing BinLingo 187 bytes 957 LAMPs
46 cycles timing b2aDrizzAt 235 bytes 705 LAMPs
40 cycles timing mmx_dw2bin 132 bytes 460 LAMPs (new Drizz mmx variant)

39 cycles timing BIN$ 180 bytes 523 LAMPs
45 cycles timing pbin2 147 bytes 546 LAMPs
61 cycles timing nwDw2Bin 101 bytes 613 LAMPs
66 cycles timing nwDw2BinJJ 102 bytes 667 LAMPs
53 cycles timing NightWare 204 bytes 757 LAMPs
70 cycles timing BinLingo 187 bytes 957 LAMPs
45 cycles timing b2aDrizzAt 235 bytes 690 LAMPs
36 cycles timing mmx_dw2bin 129 bytes 409 LAMPs (old qWord xmm variant)

LAMPs = Lean And Mean Points = cycles * sqrt(size)

qWord · June 28, 2008, 09:47:41 AM

Quote
The old qWord version seems to be an edge faster

interesting, on my Core2Duo drizzs version is 2~3 clocks faster

Code Select


sse1:
24 cycles timing mmx_dw2bin      129 bytes      273 LAMPs
drizz's:
22 cycles timing mmx_dw2bin      132 bytes      253 LAMPs

EDIT: syr, i've forgot to delete baswp and adjust pshufw-instructions =>

Code Select

sse1:
18 cycles timing mmx_dw2bin      
drizz's:
22 cycles timing mmx_dw2bin

jj2007 · June 28, 2008, 10:17:50 AM

Quote from: qWord on June 28, 2008, 09:47:41 AM
Quote
The old qWord version seems to be an edge faster

interesting, on my Core2Duo drizzs version is 2~3 clocks faster

Could you do me a fvour and time the CAT$ macro?

qWord · June 28, 2008, 10:22:49 AM

Sorry, it was an false statement by me , see my previous post

News:

Bin$