Bin$

jj2007 · June 28, 2008, 10:26:37 AM

Quote from: qWord on June 28, 2008, 10:22:49 AM
Sorry, it was an false statement by me , see my previous post

No problem; I was just curious how the Core2Duo performs on the CAT$ algo.

DoomyD · July 03, 2008, 05:08:03 PM

I know it's a bit late, but I came across this topic and thought I could give this a shot =P

Quotedw2binstr   proc
   ;Value - EAX
   ;Buffer - EBX
   mov      edx, eax   ;EDX holds the value
   mov      ecx, 8      ;Bits per byte
   @@:
      mov      eax, edx
      and      eax, 01010101h         ;Filters the low bit of every byte
      or      eax, 30303030h         ;ASCII convertion
      mov      byte ptr [ebx+31], al   ;Placing the bytes into the buffer
      mov      byte ptr [ebx+23], ah
      ror      eax, 16
      mov      byte ptr [ebx+15], al
      mov      byte ptr [ebx+07], ah
      dec      ebx                  ;Going backwards (buffer-wise)
      ror      edx,1               ;Setting the next set of bits
      dec      ecx                  ;Loop back
      jnz      @B
   retn
dw2binstr   endp

So... what do you think?

jj2007 · July 03, 2008, 05:59:39 PM

Very short but a bit slow. But the design is interesting, maybe it could be tuned a little bit...

[attachment deleted by admin]

qWord · July 03, 2008, 06:39:44 PM

hey jj,

eventually the following code could be interesting for you:

Code Select


;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;             
;                                                  ;             
;   sse1_dw2hex: converts a dword-value to an      ;             
;                ASC-hex-string                    ;             
;                                                  ;             
;       eax = dwValue                              ;             
;       edx = lpBuffer , should be aligned to 8    ;             
;                                                  ;             
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;             
align 16                                                         
sse1_dw2hex proc ;var:DWORD,buffer:DWORD                         
                                                                 
    .data                                                        
        align 16                                                 
        d2h_bitmsk  dw 4 dup (0f00fh)                            
        d2h_cmpmsk  db 8 dup (9)                                 
        d2h_09_msk  db 8 dup (030h)                              
        d2h_AF_msk  db 8 dup (7)                                 
    .code                                                        
                                                                 
    ;bswap eax                          ;<== insert for mmx only 
    movq mm4,QWORD ptr [d2h_bitmsk]     ;       |                
    movq mm5,QWORD ptr [d2h_cmpmsk]     ;       |                
    movq mm6,QWORD ptr [d2h_09_msk]     ;       |                
    movq mm7,QWORD ptr [d2h_AF_msk]     ;       |                
                                        ;       |                
    movd mm1,eax                        ;       |                
    punpcklbw mm1,mm1                   ;       V                
    pshufw mm1,mm1,000011011y           ;<== delete for mmx only 
                                                                 
    pand mm1,mm4                                                 
    movq mm0,mm1                                                 
    psrlw mm0,12                                                 
    psllw mm1,8                                                  
                                                                 
    por mm0,mm1                                                  
    movq mm2,mm0                                                 
    pcmpgtb mm2,mm5                                              
                                                                 
    pand mm2,mm7                                                 
    paddb mm2,mm6                                                
    paddb mm2,mm0                                                
                                                                 
    movq QWORD ptr [edx],mm2                                     
    mov BYTE ptr [edx+8],0                                       
                                                                 
    ret                                                          
sse1_dw2hex endp

DoomyD · July 03, 2008, 07:35:38 PM

Quote from: jj2007 on July 03, 2008, 05:59:39 PM
Very short but a bit slow. But the design is interesting, maybe it could be tuned a little bit...

Hmm... wierd...
Here's the build I used; for some reason it shows different results... Maybe its the timing macro I use(the one from the sticky).

EDIT: My outputs:

Code Select

30 cycles timing BIN$     	 180 bytes	402 LAMPs
44 cycles timing pbin2     	 147 bytes	533 LAMPs
36 cycles timing nwDw2Bin	 101 bytes	362 LAMPs
42 cycles timing nwDw2BinJJ	 102 bytes	424 LAMPs
30 cycles timing NightWare	 204 bytes	428 LAMPs
28 cycles timing BinLingo	 187 bytes	383 LAMPs
34 cycles timing b2aDrizzAt	 235 bytes	521 LAMPs
22 cycles timing mmx_dw2bin	 132 bytes	253 LAMPs
75 cycles timing dw2binstr	 49 bytes	525 LAMPs

Code Select

32 CyclesI'm begining to wonder if it has to do with my CPU...

[attachment deleted by admin]

jj2007 · July 03, 2008, 07:59:58 PM

Quote from: DoomyD on July 03, 2008, 07:35:38 PM
Here's the build I used; for some reason it shows different results... Maybe its the timing macro I use(the one from the sticky).

Shows 101 cycles for me, still slow compared to the 40 of the BIN$ and mmx_ variants. But it's indeed weird that I see 215 cycles on my puter, while your exe performs in 101; and you saw from my source that there is not much overhead.

I use timers.asm: \Masm32\macros\TIMERS.ASM 10095 bytes of 15.02.2005

Code Select

39 cycles timing BIN$            180 bytes      523 LAMPs
45 cycles timing pbin2           147 bytes      546 LAMPs
61 cycles timing nwDw2Bin        101 bytes      613 LAMPs
65 cycles timing nwDw2BinJJ      102 bytes      656 LAMPs
54 cycles timing NightWare       204 bytes      771 LAMPs
70 cycles timing BinLingo        187 bytes      957 LAMPs
45 cycles timing b2aDrizzAt      235 bytes      690 LAMPs
40 cycles timing mmx_dw2bin      132 bytes      460 LAMPs
215 cycles timing dw2binstr      49 bytes       1505 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

jj2007 · July 03, 2008, 08:02:35 PM

Quote from: qWord on July 03, 2008, 06:39:44 PM
hey jj,
eventually the following code could be interesting for you:

How much improvement?
Look for mmx_dw2bin in the previously attached source, change the if 0 to if 1, and adapt your old code.

EDIT:
39 cycles timing BIN$ 180 bytes 523 LAMPs
23 cycles timing ssemmxonly 97 bytes 227 LAMPs

Could you PLEASE tune it a little bit? Say, 3 cycles less, just to get a round figure? :cheekygreen:

EDIT (2):
Just discovered that we are comparing apples and oranges: your output is a hex$, as the name rightly suggests :red

EDIT (3):

Code Select

39 cycles timing BIN$            180 bytes      523 LAMPs
45 cycles timing pbin2           147 bytes      546 LAMPs
63 cycles timing nwDw2Bin        101 bytes      633 LAMPs
66 cycles timing nwDw2BinJJ      102 bytes      667 LAMPs
53 cycles timing NightWare       204 bytes      757 LAMPs
70 cycles timing BinLingo        187 bytes      957 LAMPs
45 cycles timing b2aDrizzAt      235 bytes      690 LAMPs
39 cycles timing QwordMmx        132 bytes      448 LAMPs xxxxxx
216 cycles timing dw2binstr      49 bytes       1512 LAMPs
350 cycles timing Dword2Bin2     56 bytes       2619 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

I renamed your dw2binstr to QwordMmx. Still the best LAMPs score :toothy

Attached the latest build with sources, as asm and rtf
Question on the latter:
This displays just fine in (MS) WordPad and (jj) RichMasm, but (MS) Word has a serious problem with the (MS Windows) System font – they seem not as compatible as they should... any ideas ?

[attachment deleted by admin]

jj2007 · July 03, 2008, 08:43:07 PM

Quote from: DoomyD on July 03, 2008, 07:35:38 PM
Code Select Expand
75 cycles timing dw2binstr 49 bytes 525 LAMPs

Code Select Expand
32 CyclesI'm begining to wonder if it has to do with my CPU...

Nope, your CPU is fine, you are measuring different cycle counts on the same puter; so it's the code, not the CPU. Mind posting your source?

qWord · July 03, 2008, 11:04:53 PM

Quote from: jj2007 on July 03, 2008, 08:02:35 PM
Still the best LAMPs score

nice to see :green

Quote from: jj2007 on July 03, 2008, 08:02:35 PM
Just discovered that we are comparing apples and oranges: your output is a hex$, as the name rightly suggests :red

sorry for confusing you :bg

i attached a file with an modified version of dw2hex

EDIT:

QuoteHow much improvement?

a quick speed test shows that see2_dw2hex is approx. 4 times faster than dw2str from masm32.lib

[attachment deleted by admin]

DoomyD · July 04, 2008, 05:10:48 AM

Attached

[attachment deleted by admin]

jj2007 · July 04, 2008, 05:45:04 AM

Quote from: DoomyD on July 04, 2008, 05:10:48 AM
Attached

Mystery solved: Your code runs twice as fast because there is only one loop...

Code Select

			counter_begin 100000h,HIGH_PRIORITY_CLASS
				mov		eax,00010010001101001010101111001101b ;1234ABCDh
				mov		ebx,offset str1
				invoke	dw2binstr
			counter_end

My version:

Code Select

	counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
		mov ecx, 11111000001111100000111111110000b
		mov edx, offset Dw2BinBuffer
		call dw2binstr
		mov ecx, 00001111100000111110000011111111b
		mov edx, offset Dw2BinBuffer
		call dw2binstr
	counter_end

P.S.: In the standard Masm32 installation, libs sit in \masm32\lib\
   include   \masm32\include\windows.inc
   include   \masm32\macros\timers.asm
   include   \masm32\macros\macros.asm

   include       \masm32\include\masm32.inc
   includelib    \masm32\lib\masm32.lib

   include       \masm32\include\kernel32.inc
   includelib    \masm32\lib\kernel32.lib

DoomyD · July 28, 2008, 12:34:00 AM

Finally, I found the time to take a closer look at it...
I modifed drizz's modification, and squeezed another cycle out of it =) (although it takes 300,000,000 loops to actually see it :lol)

Code Select

__QwordMmx proc
	movq mm6, QWORD ptr [bitmsk]
	movq mm7, QWORD ptr [ascmsk]
	pxor mm5, mm5
	movd mm0, eax
	
	punpcklbw mm0, mm0
	
	punpckhdq mm2, mm0
	punpckldq mm0, mm0
	
	punpckhwd mm0, mm0
	punpckhwd mm2, mm2

	punpckhdq mm1, mm0
	punpckldq mm0, mm0
	punpckhdq mm3, mm2
	punpckldq mm2, mm2

	punpckhdq mm1, mm1
	punpckhdq mm3, mm3

	pand mm0, mm6
	pand mm1, mm6
	pand mm2, mm6
	pand mm3, mm6
	
	pcmpeqb mm0, mm5
	pcmpeqb mm1, mm5
	pcmpeqb mm2, mm5
	pcmpeqb mm3, mm5
	
	paddb mm0,mm7
	paddb mm1,mm7
	paddb mm2,mm7
	paddb mm3,mm7
	
	movq [edx+24],mm0
	movq [edx+16],mm1
	movq [edx+8],mm2
	movq [edx],mm3
	mov BYTE ptr [edx+32],0	
	retn
__QwordMmx endp

Core 2 Duo x32:

Code Select

25 - QwordMmx - qWord
22 - QwordMmx - drizz
21 - QwordMmx - new

[attachment deleted by admin]

jj2007 · July 28, 2008, 02:34:30 AM

loop count: 300000000
31 - QwordMmx - qWord
33 - QwordMmx - drizz
31 - QwordMmx - new

Celeron M ...

DoomyD · July 28, 2008, 12:03:42 PM

Noticed I can cap another line =)

Code Select

qWordMmxOpt proc
	movq mm7, QWORD ptr [ascmsk]
	movq mm6, QWORD ptr [bitmsk]
	pxor mm5, mm5
	movd mm0, eax
	
	punpcklbw mm0, mm0
	punpckhdq mm2, mm0
	
	punpcklwd mm0, mm0
	punpckhwd mm2, mm2

	punpckhdq mm1, mm0
	punpckhdq mm3, mm2
	
	punpckldq mm0, mm0
	punpckhdq mm1, mm1
	punpckldq mm2, mm2
	punpckhdq mm3, mm3

	pand mm0, mm6
	pand mm1, mm6
	pand mm2, mm6
	pand mm3, mm6
	
	pcmpeqb mm0, mm5
	pcmpeqb mm1, mm5
	pcmpeqb mm2, mm5
	pcmpeqb mm3, mm5
	
	paddb mm0,mm7
	paddb mm1,mm7
	paddb mm2,mm7
	paddb mm3,mm7
	
	movq [edx+24],mm0
	movq [edx+16],mm1
	movq [edx+08],mm2
	movq [edx+00],mm3
	mov BYTE ptr [edx+32],0
	retn
qWordMmxOpt end

Code Select

Loop count: 2000000000
Method: timer_begin\timer_end:
10104 {00010010001101000101011001111000} QwordMmx(1) - qWord
10051 {00010010001101000101011001111000} QwordMmx(2) - qWord
9837 {00010010001101000101011001111000} _QwordMmx(1) - drizz
9885 {00010010001101000101011001111000} _QwordMmx(2) - drizz
9245 {00010010001101000101011001111000} __QwordMmx(1) - DoomyD
9191 {00010010001101000101011001111000} __QwordMmx(2) - DoomyD

[attachment deleted by admin]

jj2007 · July 28, 2008, 01:56:24 PM

I timed it twice, the 2nd time with reduced loop count, but results are stable on a P4, 3.4 GHz:

Code Select

Loop count: 2000000000
Method: timer_begin\timer_end:
11319 {00010010001101000101011001111000} QwordMmx(1) - qWord
11452 {00010010001101000101011001111000} QwordMmx(2) - qWord
14740 {00010010001101000101011001111000} _QwordMmx(1) - drizz
14475 {00010010001101000101011001111000} _QwordMmx(2) - drizz
13840 {00010010001101000101011001111000} __QwordMmx(1) - DoomyD
13833 {00010010001101000101011001111000} __QwordMmx(2) - DoomyD

Loop count: 200000000
Method: timer_begin\timer_end:
1086 {00010010001101000101011001111000} QwordMmx(1) - qWord
1088 {00010010001101000101011001111000} QwordMmx(2) - qWord
1450 {00010010001101000101011001111000} _QwordMmx(1) - drizz
1455 {00010010001101000101011001111000} _QwordMmx(2) - drizz
1385 {00010010001101000101011001111000} __QwordMmx(1) - DoomyD
1387 {00010010001101000101011001111000} __QwordMmx(2) - DoomyD

My timings with your previous code on the Celeron M saw your code on par with qWord... which processor are you using?

News:

Bin$

DoomyD

DoomyD

DoomyD

DoomyD

DoomyD