News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Bin$

Started by jj2007, June 20, 2008, 01:39:06 AM

Previous topic - Next topic

jj2007

Quote from: qWord on June 28, 2008, 10:22:49 AM
Sorry, it was an false statement by me , see my previous post
No problem; I was just curious how the Core2Duo performs on the CAT$ algo.

DoomyD

I know it's a bit late, but I  came across this topic and thought I could give this a shot =P
Quotedw2binstr   proc
   ;Value  - EAX
   ;Buffer - EBX
   mov      edx, eax   ;EDX holds the value
   mov      ecx, 8      ;Bits per byte
   @@:
      mov      eax, edx
      and      eax, 01010101h         ;Filters the low bit of every byte
      or      eax, 30303030h         ;ASCII convertion
      mov      byte ptr [ebx+31], al   ;Placing the bytes into the buffer
      mov      byte ptr [ebx+23], ah
      ror      eax, 16
      mov      byte ptr [ebx+15], al
      mov      byte ptr [ebx+07], ah
      dec      ebx                  ;Going backwards (buffer-wise)
      ror      edx,1               ;Setting the next set of bits
      dec      ecx                  ;Loop back
      jnz      @B
   retn
dw2binstr   endp
So... what do you think?

jj2007

Very short but a bit slow. But the design is interesting, maybe it could be tuned a little bit...

[attachment deleted by admin]

qWord

hey jj,

eventually the following code could be interesting for you:



;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;             
;                                                  ;             
;   sse1_dw2hex: converts a dword-value to an      ;             
;                ASC-hex-string                    ;             
;                                                  ;             
;       eax = dwValue                              ;             
;       edx = lpBuffer , should be aligned to 8    ;             
;                                                  ;             
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;             
align 16                                                         
sse1_dw2hex proc ;var:DWORD,buffer:DWORD                         
                                                                 
    .data                                                       
        align 16                                                 
        d2h_bitmsk  dw 4 dup (0f00fh)                           
        d2h_cmpmsk  db 8 dup (9)                                 
        d2h_09_msk  db 8 dup (030h)                             
        d2h_AF_msk  db 8 dup (7)                                 
    .code                                                       
                                                                 
    ;bswap eax                          ;<== insert for mmx only
    movq mm4,QWORD ptr [d2h_bitmsk]     ;       |               
    movq mm5,QWORD ptr [d2h_cmpmsk]     ;       |               
    movq mm6,QWORD ptr [d2h_09_msk]     ;       |               
    movq mm7,QWORD ptr [d2h_AF_msk]     ;       |               
                                        ;       |               
    movd mm1,eax                        ;       |               
    punpcklbw mm1,mm1                   ;       V               
    pshufw mm1,mm1,000011011y           ;<== delete for mmx only
                                                                 
    pand mm1,mm4                                                 
    movq mm0,mm1                                                 
    psrlw mm0,12                                                 
    psllw mm1,8                                                 
                                                                 
    por mm0,mm1                                                 
    movq mm2,mm0                                                 
    pcmpgtb mm2,mm5                                             
                                                                 
    pand mm2,mm7                                                 
    paddb mm2,mm6                                               
    paddb mm2,mm0                                               
                                                                 
    movq QWORD ptr [edx],mm2                                     
    mov BYTE ptr [edx+8],0                                       
                                                                 
    ret                                                         
sse1_dw2hex endp                                                 
FPU in a trice: SmplMath
It's that simple!

DoomyD

Quote from: jj2007 on July 03, 2008, 05:59:39 PM
Very short but a bit slow. But the design is interesting, maybe it could be tuned a little bit...
Hmm... wierd...
Here's the build I used; for some reason it shows different results... Maybe its the timing macro I use(the one from the sticky).

EDIT: My outputs:30 cycles timing BIN$      180 bytes 402 LAMPs
44 cycles timing pbin2      147 bytes 533 LAMPs
36 cycles timing nwDw2Bin 101 bytes 362 LAMPs
42 cycles timing nwDw2BinJJ 102 bytes 424 LAMPs
30 cycles timing NightWare 204 bytes 428 LAMPs
28 cycles timing BinLingo 187 bytes 383 LAMPs
34 cycles timing b2aDrizzAt 235 bytes 521 LAMPs
22 cycles timing mmx_dw2bin 132 bytes 253 LAMPs
75 cycles timing dw2binstr 49 bytes 525 LAMPs
32 CyclesI'm begining to wonder if it has to do with my CPU...

[attachment deleted by admin]

jj2007

Quote from: DoomyD on July 03, 2008, 07:35:38 PM
Here's the build I used; for some reason it shows different results... Maybe its the timing macro I use(the one from the sticky).
Shows 101 cycles for me, still slow compared to the 40 of the BIN$ and mmx_ variants. But it's indeed weird that I see 215 cycles on my puter, while your exe performs in 101; and you saw from my source that there is not much overhead.

I use timers.asm: \Masm32\macros\TIMERS.ASM 10095 bytes of 15.02.2005

39 cycles timing BIN$            180 bytes      523 LAMPs
45 cycles timing pbin2           147 bytes      546 LAMPs
61 cycles timing nwDw2Bin        101 bytes      613 LAMPs
65 cycles timing nwDw2BinJJ      102 bytes      656 LAMPs
54 cycles timing NightWare       204 bytes      771 LAMPs
70 cycles timing BinLingo        187 bytes      957 LAMPs
45 cycles timing b2aDrizzAt      235 bytes      690 LAMPs
40 cycles timing mmx_dw2bin      132 bytes      460 LAMPs
215 cycles timing dw2binstr      49 bytes       1505 LAMPs


LAMPs = Lean And Mean Points = cycles * sqrt(size)

jj2007

#66
Quote from: qWord on July 03, 2008, 06:39:44 PM
hey jj,
eventually the following code could be interesting for you:
How much improvement?
Look for mmx_dw2bin in the previously attached source, change the if 0 to if 1, and adapt your old code.

EDIT:
39 cycles timing BIN$            180 bytes      523 LAMPs
23 cycles timing ssemmxonly      97 bytes       227 LAMPs

Could you PLEASE tune it a little bit? Say, 3 cycles less, just to get a round figure?  :cheekygreen:

EDIT (2):
Just discovered that we are comparing apples and oranges: your output is a hex$, as the name rightly suggests :red

EDIT (3):
39 cycles timing BIN$            180 bytes      523 LAMPs
45 cycles timing pbin2           147 bytes      546 LAMPs
63 cycles timing nwDw2Bin        101 bytes      633 LAMPs
66 cycles timing nwDw2BinJJ      102 bytes      667 LAMPs
53 cycles timing NightWare       204 bytes      757 LAMPs
70 cycles timing BinLingo        187 bytes      957 LAMPs
45 cycles timing b2aDrizzAt      235 bytes      690 LAMPs
39 cycles timing QwordMmx        132 bytes      448 LAMPs xxxxxx
216 cycles timing dw2binstr      49 bytes       1512 LAMPs
350 cycles timing Dword2Bin2     56 bytes       2619 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)


I renamed your dw2binstr to QwordMmx. Still the best LAMPs score  :toothy

Attached the latest build with sources, as asm and rtf
Question on the latter:
This displays just fine in (MS) WordPad and (jj) RichMasm, but (MS) Word has a serious problem with the (MS Windows) System font – they seem not as compatible as they should... any ideas ?

[attachment deleted by admin]

jj2007

Quote from: DoomyD on July 03, 2008, 07:35:38 PM
75 cycles timing dw2binstr 49 bytes 525 LAMPs

32 CyclesI'm begining to wonder if it has to do with my CPU...

Nope, your CPU is fine, you are measuring different cycle counts on the same puter; so it's the code, not the CPU. Mind posting your source?

qWord

Quote from: jj2007 on July 03, 2008, 08:02:35 PM
Still the best LAMPs score
nice to see    :green

Quote from: jj2007 on July 03, 2008, 08:02:35 PM
Just discovered that we are comparing apples and oranges: your output is a hex$, as the name rightly suggests :red
sorry for confusing you  :bg

i attached a file with an modified version of dw2hex

EDIT:
 
QuoteHow much improvement?
  a quick speed test shows that see2_dw2hex is approx. 4 times faster than dw2str from masm32.lib





[attachment deleted by admin]
FPU in a trice: SmplMath
It's that simple!

DoomyD

Attached

[attachment deleted by admin]

jj2007

Quote from: DoomyD on July 04, 2008, 05:10:48 AM
Attached

Mystery solved: Your code runs twice as fast because there is only one loop...

counter_begin 100000h,HIGH_PRIORITY_CLASS
mov eax,00010010001101001010101111001101b ;1234ABCDh
mov ebx,offset str1
invoke dw2binstr
counter_end


My version:

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov ecx, 11111000001111100000111111110000b
mov edx, offset Dw2BinBuffer
call dw2binstr
mov ecx, 00001111100000111110000011111111b
mov edx, offset Dw2BinBuffer
call dw2binstr
counter_end



P.S.: In the standard Masm32 installation, libs sit in \masm32\lib\
   include   \masm32\include\windows.inc
   include   \masm32\macros\timers.asm
   include   \masm32\macros\macros.asm
   
   include       \masm32\include\masm32.inc
   includelib    \masm32\lib\masm32.lib

   include       \masm32\include\kernel32.inc
   includelib    \masm32\lib\kernel32.lib

DoomyD

#71
Finally, I found the time to take a closer look at it...
I modifed drizz's modification, and squeezed another cycle out of it =) (although it takes 300,000,000 loops to actually see it :lol)__QwordMmx proc
movq mm6, QWORD ptr [bitmsk]
movq mm7, QWORD ptr [ascmsk]
pxor mm5, mm5
movd mm0, eax

punpcklbw mm0, mm0

punpckhdq mm2, mm0
punpckldq mm0, mm0

punpckhwd mm0, mm0
punpckhwd mm2, mm2

punpckhdq mm1, mm0
punpckldq mm0, mm0
punpckhdq mm3, mm2
punpckldq mm2, mm2

punpckhdq mm1, mm1
punpckhdq mm3, mm3

pand mm0, mm6
pand mm1, mm6
pand mm2, mm6
pand mm3, mm6

pcmpeqb mm0, mm5
pcmpeqb mm1, mm5
pcmpeqb mm2, mm5
pcmpeqb mm3, mm5

paddb mm0,mm7
paddb mm1,mm7
paddb mm2,mm7
paddb mm3,mm7

movq [edx+24],mm0
movq [edx+16],mm1
movq [edx+8],mm2
movq [edx],mm3
mov BYTE ptr [edx+32],0
retn
__QwordMmx endp
Core 2 Duo x32:25 - QwordMmx - qWord
22 - QwordMmx - drizz
21 - QwordMmx - new

[attachment deleted by admin]

jj2007

loop count: 300000000
31 - QwordMmx - qWord
33 - QwordMmx - drizz
31 - QwordMmx - new

Celeron M ...

DoomyD

Noticed I can cap another line =)qWordMmxOpt proc
movq mm7, QWORD ptr [ascmsk]
movq mm6, QWORD ptr [bitmsk]
pxor mm5, mm5
movd mm0, eax

punpcklbw mm0, mm0
punpckhdq mm2, mm0

punpcklwd mm0, mm0
punpckhwd mm2, mm2

punpckhdq mm1, mm0
punpckhdq mm3, mm2

punpckldq mm0, mm0
punpckhdq mm1, mm1
punpckldq mm2, mm2
punpckhdq mm3, mm3

pand mm0, mm6
pand mm1, mm6
pand mm2, mm6
pand mm3, mm6

pcmpeqb mm0, mm5
pcmpeqb mm1, mm5
pcmpeqb mm2, mm5
pcmpeqb mm3, mm5

paddb mm0,mm7
paddb mm1,mm7
paddb mm2,mm7
paddb mm3,mm7

movq [edx+24],mm0
movq [edx+16],mm1
movq [edx+08],mm2
movq [edx+00],mm3
mov BYTE ptr [edx+32],0
retn
qWordMmxOpt end
Loop count: 2000000000
Method: timer_begin\timer_end:
10104 {00010010001101000101011001111000} QwordMmx(1) - qWord
10051 {00010010001101000101011001111000} QwordMmx(2) - qWord
9837 {00010010001101000101011001111000} _QwordMmx(1) - drizz
9885 {00010010001101000101011001111000} _QwordMmx(2) - drizz
9245 {00010010001101000101011001111000} __QwordMmx(1) - DoomyD
9191 {00010010001101000101011001111000} __QwordMmx(2) - DoomyD

[attachment deleted by admin]

jj2007

I timed it twice, the 2nd time with reduced loop count, but results are stable on a P4, 3.4 GHz:

Loop count: 2000000000
Method: timer_begin\timer_end:
11319 {00010010001101000101011001111000} QwordMmx(1) - qWord
11452 {00010010001101000101011001111000} QwordMmx(2) - qWord
14740 {00010010001101000101011001111000} _QwordMmx(1) - drizz
14475 {00010010001101000101011001111000} _QwordMmx(2) - drizz
13840 {00010010001101000101011001111000} __QwordMmx(1) - DoomyD
13833 {00010010001101000101011001111000} __QwordMmx(2) - DoomyD

Loop count: 200000000
Method: timer_begin\timer_end:
1086 {00010010001101000101011001111000} QwordMmx(1) - qWord
1088 {00010010001101000101011001111000} QwordMmx(2) - qWord
1450 {00010010001101000101011001111000} _QwordMmx(1) - drizz
1455 {00010010001101000101011001111000} _QwordMmx(2) - drizz
1385 {00010010001101000101011001111000} __QwordMmx(1) - DoomyD
1387 {00010010001101000101011001111000} __QwordMmx(2) - DoomyD


My timings with your previous code on the Celeron M saw your code on par with qWord... which processor are you using?