News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Bin$

Started by jj2007, June 20, 2008, 01:39:06 AM

Previous topic - Next topic

DoomyD

Intel Core 2 Due 6300  @ 1.8 GHz
Model: x86 615

jj2007

#76
I integrated your code into the package. LAMP-wise you made it, compliments... but NightWare has a little edge on speed, at least on my P4:

48 cycles timing BIN$            180 bytes      644 LAMPs
59 cycles timing pbin2           147 bytes      715 LAMPs
86 cycles timing nwDw2Bin        101 bytes      864 LAMPs
69 cycles timing nwDw2BinJJ      102 bytes      697 LAMPs
42 cycles timing NightWare       204 bytes      600 LAMPs ****
86 cycles timing BinLingo        187 bytes      1176 LAMPs
50 cycles timing b2aDrizzAt      235 bytes      766 LAMPs
54 cycles timing MmxQword        132 bytes      620 LAMPs
52 cycles timing MmxDoomy        110 bytes      545 LAMPs ****
81 cycles timing dw2bin_ex       2140 bytes     3747 LAMPs


LAMPs = Lean And Mean Points = cycles * sqrt(size)

EDIT: Results for Celeron M - Doomy clearly in the lead (but BIN$ also ok for speed)

32 cycles timing BIN$            180 bytes      429 LAMPs
42 cycles timing pbin2           147 bytes      509 LAMPs
54 cycles timing nwDw2Bin        101 bytes      543 LAMPs
57 cycles timing nwDw2BinJJ      102 bytes      576 LAMPs
39 cycles timing NightWare       204 bytes      557 LAMPs
36 cycles timing BinLingo        187 bytes      492 LAMPs
43 cycles timing b2aDrizzAt      235 bytes      659 LAMPs
33 cycles timing MmxQword        132 bytes      379 LAMPs
32 cycles timing MmxDoomy        110 bytes      336 LAMPs
60 cycles timing dw2bin_ex       2140 bytes     2776 LAMPs



[attachment deleted by admin]

jj2007

3 cycles less for everybody - a bin$ is always 32 bytes long, so no need for poking a zero terminator. Celeron M timings:

30 cycles timing BIN$            136 bytes      350 LAMPs
39 cycles timing NightWare       204 bytes      557 LAMPs
36 cycles timing BinLingo        187 bytes      492 LAMPs
33 cycles timing MmxQword        132 bytes      379 LAMPs
29 cycles timing MmxDoomy        106 bytes      299 LAMPs

[attachment deleted by admin]

DoomyD

#78
Quote29 cycles timing BIN$         136 bytes   338 LAMPs
45 cycles timing pbin2         144 bytes   540 LAMPs
49 cycles timing nwDw2Bin    101 bytes   492 LAMPs
43 cycles timing nwDw2BinJJ    102 bytes   434 LAMPs
28 cycles timing NightWare    200 bytes   396 LAMPs
27 cycles timing BinLingo    183 bytes   365 LAMPs
33 cycles timing b2aDrizzAt    235 bytes   506 LAMPs
21 cycles timing MmxQword    128 bytes   238 LAMPs
19 cycles timing MmxDoomy    106 bytes   196 LAMPs
52 cycles timing dw2bin_ex    2140 bytes   2406 LAMPs
Looking at the source thoughh, I should point that the algorithm is using the same data resources as qWord's mmx, I'll include them as seperate.
By the way: I don't think it could be shortened any more than that - Here's my final code:m_mmx_dw2bin macro Value:REQ, lpBuffer
LOCAL mmx_dw2bin_buffer

IFNDEF mmx_dw2bin_enabled
  mmx_dw2bin_enabled equ <1>
  .data
   align 8
   mmx_dw2bin_ascmsk db 8 dup (31h)
   mmx_dw2bin_bitmsk db 80h,40h,20h,10h,08h,04h,02h,01h
ENDIF

.code
  even
  mov  eax, Value
  movq mm7, qword ptr [mmx_dw2bin_ascmsk]
  movq mm6, qword ptr [mmx_dw2bin_bitmsk]
  movd mm0, eax
 
  punpcklbw mm0, mm0
  punpckldq mm2, mm0
 
  punpckhwd mm0, mm0
  punpckldq mm1, mm0
  punpckhwd mm2, mm2
  punpckldq mm3, mm2
 
  punpckhdq mm0, mm0
  punpckhdq mm1, mm1
  punpckhdq mm2, mm2
  punpckhdq mm3, mm3

  pandn mm0, mm6
  pandn mm1, mm6
  pandn mm2, mm6
  pandn mm3, mm6
 
  pcmpeqb mm0, mm6
  pcmpeqb mm1, mm6
  pcmpeqb mm2, mm6
  pcmpeqb mm3, mm6
 
  paddb mm0,mm7
  paddb mm1,mm7
  paddb mm2,mm7
  paddb mm3,mm7
 
 
  IFB <lpBuffer>
   .data
    mmx_dw2bin_buffer db 32 dup (0),0
    align 4
   .code
   mov  eax,offset mmx_dw2bin_buffer
  ELSE
   mov  eax,lpBuffer
  ENDIF
 
  movq [eax+00],mm0
  movq [eax+08],mm1
  movq [eax+16],mm2
  movq [eax+24],mm3
endm