News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

bin2byte_ex

Started by jj2007, November 16, 2009, 09:13:31 PM

Previous topic - Next topic

lingo

"Dave,
I think it means the processor you are using is not that fast with sse code"


I agree with Hutch but...everything written by JJ can be optimized with easy: :bdg
C:\My Documents\ASM\bin2byte>bin2byte
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)

133     cycles for Bin2Dw       (drizz+JJ, variable size)
52      cycles for BinValSSE_bt (JJ, variable size)
13      cycles for BinValJJ (SSE2)
10      cycles for simd_bin2byteLingo (SSE2)
14      cycles for BinValJL (SSE2)
14      cycles for BinValLingo (SSE2)
25      cycles for simd_bin2byte
47      cycles for bin2byte_exLib

135     cycles for Bin2Dw       (drizz+JJ, variable size)
53      cycles for BinValSSE_bt (JJ, variable size)
13      cycles for BinValJJ (SSE2)
10      cycles for simd_bin2byteLingo (SSE2)
14      cycles for BinValJL (SSE2)
14      cycles for BinValLingo (SSE2)
25      cycles for simd_bin2byte
48      cycles for bin2byte_exLib

133     cycles for Bin2Dw       (drizz+JJ, variable size)
53      cycles for BinValSSE_bt (JJ, variable size)
13      cycles for BinValJJ (SSE2)
10      cycles for simd_bin2byteLingo (SSE2)
14      cycles for BinValJL (SSE2)
14      cycles for BinValLingo (SSE2)
25      cycles for simd_bin2byte
47      cycles for bin2byte_exLib

Testing BinValJJ (2 lines must match):
0, 3, 12, 48, 192, 255, over8: -1, 1431655765
0, 3, 12, 48, 192, 255, over8: 255, 85

32 bytes Bin2Dw (drizz+JJ)
199 bytes BinValSSE_bt (JJ)
74 bytes BinValJJ
45 bytes BinValJL
45 bytes BinValLingo (Lingo)
80 bytes simd_bin2byteLingo(Lingo)
472 bytes bin2byte_ex (Masm32 library)
                                                 - hit any key -

3 cycles less... :bdg

jj2007

Quote from: lingo on November 23, 2009, 05:06:53 AM
I agree with Hutch but...everything written by JJ can be optimized with easy: :bdg

I love teamwork, especially with my old friend Lingo :green

So you managed to improve qWord's algo, and it's now as fast as mine. Fantastic :thumbu

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
160     cycles for Bin2Dw       (drizz+JJ, variable size)
77      cycles for BinValSSE_bt (JJ, variable size)
16      cycles for BinValJJ (SSE2)
16      cycles for simd_bin2byteLingo (SSE2)
22      cycles for BinValJL (SSE2)
56      cycles for BinValLingo (SSE2)
38      cycles for simd_bin2byte
61      cycles for bin2byte_exLib

sinsi

Everything is relative...

Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)

133     cycles for Bin2Dw       (drizz+JJ, variable size)
57      cycles for BinValSSE_bt (JJ, variable size)
14      cycles for BinValJJ (SSE2)
10      cycles for simd_bin2byteLingo (SSE2)
18      cycles for BinValJL (SSE2)
16      cycles for BinValLingo (SSE2)
40      cycles for simd_bin2byte
47      cycles for bin2byte_exLib

Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

#48
Quote from: sinsi on November 23, 2009, 08:21:37 AM
Everything is relative...

Indeed. On my Prescott, mine and Lingo's version (actually, it's qWord plus "improvements") show similar results. Now as usual the only remaining problem is to convince Lingo's algo to return the correct result :green

Edit: Problem solved, two small changes made Lingo's code work. New timings and attachment below. It is actually one or two cycles faster than mine, at least on a P4, but keep in mind that his code relies on xmm7 remaining unchanged after the creation of the bvjTable1. For more detail, search for xmm7 inside the code.

Also new:
1. BinValJJvs, another variant that works with variable size input.
2. Changed to cyct_ macros for more stable timings on P4.


Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
160     Bin2Dw       (drizz+JJ, variable size)
62      BinValJJvs   (SSE2, variable size)
16      BinValJJ     (SSE2, 8 bits)
23      BinValJL     (SSE2)
25      BinValLingo  (SSE2)
16      simd_bin2byteLingo
54      bin2byte_exLib

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
177     Bin2Dw       (drizz+JJ, variable size)
180     BinValJJvs   (SSE2, variable size)
40      BinValJJ     (SSE2, 8 bits)
141     BinValJL     (SSE2)
142     BinValLingo  (SSE2)
41      simd_bin2byteLingo
67      bin2byte_exLib

178     Bin2Dw       (drizz+JJ, variable size)
183     BinValJJvs   (SSE2, variable size)
41      BinValJJ     (SSE2, 8 bits)
140     BinValJL     (SSE2)
143     BinValLingo  (SSE2)
38      simd_bin2byteLingo
67      bin2byte_exLib

178     Bin2Dw       (drizz+JJ, variable size)
182     BinValJJvs   (SSE2, variable size)
41      BinValJJ     (SSE2, 8 bits)
137     BinValJL     (SSE2)
140     BinValLingo  (SSE2)
40      simd_bin2byteLingo
65      bin2byte_exLib

Testing simd_bin2byteLingo (2 lines must match):
0, 3, 12, 48, 192, 255
0, 3, 12, 48, 192, 255

Code sizes:
32 bytes Bin2Dw (drizz+JJ)
160 bytes BinValJJvs
74 bytes BinValJJ
45 bytes BinValJL
45 bytes BinValLingo (Lingo)
80 bytes simd_bin2byteLingo
472 bytes bin2byte_ex (Masm32 library)

mineiro

I was playing with this one for a while.
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
bin2byte_min0_SSE proc lpword:DWORD
mov eax,[esp+4]
movq xmm0,[eax]
pcmpeqb xmm0,msk
pmovmskb eax,xmm0
movzx eax,byte ptr [table+eax]
ret 4
bin2byte_min0_SSE endp
align 16
table db  00h, 80h, 40h,  0C0h, 20h,  0A0h, 60h,  0E0h, 10h, 90h, 50h,  0D0h, 30h,  0B0h, 70h,  0F0h
db  08h, 88h, 48h,  0C8h, 28h,  0A8h, 68h,  0E8h, 18h, 98h, 58h,  0D8h, 38h,  0B8h, 78h,  0F8h
db  04h, 84h, 44h,  0C4h, 24h,  0A4h, 64h,  0E4h, 14h, 94h, 54h,  0D4h, 34h,  0B4h, 74h,  0F4h
db  0Ch, 8Ch, 4Ch,  0CCh, 2Ch,  0ACh, 6Ch,  0ECh, 1Ch, 9Ch, 5Ch,  0DCh, 3Ch,  0BCh, 7Ch,  0FCh
db  02h, 82h, 42h,  0C2h, 22h,  0A2h, 62h,  0E2h, 12h, 92h, 52h,  0D2h, 32h,  0B2h, 72h,  0F2h
db  0Ah, 8Ah, 4Ah,  0CAh, 2Ah,  0AAh, 6Ah,  0EAh, 1Ah, 9Ah, 5Ah,  0DAh, 3Ah,  0BAh, 7Ah,  0FAh
db  06h, 86h, 46h,  0C6h, 26h,  0A6h, 66h,  0E6h, 16h, 96h, 56h,  0D6h, 36h,  0B6h, 76h,  0F6h
db  0Eh, 8Eh, 4Eh,  0CEh, 2Eh,  0AEh, 6Eh,  0EEh, 1Eh, 9Eh, 5Eh,  0DEh, 3Eh,  0BEh, 7Eh,  0FEh
db  01h, 81h, 41h,  0C1h, 21h,  0A1h, 61h,  0E1h, 11h, 91h, 51h,  0D1h, 31h,  0B1h, 71h,  0F1h
db  09h, 89h, 49h,  0C9h, 29h,  0A9h, 69h,  0E9h, 19h, 99h, 59h,  0D9h, 39h,  0B9h, 79h,  0F9h
db  05h, 85h, 45h,  0C5h, 25h,  0A5h, 65h,  0E5h, 15h, 95h, 55h,  0D5h, 35h,  0B5h, 75h,  0F5h
db  0Dh, 8Dh, 4Dh,  0CDh, 2Dh,  0ADh, 6Dh,  0EDh, 1Dh, 9Dh, 5Dh,  0DDh, 3Dh,  0BDh, 7Dh,  0FDh
db  03h, 83h, 43h,  0C3h, 23h,  0A3h, 63h,  0E3h, 13h, 93h, 53h,  0D3h, 33h,  0B3h, 73h,  0F3h
db  0Bh, 8Bh, 4Bh,  0CBh, 2Bh,  0ABh, 6Bh,  0EBh, 1Bh, 9Bh, 5Bh,  0DBh, 3Bh,  0BBh, 7Bh,  0FBh
db  07h, 87h, 47h,  0C7h, 27h,  0A7h, 67h,  0E7h, 17h, 97h, 57h,  0D7h, 37h,  0B7h, 77h,  0F7h
db  0Fh, 8Fh, 4Fh,  0CFh, 2Fh,  0AFh, 6Fh,  0EFh, 1Fh, 9Fh, 5Fh,  0DFh, 3Fh,  0BFh, 7Fh,  0FFh
msk dq 3131313131313131h
dq 3131313131313131h
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef


OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
bin2byte_min_SSE2 proc lpword:DWORD
mov eax,[esp+4]
movq xmm1,[eax]
movq xmm0,xmm1
psrlw xmm1,8
psllw xmm0,8
por xmm0,xmm1
pshuflw xmm0,xmm0,00011011b
pcmpeqb xmm0,qword ptr [masker]
pmovmskb eax,xmm0
ret 4
bin2byte_min_SSE2 endp
align 16
masker dq 3131313131313131h
dq 3131313131313131h
OPTION PROLOGUE : PrologueDef
OPTION EPILOGUE : EpilogueDef

jj2007

Quote from: mineiro on January 30, 2012, 05:17:12 AM
I was playing with this one for a while.

Very fast indeed :U
Attached the testbed.

P4:
32      bin2byte_min0_SSE
36      bin2byte_min_SSE2
39      BinValJJ     (SSE2, 8 bits)
41      simd_bin2byteLingo
139     BinValLingo  (SSE2)

dedndave

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

156     Bin2Dw       (drizz+JJ, variable size)
175     BinValJJvs   (SSE2, variable size)
42      BinValJJ     (SSE2, 8 bits)
137     BinValJL     (SSE2)
141     BinValLingo  (SSE2)
40      simd_bin2byteLingo
35      bin2byte_min0_SSE
41      bin2byte_min_SSE2
64      bin2byte_exLib

151*    Bin2Dw       (drizz+JJ, variable size)
175     BinValJJvs   (SSE2, variable size)
42      BinValJJ     (SSE2, 8 bits)
140     BinValJL     (SSE2)
140     BinValLingo  (SSE2)
41      simd_bin2byteLingo
34      bin2byte_min0_SSE
38      bin2byte_min_SSE2
66      bin2byte_exLib

176     Bin2Dw       (drizz+JJ, variable size)
175     BinValJJvs   (SSE2, variable size)
41      BinValJJ     (SSE2, 8 bits)
139     BinValJL     (SSE2)
140     BinValLingo  (SSE2)
40      simd_bin2byteLingo
35      bin2byte_min0_SSE
38      bin2byte_min_SSE2
65      bin2byte_exLib


165     Bin2Dw       (drizz+JJ, variable size)
174     BinValJJvs   (SSE2, variable size)
38      BinValJJ     (SSE2, 8 bits)
136     BinValJL     (SSE2)
140     BinValLingo  (SSE2)
40      simd_bin2byteLingo
33      bin2byte_min0_SSE
39      bin2byte_min_SSE2
67      bin2byte_exLib

204     Bin2Dw       (drizz+JJ, variable size)
173     BinValJJvs   (SSE2, variable size)
42      BinValJJ     (SSE2, 8 bits)
135     BinValJL     (SSE2)
141     BinValLingo  (SSE2)
39      simd_bin2byteLingo
34      bin2byte_min0_SSE
40      bin2byte_min_SSE2
63      bin2byte_exLib

179     Bin2Dw       (drizz+JJ, variable size)
174     BinValJJvs   (SSE2, variable size)
42      BinValJJ     (SSE2, 8 bits)
137     BinValJL     (SSE2)
140     BinValLingo  (SSE2)
41      simd_bin2byteLingo
33      bin2byte_min0_SSE
41      bin2byte_min_SSE2
67      bin2byte_exLib


i am trying to think of a case where this function needs to be really fast   :P

what is the asterisk for ?

jj2007

Quote from: dedndave on January 30, 2012, 01:25:35 PM
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

156     Bin2Dw       (drizz+JJ, variable size)
...
151*    Bin2Dw       (drizz+JJ, variable size)


i am trying to think of a case where this function needs to be really fast   :P

what is the asterisk for ?

Stands for "not many cases, therefore unreliable". The cyct macros are adaptations of MichaelW's originals aimed at getting more stable timings from a P4.

dedndave

yah - these look-up-table functions play games on P4's
i think you will find a lot of variation with different table addresses
they are inherently fast - and, of course - the tables are usually in the .DATA or .DATA? section
that, and the P4's older cache scheme, tend to accentuate the timing variations
you may find it interesting to put the table in the .CODE section, near the PROC

jj2007

New timings and code sizes:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

160     Bin2Dw       (drizz+JJ, variable size)
61      BinValJJvs   (SSE2, variable size)
16      BinValJJ     (SSE2, 8 bits)
23      BinValJL     (SSE2)
25      BinValLingo  (SSE2)
16      simd_bin2byteLingo
15      bin2byte_min0_SSE
28      bin2byte_min_SSE2
54      bin2byte_exLib

Code sizes:
32 bytes Bin2Dw (drizz+JJ)
160 bytes BinValJJvs
71 bytes BinValJJ
45 bytes BinValJL
45 bytes BinValLingo (Lingo)
80 bytes simd_bin2byteLingo
304 bytes .._min0_SSE (mineiro)
472 bytes bin2byte_ex (Masm32 library)

mineiro

Follows a 386 version and times:
Intel(R) Pentium(R) Dual  CPU  E2160  @ 1.80GHz (SSE4)

133     Bin2Dw       (drizz+JJ, variable size)
23      bin2byte_min_386 (fixed size)
47      BinValJJvs   (SSE2, variable size)
13      BinValJJ     (SSE2, 8 bits)
16      BinValJL     (SSE2)
16      BinValLingo  (SSE2)
67      simd_bin2byteLingo
10      bin2byte_min0_SSE
13      bin2byte_min_SSE2
96*     bin2byte_exLib

133     Bin2Dw       (drizz+JJ, variable size)
23      bin2byte_min_386 (fixed size)
47      BinValJJvs   (SSE2, variable size)
13      BinValJJ     (SSE2, 8 bits)
16      BinValJL     (SSE2)
16      BinValLingo  (SSE2)
12      simd_bin2byteLingo
10      bin2byte_min0_SSE
13      bin2byte_min_SSE2
46      bin2byte_exLib

133     Bin2Dw       (drizz+JJ, variable size)
23      bin2byte_min_386 (fixed size)
47      BinValJJvs   (SSE2, variable size)
13      BinValJJ     (SSE2, 8 bits)
16      BinValJL     (SSE2)
16      BinValLingo  (SSE2)
12      simd_bin2byteLingo
10      bin2byte_min0_SSE
13      bin2byte_min_SSE2
46      bin2byte_exLib


dedndave

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

177     Bin2Dw       (drizz+JJ, variable size)
46      bin2byte_min_386 (fixed size)
175     BinValJJvs   (SSE2, variable size)
39      BinValJJ     (SSE2, 8 bits)
140     BinValJL     (SSE2)
139     BinValLingo  (SSE2)
41      simd_bin2byteLingo
33      bin2byte_min0_SSE
38      bin2byte_min_SSE2
66      bin2byte_exLib

178     Bin2Dw       (drizz+JJ, variable size)
46      bin2byte_min_386 (fixed size)
171     BinValJJvs   (SSE2, variable size)
41      BinValJJ     (SSE2, 8 bits)
141     BinValJL     (SSE2)
141     BinValLingo  (SSE2)
41      simd_bin2byteLingo
34      bin2byte_min0_SSE
38      bin2byte_min_SSE2
64      bin2byte_exLib

177     Bin2Dw       (drizz+JJ, variable size)
44      bin2byte_min_386 (fixed size)
175     BinValJJvs   (SSE2, variable size)
41      BinValJJ     (SSE2, 8 bits)
139     BinValJL     (SSE2)
138     BinValLingo  (SSE2)
41      simd_bin2byteLingo
35      bin2byte_min0_SSE
41      bin2byte_min_SSE2
66      bin2byte_exLib

clive

AMD Phenom(tm) II X6 1055T Processor (SSE3)

122     Bin2Dw       (drizz+JJ, variable size)
31      bin2byte_min_386 (fixed size)
80      BinValJJvs   (SSE2, variable size)
68      BinValJJ     (SSE2, 8 bits)
36      BinValJL     (SSE2)
41      BinValLingo  (SSE2)
67      simd_bin2byteLingo
71      bin2byte_min0_SSE
37      bin2byte_min_SSE2
46      bin2byte_exLib

123     Bin2Dw       (drizz+JJ, variable size)
31      bin2byte_min_386 (fixed size)
80      BinValJJvs   (SSE2, variable size)
68      BinValJJ     (SSE2, 8 bits)
36      BinValJL     (SSE2)
41      BinValLingo  (SSE2)
87      simd_bin2byteLingo
71      bin2byte_min0_SSE
38*     bin2byte_min_SSE2
45      bin2byte_exLib

122     Bin2Dw       (drizz+JJ, variable size)
31      bin2byte_min_386 (fixed size)
80      BinValJJvs   (SSE2, variable size)
68      BinValJJ     (SSE2, 8 bits)
36      BinValJL     (SSE2)
41*     BinValLingo  (SSE2)
67      simd_bin2byteLingo
71      bin2byte_min0_SSE
37      bin2byte_min_SSE2
46      bin2byte_exLib
It could be a random act of randomness. Those happen a lot as well.