News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

bin2byte_ex

Started by jj2007, November 16, 2009, 09:13:31 PM

Previous topic - Next topic

oex

Awesome ty Hutch write, write write, read read read then?

What Instruction Set is pxor in SIMD.... It's not compiling for me and I cant even find which version SSE supports it, I'm restricted to SSE2
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

dedndave

if it isn't assembling, it may be the version of masm you are using
version 6.14 does not support sse (i think that's correct)

this site has a list of instructions for a few different levels of simd
http://softpixel.com/~cwright/programming/simd/

this is a link for masm 6.15
http://www.4shared.com/file/139758027/648a9665/ML_online.html

jj2007

Does anybody know a Microsoft(tm) binary string converter that can be called from Masm? Convert.ToInt32 uses mscorlib.dll, not included in Masm32...

In the meantime, I have added another SSE2 algo for the fixed 8-bit section, BinVal0, and a modified version of simd_bin2byte (qWord). Both run at 31 cycles.

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
160     cycles for Bin2Dw       (drizz+JJ, variable size)
77      cycles for BinValSSE_bt (JJ, variable size)
31      cycles for BinVal0 (SSE2)
50      cycles for BinVal2 (non-SSE2)
38      cycles for simd_bin2byte
31      cycles for simd_bin2byteB
61      cycles for bin2byte_exLib

oex

Wierd I have linker 6.15.8803 (with processor pack) and it doesnt like pxor.... According to that link it's SSE2 dave
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

lingo

oex,
The problem is not just in the tools... :wink
For example: everything written by JJ can be optimized with easy:
C:\My Documents\ASM\bin2byte>bin2byte
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)

139     cycles for Bin2Dw       (drizz+JJ, variable size)
53      cycles for BinValSSE_bt (JJ, variable size)
17      cycles for BinVal0 (SSE2)
14      cycles for BinValLingo (SSE2)
22      cycles for simd_bin2byte
19      cycles for simd_bin2byteB
47      cycles for bin2byte_exLib

133     cycles for Bin2Dw       (drizz+JJ, variable size)
55      cycles for BinValSSE_bt (JJ, variable size)
17      cycles for BinVal0 (SSE2)
15      cycles for BinValLingo (SSE2)
22      cycles for simd_bin2byte
19      cycles for simd_bin2byteB
47      cycles for bin2byte_exLib

133     cycles for Bin2Dw       (drizz+JJ, variable size)
53      cycles for BinValSSE_bt (JJ, variable size)
17      cycles for BinVal0 (SSE2)
15      cycles for BinValLingo (SSE2)
24      cycles for simd_bin2byte
19      cycles for simd_bin2byteB
46      cycles for bin2byte_exLib

Testing BinVal0 (2 lines must match):
0, 3, 12, 48, 192, 255, over8: -1, 1431655765=1431655765
0, 3, 12, 48, 192, 255, over8: 255, 85=1431655765

Code size:
32 bytes Bin2Dw (drizz+JJ)
199 bytes BinValSSE_bt (JJ)
50 bytes BinVal0 (JJ)
42 bytes BinValLingo (Lingo)
50 bytes simd_bin2byte (qWord)
47 bytes simd_bin2byteB (qWord+JJ)
472 bytes bin2byte_ex (Masm32 library)

Hit any key to get outta here

2 cycles and 8 bytes less...  :lol

jj2007

#35
Quote from: lingo on November 22, 2009, 07:52:01 AM
oex,
The problem is not just in the tools... :wink
For example: everything written by JJ can be optimized with easy:
....
2 cycles and 8 bytes less...  :lol

Lingo is clearly our champion! However, for the sake of correctness: The mask1 qwords are part of the "code size", so it is +8 bytes :bg
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
160     cycles for Bin2Dw       (drizz+JJ, variable size)
77      cycles for BinValSSE_bt (JJ, variable size)
30      cycles for BinVal0 (SSE2)
25      cycles for BinValLingo (SSE2)
50      cycles for BinVal2 (non-SSE2)
38      cycles for simd_bin2byte
31      cycles for simd_bin2byteB
61      cycles for bin2byte_exLib


There is another bit that I don't quite understand:
Testing BinValLingo (2 lines must match):
0, 3, 12, 48, 192, 255, over8: -1, 1431655765=1431655765
0, 0, 0, 0, 0, 0, over8: 0, 0=1431655765

It looks as if Lingo's algo returns 0 by default ::)
But most probably, I've done some manipulation while copying his algo... please check carefully

EDIT: With this minor modification, we really got a new champion: Lingo :cheekygreen:

Quotealign 16
; mask1 dq 0Fh,0   ; Lingo's version
mask1 dq 3131313131313131h

BinValLingo proc    ; lpSrc
   pop   ecx      
   pop   eax      
   mov   edx, [eax]
   mov   eax, [eax+4]
   bswap   edx
   bswap eax
   movd   xmm0, edx
   movd   xmm1, eax
   punpckldq   xmm1, xmm0
   pcmpeqb      xmm1, xmmword ptr [mask1]
   pmovmskb   eax, xmm1   
   ; and eax, 0FFh      ; no longer necessary with the correct maks (and, by the way, NOT faster than movzx eax, al ;-)
   jmp ecx
BinValLingo endp
BinValLingo_END:

Correct results, and 4 cycles faster than my last baby, BinVal0, and even 2 bytes shorter

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

160     cycles for Bin2Dw       (drizz+JJ, variable size)
76      cycles for BinValSSE_bt (JJ, variable size)
29      cycles for BinVal0 (SSE2)
25      cycles for BinValLingo (SSE2)
31      cycles for simd_bin2byteB
61      cycles for bin2byte_exLib

Testing BinValLingo (2 lines must match):
0, 3, 12, 48, 192, 255, over8: -1, 1431655765=1431655765
0, 3, 12, 48, 192, 255, over8: 255, 85=1431655765

Code size:
32 bytes Bin2Dw (drizz+JJ)
199 bytes BinValSSE_bt (JJ)
47 bytes BinVal0 (JJ)
45 bytes BinValLingo (Lingo)
82 bytes BinVal2 (JJ)
47 bytes simd_bin2byteB (qWord+JJ)
472 bytes bin2byte_ex (Masm32 library)

oex

:D nice 1 Lingo give me a few months when I've finished my current project and I'll get some proper practice in to give you guys a run for your money :). Started playing with SSE a few weeks ago so I have some catching up to do :)

In the meantime I'll keep an eye out for any other little compos you guys have running and see if I can post something to warm up :)

Best I could do this time was 10% slower than your b2dw1 attempt and 400 bytes (would post but is calculating a couple values wrong way round and I cant fint the bug lol
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

dedndave

prescott - with my usual "funny" numbers - lol

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

182     cycles for Bin2Dw       (drizz+JJ, variable size)
186     cycles for BinValSSE_bt (JJ, variable size)
117     cycles for BinVal0 (SSE2)
88      cycles for BinValLingo (SSE2)
136     cycles for simd_bin2byteB
82      cycles for bin2byte_exLib

202     cycles for Bin2Dw       (drizz+JJ, variable size)
190     cycles for BinValSSE_bt (JJ, variable size)
131     cycles for BinVal0 (SSE2)
124     cycles for BinValLingo (SSE2)
137     cycles for simd_bin2byteB
93      cycles for bin2byte_exLib

180     cycles for Bin2Dw       (drizz+JJ, variable size)
187     cycles for BinValSSE_bt (JJ, variable size)
130     cycles for BinVal0 (SSE2)
123     cycles for BinValLingo (SSE2)
120     cycles for simd_bin2byteB
71      cycles for bin2byte_exLib

with my machine, it's hard to pick a big winner, other than bin2byte_exLib, which is kinda large
you can see why it is hard for me to optimize code

hutch--

Dave,

I think it means the processor you are using is not that fast with sse code. As is usually the case with narrow algos posted in here, they go the way of being very fast on limited hardware and don't run on most other hardware.

On my quad the SSE versions are clearly faster.

This is the version that lingo posted. I played with the modified lib version and got another 2% off it but it will never be as fast as an SSE version.


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)

134     cycles for Bin2Dw       (drizz+JJ, variable size)
53      cycles for BinValSSE_bt (JJ, variable size)
17      cycles for BinVal0 (SSE2)
15      cycles for BinValLingo (SSE2)
22      cycles for simd_bin2byte
19      cycles for simd_bin2byteB
45      cycles for bin2byte_exLib

133     cycles for Bin2Dw       (drizz+JJ, variable size)
52      cycles for BinValSSE_bt (JJ, variable size)
17      cycles for BinVal0 (SSE2)
15      cycles for BinValLingo (SSE2)
22      cycles for simd_bin2byte
19      cycles for simd_bin2byteB
45      cycles for bin2byte_exLib

133     cycles for Bin2Dw       (drizz+JJ, variable size)
53      cycles for BinValSSE_bt (JJ, variable size)
17      cycles for BinVal0 (SSE2)
15      cycles for BinValLingo (SSE2)
22      cycles for simd_bin2byte
19      cycles for simd_bin2byteB
45      cycles for bin2byte_exLib

Testing BinVal0 (2 lines must match):
0, 3, 12, 48, 192, 255, over8: -1, 1431655765=1431655765
0, 3, 12, 48, 192, 255, over8: 255, 85=1431655765

Code size:
32 bytes Bin2Dw (drizz+JJ)
199 bytes BinValSSE_bt (JJ)
50 bytes BinVal0 (JJ)
42 bytes BinValLingo (Lingo)
50 bytes simd_bin2byte (qWord)
47 bytes simd_bin2byteB (qWord+JJ)
440 bytes bin2byte_ex (Masm32 library)

Hit any key to get outta here
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

#39
EDIT start: New attachment solves two little problems with ML 6.15 and JWasm (remember ML 6.14 as included in Masm32 does not know SSE2....):

; movsd xmm0, QWORD ptr [eax] ; mlv615 and JWasm choked:
; Error A2084: Invalid operand size for instruction
movlps xmm0, QWORD ptr [eax] ; workaround, same cycle count and one byte shorter


; xmmword is unknown to earlier ML versions, but oword is the same
pcmpeqb xmm1, oword ptr [bvMask]

EDIT end

Timing the Prescott is very tricky indeed. Here is my "winner" for the fixed 8-bit version - I baptised it BinValJL because it's kind of a synthesis between Lingo's and my code:

align 16 ; absolutely needed, otherwise pcmpeqb chokes
bvMask dd "1111", "1111"
BinValJL proc ; lpSrc ; 45 bytes including the mask; 8 bits fixed length
pop ecx ; ret address
pop edx ; pointer to src$
mov eax, [edx] ; get 4 chars
bswap eax
movd xmm0, eax
mov eax, [edx+4] ; get 4 more
bswap eax
movd xmm1, eax
punpckldq xmm1, xmm0 ; interleave low dwords
; compare packed bytes to "1"
pcmpeqb xmm1, xmmword ptr [bvMask]
pmovmskb eax, xmm1 ; set byte mask 0 in edx
jmp ecx
BinValJL endp


Quote29      cycles for BinVal0 (SSE2)
23      cycles for BinValJL (SSE2)
25      cycles for BinValLingo (SSE2)

Now, the other issue is being able to read in everything that looks vaguely like a binary string. For that purpose, I would pick Bin2Dw - kind of slow, but 32 bytes short; and I don't believe that you typically find that need to read in a Bin$ in an innermost loop that runs a Million times...
:bg

hutch--

Here is the difference that hardware makes. This is the test bed run on my 3.8 gig PIV.


Genuine Intel(R) CPU 3.80GHz (SSE3)

180     cycles for Bin2Dw       (drizz+JJ, variable size)
185     cycles for BinValSSE_bt (JJ, variable size)
154     cycles for BinVal0 (SSE2)
150     cycles for BinValLingo (SSE2)
77      cycles for simd_bin2byte
137     cycles for simd_bin2byteB
103     cycles for bin2byte_exLib

180     cycles for Bin2Dw       (drizz+JJ, variable size)
185     cycles for BinValSSE_bt (JJ, variable size)
153     cycles for BinVal0 (SSE2)
150     cycles for BinValLingo (SSE2)
77      cycles for simd_bin2byte
137     cycles for simd_bin2byteB
83      cycles for bin2byte_exLib

179     cycles for Bin2Dw       (drizz+JJ, variable size)
186     cycles for BinValSSE_bt (JJ, variable size)
114     cycles for BinVal0 (SSE2)
150     cycles for BinValLingo (SSE2)
79      cycles for simd_bin2byte
139     cycles for simd_bin2byteB
72      cycles for bin2byte_exLib

Testing BinVal0 (2 lines must match):
0, 3, 12, 48, 192, 255, over8: -1, 1431655765=1431655765
0, 3, 12, 48, 192, 255, over8: 255, 85=1431655765

Code size:
32 bytes Bin2Dw (drizz+JJ)
199 bytes BinValSSE_bt (JJ)
50 bytes BinVal0 (JJ)
42 bytes BinValLingo (Lingo)
50 bytes simd_bin2byte (qWord)
47 bytes simd_bin2byteB (qWord+JJ)
440 bytes bin2byte_ex (Masm32 library)

Hit any key to get outta here
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

GregL

Quote from: jj2007; movsd xmm0, QWORD ptr [eax] ; mlv615 and JWasm choked:

To use SSE2 MOVSD and CMPSD with ML 6.15 you need macros:


; ML assembler version 6.15 supports XMM instructions, except the
; instructions MOVSD and CMPSD which are confused with integer string
; instructions with the same names.

MOVSD_ MACRO A, B
  DB  0F2H
  MOVUPS A, B
ENDM

CMPSD_ MACRO A, B, C
  DB  0F2H
  CMPPS A, B, C
ENDM

jj2007

Thanks, Greg - I vaguely remembered these macros but couldn't find them.
@qWord: If you prefer the movsd macro over movlps, I'll change the source again. Let me know please.

Edit: The MOVSD_ macro works fine with ML 6.15 but chokes with ML 9.0 and JWasm with "error A2070:invalid instruction operands". In order to avoid version chaos, I suggest to stick with movlps - it's shorter and equally fast.

jj2007

OK, I give up - I've tried everything, but I cannot push it under 16 cycles...

QuoteIntel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

160     cycles for Bin2Dw       (drizz+JJ, variable size)
77      cycles for BinValSSE_bt (JJ, variable size)
16      cycles for BinValJJ (SSE2)
23      cycles for BinValJL (SSE2)
25      cycles for BinValLingo (SSE2)
38      cycles for simd_bin2byte
32      cycles for simd_bin2byteB
61      cycles for bin2byte_exLib

dedndave

very cool Jochen   :U
although, i thought "final" was a concept beyond assembly programmers - lol
prescott times

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

180     cycles for Bin2Dw       (drizz+JJ, variable size)
187     cycles for BinValSSE_bt (JJ, variable size)
40      cycles for BinValJJ (SSE2)
88      cycles for BinValJL (SSE2)
140     cycles for BinValLingo (SSE2)
58      cycles for simd_bin2byte
187     cycles for simd_bin2byteB
104     cycles for bin2byte_exLib

193     cycles for Bin2Dw       (drizz+JJ, variable size)
188     cycles for BinValSSE_bt (JJ, variable size)
40      cycles for BinValJJ (SSE2)
86      cycles for BinValJL (SSE2)
116     cycles for BinValLingo (SSE2)
58      cycles for simd_bin2byte
183     cycles for simd_bin2byteB
89      cycles for bin2byte_exLib

180     cycles for Bin2Dw       (drizz+JJ, variable size)
187     cycles for BinValSSE_bt (JJ, variable size)
57      cycles for BinValJJ (SSE2)
103     cycles for BinValJL (SSE2)
111     cycles for BinValLingo (SSE2)
58      cycles for simd_bin2byte
189     cycles for simd_bin2byteB
98      cycles for bin2byte_exLib