News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

bin2byte_ex

Started by jj2007, November 16, 2009, 09:13:31 PM

Previous topic - Next topic

dedndave

prescott

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

168     cycles for BinValSSE
512     cycles for BinVal0
93      cycles for BinVal3
119     cycles for BinToDw (drizz)
135     cycles for BinToDw2 (drizz+JJ)
113     cycles for b2dw1 (Lingo)
79      cycles for simd_bin2byte
105     cycles for bin2byte_exLib

164     cycles for BinValSSE
380     cycles for BinVal0
85      cycles for BinVal3
120     cycles for BinToDw (drizz)
120     cycles for BinToDw2 (drizz+JJ)
115     cycles for b2dw1 (Lingo)
78      cycles for simd_bin2byte
107     cycles for bin2byte_exLib

165     cycles for BinValSSE
497     cycles for BinVal0
86      cycles for BinVal3
119     cycles for BinToDw (drizz)
120     cycles for BinToDw2 (drizz+JJ)
113     cycles for b2dw1 (Lingo)
78      cycles for simd_bin2byte
104     cycles for bin2byte_exLib


Testing BinValSSE (2 lines must match):
0, 3, 12, 48, 192, 255, 1431655765=1431655765
0, 3, 12, 48, 192, 255, 1431655765=1431655765

Code size:
182 bytes BinValSSE
31 bytes BinVal0
85 bytes BinVal3
50 bytes SSE2 qWord
25 bytes BinToDw
25 bytes BinToDw2
133 bytes b2dw1
8760 bytes bin2byte_ex (Masm32 library)

hutch--

I tweaked the library version but got little gain in speed, about 2%. First mod was 70% slower so I split the input bin string into 2 and fed it through two FSM trees and it is marginally faster but far smaller.

Here is the result using JJs test bed.


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)

277     cycles for BinVal1
37      cycles for BinVal2
37      cycles for BinVal3
100     cycles for BinToDw (drizz)
67      cycles for BinToDw2 (drizz+JJ)
44      cycles for b2dw1 (Lingo)
27      cycles for simd_bin2byte
47      cycles for bin2byte_exLib

264     cycles for BinVal1
37      cycles for BinVal2
37      cycles for BinVal3
70      cycles for BinToDw (drizz)
70      cycles for BinToDw2 (drizz+JJ)
44      cycles for b2dw1 (Lingo)
27      cycles for simd_bin2byte
48      cycles for bin2byte_exLib

263     cycles for BinVal1
37      cycles for BinVal2
37      cycles for BinVal3
98      cycles for BinToDw (drizz)
67      cycles for BinToDw2 (drizz+JJ)
44      cycles for b2dw1 (Lingo)
27      cycles for simd_bin2byte
47      cycles for bin2byte_exLib


Testing b2dw1 (2 lines must match):
0, 3, 12, 48, 192, 255, 1431655765=1431655765
0, 3, 12, 48, 192, 255, -1431655765=1431655765

Code size:
27 bytes BinVal1
89 bytes BinVal2
85 bytes BinVal3
50 bytes SSE2 qWord
25 bytes BinToDw
25 bytes BinToDw2
133 bytes b2dw1
474 bytes bin2byte_ex (Masm32 library)

Hit any key to get outta here
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on November 18, 2009, 02:15:32 AM
I tweaked the library version but got little gain in speed, about 2%. First mod was 70% slower so I split the input bin string into 2 and fed it through two FSM trees and it is marginally faster but far smaller.

Here is the result using JJs test bed.


That looks pretty competitive, thanks Hutch. Attached new version with your version and the BinValSSE, with changes by Dave.

P.S.: I also changed the include lines to make the testbed compatible for those who are not proud owners of MasmBasic :green

dedndave

prescott...

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

141     cycles for BinValSSE
352     cycles for BinVal0
85      cycles for BinVal3
121     cycles for BinToDw (drizz)
119     cycles for BinToDw2 (drizz+JJ)
114     cycles for b2dw1 (Lingo)
79      cycles for simd_bin2byte
107     cycles for bin2byte_exLib

140     cycles for BinValSSE
253     cycles for BinVal0
86      cycles for BinVal3
119     cycles for BinToDw (drizz)
119     cycles for BinToDw2 (drizz+JJ)
114     cycles for b2dw1 (Lingo)
78      cycles for simd_bin2byte
103     cycles for bin2byte_exLib

142     cycles for BinValSSE
360     cycles for BinVal0
85      cycles for BinVal3
120     cycles for BinToDw (drizz)
132     cycles for BinToDw2 (drizz+JJ)
117     cycles for b2dw1 (Lingo)
89      cycles for simd_bin2byte
82      cycles for bin2byte_exLib


Testing bin2byte_exLib (2 lines must match):
0, 3, 12, 48, 192, 255, 1431655765=1431655765
0, 3, 12, 48, 192, 255, 85=1431655765
Last item different: Apparently, the algo does not recognize formats different
from '10101010', i.e. it needs fixed 8 chars or 8 chars plus zero delimiter
Code size:
180 bytes BinValSSE
31 bytes BinVal0
85 bytes BinVal3
50 bytes SSE2 qWord
25 bytes BinToDw
25 bytes BinToDw2
133 bytes b2dw1
472 bytes bin2byte_ex (Masm32 library)

jj2007

#19
One more. I have fumbled with a byte-size table (BinValSSE_bt), but it's kind of slow. In the end, a small modification of drizz' algo seems what I was looking for: A flexible and reasonably fast algo. Bin2Dw handles 101010, 000b, 111y, i.e. it will interpret anything until the char is neither 0 nor 1. And it's exactly 2 paras short :bg

QuoteBin2Dw proc    ; pszbinstr:DWORD      ; this one stops when char is not "1" or "0"
   mov edx, [esp+4]      ; credits drizz with mods by jj2007
   xor eax, eax
   push ecx         ; MasmBasic preserves ecx ;-)
   jmp byte1

@@:   and ecx, 1
   lea eax, [eax*2+ecx]   ; multiply eax by 2, add one if the char was "1"
byte1:   mov cl, byte ptr [edx]   ; get next char
   inc edx
   cmp cl, "1"
   ja @F
   cmp cl, "0"
   jae @B

@@:   pop ecx
   ret 4
Bin2Dw endp

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

274     cycles for BinValSSE_bt
322     cycles for BinValSSE_wt
294     cycles for BinValSSE
404     cycles for BinVal0
85      cycles for BinVal3
120     cycles for BinToDw   (drizz)
182     cycles for Bin2Dw    (drizz+JJ)
113     cycles for b2dw1 (Lingo)
47      cycles for simd_bin2byte
90      cycles for bin2byte_exLib

235     cycles for BinValSSE_bt
164     cycles for BinValSSE_wt
287     cycles for BinValSSE
500     cycles for BinVal0
85      cycles for BinVal3
119     cycles for BinToDw   (drizz)
179     cycles for Bin2Dw    (drizz+JJ)
141     cycles for b2dw1 (Lingo)
88      cycles for simd_bin2byte
92      cycles for bin2byte_exLib

224     cycles for BinValSSE_bt
266     cycles for BinValSSE_wt
312     cycles for BinValSSE
575     cycles for BinVal0
85      cycles for BinVal3
119     cycles for BinToDw   (drizz)
180     cycles for Bin2Dw    (drizz+JJ)
113     cycles for b2dw1 (Lingo)
47      cycles for simd_bin2byte
103     cycles for bin2byte_exLib


Testing Bin2Dw (2 lines must match):
0, 3, 12, 48, 192, 255, -1, 1431655765=1431655765
0, 3, 12, 48, 192, 255, -1, 1431655765=1431655765

Code size:
199 bytes BinValSSE_bt
39 bytes BinVal0
85 bytes BinVal3
50 bytes SSE2 qWord
25 bytes BinToDw
32 bytes Bin2Dw
133 bytes b2dw1
472 bytes bin2byte_ex (Masm32 library)


Edit: Celeron timings, a lot different....
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

94      cycles for BinValSSE_bt
88      cycles for BinValSSE_wt
134     cycles for BinValSSE
237     cycles for BinVal0
55      cycles for BinVal3
91      cycles for BinToDw   (drizz)
160     cycles for Bin2Dw    (drizz+JJ)
72      cycles for b2dw1 (Lingo)
47      cycles for simd_bin2byte
61      cycles for bin2byte_exLib

oex

#20
Hi JJ, I was looking at your function and changed
   mov edx, [esp+4]
   mov ebx, [edx+4]
   mov edx, [edx]

To Remove
   mov edx, [esp+4]
   mov edx, [edx+4]
With a massive reduction in speed..... but why?

Also for all functions 'align 16' Is Used, I almost get this (I think) but am unsure exactly what it applies to ie if I were to have
cmp eax, '0000' on align 4 would it have an effect? In BinVal3 you have align 16 but I see no obvious reason


xs = 1
align 16
BinVal3 proc    ; lpSrc
;   int 3
   mov edx, [esp+4]
   mov edx, [edx]
   xor eax, eax
   cmp dl, "1"
   jne @F
   add eax, 8
@@:   cmp dh, "1"
   jne @F
   add eax, 4
@@:   bswap edx
   cmp dl, "1"
   jne @F
   add eax, 2
@@:   cmp dh, "1"
   jne @F
   inc eax
@@:   shl eax, 4
   mov edx, [esp+4]
   mov edx, [edx+4]
   cmp dl, "1"
   jne @F
   add eax, 8
@@:   cmp dh, "1"
   jne @F
   add eax, 4
@@:   bswap edx
   cmp dl, "1"
   jne @F
   add eax, 2
@@:   cmp dh, "1"
   jne @F
   inc eax
@@:   ret 4
;bvpError:   or eax, -1
;   jmp @B
BinVal3 endp
BinVal3_END:
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

jj2007

Quote from: oex on November 20, 2009, 11:52:14 PM
Hi JJ, I was looking at your function and changed
   mov edx, [esp+4]
   mov ebx, [edx+4]
   mov edx, [edx]

To Remove
   mov edx, [esp+4]
   mov edx, [edx+4]
With a massive reduction in speed..... but why?
Search for Algo equ and put the name of the changed algo there. If it still yields the correct result, post your complete code :U

align 16 inserts some bytes before the proc, so that it is aligned to a 16-byte boundary. Many people believe this can speed up the code.

oex

Quote from: jj2007 on November 21, 2009, 08:24:34 AM
Search for Algo equ and put the name of the changed algo there. If it still yields the correct result, post your complete code :U

align 16 inserts some bytes before the proc, so that it is aligned to a 16-byte boundary. Many people believe this can speed up the code.

Ty JJ I did notice a speed increase, with align 16, my code was close to Lingos speed but 400 bytes, I dont have MasmBasic.inc so cant build it.

I believe mixing registers must also be speed detremental ie ecx, [edx+4] (which also makes me wonder if mov edx, [esp+4] is quicker)

We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

jj2007

Quote from: oex on November 21, 2009, 01:43:29 PM
I dont have MasmBasic.inc so cant build it.
Just change the first two lines as follows:
include \masm32\include\masm32rt.inc
; include \masm32\MasmBasic\MasmBasic.inc

oex

We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

hutch--

oex,

Mixed registers are not the problem, its regster dependency that causes stalls.


mov eax, [ecx+eax*4]


The read from the address [ecx+eax*4] is overwriting its own index which makes it difficult to pipeline instructions of this type. Its still a better option that a memory read/write but if you can avoid this type of dependency you do so by using another register.

Depending on the algorithm design that more registers you can keep free the easier it is to avoid problems like this.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

oex

My problem actually seemed to be the other way round...

mov edx, [esp+4]
mov ecx, [edx+4]
mov edx, [edx]

slower than

mov edx, [esp+4]
mov edx, [edx]
then later in script
mov edx, [esp+4]
mov edx, [edx+4]
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

jj2007

#27
I managed to squeeze out 5 cycles from the BinVal2 algo. It is now almost on par with qWord's simd_bin2byte.
Note that the first two are slower because they accept variable size input and delimiter, e.g. 10101b, 111000111000y etc.; all others require 8-bit input.

Note there is a strange problem with line 304, (edit) movsd xmm0, QWORD ptr [eax]: Both ml 6.15 and JWasm choke with Error A2084: Invalid operand size for instruction (ml 6.14 can't deal with SSE2). So this code assembles only with higher versions of ml.

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

160     cycles for Bin2Dw       (drizz+JJ, variable size)
94      cycles for BinValSSE_bt (JJ, variable size)
50      cycles for BinVal2
55      cycles for BinVal3
47      cycles for simd_bin2byte
81      cycles for bin2byte_exLib

160     cycles for Bin2Dw       (drizz+JJ, variable size)
94      cycles for BinValSSE_bt (JJ, variable size)
50      cycles for BinVal2
55      cycles for BinVal3
47      cycles for simd_bin2byte
61      cycles for bin2byte_exLib

160     cycles for Bin2Dw       (drizz+JJ, variable size)
94      cycles for BinValSSE_bt (JJ, variable size)
51      cycles for BinVal2
55      cycles for BinVal3
47      cycles for simd_bin2byte
61      cycles for bin2byte_exLib

dedndave

prescott

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

179     cycles for Bin2Dw       (drizz+JJ, variable size)
187     cycles for BinValSSE_bt (JJ, variable size)
81      cycles for BinVal2
85      cycles for BinVal3
77      cycles for simd_bin2byte
103     cycles for bin2byte_exLib

179     cycles for Bin2Dw       (drizz+JJ, variable size)
190     cycles for BinValSSE_bt (JJ, variable size)
81      cycles for BinVal2
76      cycles for BinVal3
79      cycles for simd_bin2byte
95      cycles for bin2byte_exLib

180     cycles for Bin2Dw       (drizz+JJ, variable size)
188     cycles for BinValSSE_bt (JJ, variable size)
76      cycles for BinVal2
64      cycles for BinVal3
79      cycles for simd_bin2byte
110     cycles for bin2byte_exLib

hutch--

oex,


mov edx, [esp+4]    ; write to EDX
mov ecx, [edx+4]    ; 1st read after write stall
mov edx, [edx]      ; 2nd read after write stall
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php