I did this a while back and thought I would post it before it got lost in the shuffle.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
.686
include timers.asm
bin2dword PROTO :DWORD
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
b00 db "00000000000000000000000000000000",0
b01 db "00000000000000000000000000000001",0
b02 db "00000000000000000000000000000010",0
b03 db "00000000000000000000000000000111",0
b04 db "00000000000110100100000000000010",0
b05 db "10000000000000000000000000000000",0
b06 db "11000000000000000000000000000000",0
b07 db "01000000000000000000000000000001",0
b08 db "01010101010101010101010101010101",0
b09 db "10101010101010101010101010101010",0
b10 db "11111111111111111111111111111111",0
b11 db "0000000000000000000000000100",0
b12 db "00000000000000000000000001",0
b13 db "1010101010101010101010",0
b14 db "1010101010101010101",0
b15 db "1111111111111111",0
b16 db "0000000000111",0
b17 db "01100110",0
b18 db "100000",0
b19 db "10001",0
b20 db "100",0
b21 db "0",0
b22 db "1",0
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
FOR arg,<b00,b01,b02,b03,b04,b05,b06,b07,b08,b09,b10,\
b11,b12,b13,b14,b15,b16,b17,b18,b19,b20,b21,b22>
invoke crt_strtoul,ADDR arg,NULL,2
print ustr$(eax),13,10
invoke bin2dword,ADDR arg
print ustr$(eax),13,10,13,10
ENDM
LOOP_COUNT EQU 10000000
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
invoke bin2dword,ADDR b08
counter_end
print ustr$(eax)
print " cycles, bin2dword 32-bit input",13,10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
invoke bin2dword,ADDR b17
counter_end
print ustr$(eax)
print " cycles, bin2dword 8-bit input",13,10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
invoke bin2byte_ex,ADDR b17
counter_end
print ustr$(eax)
print " cycles, bin2byte_ex 8-bit input",13,10,13,10
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; Tried a byte table version, was slower and larger.
; Tried a cmov version, was slower (and less compatible).
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
bin2dword proc pszbinstr:DWORD
push esi
mov esi, [esp+8]
mov ecx, 1 SHL 31 ; load value for bit 31
xor edx, edx
xor eax, eax
align 4
digitLoop:
cmp BYTE PTR[esi+edx], '1'
jne @F
add eax, ecx ; add current bit value to total
@@:
add edx, 1
shr ecx, 1 ; adjust to bit value for next bit
cmp BYTE PTR[esi+edx-1], 0
jne digitLoop
mov ecx, 33 ; adjust result for < 32 digits
sub ecx, edx
shr eax, cl
pop esi
ret 4
bin2dword endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
Timing result on a P3:
166 cycles, bin2dword 32-bit input
61 cycles, bin2dword 8-bit input
23 cycles, bin2byte_ex 8-bit input
Quote from: AMD XP 2500+
202 cycles, bin2dword 32-bit input
68 cycles, bin2dword 8-bit input
17 cycles, bin2byte_ex 8-bit input
Quote from: AMD Athlon 1.00 GHz, XP SP2133 cycles, bin2dword 32-bit input
53 cycles, bin2dword 8-bit input
17 cycles, bin2byte_ex 8-bit input
Edit: I find it very interesting that both Mark and my machines (each having very different CPUs interms of speed) yield the same results for bin2byte_ex 8-bit input.
Paul
Cycle for Cycle on AMD CPUs is not hard to understand, it's the fact his inter-period pulse is shorter. So his ran faster than yours.
Because this part of code is so short, it ran from uP cache and memory access was not a factor.
Regards, P1 :8)
Michael,
So I guess what you are saying is that the results of these tests can be misleading?
Paul
Why not :lol
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
db 8Dh,0A4h,24h,0,0,0,0,90h
b2dw proc lpszbinstr:dword
mov edx, [esp+4]
mov ecx, 1 SHL 31 ; load value for bit 31
xor eax, eax
cmp byte ptr [edx], '0'
lea edx, [edx+1]
je @f
lea eax, [eax+ecx] ; add current bit value to total
jc @1
@@:
shr ecx, 1 ; adjust to bit value for next bit
cmp byte ptr [edx], '0'
lea edx, [edx+1]
je @b
lea eax, [eax+ecx] ; add current bit value to total
ja @b
@1:
sub edx, [esp+4]
mov ecx, 33 ; adjust result for < 32 digits
sub ecx, edx
shr eax, cl
ret 4
b2dw endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Regards,
Lingo
I won't write a entire proc (too lazy for it now) but an ideia:
mov al, '0'
@@:
cmp al, [edx]
ja @F
rcl ecx, 1
inc edx
jmp @B
@@:
Quote from: PBrennick on May 05, 2006, 10:08:18 PMSo I guess what you are saying is that the results of these tests can be misleading?
More like misunderstood, if you don't have all the relative facts.
Regards, P1 :8) ©>«
Timings forr P3:
166 cycles, bin2dword 32-bit input
153 cycles, b2dw 32-bit input
60 cycles, bin2dword 8-bit input
34 cycles, b2dw 8-bit input
24 cycles, bin2byte_ex 8-bit input
15 cycles, bin2dword 1-bit input
10 cycles, b2dw 1-bit input
[attachment deleted by admin]
tested on Intel(R) Pentium(R) CPU 1.40GHz
Quote127 cycles, bin2dword 32-bit input
208 cycles, b2dw 32-bit input
35 cycles, bin2dword 8-bit input
28 cycles, b2dw 8-bit input
17 cycles, bin2byte_ex 8-bit input
10 cycles, bin2dword 1-bit input
8 cycles, b2dw 1-bit input
Press any key to exit...
These are the results on my PIV.
205 cycles, bin2dword 32-bit input
148 cycles, b2dw 32-bit input
55 cycles, bin2dword 8-bit input
34 cycles, b2dw 8-bit input
21 cycles, bin2byte_ex 8-bit input
7 cycles, bin2dword 1-bit input
2 cycles, b2dw 1-bit input
Press any key to exit...
AMD Athlon 1190 Mhz, Windows XP SP2
135 cycles, bin2dword 32-bit input
129 cycles, b2dw 32-bit input
55 cycles, bin2dword 8-bit input
32 cycles, b2dw 8-bit input
22 cycles, bin2byte_ex 8-bit input
11 cycles, bin2dword 1-bit input
10 cycles, b2dw 1-bit input
Let's try:
136 cycles, bin2dword 32-bit input
114 cycles, b2dw 32-bit input
112 cycles, bin2dw 32-bit input
58 cycles, bin2dword 8-bit input
28 cycles, b2dw 8-bit input
26 cycles, bin2dw 8-bit input
15 cycles, bin2byte_ex 8-bit input
7 cycles, bin2dword 1-bit input
5 cycles, b2dw 1-bit input
5 cycles, bin2dw 1-bit input
Changing the unroll block value to 33:
136 cycles, bin2dword 32-bit input
114 cycles, b2dw 32-bit input
36 cycles, bin2dw 32-bit input
57 cycles, bin2dword 8-bit input
28 cycles, b2dw 8-bit input
13 cycles, bin2dw 8-bit input
15 cycles, bin2byte_ex 8-bit input
7 cycles, bin2dword 1-bit input
5 cycles, b2dw 1-bit input
4 cycles, bin2dw 1-bit input
[attachment deleted by admin]
bin2dword2
126 cycles, bin2dword 32-bit input
208 cycles, b2dw 32-bit input
130 cycles, bin2dw 32-bit input
35 cycles, bin2dword 8-bit input
28 cycles, b2dw 8-bit input
45 cycles, bin2dw 8-bit input
17 cycles, bin2byte_ex 8-bit input
10 cycles, bin2dword 1-bit input
8 cycles, b2dw 1-bit input
10 cycles, bin2dw 1-bit input
lingo, i'm disappointed ! :P
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 8
BinToDw proc pszbinstr:DWORD
mov edx,[esp+1*4]
xor eax,eax
jmp @F
.repeat
and ecx,1
lea eax,[eax*2+ecx]
@@: movzx ecx,byte ptr [edx]
inc edx
test ecx,ecx
.until zero?
ret 1*4
BinToDw endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Timings for P3:
164 cycles, bin2dword (MichaelW) 32-bit input
153 cycles, b2dw (lingo) 32-bit input
144 cycles, bin2dw (EduardoS) 32-bit input
107 cycles, BinToDw (drizz) 32-bit input
61 cycles, bin2dword (MichaelW) 8-bit input
34 cycles, b2dw (lingo) 8-bit input
47 cycles, bin2dw (EduardoS) 8-bit input
39 cycles, BinToDw (drizz) 8-bit input
23 cycles, bin2byte_ex (Hutch) 8-bit input
14 cycles, bin2dword (MichaelW) 1-bit input
10 cycles, b2dw (lingo) 1-bit input
8 cycles, bin2dw (EduardoS) 1-bit input
7 cycles, BinToDw (drizz) 1-bit input
I would like to see the results for a good compiler-optimized version (where the coder knows enough to produce an optimal C source, I don't).
[attachment deleted by admin]
Quote from: MichaelW on May 07, 2006, 09:56:27 PM
I would like to see the results for a good compiler-optimized version (where the coder knows enough to produce an optimal C source, I don't).
Maybe the drizz's code is easy to convert:
unsigned int b2dw(char* ptr)
{
unsigned int ret = 0;
char tmp;
while(tmp = *(ptr++))
ret = ret * 2 + (tmp & 1);
return ret;
}
Timings for Athlon 64:
137 cycles, bin2dword (MichaelW) 32-bit input
115 cycles, b2dw (lingo) 32-bit input
113 cycles, bin2dw (EduardoS) 32-bit input
100 cycles, BinToDw (drizz) 32-bit input
56 cycles, bin2dword (MichaelW) 8-bit input
28 cycles, b2dw (lingo) 8-bit input
26 cycles, bin2dw (EduardoS) 8-bit input
21 cycles, BinToDw (drizz) 8-bit input
15 cycles, bin2byte_ex (Hutch) 8-bit input
7 cycles, bin2dword (MichaelW) 1-bit input
5 cycles, b2dw (lingo) 1-bit input
5 cycles, bin2dw (EduardoS) 1-bit input
5 cycles, BinToDw (drizz) 1-bit input
EDIT: C++ Code fixed.
Using Visual C++ 2003 the generated assembly was:
?b2dw@@YAIPAD@Z PROC NEAR ; b2dw, COMDAT
; 14 : unsigned int ret = 0;
; 15 : char tmp;
; 16 : while(tmp = *(ptr++))
mov edx, DWORD PTR _ptr$[esp-4]
; 17 : ret = ret * 2 + (tmp & 1);
xor ecx, ecx
mov cl, BYTE PTR [edx]
xor eax, eax
test cl, cl
je SHORT $L9651
npad 2
$L9632:
and ecx, 1
inc edx
lea eax, DWORD PTR [ecx+eax*2]
mov cl, BYTE PTR [edx]
test cl, cl
jne SHORT $L9632
$L9651:
; 18 : return ret;
; 19 : }
the timings (i included a unrolled version of bin2dw to avoid C++ being faster than mine one :bdg):
136 cycles, bin2dword (MichaelW) 32-bit input
114 cycles, b2dw (lingo) 32-bit input
112 cycles, bin2dw (EduardoS) 32-bit input
113 cycles, bin2dwc (Visual C++) 32-bit input
35 cycles, bin2dwu (C++ Killer) 32-bit input
99 cycles, BinToDw (drizz) 32-bit input
55 cycles, bin2dword (MichaelW) 8-bit input
28 cycles, b2dw (lingo) 8-bit input
26 cycles, bin2dw (EduardoS) 8-bit input
24 cycles, bin2dwc (Visual C++) 8-bit input
13 cycles, bin2dwu (C++ Killer) 8-bit input
21 cycles, BinToDw (drizz) 8-bit input
15 cycles, bin2byte_ex (Hutch) 8-bit input
7 cycles, bin2dword (MichaelW) 1-bit input
5 cycles, b2dw (lingo) 1-bit input
5 cycles, bin2dw (EduardoS) 1-bit input
2 cycles, bin2dwc (Visual C++) 1-bit input
5 cycles, bin2dwu (C++ Killer) 1-bit input
5 cycles, BinToDw (drizz) 1-bit input
[attachment deleted by admin]
“lingo, i'm disappointed !” :lol
OK, biger and faster again …
P4 Prescott 3.6GHz – XP pro SP2
242 cycles, bin2dword (MichaelW) 32-bit input
149 cycles, b2dw (lingo) 32-bit input
203 cycles, bin2dw (EduardoS) 32-bit input
214 cycles, bin2dwc (Visual C++) 32-bit input
140 cycles, bin2dwu (C++ Killer) 32-bit input
157 cycles, BinToDw (drizz) 32-bit input
109 cycles, b2dw1 (lingo-fast) 32-bit input
67 cycles, bin2dword (MichaelW) 8-bit input
43 cycles, b2dw (lingo) 8-bit input
39 cycles, bin2dw (EduardoS) 8-bit input
47 cycles, bin2dwc (Visual C++) 8-bit input
31 cycles, bin2dwu (C++ Killer) 8-bit input
37 cycles, BinToDw (drizz) 8-bit input
31 cycles, bin2byte_ex (Hutch) 8-bit input
36 cycles, b2dw1 (lingo-fast) 8-bit input
17 cycles, bin2dword (MichaelW) 1-bit input
12 cycles, b2dw (lingo) 1-bit input
10 cycles, bin2dw (EduardoS) 1-bit input
13 cycles, bin2dwc (Visual C++) 1-bit input
8 cycles, bin2dwu (C++ Killer) 1-bit input
12 cycles, BinToDw (drizz) 1-bit input
8 cycles, b2dw1 (lingo-fast) 1-bit input
Press any key to exit...
AMD Turion 64 ML-30 processor (1 MB L2 cache, 1.6 Ghz)
– XP pro SP2
136 cycles, bin2dword (MichaelW) 32-bit input
114 cycles, b2dw (lingo) 32-bit input
112 cycles, bin2dw (EduardoS) 32-bit input
113 cycles, bin2dwc (Visual C++) 32-bit input
35 cycles, bin2dwu (C++ Killer) 32-bit input
99 cycles, BinToDw (drizz) 32-bit input
42 cycles, b2dw1 (lingo-fast) 32-bit input
56 cycles, bin2dword (MichaelW) 8-bit input
28 cycles, b2dw (lingo) 8-bit input
26 cycles, bin2dw (EduardoS) 8-bit input
24 cycles, bin2dwc (Visual C++) 8-bit input
13 cycles, bin2dwu (C++ Killer) 8-bit input
21 cycles, BinToDw (drizz) 8-bit input
15 cycles, bin2byte_ex (Hutch) 8-bit input
12 cycles, b2dw1 (lingo-fast) 8-bit input
7 cycles, bin2dword (MichaelW) 1-bit input
5 cycles, b2dw (lingo) 1-bit input
5 cycles, bin2dw (EduardoS) 1-bit input
3 cycles, bin2dwc (Visual C++) 1-bit input
4 cycles, bin2dwu (C++ Killer) 1-bit input
5 cycles, BinToDw (drizz) 1-bit input
4 cycles, b2dw1 (lingo-fast) 1-bit input
Press any key to exit...
Regards,
Lingo
[attachment deleted by admin]
Quote
127 cycles, bin2dword (MichaelW) 32-bit input
209 cycles, b2dw (lingo) 32-bit input
130 cycles, bin2dw (EduardoS) 32-bit input
158 cycles, bin2dwc (Visual C++) 32-bit input
77 cycles, bin2dwu (C++ Killer) 32-bit input
97 cycles, BinToDw (drizz) 32-bit input
90 cycles, b2dw1 (lingo-fast) 32-bit input
35 cycles, bin2dword (MichaelW) 8-bit input
28 cycles, b2dw (lingo) 8-bit input
45 cycles, bin2dw (EduardoS) 8-bit input
38 cycles, bin2dwc (Visual C++) 8-bit input
22 cycles, bin2dwu (C++ Killer) 8-bit input
23 cycles, BinToDw (drizz) 8-bit input
34 cycles, bin2byte_ex (Hutch) 8-bit input
20 cycles, b2dw1 (lingo-fast) 8-bit input
10 cycles, bin2dword (MichaelW) 1-bit input
8 cycles, b2dw (lingo) 1-bit input
18 cycles, bin2dw (EduardoS) 1-bit input
4 cycles, bin2dwc (Visual C++) 1-bit input
5 cycles, bin2dwu (C++ Killer) 1-bit input
6 cycles, BinToDw (drizz) 1-bit input
4 cycles, b2dw1 (lingo-fast) 1-bit input
Press any key to exit...
AMD Turion 64/XP Home SP2
bin2dword2:
136 cycles, bin2dword (MichaelW) 32-bit input
114 cycles, b2dw (lingo) 32-bit input
112 cycles, bin2dw (EduardoS) 32-bit input
113 cycles, bin2dwc (Visual C++) 32-bit input
35 cycles, bin2dwu (C++ Killer) 32-bit input
99 cycles, BinToDw (drizz) 32-bit input
42 cycles, b2dw1 (lingo-fast) 32-bit input
55 cycles, bin2dword (MichaelW) 8-bit input
28 cycles, b2dw (lingo) 8-bit input
26 cycles, bin2dw (EduardoS) 8-bit input
24 cycles, bin2dwc (Visual C++) 8-bit input
13 cycles, bin2dwu (C++ Killer) 8-bit input
21 cycles, BinToDw (drizz) 8-bit input
15 cycles, bin2byte_ex (Hutch) 8-bit input
12 cycles, b2dw1 (lingo-fast) 8-bit input
7 cycles, bin2dword (MichaelW) 1-bit input
5 cycles, b2dw (lingo) 1-bit input
5 cycles, bin2dw (EduardoS) 1-bit input
2 cycles, bin2dwc (Visual C++) 1-bit input
5 cycles, bin2dwu (C++ Killer) 1-bit input
5 cycles, BinToDw (drizz) 1-bit input
3 cycles, b2dw1 (lingo-fast) 1-bit input
bin2dword2 P2 2.8 HT
Result1
198 cycles, bin2dword (MichaelW) 32-bit input
136 cycles, b2dw (lingo) 32-bit input
135 cycles, bin2dw (EduardoS) 32-bit input
150 cycles, bin2dwc (Visual C++) 32-bit input
80 cycles, bin2dwu (C++ Killer) 32-bit input
132 cycles, BinToDw (drizz) 32-bit input
65 cycles, b2dw1 (lingo-fast) 32-bit input
46 cycles, bin2dword (MichaelW) 8-bit input
33 cycles, b2dw (lingo) 8-bit input
23 cycles, bin2dw (EduardoS) 8-bit input
34 cycles, bin2dwc (Visual C++) 8-bit input
12 cycles, bin2dwu (C++ Killer) 8-bit input
24 cycles, BinToDw (drizz) 8-bit input
20 cycles, bin2byte_ex (Hutch) 8-bit input
10 cycles, b2dw1 (lingo-fast) 8-bit input
19 cycles, bin2dword (MichaelW) 1-bit input
9 cycles, b2dw (lingo) 1-bit input
0 cycles, bin2dw (EduardoS) 1-bit input <-- ??
1 cycles, bin2dwc (Visual C++) 1-bit input
4294967294 cycles, bin2dwu (C++ Killer) 1-bit input <-- ??
4 cycles, BinToDw (drizz) 1-bit input
4294967294 cycles, b2dw1 (lingo-fast) 1-bit input <---??
Result2:
188 cycles, bin2dword (MichaelW) 32-bit input
138 cycles, b2dw (lingo) 32-bit input
126 cycles, bin2dw (EduardoS) 32-bit input
158 cycles, bin2dwc (Visual C++) 32-bit input
79 cycles, bin2dwu (C++ Killer) 32-bit input
130 cycles, BinToDw (drizz) 32-bit input
70 cycles, b2dw1 (lingo-fast) 32-bit input
41 cycles, bin2dword (MichaelW) 8-bit input
34 cycles, b2dw (lingo) 8-bit input
28 cycles, bin2dw (EduardoS) 8-bit input
36 cycles, bin2dwc (Visual C++) 8-bit input
9 cycles, bin2dwu (C++ Killer) 8-bit input
34 cycles, BinToDw (drizz) 8-bit input
26 cycles, bin2byte_ex (Hutch) 8-bit input
13 cycles, b2dw1 (lingo-fast) 8-bit input
10 cycles, bin2dword (MichaelW) 1-bit input
2 cycles, b2dw (lingo) 1-bit input
10 cycles, bin2dw (EduardoS) 1-bit input
10 cycles, bin2dwc (Visual C++) 1-bit input
10 cycles, bin2dwu (C++ Killer) 1-bit input
3 cycles, BinToDw (drizz) 1-bit input
2 cycles, b2dw1 (lingo-fast) 1-bit input
Result1 I get on normal execute
Result2 I get after a couple of runs and it returns back to result1
guys,
if you want to make test, ok, but do it correctly... there is Align4, Align8 and Align16 procs on the same test... all must be 16 bytes aligned ! or you have to test the 2 possible Align8, or the 4 possible Align4 for all the procs... coz there is differencies... otherwise the résults are completly useless...