Faster Bin2ASCII

NightWare · March 30, 2008, 09:29:05 PM

Quote from: lingo on March 30, 2008, 05:50:54 AM
Your easy ways are just to get lingo's decision of the problem... :lol

in fact, i speaking of HOW i've fixed the code :lol

123456789
123456789
123456789
123456789
123456789
123456789
93 cycles, bin2ascii-AMD
48 cycles, bin2ascii-lingo
37 cycles, bi2ascii-lingo 2
34 cycles, uDw2Ascii - NightWare
22 cycles, udword-p.dixon
19 cycles, udwordl-lingo 3
Press any key to exit...

MichaelW · March 31, 2008, 10:01:29 PM

For what little it's worth, I tested a version that used the FPU to do the conversion to decimal, and while it worked OK, it was *very* slow. On my P3, just the necessary fild and fbstp instructions took 168 cycles.

NightWare · March 31, 2008, 11:52:23 PM

fild not really slow, but look at fbstp in agner fog's instructions tables... :lol

and it's not only on p3

MichaelW · April 01, 2008, 01:26:10 AM

I tested them together because I was guessing that I could not get a valid cycle count from a thousand iterations of fbstp. Now that I try I get 106 cycles for fbstp alone, and 106 cycles for fild alone, so I'll assume that my guess was right.

NightWare · April 01, 2008, 02:11:00 AM

:eek 106 cycles for fild alone ? for fbstp yes, it use the (slow) bcd technic on a ten byte value (it's why fpu is slow, everything is made with ten byte values...) but for fild it's abnormal, it should be between 2 and 12

MichaelW · April 01, 2008, 02:36:35 AM

I assume it's abnormal because for either one of the instructions used alone, 992 of the 1000 iterations generate an exception (in this case internal to the FPU).

thomas_remkus · April 01, 2008, 03:00:33 AM

1) Is there a collection of procs like this where the fastest get put inside a header for maybe the next version of MASM?

2) I see that MMX is mostly ignored from performance routines like this. Is this because of the challenge or because it's not practical to test MMX versions of these procs?

Oh, an my tests were:

Code Select

GenuineIntel Family 15 Model 4 Stepping 7
FPU TSC CX8 CMOV CLFSH FXSR HTT MXX SSE SSE2 SSE3
Pentium D 2.8, 3.5gb RAM

122 cycles, bin2ascii-AMD
97 cycles, bin2ascii-lingo
64 cycles, bi2ascii-lingo 2
85 cycles, uDw2Ascii - NighwWare
48 cycles, udword-p.dixon
39 cycles, udwordl-lingo 3

Ian_B · April 01, 2008, 03:27:26 AM

Heh, I pulled out my own version of this from the old thread the other day to see if I could optimise it any more. That was a repeated-subtraction approach, hideous code length but had the advantage of being fastest for small numbers compared to Paul Dixon's approach, because it processes only the digits it needs to. To my chagrin, every trick I tried couldn't improve on the code speed at all and generally made it far worse! I tried passing params in registers to allow a "JMP proc ... JMP ebx" type of pseudo-call to remove the repeated loops, but this was terrible speedwise too. However bad the code looks, it was still beating Pauls' MUL-based approach for most values. I'll be interested to try the new Lingo code, tho.

FWIW, no-one has yet posted a QWORD-based solution (value to convert held in EDX:EAX). My old code, however bad, is still a benchmark there... I'd be interested to see if anyone can come up with a better algorithm for that.

NightWare · April 01, 2008, 10:06:25 PM

thomas remkus,
MMX or SSE+ are usefull for parallel ops, but here you need to know a value, depending of another value, depending of another value, etc... so it's inappropriate here, (it's possible if you cut the value in 2*5 digits, but) after that, you will have to test/shift the bytes to remove useless zeros, and all thoses ops on 64/128 bits registers... not really efficient... :(

michaelw,
if i understand well, you try to rework your timing macro ? if you ask me, you loose your time, because of cpu hardware features... i explain : on p4- we had to do things by ourself, but on p4 and + the automatic harware data prefetcher has been implemented, so all thoses +1 extra uops due to mem access are removed, for example, if you take an algo with 40 cycles (where there is +9 uops due to mem access) and 1000 iterations in loop test you will obtain 40 cycles on P3 and 31.009 cycles on P4+ (not exactly that, but you've understood the concept). the cache is heavilly used, and automatic, on P4+. it depend of the usage you make of your algo. if the algo is used once with many other algos, then you will prefere the algo with the best result on P3. in the other hand, if like me you group the tasks to use recent cpu features, you will prefere the algo with the best result on P4+... it really depend of the usage. now make your own opinion...

lingo · April 03, 2008, 01:57:12 PM

a bit cosmetics and new time again: :wink

Code Select


P4 3.6 GHz, 2GB RAM,  Windows Vista with SP1  

123456789
123456789
123456789
123456789
123456789
123456789
153 cycles, bin2ascii-AMD
92 cycles, bin2ascii-lingo
61 cycles, bi2ascii-lingo 2
83 cycles, uDw2Ascii - NightWare
47 cycles, udword-p.dixon
35 cycles, udwordl-lingo 3
Press any key to exit...

Regards,
Lingo

[attachment deleted by admin]

Jimg · April 03, 2008, 02:26:57 PM

lingo-

Your latest is consistently 5 cycles slower on my

Code Select

AuthenticAMD  Family 7  Model 10  Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE

Previous version:

Code Select

63 cycles, bin2ascii-AMD
55 cycles, bin2ascii-lingo
55 cycles, bi2ascii-lingo 2
54 cycles, uDw2Ascii - NightWare
39 cycles, udword-p.dixon
35 cycles, udwordl-lingo 3
Press any key to exit...

New Version:

Code Select

62 cycles, bin2ascii-AMD
55 cycles, bin2ascii-lingo
55 cycles, bi2ascii-lingo 2
54 cycles, uDw2Ascii - NightWare
39 cycles, udword-p.dixon
40 cycles, udwordl-lingo 3
Press any key to exit...

Ossa · April 03, 2008, 07:29:44 PM

Code Select

AMD 64 Mobility Athlon 3400+
512MB RAM
Win XP Home (SP2) (32-bit)

52 cycles, bin2ascii-AMD
43 cycles, bin2ascii-lingo
31 cycles, bi2ascii-lingo 2
28 cycles, uDw2Ascii - NightWare
25 cycles, udword-p.dixon
20 cycles, udwordl-lingo 3

3 cycles faster than last time.

NightWare · April 03, 2008, 10:03:57 PM

Code Select

123456789
123456789
123456789
123456789
123456789
123456789
93 cycles, bin2ascii-AMD
48 cycles, bin2ascii-lingo
36 cycles, bi2ascii-lingo 2
34 cycles, uDw2Ascii - NightWare
22 cycles, udword-p.dixon
20 cycles, udwordl-lingo 3
Press any key to exit...

aproximatively same speed than the previous one, for me

lingo · April 11, 2008, 12:56:28 AM

"FWIW, no-one has yet posted a QWORD-based solution"

Here is P.Dixon's algo which combined 32 and 64 bit code with my modest attempt for improvement... :wink

Code Select


P4 3.6 GHz, 2GB RAM,  Vista Ultimate SP1  

18446744073709551614
18446744073709551614
4294967294
4294967294
137 cycles, uqword-with 64 bit number - p.dixon
125 cycles, b2a3264-with 64 bit number - lingo
45 cycles, uqword-with 32 bit number - p.dixon
39 cycles, b2a3264-with 32 bit number - lingo
Press any key to exit...

Regards,
Lingo

[attachment deleted by admin]

six_L · April 11, 2008, 03:14:41 AM

Quote18446744073709551614
18446744073709551614
4294967294
4294967294
87 cycles, uqword-with 64 bit number - p.dixon
77 cycles, b2a3264-with 64 bit number - lingo
43 cycles, uqword-with 32 bit number - p.dixon
44 cycles, b2a3264-with 32 bit number - lingo
Press any key to exit...

News:

Faster Bin2ASCII

thomas_remkus

Ian_B