The originals of these algos were intended to be ascii/ansi to unsigned DWORD. Here is a test piece that tests the old atodw against two later versions, a short version and a long version that is unrolled to pick up some speed.
The test piece is to determine relative speeds of the 3 algos. It tests the 3 algos for correctness then benchmarks them for comparison purposes.
These are the results on this Core2 quad.
atodw Version
4294967295
987
0
9876
Short Version
4294967295
987
0
9876
Long Version
4294967295
987
0
9876
-------
Timings
-------
Timing atodw version
312
Timing short version
266
Timing long version
109
Press any key to continue ...
What are these measurements in? Clocks or (milli/nano)seconds?
if you need such measurements feel free to write your own. his benchmark is specifically written to run in real time and reference the earlier version that will be replaced.
prescott w/htt:
Timing atodw version
344
Timing short version
266
Timing long version
250
had to change because of throttle
Dave,
It looks like the Prescott does not like the paired LEA instructions which is probably normal.
Here are the timings for my P3:
Timing atodw version
2834
Timing short version
1482
Timing long version
981
I also tested all 32-bit values, using crt__ultoa to generate the strings, and found no problems.
QuoteDave,
It looks like the Prescott does not like the paired LEA instructions which is probably normal.
the long version still has a pay-off for non-P4 machines
at least it isn't slower than the short version on the P4 :bg
Core 2 Duo 2.0 GHz
atodw Version
4294967295
987
0
9876
Short Version
4294967295
987
0
9876
Long Version
4294967295
987
0
9876
-------
Timings
-------
Timing atodw version
500
Timing short version
390
Timing long version
172
Press any key to continue ...
AMD Athlon II x2 215 2.7Ghz
atodw Version
4294967295
987
0
9876
Short Version
4294967295
987
0
9876
Long Version
4294967295
987
0
9876
-------
Timings
-------
Timing atodw version
296
Timing short version
187
Timing long version
171
Hutch, or anyone, I have a question about libraries.
When you statically or dynamically link to the library, doesn't the entire library become mapped into your virtual space? I mean, it has to be mapped, otherwise you could not "call" and mapped function, right? The loader does not load pieces of the library and skip the rest. It would be different if you had individual .obj files assembled and you linked only the ones that you needed.
If that is the case, is there any real reason two have two functions in the library that do the same thing? Since Hutch's sample has two functions that do the same conversion, and the unrolled one is the fastest, why use anything else? Why keep the short slow one in the library?
Hutch,
Several things about your sample.
1. I tried 17 different ways to do this faster using the decade table (my own code in my own library matched this algo, two leas, process the string once). I could beat atodw, but not the other two. OBTW, this same method of two leas can be used for hex conversions once the character has been converted to a numeric 0-15 by using lea eax,[eax*4] and lea eax,[ecx+eax*4].
2. You have no checking for invalid characters other than the null. If you checked and terminated the conversion when a non numeric was found, then the function could be used for processing a string containing multiple values, i.e., "1234 5678 9124", "12:30:05" "08/02/2010".
3. In the long version, you return an error condition in ecx. What about returning the converted value in eax, and returning in the pointer to the character that stopped the scan in edx, and use the following encoding for ecx:
-2 invalid numeric character > '9'
-1 invalid numeric character < '0'
0 valid conversion (edx points to the null)
+1 valid conversion but more than 9 characters (00000000000001)
+2 invalid conversion exceeded 32 bits (500000000)
By returning the error flag and the terminating character pointer, the caller can determine whether to continue on to the next piece of the string.
Dave.
Here it is my test on Core 2 Duo 2.4 Ghz x64 bit:
atodw Version
4294967295
987
0
9876
Short Version
4294967295
987
0
9876
Long Version
4294967295
987
0
9876
-------
Timings
-------
Timing atodw version
390
Timing short version
296
Timing long version
109
Press any key to continue ...
Celeron M:
Timing atodw version
625
Timing short version
532
Timing long version
281
Quote from: KeepingRealBusyWhen you statically or dynamically link to the library, doesn't the entire library become mapped into your virtual space? I mean, it has to be mapped, otherwise you could not "call" and mapped function, right? The loader does not load pieces of the library and skip the rest. It would be different if you had individual .obj files assembled and you linked only the ones that you needed.
If the library is built from individual .obj files for each procedure, which is how the masm32 libraries are done, then when you link you only get the individual procedures that you use mapped into your program space and none of the others.
Greg,
Thank you. That is good to know. What about DLL's, especially user and system?
If your program loads a DLL, you get the whole thing mapped into your program space.
Celeron D Prescott test
atodw Version
4294967295
987
0
9876
Short Version
4294967295
987
0
9876
Long Version
4294967295
987
0
9876
-------
Timings
-------
Timing atodw version
562
Timing short version
438
Timing long version
391
Press any key to continue ...
Alex
KeepingRealBusy,
The reason why this pair of algos were posted in the lab was to test them, while you have perfectly valid questions about some of the issues raised, this is not the place for them or other unrelated questions in the previous post.
RE: The design choice for 2 algos, it is simple enough to provide the range difference so that someone who wants to do an occasional conversion can use a shorter version where someone who want to stream data of this type can use a faster but larger one. Its the choice of the user being catered for here, not a theory about library design.
Excuse me,
Quote
The Laboratory
Algorithm and code design research laboratory. This is the place to post assembler algorithms and code design for discussion, optimisation and any other improvements that can be made on it. Post code here to be beaten to death to make it better, smaller, faster or more powerful. Feel free to explain the optimisation methods used so that everyone can get a feel for the code design.
By your own words.
If this is not the place, then where is the place?
Dave.
Try the Campus. If you have an algo post it in here.
Hutch
It seems, what this proc have bottleneck on PIV.
By my test, if you change this:
movzx eax, BYTE PTR [edx]
test eax, eax
jz quit
lea ecx, [ecx+ecx*4]
lea ecx, [eax+ecx*2-48]
movzx eax, BYTE PTR [edx+1]
test eax, eax
jz quit
to:
movzx eax, BYTE PTR [edx]
add eax,-48 ; <--- this
js quit ; <--- this
lea ecx, [ecx+ecx*4]
lea ecx, [eax+ecx*2] ; <--- this, if remove substraction from LEA, it works faster
movzx eax, BYTE PTR [edx+1]
add eax,-48 ; <--- this
js quit ; <--- this
..... etc .....
On my system speed increase by ~14%
Try this, may be on any hardware this have positive advantages.
Alex
Pentium Dual 1.8ghz
Timing atodw version
547
Timing short version
406
Timing long version
187
sorry, but those numbers mean nothing without code to go with them :P
Dave,
they are just results of the attachment in the first post.
ahhhhhhhhhhh
good mornin Hutch :bg
Your Timing short version
265
My Timing short version
140
align algn
atodL proc String:DWORD
mov edx, [esp+1*4]
movzx eax, byte ptr [edx]
add edx, 1
test eax, eax
jz quit
movzx ecx, byte ptr [edx]
sub eax, 30h
test ecx, ecx
jz quit
@@:
inc edx
lea eax, [eax+eax*4]
cmp byte ptr [edx], 0
lea eax, [ecx+eax*2-30h]
movzx ecx, byte ptr [edx]
jnz @b
quit:
pop ecx
pop edx
jmp ecx
atodL endp
What I like most about Lingo's code is that it can always be improved :green2
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
24 cycles for atodL: Lingo's bloatware
23 cycles for atodJJ: improved by JJ
24 cycles for atodL: Lingo's bloatware
23 cycles for atodJJ: improved by JJ
Code size:
44 atodL Lingo
24 atodJJ Jochen
:bg
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
17 cycles for atodL: Lingo's bloatware
16 cycles for atodJJ: improved by JJ
17 cycles for atodL: Lingo's bloatware
16 cycles for atodJJ: improved by JJ
Code size:
44 atodL Lingo
24 atodJJ Jochen
--- ok ---
AMD Athlon(tm) 64 Processor 3000+ (SSE3)
24 cycles for atodL: Lingo's bloatware
26 cycles for atodJJ: improved by JJ
22 cycles for atodL: Lingo's bloatware
26 cycles for atodJJ: improved by JJ
Code size:
44 atodL Lingo
24 atodJJ Jochen
--- ok ---
2-4 cycles slower, nice improvement :tdown
I put the new algos into a real time test bed and the results are very different. I played with this one for a while to ensure that they were reasonable comparisons and with the short algos Lingos is easily the fastest, JJs is the smallest and the timings all wander depending on code placement and algorithm alighment.
641 atou
250 atodL
593 atodJJ
235 atou_ex
640 atou
250 atodL
594 atodJJ
188 atou_ex
640 atou
250 atodL
594 atodJJ
203 atou_ex
641 atou
250 atodL
594 atodJJ
234 atou_ex
640 ms average atou
250 ms average atodL
593 ms average atodJJ
215 ms average atou_ex
Press any key to continue ...
"What I like most about Lingo's code is that it can always be improved
Code:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
24 cycles for atodL: Lingo's bloatware
23 cycles for atodJJ: improved by JJ
24 cycles for atodL: Lingo's bloatware
23 cycles for atodJJ: improved by JJ"
My algo was for Hutch's test program , so we can see before my original code: align algn
- If we try my original algo vs "improved by JJ" in the Hutch's test program
the result will be Timing short version 140 vs Timing short version 359 (with my CPU)
The thief is always a liar but our is stupid too due to he don't understand the algo's details (lack of A.Fog)...Why?
- In his test program he transferred my algo with align 16 ONLY! Hence, after that my loop label is not aligned properly....
If we have align 16 PLUS db 8 dup(0) after it, the results will be diferent.
- In my test versions I have similar short code but it is slower (biger is faster law). Why?
One reason is in the first(empty) looping when eax=0 JJ "improved" algo spend time for nothing ( we have some empty instructions like lea eax, [eax+eax*4] and lea eax, [ecx+eax*2-30h])
So, "Lingo's code is that it can always be improved".
I agree, but not from everyone... :lol
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
17 cycles for atodL: Lingo's bloatware
23 cycles for atodJJ: improved by JJ
17 cycles for atodL: Lingo's bloatware
23 cycles for atodJJ: improved by JJ
Code size:
52 atodL Lingo
24 atodJJ Jochen
--- ok ---
Quote from: lingo on August 07, 2010, 02:29:54 PM
If we have align 16 PLUS db 8 dup(0) after it, the results will be diferent.
My most sincere apologies, young friend! Indeed, with only 8 bytes more, you gain half a cycle on my modern CPU :U
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
24 cycles for atodL: Lingo's superbloatware
24 cycles for atodJJ: improved by JJ
24 cycles for atodL: Lingo's superbloatware
23 cycles for atodJJ: improved by JJ
24 cycles for atodL: Lingo's superbloatware
24 cycles for atodJJ: improved by JJ
24 cycles for atodL: Lingo's superbloatware
23 cycles for atodJJ: improved by JJ
Code size:
52 atodL Lingo
24 atodJJ Jochen
I just mercilessly hacked JJs test bed to try out the changes to atou. I tried an unroll by 2 which brought its size up to a bit bigger than Lingo's algo but still within a reasonable size for a small general purpose algo and its now averaging faster than the long version I have been playing with.
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
17 cycles for atodL: Lingo's replacement
12 cycles for atou: improved by hutch
17 cycles for atodL: Lingo's replacement
12 cycles for atou: improved by hutch
Code size:
44 atodL Lingo
52 atou Hutch
--- ok ---
:toothy
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
11 cycles for atodL: Lingo's replacement
12 cycles for atou: improved by hutch
11 cycles for atodL: Lingo's replacement
12 cycles for atou: improved by hutch
Code size:
144 atodL Lingo
52 atou Hutch
--- ok ---
Lingo,
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
11 cycles for atodL: Lingo's replacement
12 cycles for atou: improved by hutch
11 cycles for atodL: Lingo's replacement
12 cycles for atou: improved by hutch
Code size:
144 atodL Lingo
52 atou Hutch
--- ok ---
Have you got a version of that algo that is an unroll by 2, I have done the same with JJs but I don't know enough about yours to unroll it properly. I tried this on the last one but it did not get any faster.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
REPEAT ncnt
padd
ENDM
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
db 8 dup(0)
atodL proc String:DWORD
mov edx, [esp+1*4]
movzx eax, byte ptr [edx]
add edx, 1
test eax, eax
jz quit
movzx ecx, byte ptr [edx]
sub eax, 30h
test ecx, ecx
jz quit
@@:
; inc edx
; lea eax, [eax+eax*4]
; cmp byte ptr [edx], 0
; lea eax, [ecx+eax*2-30h]
; movzx ecx, byte ptr [edx]
; jz quit
inc edx
lea eax, [eax+eax*4]
cmp byte ptr [edx], 0
lea eax, [ecx+eax*2-30h]
movzx ecx, byte ptr [edx]
jnz @b
quit:
pop ecx
pop edx
jmp ecx
atodL endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
a2b3
AMD Athlon(tm) 4 Processor (SSE1)
31 cycles for atodL: Lingo's replacement
37 cycles for atou: improved by hutch
Kinda sucks that it's 3 times as big.
Queue
Here is the latest test bed I have been using for these algos, I just added Lingo's unroll by 4 to it so I have JJs and mine as short versions and Lingo's and my long one all timed. I added a suggestion of Daves to set the affinity mask to stabilise the timings and this seems to have helped get all of the times closest to their minumin.
187 atou
203 atodL
235 atodJJ
218 atou_ex
188 atou
203 atodL
219 atodJJ
219 atou_ex
187 atou
203 atodL
219 atodJJ
203 atou_ex
188 atou
203 atodL
218 atodJJ
250 atou_ex
187 ms average atou
203 ms average atodL
222 ms average atodJJ
222 ms average atou_ex
Press any key to continue ...
Queue,
"Kinda sucks that it's 3 times as big."
144 atodL Lingo
144 is a total length of new and old versions due to I'm lazy to delete them, sorry. :wink
It is crazy to write optimized code for different testing programs due to for one algo they compute different results.
So, test programs writers should obey some standard rules and/or for this site must have just one "the best"
and mandatory testing program. Otherwise I have to write one algo for Hutch's testing program other for
A.Fog's testing program, next for MichaelW's testing program, next for jj testing program, etc... :lol
:bg
Its crazy to write optimised code that only works in one test bed and not the another. I test algos in real time when I need to as algos run in real time whan they are being used.
Show us you Lingo test program. :bg
"that only works in one test bed and not the another"
this is a nonsense.
We optimize algos according the rules in the Intel Optimization manual rather than "rules" of your testing program.
Where are your "rules" because I can't see them? :lol
"I test algos in real time"
It is just your opinion but without rules...
What about other's testing programs?
For example: try to optimize an algo to be fastest with same times in your and jj testing programs. :lol
:bg
There is fantasy involved here, you optimise code to the CLOCK. It either works or it does not, that what you can get from the manual.
Your dependence on one format of test bed is based on "gaming" the test bed to get results in that test bed, not produce the faster algorithm.
If a piece of code is only a superstar in the test bed and not in general purpose, then its useless.
I think that lingo doesnt understand a principle of profiling and optimization.
The only profiling that matters is profiling of real programs. Real production code. The purpose of testbeds (plural) is just to get a general idea of the issues you will face in those aforementioned real production programs. If you restrict yourself to a single testbed then you are just specializing, not using testbeds for the purpose that they are supposed to serve.
it is hard to optimize code while running on a specific processor
in fact, i am P4-bound, so if i speed up a routine on my machine,
it is not likely to be optimal on any of the newer processors :P
as time goes by, it hardly makes sense to use a P4 machine for writing code, although it's nice to test it on one :bg
hmmm - maybe i can use that as an excuse, of some sort - lol
"Your dependence on one format of test bed is based on "gaming" the test bed to get results in that test bed, not produce the faster algorithm.
If a piece of code is only a superstar in the test bed and not in general purpose, then its useless."
All this is nonsense again. I will try to explain to you last case with YOUR SUPERSTAR ALGO CODE(I like it).
The YOUR ALGO CODE from YOUR a2b2.zip file measured with the jj testing program which YOU use inside finished for 12 clocks. It is Unrolled just 2 times.
If you unroll it 4 times it will finished for 12 clocks again in the jj testing program. Due to you know the difference between your testing program and jj testing program (TP) you stoped using jj TP and start using your test program bm4.asm. In it you unroll YOUR ALGO CODE 4 times and it become fastest(but just in your testing program).
Of course in it you included other's algos which are just 2 times unrolled but it is other story (bad gaming).
So, YOUR ALGO CODE unrolled 4 times run different on your test program and on jj test program.
If YOUR ALGO CODE is superstar try it in jj test program against my code to see the results. You can unroll many times until your face become red but your end result will be 12 clocks or more...so, put your code in the recycle bin because according to you: :lol
"If a piece of code is only a superstar in the test bed and not in general purpose, then its useless."
"There is fantasy involved here"
Everyone can investigate this to see who is wrong... :lol
I will repeat again and again my offer from my previous post because still I have no answer
For example: try to optimize an algo to be fastest with same times in your and jj testing programs. :lol
About "gaming"...
You are the "gamer" because you use your testing program to have better results. Try to get them in the jj test program... :lol
Rockoon,
"I think that lingo doesn't understand a principle of profiling and optimization."
What and why you think so is not important. Will be better to post your code rather than blah,bla,bla...
Dave,
We don't talk for different processors. We talk about different testing programs with different testing results for the same algo and CPU..
You can test them with Hutch's superstar code... :lol
Or may be I have to write "My algo is fastest with JJ test program, Hutch's algo is fastest with his test program"... :lol
You are failing to realize the immense mistake in your thinking. There is no such thing as the fastest code.
Testbeds dont tell you what the fastest code is. There is no fastest code. Testbeds only allow you to compare code within the testbed, and can only tell you which one is faster within the testbed. You still cannot determine if you have found the fastest code even in terms of that testbed, let alone a real production product.
Testbeds are not marketable products. Users do not clamor for testbeds. Nobody is saying "Gee, I wish I had JJ's testbed" or "Gee, I wish I had Hutches testbed," because testbeds are USELESS things that serve no purpose other than investigation by people writing code.
What production products is this function a bottleneck in? HMM? HMMMMM??? Once you determine that, THEN you have some RELEVANT testbeds: THE PRODUCTION PRODUCTS.
No code again, just blah,bla,bla... :(
Do you think that asking for a code handout makes you right?
It doesn't. It makes you just as wrong as before you asked.
"It makes you just as wrong as before you asked."
Why I'm wrong or guilty? Why? The SUPERSTAR ALGO CODE is from Hutch,
the superstar "real time" testing program is from Hutch too, other testing program is from JJ or from MichaelW.
It is like to measure voltage with two voltmeters. One draws 24 V other 33V from the same source. Which is wrong?
Am I guilty that I invent the different results between these testing programs and SUPERSTAR ALGO CODE from hutch and ask about it? Or to close my eyes and say that always is super and there are no differences?
"You are failing to realize the immense mistake in your thinking."
I think how to improve some algos rather than how to use "properly" superstar "real time" testing program or other tools. So, when I see a problem with the tools I just ask their creators about it.
"Do you think that asking for a code..."
If you have some optimized code my advice will be to test it with these two testing programs and after that we will continue our very interesting conversation. :(
How are you wrong?
Lets see...
"It is crazy to write optimized code for different testing programs due to for one algo they compute different results."
Thats wrong. Its not "crazy" to write multiple versions of an algorithm, all "better" in different cases. Its important that different cases have different optimization criteria.
So, test programs writers should obey some standard rules and/or for this site must have just one "the best" and mandatory testing program.
Thats wrong. There are no standard rules that will apply successfully to arbitrary real world applications of the algorithm.
Otherwise I have to write one algo for Hutch's testing program other for A.Fog's testing program, next for MichaelW's testing program, next for jj testing program, etc...
Thats wrong. You dont "have" to write multiple algorithms. You *get* to write multiple algorithms and *get* to have superior performance in multiple circumstances.
Quote from: Rockoon on August 08, 2010, 06:00:50 PMTestbeds dont tell you what the fastest code is. There is no fastest code. Testbeds only allow you to compare code within the testbed, and can only tell you which one is faster within the testbed.
Rockoon,
The algos posted here for teasing Lingo are not the best example. Their results differ by CPU because they are
the fastest code for a particular CPU: 20 cycles, not a single one less because that CPU has only this limited set of instructions and this limited set of possible combinations. In this sense, yes "there is no fastest code" that fits to all CPUs but there is one for each CPU. And of course the algos can behave different in real life apps.
However, the majority of threads in the Lab starts with the idea "hey, we could speed up lstrwhatever a little bit. And wow, you get a factor 10 with some simple tricks from Agner's or Lingo's or NightWare's code kitchen. On all CPUs, in all real life apps. That is where testbeds have a positive role. It turns into the absurd when later on we start squeezing out the last cycle and discover that we are running different hardware. That is less useful but it is the fun part :bg
Hutch,
add to yours test this code:
align 16
Axa2l proc STDCALL lpszStr:DWORD
mov edx,[esp+4]
xor eax,eax
xor ecx,ecx ; try to comment this, and add two nops before proc
movzx ecx,byte ptr [edx]
add ecx,-30h
js @done
@mainloop:
lea eax,[eax+eax*4]
lea eax,[eax*2+ecx]
movzx ecx,byte ptr [edx+1]
add edx,1 ; try to change this to "inc edx" (to put main loop in 16bytes long)
add ecx,-30h
jns @mainloop
@done:
ret 4
Axa2l endp
I write some comments, what is useful to playing.
Alex
Let see.... again blah bla bla... without any value in the practice, without any code example or/and sources of your "knowledge"
Its not "crazy" to write multiple versions of an algorithm, all "better" in different cases."
and
"Its important that different cases have different optimization criteria."
and
"Its crazy to write optimized code that only works in one test bed and not the another" by Hutch
For me All OPTIMIZATION criteria and rules are in the INTEL (AMD) Optimization Reference Manual.
If you or Hutch know more sources of optimization criteria please let us know and write a CODE EXAMPLES to see how to do that...
"There are no standard rules that will apply successfully to arbitrary real world applications of the algorithm."
Again: they are in the INTEL (AMD) Optimization Reference Manual and some tips in the help file of VTune soft performance analyzer
"You don't "have" to write multiple algorithms. You *get* to write multiple algorithms and *get* to have superior performance in multiple circumstances."
Try to do this with some CODE EXAMPLES to see how to do that in the practice...
And please stop to fill this thread with empty words Blah, bla, bla, bla and to kill our time to read them...Your code examples or nothing... :(
You can start with your version of atou algo too...or teach Hutch how to make his atou faster with JJ testing program... :lol
Quote from: lingo on August 08, 2010, 11:15:52 PM
For me All OPTIMIZATION criteria and rules are in the INTEL (AMD) Optimization Reference Manual.
No, not all of them. Tell me, what do those optimization manuals say about the wildly differing performance of your routine in the other thread that is either great, or really crappy, depending on what other code is also included in the program.
Quote from: lingo on August 08, 2010, 11:15:52 PM
Again: they are in the INTEL (AMD) Optimization Reference Manual and some tips in the help file of VTune soft performance analyzer
You requested a standard test harness. I was writing about a standard test harness and the flaw in the idea. What you are writing here has nothing to do with that. Why dont you address the topic you brought up?
You seem to think that usage patterns dont matter.
Lingo,
I know you can do better than this. I have had my misgivings about the popular test bed that has been used recently due to granularity with very low result counts so I have done the obvious, I have made a test bed that times a number of algorithms in real time that runs longer and can be made to run longer again.
Now your theory is flawed in that by your own admission you want to write a different algo for each different test bed rather than trying to make the algo faster over the range of conditions that it is designed to operate under. Specialisation on one processor and one test bed is a mistake as they are both variables, different hardware produces different results and different test beds serve different purposes.
The common test bed is based off Michael's timing macros and while the technique is well suited for testing short sections of code that are almost impossible to test in other ways, the technique becomes less useful on complete algorithms that have a very low timing count.
Long ago I have learnt that you tailor a test bed to the task you are performing and the task change from one algorithm to another. If I am testing an algo that needs a large piece of data I build a test bed that has data on this scale, I have commonly tested algos on a gigabyte of data if that form of streaming is where it will be used. At the other end if the algo is very short and its "take off" time is the critical factor, I write a high loop count short data source test bed to test that feature.
The current test bed for these algos is ajustable in real time, uses varible length data input, variable duration based off the loop count, variable inter-algo padding to test code isolation, is controlled to run on a single core, each timing is real time isolated using SleepEx() and the instruction queue is flushed with a serialised instruction before each timing.
The criterion for an algorithm is not if it uses instructions from either the Intel or AMD optimisation manual, its how fast it is in the conditions where it will be used and this is a variable depending on the processor it is used on.
Alex,
I added your algo to the test bed. As a short algo it was slow and messed up the timings of the other algos but I unrolled it by 2 and it came down in its time to near the rest. Even with the layout of this testbed algo placement effects the timing of other algos so I have ordered them to try and get the bets timings for each algo but note that there are fluctuations in the timing of all of them.
172 atou
203 atodL
203 Axa2l
203 atodJJ
219 atou_ex
203 atou
203 atodL
235 Axa2l
218 atodJJ
235 atou_ex
203 atou
203 atodL
203 Axa2l
203 atodJJ
250 atou_ex
219 atou
203 atodL
281 Axa2l
204 atodJJ
234 atou_ex
199 ms average atou
203 ms average atodL
207 ms average atodJJ
234 ms average atou_ex
230 ms average Axa2l
Press any key to continue ...
Core MACRO corenum
mov eax, 1
.if eax == 1
Always core 1 :bg
that selects core 0 :P
"I know you can do better than this"
'Or may be I have to write "My algo is fastest with JJ test program, Hutch's algo is fastest with his test program"
No, should be "My algo is fastest with JJ testing program and with Hutch's testing program" :lol
You are like Rockoon who has a big need from empty speech...but...
Due to I have a Master Degree in Electrical Engineering to argue with me, theoretically, you should have a Ph.D. at least... :lol
So, try to concentrate yourself to find a faster solution for your atou - with other words catch me if you can... :lol
New results from your test program:
C:\7>bm6
171 atou
156 atodL ->Lingo
187 Axa2l
203 atodJJ
234 atou_ex
172 atou
156 atodL ->Lingo
187 Axa2l
187 atodJJ
172 atou_ex
171 atou
156 atodL ->Lingo
187 Axa2l
203 atodJJ
187 atou_ex
188 atou
156 atodL ->Lingo
187 Axa2l
187 atodJJ
203 atou_ex
175 ms average atou
156 ms average atodL ->Lingo
195 ms average atodJJ
199 ms average atou_ex
187 ms average Axa2l
Press any key to continue ...
:bg
No great need, I am pleased you are catching up instead of talking.
Timing is on a 3 gig Core2 quad with 1333 memory.
172 atou
172 atodL
203 Axa2l
203 atodJJ
172 atou_ex
172 atou
171 atodL
204 Axa2l
203 atodJJ
218 atou_ex
172 atou
172 atodL
203 Axa2l
219 atodJJ
219 atou_ex
172 atou
172 atodL
203 Axa2l
203 atodJJ
187 atou_ex
172 ms average atou
171 ms average atodL
207 ms average atodJJ
199 ms average atou_ex
203 ms average Axa2l
Press any key to continue ...
"I am pleased you are catching up"
But I'm not due to big differences between values for your algos atou and atou_exe in your last and previous results...Somebody else pls..
prev. last
172 172 atou
172 atodL
203 Axa2l
203 atodJJ
219 172 atou_ex
203 172 atou
171 atodL
204 Axa2l
203 atodJJ
235 218 atou_ex
203 172 atou
172 atodL
203 Axa2l
219 atodJJ
250 219 atou_ex
219 172 atou
172 atodL
203 Axa2l
203 atodJJ
234 187 atou_ex
199 172 ms average atou
171 ms average atodL
207 ms average atodJJ
234 199 ms average atou_ex
203 ms average Axa2l
Press any key to continue ...
Quote from: jj2007 on August 08, 2010, 09:22:20 PM
In this sense, yes "there is no fastest code" that fits to all CPUs but there is one for each CPU. And of course the algos can behave different in real life apps.
The term TANSTATFC, "There Aint No Such Thing As The Fastest Code" has nothing to do with differing architectures. Its the acceptance that there is a limitless number of ways to solve a problem and that you will never be able to try them all, that the chance that you have found the fastest way is vanishingly small. That still further you accept that "the algorithm" is the entire program, not just some subset of it.
The only reason I tackled this problem is because I knew that your claim of a 20 cycle limit was horseshit.
align 16
atou_rock proc String:DWORD
mov ecx, [esp + 4]
push ebx
; ---
movzx eax, byte ptr [ecx + 0]
test eax, eax
jz @@done
sub eax, 48
movzx edx, byte ptr [ecx + 1]
test edx, edx
jz @@done
movzx ebx, byte ptr [ecx + 2]
lea eax, [4*eax + eax]
test ebx, ebx
lea eax, [2*eax + edx - 48]
jz @@done
movzx edx, byte ptr [ecx + 3]
lea eax, [4*eax + eax]
test edx, edx
lea eax, [2*eax + ebx - 48]
jz @@done
movzx ebx, byte ptr [ecx + 4]
lea eax, [4*eax + eax]
test ebx, ebx
lea eax, [2*eax + edx - 48]
jz @@done
movzx edx, byte ptr [ecx + 5]
lea eax, [4*eax + eax]
test edx, edx
lea eax, [2*eax + ebx - 48]
jz @@done
movzx ebx, byte ptr [ecx + 6]
lea eax, [4*eax + eax]
test ebx, ebx
lea eax, [2*eax + edx - 48]
jz @@done
movzx edx, byte ptr [ecx + 7]
lea eax, [4*eax + eax]
test edx, edx
lea eax, [2*eax + ebx - 48]
jz @@done
movzx ebx, byte ptr [ecx + 8]
lea eax, [4*eax + eax]
test ebx, ebx
lea eax, [2*eax + edx - 48]
jz @@done
movzx edx, byte ptr [ecx + 9]
lea eax, [4*eax + eax]
test edx, edx
lea eax, [2*eax + ebx - 48]
jz @@done
lea eax, [4*eax + eax]
lea eax, [2*eax + edx - 48]
@@done:
pop ebx
ret 4
atou_rock endp
280 ms average atou
499 ms average atodL
265 ms average atodJJ
265 ms average atou_ex
292 ms average Axa2l
218 ms average atou_rock
Phenom II x6 1055T
Lingo's previous versions performed much better (within a point or two of atodJJ)
Edited: Made a small change, and including project files.
The short version "atou" has been faster than the longer version ever since I unrolled it by 2. What is the big deal ?
I unrolled both JJs and Alex's to do a fair comparison, I left it up to you to do an unroll by 2 on your own.
Just goes to show it's all a waste of time wankfest :bdg
245 ms average atou
218 ms average atodL
257 ms average atodJJ
277 ms average atou_ex
257 ms average Axa2l
276 ms average atou_rock
Q6600 2.4GHz
Here is the good news and the bad news. I added a code loop between each test to load the processor core between tests, that slowed down my atou algo on the Core2 quad and made lingo's faster. The bad news is its the only Core that its faster on. here are the 5 boxes I have handy. The unrolled algo clarly has the legs on the i7, Lingo has it on the COre2 quad, Alex has it on the Prescott P4, Lingo has it on the Northwood by a tiny amount and the unrolled version has it on the ancient Celeron. Kruel ain't it.
2.8 gig i7 quad
171 ms average atou
171 ms average atodL
202 ms average atodJJ
156 ms average atou_ex *
191 ms average Axa2l
3 gig Core2 Quad
191 ms average atou
171 ms average atodL *
227 ms average atodJJ
191 ms average atou_ex
215 ms average Axa2l
3.8 gig Prescott P4
410 ms average atou
386 ms average atodL
511 ms average atodJJ
363 ms average atou_ex
355 ms average Axa2l *
2.8 gig Northwood P4
633 ms average atou
566 ms average atodL *
719 ms average atodJJ
570 ms average atou_ex
593 ms average Axa2l
1.2 gig Celeron
1071 ms average atou
1171 ms average atodL
966 ms average atodJJ
871 ms average atou_ex *
1041 ms average Axa2l
Sinsi,
Digital manual self delusion may be misconstrued as something else. :P
well - run that on a couple fairly new AMD machines
then, add all the times together and see who is fastest overall - lol
another approach might be to load a different library depending on the processor being run :bg
it seems like this is where the fastest apps will be
with 4 OS's, we need the equiv of 28 test machines :red
then, we can develop the 7 libraries
Thanks, sinsi :U
"Here is the good news and the bad news"
Thanks, but your info is old because we have a new algo... :lol
Rockoon,
I added your algo to the test bed and results are:
C:\7>bm7
172 atou
156 atodL
234 Axa2l
202 atodJJ
188 atou_ex
218 atou_rock
172 atou
156 atodL
187 Axa2l
187 atodJJ
203 atou_ex
218 atou_rock
172 atou
156 atodL
203 Axa2l
202 atodJJ
219 atou_ex
218 atou_rock
172 atou
156 atodL
234 Axa2l
187 atodJJ
172 atou_ex
218 atou_rock
172 ms average atou
156 ms average atodL
194 ms average atodJJ
195 ms average atou_ex
214 ms average Axa2l
218 ms average atou_rock
Press any key to continue ...
with JJ testing program:
C:\7>asc2bin3
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
11 cycles for atodL: Lingo's replacement
12 cycles for atou: improved by hutch
14 cycles for atou_rock: by Rockoon
11 cycles for atodL: Lingo's replacement
12 cycles for atou: improved by hutch
14 cycles for atou_rock: by Rockoon
Code size:
100 atodL Lingo
52 atou Hutch
158 atou_rock Rockoon
--- ok ---
Lingo,
I don't get any timing difference on the quad but it looks like the algo you attributed to Rockoon is almost the same as the unrolled version of mine.
This is fun!
277 ms average atou
218 ms average atodL
265 ms average atodJJ
288 ms average atou_ex
273 ms average Axa2l
288 ms average atou_rock
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz (SSE4)
11 cycles for atodL: Lingo's replacement
10 cycles for atou: improved by hutch
10 cycles for atou_rock: by Rockoon
11 cycles for atodL: Lingo's replacement
10 cycles for atou: improved by hutch
10 cycles for atou_rock: by Rockoon
Here is more bad news for you.
171 atodL
187 Axa2l
203 atodJJ
156 atou_ex
172 atou_rock
171 atou
172 atodL
187 Axa2l
187 atodJJ
156 atou_ex
172 atou_rock
171 atou
172 atodL
187 Axa2l
187 atodJJ
156 atou_ex
171 atou_rock
171 ms average atou
171 ms average atodL
191 ms average atodJJ
156 ms average atou_ex
187 ms average Axa2l
171 ms average atou_rock
Press any key to continue ...
Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz (SSE4)
9 cycles for atodL: Lingo's replacement
8 cycles for atou: improved by hutch
8 cycles for atou_rock: by Rockoon
9 cycles for atodL: Lingo's replacement
8 cycles for atou: improved by hutch
8 cycles for atou_rock: by Rockoon
Code size:
100 atodL Lingo
210 atou Hutch
158 atou_rock Rockoon
--- ok ---
PS: You must have messed up the code size labels, the atou algo is a bit over 50 bytes long.
Quote from: hutch-- on August 09, 2010, 07:33:49 AM
PS: You must have messed up the code size labels, the atou algo is a bit over 50 bytes long.
The labels look OK, but a bit of additional code was inserted.
The attachment makes testing a bit more straightforward. The Smooth macro is for discussion on code cache and alignment issues. It seems to render fairly stable P4 cycle counts.
Usage:
testcorrect atou
testme atou
codesize atou
The codesize macro requires the presence of two labels:
QuoteMyAlgo_s:
MyAlgo proc
...
MyAlgo endp
MyAlgo_endp:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
797 cycles for 10*Lingo's replacement
445 cycles for 10*atou
442 cycles for 10*atou_rock
858 cycles for 10*atodJJ
784 cycles for 10*Lingo's replacement
443 cycles for 10*atou
444 cycles for 10*atou_rock
829 cycles for 10*atodJJ
Code size:
100 bytes for atodL
158 bytes for atou
158 bytes for atou_rock
24 bytes for atodJJ
"but it looks like the algo you attributed to Rockoon is almost the same as the unrolled version of mine."
and
"PS: You must have messed up the code size labels, the atou algo is a bit over 50 bytes long."
Hutch,
I apologize. The error is mine. I pasted Rockoon's algo two times...and 2nd time on your algo. Sorry...
Corrected...
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
148 cycles for 10*Lingo's replacement
185 cycles for 10*atou
176 cycles for 10*atou_rock
259 cycles for 10*atodJJ
149 cycles for 10*Lingo's replacement
187 cycles for 10*atou
175 cycles for 10*atou_rock
259 cycles for 10*atodJJ
Code size:
180 bytes for atodL
52 bytes for atou
158 bytes for atou_rock
24 bytes for atodJJ
--- ok ---
Hi,
From jj2007's code in Reply #69.
Regards,
Steve
pre-P4 (SSE1)
453 cycles for 10*Lingo's replacement
314 cycles for 10*atou
313 cycles for 10*atou_rock
492 cycles for 10*atodJJ
383 cycles for 10*Lingo's replacement
313 cycles for 10*atou
314 cycles for 10*atou_rock
490 cycles for 10*atodJJ
Code size:
100 bytes for atodL
158 bytes for atou
158 bytes for atou_rock
24 bytes for atodJJ
--- ok ---
Quote from: lingo on August 09, 2010, 02:03:58 PM
Please, redownload last asc2bin_testbed1.zip file and test it again.Thanks :wink
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
697 cycles for 10*Lingo's replacement
474 cycles for 10*atou
444 cycles for 10*atou_rock
845 cycles for 10*atodJJ
684 cycles for 10*Lingo's replacement
494 cycles for 10*atou
448 cycles for 10*atou_rock
842 cycles for 10*atodJJ
Code size:
171 bytes for atodL
52 bytes for atou
158 bytes for atou_rock
24 bytes for atodJJ
And one more - but note Lingo's code size, the version has changed again ::)
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
268 cycles for 10*Lingo's replacement
295 cycles for 10*atou
286 cycles for 10*atou_rock
335 cycles for 10*atodJJ
266 cycles for 10*Lingo's replacement
295 cycles for 10*atou
286 cycles for 10*atou_rock
315 cycles for 10*atodJJ
Code size:
180 bytes for atodL
52 bytes for atou
158 bytes for atou_rock
24 bytes for atodJJ
Please, redownload last asc2bin_testbed1.zip file and test it again.Thanks :wink
I swapped the short atou to the longer version, unrolled JJs algo by 8 and changed the algo order and got this result on the i7.
Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz (SSE4)
99 cycles for 10*atou
83 cycles for 10*atou_rock
113 cycles for 10*Lingo's replacement
104 cycles for 10*atodJJ
99 cycles for 10*atou
84 cycles for 10*atou_rock
82 cycles for 10*Lingo's replacement
103 cycles for 10*atodJJ
Code size:
210 bytes for atou
158 bytes for atou_rock
171 bytes for atodL
129 bytes for atodJJ
--- ok ---
"I swapped the short atou to the longer version, unrolled JJs algo by 8 and changed the algo order and got this result on the i7."
Thank you Hutch, but you used my old file. Sorry...
Real code size of my algo is: 180 bytes for atodL
So, please, re-download last asc2bin_testbed1.zip file and test it again. Thanks. :wink
I had used the new version but in any case here is another benchmark that includes your new unrolled version and Rockoon's unrolled version. Interestingly enough the i7 runs all of the full unrolls in the same timing. Your algo is faster on the 2 P4s but a bit slower on the Core2 quad.
i7 quad.
=======
172 atou
156 atodL
187 Axa2l
203 atodJJ
156 atou_ex
156 atou_rock
172 atou
156 atodL
187 Axa2l
203 atodJJ
156 atou_ex
156 atou_rock
171 atou
156 atodL
188 Axa2l
202 atodJJ
156 atou_ex
156 atou_rock
172 atou
156 atodL
187 Axa2l
203 atodJJ
156 atou_ex
156 atou_rock
171 ms average atou
156 ms average atodL
202 ms average atodJJ
156 ms average atou_ex
187 ms average Axa2l
156 ms average atou_rock
Press any key to continue ...
Core2 Quad.
==========
188 atou
203 atodL
203 Axa2l
219 atodJJ
218 atou_ex
188 atou_rock
187 atou
203 atodL
204 Axa2l
203 atodJJ
234 atou_ex
203 atou_rock
188 atou
203 atodL
219 Axa2l
218 atodJJ
203 atou_ex
219 atou_rock
188 atou
203 atodL
203 Axa2l
219 atodJJ
203 atou_ex
187 atou_rock
187 ms average atou
203 ms average atodL
214 ms average atodJJ
214 ms average atou_ex
207 ms average Axa2l
199 ms average atou_rock
Press any key to continue ...
Prescott P4.
===========
359 atou
297 atodL
344 Axa2l
500 atodJJ
359 atou_ex
329 atou_rock
375 atou
296 atodL
344 Axa2l
516 atodJJ
359 atou_ex
328 atou_rock
375 atou
297 atodL
344 Axa2l
484 atodJJ
360 atou_ex
328 atou_rock
375 atou
297 atodL
328 Axa2l
500 atodJJ
359 atou_ex
328 atou_rock
371 ms average atou
296 ms average atodL
500 ms average atodJJ
359 ms average atou_ex
340 ms average Axa2l
328 ms average atou_rock
Press any key to continue ...
Northwood P4.
============
625 atou
438 atodL
484 Axa2l
703 atodJJ
563 atou_ex
484 atou_rock
641 atou
453 atodL
516 Axa2l
703 atodJJ
562 atou_ex
469 atou_rock
641 atou
437 atodL
563 Axa2l
672 atodJJ
562 atou_ex
484 atou_rock
563 atou
453 atodL
531 Axa2l
703 atodJJ
563 atou_ex
484 atou_rock
617 ms average atou
445 ms average atodL
695 ms average atodJJ
562 ms average atou_ex
523 ms average Axa2l
480 ms average atou_rock
Press any key to continue ...
Antique Celeron.
===============
1102 atou
1092 atodL
1042 Axa2l
972 atodJJ
851 atou_ex
791 atou_rock
1112 atou
1092 atodL
1062 Axa2l
952 atodJJ
871 atou_ex
781 atou_rock
1092 atou
1122 atodL
1042 Axa2l
972 atodJJ
851 atou_ex
791 atou_rock
1082 atou
1112 atodL
1041 Axa2l
982 atodJJ
842 atou_ex
791 atou_rock
1097 ms average atou
1104 ms average atodL
969 ms average atodJJ
853 ms average atou_ex
1046 ms average Axa2l
788 ms average atou_rock
Press any key to continue ...
"I had used the new version but"
Thank you Hutch, but you used my new version algo for JJ testing program in your testing program and it is
wrong. Sorry...
I have second version for your testing program and it is in the my last bm7.zip file
Note: these are just for Intel users.
Soon, I will post similar two algos (bmx.exe and asc2binx) for the AMD users.
"I am pleased you are catching up instead of talking."
Are you pleased now? :lol
Hutch, here new version of my proc.
It seems, that unrolling by 4 - the same good (on my machine).
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
repeat 12
ret
endm
Axa2l proc lpszStr:DWORD
;int 3
mov edx,[esp+4]
mov eax,0
movzx ecx,byte ptr [edx]
add ecx,-30h
js @done
add edx,1
@mainloop:
lea eax,[eax+eax*4]
lea eax,[eax*2+ecx]
movzx ecx,byte ptr [edx]
add ecx,-30h
js @done
lea eax,[eax+eax*4]
lea eax,[eax*2+ecx]
movzx ecx,byte ptr [edx+1]
add ecx,-30h
js @done
lea eax,[eax+eax*4]
lea eax,[eax*2+ecx]
movzx ecx,byte ptr [edx+2]
add ecx,-30h
js @done
lea eax,[eax+eax*4]
lea eax,[eax*2+ecx]
movzx ecx,byte ptr [edx+3]
add edx,4
add ecx,-30h
jns @mainloop
@done:
ret 4
Axa2l endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Alex
I reordered something in my algo and now it works well enough with AMD too. Please, try it to receive the final results. Thank you.
Note:Due to Hutch's testing program is very memory usage sensitive please close all other programs and start the test minimum 3 times. Get the best values. Thanks. :toothy
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
C:\8>bmu6
172 atou
140 atodL
219 Axa2l
203 atodJJ
171 atou_ex
172 atou_rock
171 atou
141 atodL
218 Axa2l
203 atodJJ
172 atou_ex
156 atou_rock
171 atou
141 atodL
202 Axa2l
188 atodJJ
171 atou_ex
172 atou_rock
234 atou
140 atodL
219 Axa2l
202 atodJJ
188 atou_ex
156 atou_rock
187 ms average atou
140 ms average atodL
199 ms average atodJJ
175 ms average atou_ex
214 ms average Axa2l
164 ms average atou_rock
Press any key to continue ...
AMD Turion(tm) 64 Mobile Technology ML-30 (SSE3)
C:\8>bmu6
703 atou
437 atodL
812 Axa2l
672 atodJJ
532 atou_ex
484 atou_rock
703 atou
438 atodL
812 Axa2l
672 atodJJ
531 atou_ex
485 atou_rock
703 atou
437 atodL
813 Axa2l
672 atodJJ
531 atou_ex
484 atou_rock
703 atou
438 atodL
812 Axa2l
672 atodJJ
531 atou_ex
485 atou_rock
703 ms average atou
437 ms average atodL
672 ms average atodJJ
531 ms average atou_ex
812 ms average Axa2l
484 ms average atou_rock
Press any key to continue ...
C:\8>asc2bin_testbed2
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
158 cycles for 10*Lingo's replacement
184 cycles for 10*atou
175 cycles for 10*atou_rock
258 cycles for 10*atodJJ
159 cycles for 10*Lingo's replacement
186 cycles for 10*atou
177 cycles for 10*atou_rock
260 cycles for 10*atodJJ
Code size:
118 bytes for atodL
52 bytes for atou
158 bytes for atou_rock
24 bytes for atodJJ
--- ok ---
C:\8>asc2bin_testbed2
AMD Turion(tm) 64 Mobile Technology ML-30 (SSE3)
240 cycles for 10*Lingo's replacement
304 cycles for 10*atou
314 cycles for 10*atou_rock
312 cycles for 10*atodJJ
240 cycles for 10*Lingo's replacement
299 cycles for 10*atou
306 cycles for 10*atou_rock
293 cycles for 10*atodJJ
Code size:
118 bytes for atodL
52 bytes for atou
158 bytes for atou_rock
24 bytes for atodJJ
--- ok ---
Lingo, you use other version of my proc, see 2 posts about for newer version. Thanks.
719 atou
641 atodL
656 Axa2l
937 atodJJ
672 atou_ex
625 atou_rock
688 atou
625 atodL
625 Axa2l
953 atodJJ
672 atou_ex
625 atou_rock
687 atou
625 atodL
641 Axa2l
906 atodJJ
672 atou_ex
609 atou_rock
688 atou
625 atodL
641 Axa2l
921 atodJJ
672 atou_ex
610 atou_rock
695 ms average atou
629 ms average atodL
929 ms average atodJJ
672 ms average atou_ex
640 ms average Axa2l
617 ms average atou_rock
Press any key to continue ...
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
491 cycles for 10*Lingo's replacement
489 cycles for 10*atou
458 cycles for 10*atou_rock
882 cycles for 10*atodJJ
487 cycles for 10*Lingo's replacement
490 cycles for 10*atou
458 cycles for 10*atou_rock
888 cycles for 10*atodJJ
Code size:
118 bytes for atodL
52 bytes for atou
158 bytes for atou_rock
24 bytes for atodJJ
--- ok ---
There is best values, for MY old proc, of course :toothy
Alex
Ignore this posting, I made a mess of the algos that should have been used.
AMD Phenom(tm) II X6 1055T Processor
1420 atodw library
406 atou_rock
452 Alex
344 Lingo
405 atou_ex
Note there is something wrong with the results in terms of correctness testing in the last benchmark.
you are right, read the above posts. :dazzled:
Hutch, see this new Axa2l algo.
This is true unrolled and data-controlled algo. It work in paradigm of OOP: "Data controls execution process". And it works slow in my machine... So, one more provement of slowness of OOP algos :)
But it interesting, See and test it, please. It check for string length, and return 0 in ecx, if length is wrong (biggest than 10bytes). If ecx not 0 - then length is OK.
I attach archive. This is old-test bed, but you may copy proc to yours newest and improved test-beds.
Interesting, which timings it have on newest CPUs...
Alex
P.S. This algo is just for fun :)
It would be nice to have one where size is known priori, as well as one that terminates on any non-digit rather than on null.
I think in most common large string processing scenarios you are dealing with a large file that just isnt going to have nice convenient nulls in it, so you are either processing it as-is (size not known, but terminator is any non-digit) or have indexed the data already (size is known)
Rockoon,
If all of the data is in the correct form (IE does not need to be parsed from other test) its easy enough to tokenise it in one pass.
Quote from: Rockoon on August 13, 2010, 12:35:14 PM
It would be nice to have one where size is known priori, as well as one that terminates on any non-digit rather than on null.
"If all of the data is in the correct form..."
That is ambitious. For inspiration, attached a list of examples for valid number formats. MasmBasic reads them all correctly, but it took me a while :green
JJ,
No doubt but the parser is another animal, not just a conversion.
Quote from: hutch-- on August 13, 2010, 02:48:49 PM
Rockoon,
If all of the data is in the correct form (IE does not need to be parsed from other test) its easy enough to tokenise it in one pass.
I am arguing about your "correct form" idea. This idea that the input is "correct form" only when its nicely null terminated is wrong.
I'm thinking along the lines of working with large CSV files (comma separated values) and other large textual data sets (why else would performance be a concern?)
Sure, its easy enough to tokenize it in one
extra pass. Its still an extra pass, and one thats particularly memory bandwidth sensitive.
Hi!
Guys, with yours disputes you forgot to test this "http://www.masm32.com/board/index.php?action=dlattach;topic=14522.0;id=7946" (link to archive). New Axa2l, true unrolled.
This is interesting algo, and very interesting, how it works on different machines.
Alex
Rockoon, Dave (aka KeepingRealBusy) write one good proc, for conversion hex2dword. In it he use look-up table, with values of converted char if this char is hex, and special "signal" value, if char is not hex. So, code make conversion and checking in one pass - I think, this is same good solution. As you thinks?
For slightly changed version of Dave's proc you may goto "http://www.masm32.com/board/index.php?topic=14540.msg117994#msg117994". In archive - krbhtodw - his proc.
Alex
Rockoon,
I agree that if your target is a large CSV file then you need to handle the data conversion in a different manner. I would be looking at a one pass tokeniser that did the conversion on the fly but I would make the comment that an algo that is both a file parser and a conversion utility would need to be timed against a 2 pass system where you tokenise it as fast as you can then convert it sequentially.
The single pass version has advantages in memory bandwidth but it would be a slower algo because of the amount of work it would have to do. Timing both would tel you the difference. I would imagine that in a one pass parser/converter that the conversion algo would have to be short and have a faster attack time than unrolled linear conversions and code cache issues would come into play.
And, Rockoon, one note more. If compare getting char not with zero, but with 30h, and go to end if value is smallest than this code, then big part of punctuation chars will be processed correctly - as end of string with number.
Alex
Hi, Hutch!
Test, please, archive in post by 4 (edited) post above.
Alex
:bg
Alex,
I will get there but I am a bit tired of writing test pieces, I only have a million other things to do.
Hutch, I understand you.
But you may run already compiled test from archive. This is take ~30secs. This is your old test-bed. I only replace in source my proc, and compile source. So, you no need to writing test pieces.
Only time of working this algo on newest CPUs interesting to me, no "racing" with other's algos.
Alex
P.S. Sorry me, if I'm pester to you!
Alex,
This is the result from BM10. Timed on the Core2 3 gig Quad I work on.
188 atou
203 atodL
250 Axa2l
218 atodJJ
204 atou_ex
234 atou_rock
187 atou
204 atodL
250 Axa2l
203 atodJJ
218 atou_ex
235 atou_rock
187 atou
203 atodL
250 Axa2l
219 atodJJ
219 atou_ex
234 atou_rock
188 atou
203 atodL
250 Axa2l
203 atodJJ
172 atou_ex
234 atou_rock
187 ms average atou
203 ms average atodL
210 ms average atodJJ
203 ms average atou_ex
250 ms average Axa2l
234 ms average atou_rock
Press any key to continue ...
Thanks, Hutch!
So, results no so bad, as I expect :) On my CPU this proc works not well. Because it hard to prediction of CPU - path of execution is depended from data, and this is very not well in speedy programming, contrary to OOP paradigms. This proc implement something like virtual-methods :), and this methods work not fastly.
But, this algo is for fun - it interesting enought, and show some good info about designing. So, it is no bad :)
Alex
align 16
mineiro proc String:DWORD
; ------------------------------------------------
; Convert decimal string into UNSIGNED DWORD value
; ------------------------------------------------
xor eax,eax
mov edx,[esp+4] ;coment this you dont use calling convention
movzx ecx,byte ptr [edx]
test ecx,ecx
je @fora
movzx eax,byte ptr [edx+1]
test eax,eax
je @ssub30
lea ecx, [ecx+ecx*4-30h*5]
lea ecx, [eax+ecx*2-30h]
movzx eax, byte ptr [edx+2]
test eax,eax
je @F
lea ecx,[ecx+ecx*4]
lea ecx, [eax+ecx*2-30h]
movzx eax, byte ptr [edx+3]
test eax,eax
je @F
lea ecx,[ecx+ecx*4]
lea ecx, [eax+ecx*2-30h]
movzx eax, byte ptr [edx+4]
test eax,eax
je @F
lea ecx,[ecx+ecx*4]
lea ecx, [eax+ecx*2-30h]
movzx eax, byte ptr [edx+5]
test eax,eax
je @F
lea ecx,[ecx+ecx*4]
lea ecx, [eax+ecx*2-30h]
movzx eax, byte ptr [edx+6]
test eax,eax
je @F
lea ecx,[ecx+ecx*4]
lea ecx, [eax+ecx*2-30h]
movzx eax, byte ptr [edx+7]
test eax,eax
je @F
lea ecx,[ecx+ecx*4]
lea ecx, [eax+ecx*2-30h]
movzx eax, byte ptr [edx+8]
test eax,eax
je @F
lea ecx,[ecx+ecx*4]
lea ecx, [eax+ecx*2-30h]
movzx eax, byte ptr [edx+9]
test eax,eax
je @F
lea ecx,[ecx+ecx*4]
lea ecx, [eax+ecx*2-30h]
cmp byte ptr [edx+10],0
je @F
mov eax,0
mov ecx,0
jmp @fora
@@:
mov eax,ecx
mov ecx,1
@fora:
ret 4
@ssub30:
mov eax,ecx
xor eax,30h
ret 4
mineiro endp
In some procedures in these test return wrong values in ecx or eax.
.data
align dalg
itm1 db "4294967295",0 ;eax==0ffffffffh ecx!=0
align dalg
itm2 db "42949672952",0 ;eax==0 ecx=0
align dalg
itm3 db "0",0 ;eax==0 ecx!=0
align dalg
itm4 db "0000",0 ;eax==0 ecx!=0
regards