The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: hutch-- on July 31, 2010, 11:24:17 AM

Title: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on July 31, 2010, 11:24:17 AM
The originals of these algos were intended to be ascii/ansi to unsigned DWORD. Here is a test piece that tests the old atodw against two later versions, a short version and a long version that is unrolled to pick up some speed.

The test piece is to determine relative speeds of the 3 algos. It tests the 3 algos for correctness then benchmarks them for comparison purposes.

These are the results on this Core2 quad.


atodw Version

4294967295
987
0
9876

Short Version

4294967295
987
0
9876

Long Version

4294967295
987
0
9876
-------
Timings
-------

Timing atodw version
312

Timing short version
266

Timing long version
109

Press any key to continue ...
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: Twister on August 01, 2010, 07:41:48 AM
What are these measurements in? Clocks or (milli/nano)seconds?
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 01, 2010, 09:45:57 AM
if you need such measurements feel free to write your own. his benchmark is specifically written to run in real time and reference the earlier version that will be replaced.
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: dedndave on August 01, 2010, 01:27:19 PM
prescott w/htt:
Timing atodw version
344

Timing short version
266

Timing long version
250

had to change because of throttle
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 01, 2010, 02:47:06 PM
Dave,

It looks like the Prescott does not like the paired LEA instructions which is probably normal.
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: MichaelW on August 01, 2010, 07:23:30 PM
Here are the timings for my P3:

Timing atodw version
2834

Timing short version
1482

Timing long version
981


I also tested all 32-bit values, using crt__ultoa to generate the strings, and found no problems.
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: dedndave on August 02, 2010, 01:12:40 AM
QuoteDave,

It looks like the Prescott does not like the paired LEA instructions which is probably normal.

the long version still has a pay-off for non-P4 machines
at least it isn't slower than the short version on the P4   :bg
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: GregL on August 02, 2010, 03:56:16 AM
Core 2 Duo  2.0 GHz

atodw Version

4294967295
987
0
9876

Short Version

4294967295
987
0
9876

Long Version

4294967295
987
0
9876
-------
Timings
-------

Timing atodw version
500

Timing short version
390

Timing long version
172

Press any key to continue ...
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: rags on August 02, 2010, 10:25:19 AM
AMD Athlon II x2 215 2.7Ghz


atodw Version

4294967295
987
0
9876

Short Version

4294967295
987
0
9876

Long Version

4294967295
987
0
9876
-------
Timings
-------

Timing atodw version
296

Timing short version
187

Timing long version
171

Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: KeepingRealBusy on August 02, 2010, 02:55:05 PM
Hutch, or anyone, I have a question about libraries.

When you statically or dynamically link to the library, doesn't the entire library become mapped into your virtual space? I mean, it has to be mapped, otherwise you could not "call" and mapped function, right? The loader does not load pieces of the library and skip the rest.  It would be different if you had individual .obj files assembled and you linked only the ones that you needed.

If that is the case, is there any real reason two have two functions in the library that do the same thing? Since Hutch's sample has two functions that do the same conversion, and the unrolled one is the fastest, why use anything else? Why keep the short slow one in the library?

Hutch,

Several things about your sample.

1.    I tried 17 different ways to do this faster using the decade table (my own code in my own library matched this algo, two leas, process the string once). I could beat atodw, but not the other two. OBTW, this same method of two leas can be used for hex conversions once the character has been converted to a numeric 0-15 by using lea eax,[eax*4] and lea eax,[ecx+eax*4].

2.    You have no checking for invalid characters other than the null. If you checked and terminated the conversion when a non numeric was found, then the function could be used for processing a string containing multiple values, i.e., "1234 5678 9124", "12:30:05" "08/02/2010".

3.    In the long version, you return an error condition in ecx. What about returning the converted value in eax, and returning  in  the pointer to the character that stopped the scan in edx, and use the following encoding for ecx:

        -2    invalid numeric character > '9'
        -1    invalid numeric character < '0'
         0    valid conversion (edx points to the null)
        +1   valid conversion but more than 9 characters (00000000000001)
        +2   invalid conversion exceeded 32 bits (500000000)

By returning the error flag and the terminating character pointer, the caller can determine whether to continue on to the next piece of the string.

Dave.
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: frktons on August 02, 2010, 06:46:45 PM
Here it is my test on Core 2 Duo 2.4 Ghz x64 bit:

atodw Version

4294967295
987
0
9876

Short Version

4294967295
987
0
9876

Long Version

4294967295
987
0
9876
-------
Timings
-------

Timing atodw version
390

Timing short version
296

Timing long version
109

Press any key to continue ...



Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: jj2007 on August 02, 2010, 07:19:54 PM
Celeron M:
Timing atodw version
625

Timing short version
532

Timing long version
281
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: GregL on August 02, 2010, 08:50:02 PM
Quote from: KeepingRealBusyWhen you statically or dynamically link to the library, doesn't the entire library become mapped into your virtual space? I mean, it has to be mapped, otherwise you could not "call" and mapped function, right? The loader does not load pieces of the library and skip the rest.  It would be different if you had individual .obj files assembled and you linked only the ones that you needed.

If the library is built from individual .obj files for each procedure, which is how the masm32 libraries are done,  then when you link you only get the individual procedures that you use mapped into your program space and none of the others.
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: KeepingRealBusy on August 02, 2010, 08:53:30 PM
Greg,

Thank you. That is good to know. What about DLL's, especially user and system?
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: GregL on August 02, 2010, 09:24:27 PM
If your program loads a DLL, you get the whole thing mapped into your program space.
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: Antariy on August 02, 2010, 10:37:56 PM
Celeron D Prescott test

atodw Version

4294967295
987
0
9876

Short Version

4294967295
987
0
9876

Long Version

4294967295
987
0
9876
-------
Timings
-------

Timing atodw version
562

Timing short version
438

Timing long version
391

Press any key to continue ...




Alex
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 03, 2010, 03:01:22 AM
KeepingRealBusy,

The reason why this pair of algos were posted in the lab was to test them, while you have perfectly valid questions about some of the issues raised, this is not the place for them or other unrelated questions in the previous post.

RE: The design choice for 2 algos, it is simple enough to provide the range difference so that someone who wants to do an occasional conversion can use a shorter version where someone who want to stream data of this type can use a faster but larger one. Its the choice of the user being catered for here, not a theory about library design.
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: KeepingRealBusy on August 03, 2010, 03:06:46 AM
Excuse me,

Quote
The Laboratory
Algorithm and code design research laboratory. This is the place to post assembler algorithms and code design for discussion, optimisation and any other improvements that can be made on it. Post code here to be beaten to death to make it better, smaller, faster or more powerful. Feel free to explain the optimisation methods used so that everyone can get a feel for the code design.

By your own words.

If this is not the place, then where is the place?

Dave.
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 03, 2010, 03:20:50 AM
Try the Campus. If you have an algo post it in here.
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: Antariy on August 03, 2010, 11:02:38 PM
Hutch

It seems, what this proc have bottleneck on PIV.
By my test, if you change this:

    movzx eax, BYTE PTR [edx]
    test eax, eax
    jz quit

    lea ecx, [ecx+ecx*4]
    lea ecx, [eax+ecx*2-48]
    movzx eax, BYTE PTR [edx+1]
    test eax, eax
    jz quit


to:


    movzx eax, BYTE PTR [edx]
    add eax,-48 ; <--- this
    js quit ; <--- this

    lea ecx, [ecx+ecx*4]
    lea ecx, [eax+ecx*2] ; <--- this, if remove substraction from LEA, it works faster
    movzx eax, BYTE PTR [edx+1]
    add eax,-48 ; <--- this
    js quit ; <--- this

..... etc .....


On my system speed increase by ~14%
Try this, may be on any hardware this have positive advantages.


Alex
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: mineiro on August 06, 2010, 02:52:59 AM
Pentium Dual 1.8ghz
Timing atodw version
547
Timing short version
406
Timing long version
187
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: dedndave on August 06, 2010, 11:00:37 AM
sorry, but those numbers mean nothing without code to go with them   :P
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 06, 2010, 11:03:25 AM
Dave,

they are just results of the attachment in the first post.
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: dedndave on August 06, 2010, 11:08:57 AM
ahhhhhhhhhhh
good mornin Hutch  :bg
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: lingo on August 07, 2010, 04:35:17 AM
Your Timing short version
265
My Timing short version
140
align algn

atodL proc String:DWORD
    mov    edx, [esp+1*4]
    movzx  eax, byte ptr [edx]
    add    edx, 1
    test   eax, eax
    jz    quit
    movzx  ecx, byte ptr [edx]
    sub    eax, 30h
    test   ecx, ecx
    jz    quit
@@:
    inc    edx
    lea    eax, [eax+eax*4]
    cmp    byte ptr [edx], 0
    lea    eax, [ecx+eax*2-30h]
    movzx  ecx, byte ptr [edx]
    jnz    @b
  quit:
    pop ecx
    pop edx
    jmp ecx
atodL   endp

Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: jj2007 on August 07, 2010, 11:48:43 AM
What I like most about Lingo's code is that it can always be improved :green2

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
24      cycles for atodL: Lingo's bloatware
23      cycles for atodJJ: improved by JJ

24      cycles for atodL: Lingo's bloatware
23      cycles for atodJJ: improved by JJ

Code size:
44      atodL Lingo
24      atodJJ Jochen


Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 07, 2010, 12:12:26 PM
 :bg


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
17      cycles for atodL: Lingo's bloatware
16      cycles for atodJJ: improved by JJ

17      cycles for atodL: Lingo's bloatware
16      cycles for atodJJ: improved by JJ

Code size:
44      atodL Lingo
24      atodJJ Jochen

--- ok ---
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: ecube on August 07, 2010, 12:29:22 PM

AMD Athlon(tm) 64 Processor 3000+ (SSE3)
24      cycles for atodL: Lingo's bloatware
26      cycles for atodJJ: improved by JJ

22      cycles for atodL: Lingo's bloatware
26      cycles for atodJJ: improved by JJ

Code size:
44      atodL Lingo
24      atodJJ Jochen

--- ok ---


2-4 cycles slower, nice improvement  :tdown
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 07, 2010, 02:22:40 PM
I put the new algos into a real time test bed and the results are very different. I played with this one for a while to ensure that they were reasonable comparisons and with the short algos Lingos is easily the fastest, JJs is the smallest and the timings all wander depending on code placement and algorithm alighment.


641 atou
250 atodL
593 atodJJ
235 atou_ex
640 atou
250 atodL
594 atodJJ
188 atou_ex
640 atou
250 atodL
594 atodJJ
203 atou_ex
641 atou
250 atodL
594 atodJJ
234 atou_ex


640 ms average atou
250 ms average atodL
593 ms average atodJJ
215 ms average atou_ex
Press any key to continue ...


Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: lingo on August 07, 2010, 02:29:54 PM
"What I like most about Lingo's code is that it can always be improved 
Code:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
24      cycles for atodL: Lingo's bloatware
23      cycles for atodJJ: improved by JJ

24      cycles for atodL: Lingo's bloatware
23      cycles for atodJJ: improved by JJ"



My algo was for Hutch's test program , so we can see before my original code: align algn
-   If we try my original algo vs "improved by JJ" in the Hutch's test program
the result will be Timing short version 140 vs Timing short version 359 (with my CPU)

The thief is always a liar but our is stupid too due to he don't understand the algo's details (lack of A.Fog)...Why?
-   In his test program he transferred my algo with align 16 ONLY! Hence, after that my loop label is not aligned properly....
If we have align 16 PLUS db 8 dup(0) after it,  the results will be diferent.   
-   In my test versions I have similar short code but it is slower (biger is faster law). Why?
One reason is in the first(empty) looping when eax=0  JJ "improved" algo  spend time for nothing ( we have some empty instructions like  lea    eax, [eax+eax*4] and   lea  eax, [ecx+eax*2-30h])

So, "Lingo's code is that it can always be improved".
I agree, but not from everyone... :lol
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
17      cycles for atodL: Lingo's bloatware
23      cycles for atodJJ: improved by JJ

17      cycles for atodL: Lingo's bloatware
23      cycles for atodJJ: improved by JJ

Code size:
52      atodL Lingo
24      atodJJ Jochen

--- ok ---
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: jj2007 on August 07, 2010, 05:50:55 PM
Quote from: lingo on August 07, 2010, 02:29:54 PM
If we have align 16 PLUS db 8 dup(0) after it,  the results will be diferent.

My most sincere apologies, young friend! Indeed, with only 8 bytes more, you gain half a cycle on my modern CPU :U

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
24      cycles for atodL: Lingo's superbloatware
24      cycles for atodJJ: improved by JJ

24      cycles for atodL: Lingo's superbloatware
23      cycles for atodJJ: improved by JJ

24      cycles for atodL: Lingo's superbloatware
24      cycles for atodJJ: improved by JJ

24      cycles for atodL: Lingo's superbloatware
23      cycles for atodJJ: improved by JJ

Code size:
52      atodL Lingo
24      atodJJ Jochen
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 07, 2010, 11:39:57 PM
I just mercilessly hacked JJs test bed to try out the changes to atou. I tried an unroll by 2 which brought its size up to a bit bigger than Lingo's algo but still within a reasonable size for a small general purpose algo and its now averaging faster than the long version I have been playing with.


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
17      cycles for atodL: Lingo's replacement
12      cycles for atou: improved by hutch

17      cycles for atodL: Lingo's replacement
12      cycles for atou: improved by hutch

Code size:
44      atodL Lingo
52      atou Hutch

--- ok ---
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: lingo on August 08, 2010, 06:01:11 AM
 :toothy
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
11      cycles for atodL: Lingo's replacement
12      cycles for atou: improved by hutch

11      cycles for atodL: Lingo's replacement
12      cycles for atou: improved by hutch

Code size:
144     atodL Lingo
52      atou Hutch

--- ok ---
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 08, 2010, 06:19:33 AM
Lingo,


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
11      cycles for atodL: Lingo's replacement
12      cycles for atou: improved by hutch

11      cycles for atodL: Lingo's replacement
12      cycles for atou: improved by hutch

Code size:
144     atodL Lingo
52      atou Hutch

--- ok ---


Have you got a version of that algo that is an unroll by 2, I have done the same with JJs but I don't know enough about yours to unroll it properly. I tried this on the last one but it did not get any faster.


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    REPEAT ncnt
      padd
    ENDM

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 16

db 8 dup(0)

atodL proc String:DWORD
    mov    edx, [esp+1*4]
    movzx  eax, byte ptr [edx]
    add    edx, 1
    test   eax, eax
    jz    quit
    movzx  ecx, byte ptr [edx]
    sub    eax, 30h
    test   ecx, ecx
    jz    quit
@@:
;     inc    edx
;     lea    eax, [eax+eax*4]
;     cmp    byte ptr [edx], 0
;     lea    eax, [ecx+eax*2-30h]
;     movzx  ecx, byte ptr [edx]
;     jz quit

    inc    edx
    lea    eax, [eax+eax*4]
    cmp    byte ptr [edx], 0
    lea    eax, [ecx+eax*2-30h]
    movzx  ecx, byte ptr [edx]
    jnz    @b

  quit:
    pop ecx
    pop edx
    jmp ecx
atodL   endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: Queue on August 08, 2010, 06:23:55 AM
a2b3
AMD Athlon(tm) 4 Processor (SSE1)
31 cycles for atodL: Lingo's replacement
37 cycles for atou: improved by hutch


Kinda sucks that it's 3 times as big.

Queue
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 08, 2010, 06:33:48 AM
Here is the latest test bed I have been using for these algos, I just added Lingo's unroll by 4 to it so I have JJs and mine as short versions and Lingo's and my long one all timed. I added a suggestion of Daves to set the affinity mask to stabilise the timings and this seems to have helped get all of the times closest to their minumin.


187 atou
203 atodL
235 atodJJ
218 atou_ex
188 atou
203 atodL
219 atodJJ
219 atou_ex
187 atou
203 atodL
219 atodJJ
203 atou_ex
188 atou
203 atodL
218 atodJJ
250 atou_ex


187 ms average atou
203 ms average atodL
222 ms average atodJJ
222 ms average atou_ex
Press any key to continue ...
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: lingo on August 08, 2010, 12:42:08 PM
Queue,
"Kinda sucks that it's 3 times as big."

144     atodL Lingo

144 is a total length of new and old versions due to I'm lazy to delete them, sorry. :wink
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: lingo on August 08, 2010, 01:58:40 PM
It is crazy to write optimized code for different testing programs due to for one algo they compute different results.
So, test programs writers should obey some standard rules and/or for this site must have just one "the best"
and mandatory testing program. Otherwise I have to write one algo for Hutch's testing program other for
A.Fog's testing program, next for  MichaelW's testing program, next for jj testing program, etc... :lol
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 08, 2010, 02:21:30 PM
 :bg

Its crazy to write optimised code that only works in one test bed and not the another. I test algos in real time when I need to as algos run in real time whan they are being used.

Show us you Lingo test program.  :bg
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: lingo on August 08, 2010, 03:48:35 PM
"that only works in one test bed and not the another"
this is a nonsense.
We optimize algos according the rules in the Intel Optimization manual rather than "rules" of your testing program.
Where are your "rules" because I can't see them? :lol

"I test algos in real time"
It is just your opinion but without rules...
What about other's testing programs?
For example: try to optimize an algo to be fastest with same times in your and jj testing programs. :lol
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 08, 2010, 04:13:48 PM
 :bg

There is fantasy involved here, you optimise code to the CLOCK. It either works or it does not, that what you can get from the manual.

Your dependence on one format of test bed is based on "gaming" the test bed to get results in that test bed, not produce the faster algorithm.

If a piece of code is only a superstar in the test bed and not in general purpose, then its useless.
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: Rockoon on August 08, 2010, 04:47:57 PM
I think that lingo doesnt understand a principle of profiling and optimization.

The only profiling that matters is profiling of real programs. Real production code. The purpose of testbeds (plural) is just to get a general idea of the issues you will face in those aforementioned real production programs. If you restrict yourself to a single testbed then you are just specializing, not using testbeds for the purpose that they are supposed to serve.

Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: dedndave on August 08, 2010, 04:49:34 PM
it is hard to optimize code while running on a specific processor
in fact, i am P4-bound, so if i speed up a routine on my machine,
it is not likely to be optimal on any of the newer processors   :P
as time goes by, it hardly makes sense to use a P4 machine for writing code, although it's nice to test it on one   :bg

hmmm - maybe i can use that as an excuse, of some sort - lol
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: lingo on August 08, 2010, 05:11:12 PM
"Your dependence on one format of test bed is based on "gaming" the test bed to get results in that test bed, not produce the faster algorithm.
If a piece of code is only a superstar in the test bed and not in general purpose, then its useless."


All this is nonsense again. I will try to explain to you last case with YOUR SUPERSTAR ALGO CODE(I like it). 
The YOUR ALGO CODE   from YOUR  a2b2.zip file measured with the jj testing program which YOU use inside finished for 12 clocks. It is Unrolled just 2 times.
If you unroll it 4 times it will finished for 12 clocks again in the jj testing program. Due to you know the difference between your testing program and jj testing program (TP) you stoped using jj TP  and start using your test program bm4.asm. In it you unroll  YOUR ALGO CODE 4 times and it become fastest(but just in your testing program).
Of course in it you included other's algos which are just 2 times unrolled but it is other story (bad gaming).
So, YOUR ALGO CODE unrolled 4 times run different on your test program and on jj test program. 
If YOUR ALGO CODE is superstar try it in jj test program against my code to see the results. You can unroll many times until your face become red but your end result will be 12 clocks or more...so, put your code in the recycle bin because according to you: :lol
"If a piece of code is only a superstar in the test bed and not in general purpose, then its useless."


"There is fantasy involved here"
Everyone can investigate this to see who is wrong...  :lol

I will repeat again and again my offer from my previous post because still I have no answer
For example: try to optimize an algo to be fastest with same times in your and jj testing programs.   :lol


About "gaming"...
You are the "gamer" because you use your testing program to have better results. Try to get them in the jj test program... :lol

Rockoon,
"I think that lingo doesn't understand a principle of profiling and optimization."
What and why you think so is not important. Will be better to post your code rather than blah,bla,bla...

Dave,
We don't talk for different processors. We talk about different testing programs with different testing results for the same algo and CPU..
You can test them with Hutch's superstar code... :lol
Or may be I have to write "My algo is fastest with JJ test program, Hutch's algo is fastest with his test program"... :lol

Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: Rockoon on August 08, 2010, 06:00:50 PM
You are failing to realize the immense mistake in your thinking. There is no such thing as the fastest code.

Testbeds dont tell you what the fastest code is. There is no fastest code. Testbeds only allow you to compare code within the testbed, and can only tell you which one is faster within the testbed. You still cannot determine if you have found the fastest code even in terms of that testbed, let alone a real production product.

Testbeds are not marketable products. Users do not clamor for testbeds. Nobody is saying "Gee, I wish I had JJ's testbed" or "Gee, I wish I had Hutches testbed," because testbeds are USELESS things that serve no purpose other than investigation by people writing code.

What production products is this function a bottleneck in? HMM? HMMMMM??? Once you determine that, THEN you have some RELEVANT testbeds: THE PRODUCTION PRODUCTS.

Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: lingo on August 08, 2010, 06:04:26 PM
No code again, just blah,bla,bla... :(
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: Rockoon on August 08, 2010, 07:02:01 PM
Do you think that asking for a code handout makes you right?

It doesn't. It makes you just as wrong as before you asked.
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: lingo on August 08, 2010, 07:32:47 PM
"It makes you just as wrong as before you asked."
Why I'm wrong or guilty? Why? The SUPERSTAR ALGO CODE is from Hutch,
the superstar "real time" testing program is from Hutch too, other testing program is from JJ or from  MichaelW.
It is like to measure voltage with two voltmeters. One draws 24 V other 33V from the same source. Which is wrong?
Am I guilty that I invent the different results between these testing programs and  SUPERSTAR ALGO CODE from hutch and ask about it? Or to close my eyes and say that always is super and there are no differences?

"You are failing to realize the immense mistake in your thinking."
I think how to improve some algos rather than how to use "properly" superstar "real time" testing program or other tools. So, when I see a problem with the tools I just ask their creators about it.

"Do you think that asking for a code..."
If you have some optimized code my advice will be to test it with these two testing programs and after that we will continue our very interesting conversation. :(
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: Rockoon on August 08, 2010, 08:53:37 PM
How are you wrong?

Lets see...

"It is crazy to write optimized code for different testing programs due to for one algo they compute different results."

Thats wrong. Its not "crazy" to write multiple versions of an algorithm, all "better" in different cases. Its important that different cases have different optimization criteria.

So, test programs writers should obey some standard rules and/or for this site must have just one "the best" and mandatory testing program.

Thats wrong. There are no standard rules that will apply successfully to arbitrary real world applications of the algorithm.

Otherwise I have to write one algo for Hutch's testing program other for A.Fog's testing program, next for  MichaelW's testing program, next for jj testing program, etc...

Thats wrong. You dont "have" to write multiple algorithms. You *get* to write multiple algorithms and *get* to have superior performance in multiple circumstances.

Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: jj2007 on August 08, 2010, 09:22:20 PM
Quote from: Rockoon on August 08, 2010, 06:00:50 PMTestbeds dont tell you what the fastest code is. There is no fastest code. Testbeds only allow you to compare code within the testbed, and can only tell you which one is faster within the testbed.

Rockoon,

The algos posted here for teasing Lingo are not the best example. Their results differ by CPU because they are the fastest code for a particular CPU: 20 cycles, not a single one less because that CPU has only this limited set of instructions and this limited set of possible combinations. In this sense, yes "there is no fastest code" that fits to all CPUs but there is one for each CPU. And of course the algos can behave different in real life apps.

However, the majority of threads in the Lab starts with the idea "hey, we could speed up lstrwhatever a little bit. And wow, you get a factor 10 with some simple tricks from Agner's or Lingo's or NightWare's code kitchen. On all CPUs, in all real life apps. That is where testbeds have a positive role. It turns into the absurd when later on we start squeezing out the last cycle and discover that we are running different hardware. That is less useful but it is the fun part :bg
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: Antariy on August 08, 2010, 11:11:13 PM

Hutch,
add to yours test this code:


align 16
Axa2l proc STDCALL lpszStr:DWORD
    mov edx,[esp+4]

        xor eax,eax
        xor ecx,ecx ; try to comment this, and add two nops before proc
        movzx ecx,byte ptr [edx]

        add ecx,-30h
        js @done

        @mainloop:
        lea eax,[eax+eax*4]
        lea eax,[eax*2+ecx]
        movzx ecx,byte ptr [edx+1]
        add edx,1 ; try to change this to "inc edx" (to put main loop in 16bytes long)
        add ecx,-30h
        jns @mainloop


    @done:       

    ret 4
Axa2l endp


I write some comments, what is useful to playing.


Alex
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: lingo on August 08, 2010, 11:15:52 PM
Let see.... again blah bla bla... without any value in the practice, without any code example or/and sources of your "knowledge"

Its not "crazy" to write multiple versions of an algorithm, all "better" in different cases."
and
"Its important that different cases have different optimization criteria."
and
"Its crazy to write optimized code that only works in one test bed and not the another" by Hutch

For me All OPTIMIZATION criteria and rules are in the INTEL (AMD) Optimization Reference Manual.
If you or Hutch know more sources of optimization criteria please let us know and write a CODE EXAMPLES to see how to do that...

"There are no standard rules that will apply successfully to arbitrary real world applications of the algorithm."
Again: they are in the INTEL (AMD) Optimization Reference Manual and some tips in the help file of VTune soft performance analyzer

"You don't "have" to write multiple algorithms. You *get* to write multiple algorithms and *get* to have superior performance in multiple circumstances."
Try to do this with some CODE EXAMPLES to see how to do that in the practice...

And please stop to fill this thread with empty words Blah, bla, bla, bla and to kill our time to read them...Your code examples or nothing... :(

You can start with your version of atou algo too...or teach Hutch how to make his atou faster with JJ testing program... :lol
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: Rockoon on August 08, 2010, 11:46:35 PM
Quote from: lingo on August 08, 2010, 11:15:52 PM
For me All OPTIMIZATION criteria and rules are in the INTEL (AMD) Optimization Reference Manual.

No, not all of them. Tell me, what do those optimization manuals say about the wildly differing performance of your routine in the other thread that is either great, or really crappy, depending on what other code is also included in the program.

Quote from: lingo on August 08, 2010, 11:15:52 PM
Again: they are in the INTEL (AMD) Optimization Reference Manual and some tips in the help file of VTune soft performance analyzer

You requested a standard test harness. I was writing about a standard test harness and the flaw in the idea. What you are writing here has nothing to do with that. Why dont you address the topic you brought up?

You seem to think that usage patterns dont matter.
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 09, 2010, 12:42:29 AM
Lingo,

I know you can do better than this. I have had my misgivings about the popular test bed that has been used recently due to granularity with very low result counts so I have done the obvious, I have made a test bed that times a number of algorithms in real time that runs longer and can be made to run longer again.

Now your theory is flawed in that by your own admission you want to write a different algo for each different test bed rather than trying to make the algo faster over the range of conditions that it is designed to operate under. Specialisation on one processor and one test bed is a mistake as they are both variables, different hardware produces different results and different test beds serve different purposes.

The common test bed is based off Michael's timing macros and while the technique is well suited for testing short sections of code that are almost impossible to test in other ways, the technique becomes less useful on complete algorithms that have a very low timing count.

Long ago I have learnt that you tailor a test bed to the task you are performing and the task change from one algorithm to another. If I am testing an algo that needs a large piece of data I build a test bed that has data on this scale, I have commonly tested algos on a gigabyte of data if that form of streaming is where it will be used. At the other end if the algo is very short and its "take off" time is the critical factor, I write a high loop count short data source test bed to test that feature.

The current test bed for these algos is ajustable in real time, uses varible length data input, variable duration based off the loop count, variable inter-algo padding to test code isolation, is controlled to run on a single core, each timing is real time isolated using SleepEx() and the instruction queue is flushed with a serialised instruction before each timing.

The criterion for an algorithm is not if it uses instructions from either the Intel or AMD optimisation manual, its how fast it is in the conditions where it will be used and this is a variable depending on the processor it is used on.
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 09, 2010, 01:23:57 AM
Alex,

I added your algo to the test bed. As a short algo it was slow and messed up the timings of the other algos but I unrolled it by 2 and it came down in its time to near the rest. Even with the layout of this testbed algo placement effects the timing of other algos so I have ordered them to try and get the bets timings for each algo but note that there are fluctuations in the timing of all of them.


172 atou
203 atodL
203 Axa2l
203 atodJJ
219 atou_ex
203 atou
203 atodL
235 Axa2l
218 atodJJ
235 atou_ex
203 atou
203 atodL
203 Axa2l
203 atodJJ
250 atou_ex
219 atou
203 atodL
281 Axa2l
204 atodJJ
234 atou_ex


199 ms average atou
203 ms average atodL
207 ms average atodJJ
234 ms average atou_ex
230 ms average Axa2l
Press any key to continue ...
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: sinsi on August 09, 2010, 01:40:11 AM

    Core MACRO corenum
      mov eax, 1
      .if eax == 1

Always core 1  :bg
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: dedndave on August 09, 2010, 02:18:58 AM
that selects core 0   :P
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: lingo on August 09, 2010, 02:51:28 AM
"I know you can do better than this"

'Or may be I have to write "My algo is fastest with JJ test program, Hutch's algo is fastest with his test program"
No, should be "My algo is fastest with JJ testing program and with Hutch's testing program"  :lol

You are like Rockoon who has a big need from empty speech...but...
Due to I have a Master Degree in Electrical Engineering to argue with me, theoretically, you should have a Ph.D. at least... :lol
So, try to concentrate yourself to find a faster solution for your atou - with other words catch me if you can... :lol
New results from your test program:
C:\7>bm6
171 atou
156 atodL ->Lingo
187 Axa2l
203 atodJJ
234 atou_ex
172 atou
156 atodL ->Lingo 
187 Axa2l
187 atodJJ
172 atou_ex
171 atou
156 atodL  ->Lingo
187 Axa2l
203 atodJJ
187 atou_ex
188 atou
156 atodL  ->Lingo
187 Axa2l
187 atodJJ
203 atou_ex


175 ms average atou
156 ms average atodL ->Lingo
195 ms average atodJJ
199 ms average atou_ex
187 ms average Axa2l
Press any key to continue ...


Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 09, 2010, 03:42:20 AM
 :bg

No great need, I am pleased you are catching up instead of talking.

Timing is on a 3 gig Core2 quad with 1333 memory.

172 atou
172 atodL
203 Axa2l
203 atodJJ
172 atou_ex
172 atou
171 atodL
204 Axa2l
203 atodJJ
218 atou_ex
172 atou
172 atodL
203 Axa2l
219 atodJJ
219 atou_ex
172 atou
172 atodL
203 Axa2l
203 atodJJ
187 atou_ex


172 ms average atou
171 ms average atodL
207 ms average atodJJ
199 ms average atou_ex
203 ms average Axa2l
Press any key to continue ...
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: lingo on August 09, 2010, 04:27:32 AM
"I am pleased you are catching up"

But I'm not due to big  differences between values for your algos atou and atou_exe in your last and previous results...Somebody else pls..
prev. last
172  172 atou
     172 atodL
     203 Axa2l
     203 atodJJ
219  172 atou_ex
203  172 atou
     171 atodL
     204 Axa2l
     203 atodJJ
235  218 atou_ex
203  172 atou
     172 atodL
     203 Axa2l
     219 atodJJ
250  219 atou_ex
219  172 atou
     172 atodL
     203 Axa2l
     203 atodJJ
234  187 atou_ex


199  172 ms average atou
     171 ms average atodL
     207 ms average atodJJ
234  199 ms average atou_ex
     203 ms average Axa2l
Press any key to continue ...

Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: Rockoon on August 09, 2010, 04:59:07 AM
Quote from: jj2007 on August 08, 2010, 09:22:20 PM
In this sense, yes "there is no fastest code" that fits to all CPUs but there is one for each CPU. And of course the algos can behave different in real life apps.

The term TANSTATFC, "There Aint No Such Thing As The Fastest Code" has nothing to do with differing architectures. Its the acceptance that there is a limitless number of ways to solve a problem and that you will never be able to try them all, that the chance that you have found the fastest way is vanishingly small. That still further you accept that "the algorithm" is the entire program, not just some subset of it.

The only reason I tackled this problem is because I knew that your claim of a 20 cycle limit was horseshit.


align 16
atou_rock proc String:DWORD

    mov ecx, [esp + 4]
    push ebx

    ; ---

    movzx eax, byte ptr [ecx + 0]
    test eax, eax
    jz @@done

    sub eax, 48

    movzx edx, byte ptr [ecx + 1]
    test edx, edx
    jz @@done

    movzx ebx, byte ptr [ecx + 2]
    lea eax, [4*eax + eax]
    test ebx, ebx
    lea eax, [2*eax + edx - 48]
    jz @@done

    movzx edx, byte ptr [ecx + 3]
    lea eax, [4*eax + eax]
    test edx, edx
    lea eax, [2*eax + ebx - 48]
    jz @@done

    movzx ebx, byte ptr [ecx + 4]
    lea eax, [4*eax + eax]
    test ebx, ebx
    lea eax, [2*eax + edx - 48]
    jz @@done

    movzx edx, byte ptr [ecx + 5]
    lea eax, [4*eax + eax]
    test edx, edx
    lea eax, [2*eax + ebx - 48]
    jz @@done

    movzx ebx, byte ptr [ecx + 6]
    lea eax, [4*eax + eax]
    test ebx, ebx
    lea eax, [2*eax + edx - 48]
    jz @@done

    movzx edx, byte ptr [ecx + 7]
    lea eax, [4*eax + eax]
    test edx, edx
    lea eax, [2*eax + ebx - 48]
    jz @@done

    movzx ebx, byte ptr [ecx + 8]
    lea eax, [4*eax + eax]
    test ebx, ebx
    lea eax, [2*eax + edx - 48]
    jz @@done

    movzx edx, byte ptr [ecx + 9]
    lea eax, [4*eax + eax]
    test edx, edx
    lea eax, [2*eax + ebx - 48]
    jz @@done

    lea eax, [4*eax + eax]
    lea eax, [2*eax + edx - 48]

@@done:
    pop ebx
    ret 4

atou_rock endp


280 ms average atou
499 ms average atodL
265 ms average atodJJ
265 ms average atou_ex
292 ms average Axa2l
218 ms average atou_rock

Phenom II x6 1055T

Lingo's previous versions performed much better (within a point or two of atodJJ)

Edited: Made a small change, and including project files.
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 09, 2010, 05:06:40 AM
The short version "atou" has been faster than the longer version ever since I unrolled it by 2. What is the big deal ?

I unrolled both JJs and Alex's to do a fair comparison, I left it up to you to do an unroll by 2 on your own.
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: sinsi on August 09, 2010, 05:29:46 AM
Just goes to show it's all a waste of time wankfest  :bdg

245 ms average atou
218 ms average atodL
257 ms average atodJJ
277 ms average atou_ex
257 ms average Axa2l
276 ms average atou_rock

Q6600 2.4GHz
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 09, 2010, 06:19:17 AM
Here is the good news and the bad news. I added a code loop between each test to load the processor core between tests, that slowed down my atou algo on the Core2 quad and made lingo's faster. The bad news is its the only Core that its faster on. here are the 5 boxes I have handy. The unrolled algo clarly has the legs on the i7, Lingo has it on the COre2 quad, Alex has it on the Prescott P4, Lingo has it on the Northwood by a tiny amount and the unrolled version has it on the ancient Celeron. Kruel ain't it.


2.8 gig i7 quad
171 ms average atou
171 ms average atodL
202 ms average atodJJ
156 ms average atou_ex  *
191 ms average Axa2l

3 gig Core2 Quad
191 ms average atou
171 ms average atodL    *
227 ms average atodJJ
191 ms average atou_ex
215 ms average Axa2l

3.8 gig Prescott P4
410 ms average atou
386 ms average atodL
511 ms average atodJJ
363 ms average atou_ex
355 ms average Axa2l    *

2.8 gig Northwood P4
633 ms average atou
566 ms average atodL    *
719 ms average atodJJ
570 ms average atou_ex
593 ms average Axa2l

1.2 gig Celeron
1071 ms average atou
1171 ms average atodL
966 ms average atodJJ
871 ms average atou_ex  *
1041 ms average Axa2l


Sinsi,

Digital manual self delusion may be misconstrued as something else.  :P
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: dedndave on August 09, 2010, 06:24:18 AM
well - run that on a couple fairly new AMD machines
then, add all the times together and see who is fastest overall - lol

another approach might be to load a different library depending on the processor being run   :bg
it seems like this is where the fastest apps will be

with 4 OS's, we need the equiv of 28 test machines   :red
then, we can develop the 7 libraries
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: lingo on August 09, 2010, 06:28:51 AM
Thanks, sinsi  :U
"Here is the good news and the bad news"
Thanks, but your info is old because we have a new algo... :lol

Rockoon,
I added your algo to the test bed and results are:
C:\7>bm7
172 atou
156 atodL
234 Axa2l
202 atodJJ
188 atou_ex
218 atou_rock
172 atou
156 atodL
187 Axa2l
187 atodJJ
203 atou_ex
218 atou_rock
172 atou
156 atodL
203 Axa2l
202 atodJJ
219 atou_ex
218 atou_rock
172 atou
156 atodL
234 Axa2l
187 atodJJ
172 atou_ex
218 atou_rock


172 ms average atou
156 ms average atodL
194 ms average atodJJ
195 ms average atou_ex
214 ms average Axa2l
218 ms average atou_rock
Press any key to continue ...

with JJ testing program:

C:\7>asc2bin3
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
11      cycles for atodL: Lingo's replacement
12      cycles for atou: improved by hutch
14      cycles for atou_rock: by  Rockoon

11      cycles for atodL: Lingo's replacement
12      cycles for atou: improved by hutch
14      cycles for atou_rock: by  Rockoon

Code size:
100     atodL Lingo
52      atou Hutch
158     atou_rock Rockoon

--- ok ---


Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 09, 2010, 06:52:23 AM
Lingo,

I don't get any timing difference on the quad but it looks like the algo you attributed to Rockoon is almost the same as the unrolled version of mine.
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: sinsi on August 09, 2010, 06:53:17 AM
This is fun!

277 ms average atou
218 ms average atodL
265 ms average atodJJ
288 ms average atou_ex
273 ms average Axa2l
288 ms average atou_rock

Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)
11      cycles for atodL: Lingo's replacement
10      cycles for atou: improved by hutch
10      cycles for atou_rock: by  Rockoon

11      cycles for atodL: Lingo's replacement
10      cycles for atou: improved by hutch
10      cycles for atou_rock: by  Rockoon

Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 09, 2010, 07:33:49 AM
Here is more bad news for you.


171 atodL
187 Axa2l
203 atodJJ
156 atou_ex
172 atou_rock
171 atou
172 atodL
187 Axa2l
187 atodJJ
156 atou_ex
172 atou_rock
171 atou
172 atodL
187 Axa2l
187 atodJJ
156 atou_ex
171 atou_rock


171 ms average atou
171 ms average atodL
191 ms average atodJJ
156 ms average atou_ex
187 ms average Axa2l
171 ms average atou_rock
Press any key to continue ...


Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz (SSE4)
9       cycles for atodL: Lingo's replacement
8       cycles for atou: improved by hutch
8       cycles for atou_rock: by  Rockoon

9       cycles for atodL: Lingo's replacement
8       cycles for atou: improved by hutch
8       cycles for atou_rock: by  Rockoon

Code size:
100     atodL Lingo
210     atou Hutch
158     atou_rock Rockoon

--- ok ---


PS: You must have messed up the code size labels, the atou algo is a bit over 50 bytes long.
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: jj2007 on August 09, 2010, 09:38:16 AM
Quote from: hutch-- on August 09, 2010, 07:33:49 AM
PS: You must have messed up the code size labels, the atou algo is a bit over 50 bytes long.
The labels look OK, but a bit of additional code was inserted.

The attachment makes testing a bit more straightforward. The Smooth macro is for discussion on code cache and alignment issues. It seems to render fairly stable P4 cycle counts.

Usage:
testcorrect atou
testme atou
codesize atou


The codesize macro requires the presence of two labels:
QuoteMyAlgo_s:
MyAlgo proc
...
MyAlgo endp
MyAlgo_endp:

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
797     cycles for 10*Lingo's replacement
445     cycles for 10*atou
442     cycles for 10*atou_rock
858     cycles for 10*atodJJ

784     cycles for 10*Lingo's replacement
443     cycles for 10*atou
444     cycles for 10*atou_rock
829     cycles for 10*atodJJ

Code size:
100      bytes for atodL
158      bytes for atou
158      bytes for atou_rock
24       bytes for atodJJ
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: lingo on August 09, 2010, 12:46:02 PM
"but it looks like the algo you attributed to Rockoon is almost the same as the unrolled version of mine."
and
"PS: You must have messed up the code size labels, the atou algo is a bit over 50 bytes long."

Hutch,
I apologize. The error is mine. I pasted Rockoon's algo two times...and 2nd time on your algo. Sorry...
Corrected...
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
148     cycles for 10*Lingo's replacement
185     cycles for 10*atou
176     cycles for 10*atou_rock
259     cycles for 10*atodJJ

149     cycles for 10*Lingo's replacement
187     cycles for 10*atou
175     cycles for 10*atou_rock
259     cycles for 10*atodJJ

Code size:
180      bytes for atodL
52       bytes for atou
158      bytes for atou_rock
24       bytes for atodJJ

--- ok ---

Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: FORTRANS on August 09, 2010, 01:18:18 PM
Hi,

   From jj2007's code in Reply #69.

Regards,

Steve

pre-P4 (SSE1)
453     cycles for 10*Lingo's replacement
314     cycles for 10*atou
313     cycles for 10*atou_rock
492     cycles for 10*atodJJ

383     cycles for 10*Lingo's replacement
313     cycles for 10*atou
314     cycles for 10*atou_rock
490     cycles for 10*atodJJ

Code size:
100      bytes for atodL
158      bytes for atou
158      bytes for atou_rock
24       bytes for atodJJ

--- ok ---
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: jj2007 on August 09, 2010, 02:15:47 PM
Quote from: lingo on August 09, 2010, 02:03:58 PM
Please, redownload last asc2bin_testbed1.zip file and test it again.Thanks :wink

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
697     cycles for 10*Lingo's replacement
474     cycles for 10*atou
444     cycles for 10*atou_rock
845     cycles for 10*atodJJ

684     cycles for 10*Lingo's replacement
494     cycles for 10*atou
448     cycles for 10*atou_rock
842     cycles for 10*atodJJ

Code size:
171      bytes for atodL
52       bytes for atou
158      bytes for atou_rock
24       bytes for atodJJ


And one more - but note Lingo's code size, the version has changed again ::)

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
268     cycles for 10*Lingo's replacement
295     cycles for 10*atou
286     cycles for 10*atou_rock
335     cycles for 10*atodJJ

266     cycles for 10*Lingo's replacement
295     cycles for 10*atou
286     cycles for 10*atou_rock
315     cycles for 10*atodJJ

Code size:
180      bytes for atodL
52       bytes for atou
158      bytes for atou_rock
24       bytes for atodJJ
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: lingo on August 09, 2010, 02:37:29 PM
Please, redownload last asc2bin_testbed1.zip file and test it again.Thanks :wink

Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 09, 2010, 03:05:25 PM
I swapped the short atou to the longer version, unrolled JJs algo by 8 and changed the algo order and got this result on the i7.


Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz (SSE4)
99      cycles for 10*atou
83      cycles for 10*atou_rock
113     cycles for 10*Lingo's replacement
104     cycles for 10*atodJJ

99      cycles for 10*atou
84      cycles for 10*atou_rock
82      cycles for 10*Lingo's replacement
103     cycles for 10*atodJJ

Code size:
210      bytes for atou
158      bytes for atou_rock
171      bytes for atodL
129      bytes for atodJJ

--- ok ---
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: lingo on August 09, 2010, 03:31:12 PM
"I swapped the short atou to the longer version, unrolled JJs algo by 8 and changed the algo order and got this result on the i7."

Thank you Hutch, but you used my old file. Sorry...
Real code size of my algo is:  180 bytes for atodL
So, please, re-download last asc2bin_testbed1.zip file and test it again. Thanks. :wink
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 09, 2010, 04:09:19 PM
I had used the new version but in any case here is another benchmark that includes your new unrolled version and Rockoon's unrolled version. Interestingly enough the i7 runs all of the full unrolls in the same timing. Your algo is faster on the 2 P4s but a bit slower on the Core2 quad.


i7 quad.
=======
172 atou
156 atodL
187 Axa2l
203 atodJJ
156 atou_ex
156 atou_rock
172 atou
156 atodL
187 Axa2l
203 atodJJ
156 atou_ex
156 atou_rock
171 atou
156 atodL
188 Axa2l
202 atodJJ
156 atou_ex
156 atou_rock
172 atou
156 atodL
187 Axa2l
203 atodJJ
156 atou_ex
156 atou_rock


171 ms average atou
156 ms average atodL
202 ms average atodJJ
156 ms average atou_ex
187 ms average Axa2l
156 ms average atou_rock
Press any key to continue ...

Core2 Quad.
==========
188 atou
203 atodL
203 Axa2l
219 atodJJ
218 atou_ex
188 atou_rock
187 atou
203 atodL
204 Axa2l
203 atodJJ
234 atou_ex
203 atou_rock
188 atou
203 atodL
219 Axa2l
218 atodJJ
203 atou_ex
219 atou_rock
188 atou
203 atodL
203 Axa2l
219 atodJJ
203 atou_ex
187 atou_rock

187 ms average atou
203 ms average atodL
214 ms average atodJJ
214 ms average atou_ex
207 ms average Axa2l
199 ms average atou_rock
Press any key to continue ...

Prescott P4.
===========
359 atou
297 atodL
344 Axa2l
500 atodJJ
359 atou_ex
329 atou_rock
375 atou
296 atodL
344 Axa2l
516 atodJJ
359 atou_ex
328 atou_rock
375 atou
297 atodL
344 Axa2l
484 atodJJ
360 atou_ex
328 atou_rock
375 atou
297 atodL
328 Axa2l
500 atodJJ
359 atou_ex
328 atou_rock

371 ms average atou
296 ms average atodL
500 ms average atodJJ
359 ms average atou_ex
340 ms average Axa2l
328 ms average atou_rock
Press any key to continue ...

Northwood P4.
============
625 atou
438 atodL
484 Axa2l
703 atodJJ
563 atou_ex
484 atou_rock
641 atou
453 atodL
516 Axa2l
703 atodJJ
562 atou_ex
469 atou_rock
641 atou
437 atodL
563 Axa2l
672 atodJJ
562 atou_ex
484 atou_rock
563 atou
453 atodL
531 Axa2l
703 atodJJ
563 atou_ex
484 atou_rock


617 ms average atou
445 ms average atodL
695 ms average atodJJ
562 ms average atou_ex
523 ms average Axa2l
480 ms average atou_rock
Press any key to continue ...

Antique Celeron.
===============
1102 atou
1092 atodL
1042 Axa2l
972 atodJJ
851 atou_ex
791 atou_rock
1112 atou
1092 atodL
1062 Axa2l
952 atodJJ
871 atou_ex
781 atou_rock
1092 atou
1122 atodL
1042 Axa2l
972 atodJJ
851 atou_ex
791 atou_rock
1082 atou
1112 atodL
1041 Axa2l
982 atodJJ
842 atou_ex
791 atou_rock


1097 ms average atou
1104 ms average atodL
969 ms average atodJJ
853 ms average atou_ex
1046 ms average Axa2l
788 ms average atou_rock
Press any key to continue ...


Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: lingo on August 09, 2010, 04:37:25 PM
"I had used the new version but"

Thank you Hutch, but you used my new version algo for JJ testing program in your testing program and it is
wrong.  Sorry...
I have second version for your testing program and it is in the my last bm7.zip file
Note: these are just for Intel users.

Soon, I will post similar two algos (bmx.exe and asc2binx) for the AMD users.

"I am pleased you are catching up instead of talking."
Are you pleased now?  :lol
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: Antariy on August 09, 2010, 10:52:05 PM
Hutch, here new version of my proc.
It seems, that unrolling by 4 - the same good (on my machine).


OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 16
    repeat 12
    ret
    endm
Axa2l proc lpszStr:DWORD
;int 3
    mov edx,[esp+4]
        mov eax,0
       
        movzx ecx,byte ptr [edx]
        add ecx,-30h
       
        js @done

        add edx,1

      @mainloop:
        lea eax,[eax+eax*4]
        lea eax,[eax*2+ecx]
        movzx ecx,byte ptr [edx]
        add ecx,-30h
        js @done
       
        lea eax,[eax+eax*4]
        lea eax,[eax*2+ecx]
        movzx ecx,byte ptr [edx+1]
        add ecx,-30h
        js @done
       
        lea eax,[eax+eax*4]
        lea eax,[eax*2+ecx]
        movzx ecx,byte ptr [edx+2]
        add ecx,-30h
        js @done

        lea eax,[eax+eax*4]
        lea eax,[eax*2+ecx]
        movzx ecx,byte ptr [edx+3]
        add edx,4
        add ecx,-30h
        jns @mainloop

      @done:       

    ret 4
Axa2l endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef




Alex
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: lingo on August 11, 2010, 04:50:41 PM
I reordered something in my algo and now it works well enough with AMD too. Please, try it to receive the final results. Thank you.
Note:Due to Hutch's testing program is very memory usage sensitive please close all other programs and start the test minimum 3 times. Get the best values. Thanks. :toothy



Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
C:\8>bmu6
172 atou
140 atodL
219 Axa2l
203 atodJJ
171 atou_ex
172 atou_rock
171 atou
141 atodL
218 Axa2l
203 atodJJ
172 atou_ex
156 atou_rock
171 atou
141 atodL
202 Axa2l
188 atodJJ
171 atou_ex
172 atou_rock
234 atou
140 atodL
219 Axa2l
202 atodJJ
188 atou_ex
156 atou_rock

187 ms average atou
140 ms average atodL
199 ms average atodJJ
175 ms average atou_ex
214 ms average Axa2l
164 ms average atou_rock
Press any key to continue ...

AMD Turion(tm) 64 Mobile Technology ML-30 (SSE3)
C:\8>bmu6
703 atou
437 atodL
812 Axa2l
672 atodJJ
532 atou_ex
484 atou_rock
703 atou
438 atodL
812 Axa2l
672 atodJJ
531 atou_ex
485 atou_rock
703 atou
437 atodL
813 Axa2l
672 atodJJ
531 atou_ex
484 atou_rock
703 atou
438 atodL
812 Axa2l
672 atodJJ
531 atou_ex
485 atou_rock

703 ms average atou
437 ms average atodL
672 ms average atodJJ
531 ms average atou_ex
812 ms average Axa2l
484 ms average atou_rock
Press any key to continue ...

C:\8>asc2bin_testbed2
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
158     cycles for 10*Lingo's replacement
184     cycles for 10*atou
175     cycles for 10*atou_rock
258     cycles for 10*atodJJ

159     cycles for 10*Lingo's replacement
186     cycles for 10*atou
177     cycles for 10*atou_rock
260     cycles for 10*atodJJ

Code size:
118      bytes for atodL
52       bytes for atou
158      bytes for atou_rock
24       bytes for atodJJ

--- ok ---

C:\8>asc2bin_testbed2
AMD Turion(tm) 64 Mobile Technology ML-30 (SSE3)
240     cycles for 10*Lingo's replacement
304     cycles for 10*atou
314     cycles for 10*atou_rock
312     cycles for 10*atodJJ

240     cycles for 10*Lingo's replacement
299     cycles for 10*atou
306     cycles for 10*atou_rock
293     cycles for 10*atodJJ

Code size:
118      bytes for atodL
52       bytes for atou
158      bytes for atou_rock
24       bytes for atodJJ

--- ok ---



Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: Antariy on August 11, 2010, 10:49:44 PM
Lingo, you use other version of my proc, see 2 posts about for newer version. Thanks.


719 atou
641 atodL
656 Axa2l
937 atodJJ
672 atou_ex
625 atou_rock
688 atou
625 atodL
625 Axa2l
953 atodJJ
672 atou_ex
625 atou_rock
687 atou
625 atodL
641 Axa2l
906 atodJJ
672 atou_ex
609 atou_rock
688 atou
625 atodL
641 Axa2l
921 atodJJ
672 atou_ex
610 atou_rock


695 ms average atou
629 ms average atodL
929 ms average atodJJ
672 ms average atou_ex
640 ms average Axa2l
617 ms average atou_rock
Press any key to continue ...




Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
491     cycles for 10*Lingo's replacement
489     cycles for 10*atou
458     cycles for 10*atou_rock
882     cycles for 10*atodJJ

487     cycles for 10*Lingo's replacement
490     cycles for 10*atou
458     cycles for 10*atou_rock
888     cycles for 10*atodJJ

Code size:
118      bytes for atodL
52       bytes for atou
158      bytes for atou_rock
24       bytes for atodJJ

--- ok ---


There is best values, for MY old proc, of course  :toothy



Alex
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 11, 2010, 11:14:09 PM
Ignore this posting, I made a mess of the algos that should have been used.

Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: Rockoon on August 12, 2010, 12:12:31 AM
AMD Phenom(tm) II X6 1055T Processor
1420 atodw library
406 atou_rock
452 Alex
344 Lingo
405 atou_ex
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 12, 2010, 12:22:38 AM
Note there is something wrong with the results in terms of correctness testing in the last benchmark.
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 12, 2010, 01:41:38 AM
you are right, read the above posts.  :dazzled:
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: Antariy on August 12, 2010, 09:52:12 PM
Hutch, see this new Axa2l algo.
This is true unrolled and data-controlled algo. It work in paradigm of OOP: "Data controls execution process". And it works slow in my machine... So, one more provement of slowness of OOP algos :)

But it interesting, See and test it, please. It check for string length, and return 0 in ecx, if length is wrong (biggest than 10bytes). If ecx not 0 - then length is OK.
I attach archive. This is old-test bed, but you may copy proc to yours newest and improved test-beds.
Interesting, which timings it have on newest CPUs...



Alex
P.S. This algo is just for fun :)
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: Rockoon on August 13, 2010, 12:35:14 PM
It would be nice to have one where size is known priori, as well as one that terminates on any non-digit rather than on null.

I think in most common large string processing scenarios you are dealing with a large file that just isnt going to have nice convenient nulls in it, so you are either processing it as-is (size not known, but terminator is any non-digit) or have indexed the data already (size is known)
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 13, 2010, 02:48:49 PM
Rockoon,

If all of the data is in the correct form (IE does not need to be parsed from other test) its easy enough to tokenise it in one pass.
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: jj2007 on August 13, 2010, 02:51:43 PM
Quote from: Rockoon on August 13, 2010, 12:35:14 PM
It would be nice to have one where size is known priori, as well as one that terminates on any non-digit rather than on null.

"If all of the data is in the correct form..."

That is ambitious. For inspiration, attached a list of examples for valid number formats. MasmBasic reads them all correctly, but it took me a while :green
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 13, 2010, 02:55:48 PM
JJ,

No doubt but the parser is another animal, not just a conversion.
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: Rockoon on August 13, 2010, 03:20:06 PM
Quote from: hutch-- on August 13, 2010, 02:48:49 PM
Rockoon,

If all of the data is in the correct form (IE does not need to be parsed from other test) its easy enough to tokenise it in one pass.

I am arguing about your "correct form" idea. This idea that the input is "correct form" only when its nicely null terminated is wrong.

I'm thinking along the lines of working with large CSV files (comma separated values) and other large textual data sets (why else would performance be a concern?)

Sure, its easy enough to tokenize it in one extra pass. Its still an extra pass, and one thats particularly memory bandwidth sensitive.
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: Antariy on August 13, 2010, 10:26:04 PM
Hi!

Guys, with yours disputes you forgot to test this "http://www.masm32.com/board/index.php?action=dlattach;topic=14522.0;id=7946" (link to archive). New Axa2l, true unrolled.

This is interesting algo, and very interesting, how it works on different machines.



Alex
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: Antariy on August 13, 2010, 10:35:49 PM
Rockoon, Dave (aka KeepingRealBusy) write one good proc, for conversion hex2dword. In it he use look-up table, with values of converted char if this char is hex, and special "signal" value, if char is not hex. So, code make conversion and checking in one pass - I think, this is same good solution. As you thinks?
For slightly changed version of Dave's proc you may goto "http://www.masm32.com/board/index.php?topic=14540.msg117994#msg117994". In archive - krbhtodw - his proc.



Alex
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 13, 2010, 10:39:34 PM
Rockoon,

I agree that if your target is a large CSV file then you need to handle the data conversion in a different manner. I would be looking at a one pass tokeniser that did the conversion on the fly but I would make the comment that an algo that is both a file parser and a conversion utility would need to be timed against a 2 pass system where you tokenise it as fast as you can then convert it sequentially.

The single pass version has advantages in memory bandwidth but it would be a slower algo because of the amount of work it would have to do. Timing both would tel you the difference. I would imagine that in a one pass parser/converter that the conversion algo would have to be short and have a faster attack time than unrolled linear conversions and code cache issues would come into play.
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: Antariy on August 13, 2010, 10:40:54 PM
And, Rockoon, one note more. If compare getting char not with zero, but with 30h, and go to end if value is smallest than this code, then big part of punctuation chars will be processed correctly - as end of string with number.



Alex
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: Antariy on August 13, 2010, 10:42:44 PM
Hi, Hutch!

Test, please, archive in post by 4 (edited) post above.



Alex
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 13, 2010, 10:57:58 PM
 :bg

Alex,

I will get there but I am a bit tired of writing test pieces, I only have a million other things to do.
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: Antariy on August 13, 2010, 11:04:43 PM
Hutch, I understand you.
But you may run already compiled test from archive. This is take ~30secs. This is your old test-bed. I only replace in source my proc, and compile source. So, you no need to writing test pieces.
Only time of working this algo on newest CPUs interesting to me, no "racing" with other's algos.



Alex
P.S. Sorry me, if I'm pester to you!
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: hutch-- on August 13, 2010, 11:13:13 PM
Alex,

This is the result from BM10. Timed on the Core2 3 gig Quad I work on.


188 atou
203 atodL
250 Axa2l
218 atodJJ
204 atou_ex
234 atou_rock
187 atou
204 atodL
250 Axa2l
203 atodJJ
218 atou_ex
235 atou_rock
187 atou
203 atodL
250 Axa2l
219 atodJJ
219 atou_ex
234 atou_rock
188 atou
203 atodL
250 Axa2l
203 atodJJ
172 atou_ex
234 atou_rock


187 ms average atou
203 ms average atodL
210 ms average atodJJ
203 ms average atou_ex
250 ms average Axa2l
234 ms average atou_rock
Press any key to continue ...
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: Antariy on August 13, 2010, 11:27:30 PM
Thanks, Hutch!

So, results no so bad, as I expect :) On my CPU this proc works not well. Because it hard to prediction of CPU - path of execution is depended from data, and this is very not well in speedy programming, contrary to OOP paradigms. This proc implement something like virtual-methods :), and this methods work not fastly.

But, this algo is for fun - it interesting enought, and show some good info about designing. So, it is no bad :)



Alex
Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: mineiro on August 31, 2010, 05:38:30 AM

align 16
mineiro proc String:DWORD

  ; ------------------------------------------------
  ; Convert decimal string into UNSIGNED DWORD value
  ; ------------------------------------------------

xor eax,eax
mov edx,[esp+4] ;coment this you dont use calling convention
movzx ecx,byte ptr [edx]
test ecx,ecx
je @fora
movzx eax,byte ptr [edx+1]
test eax,eax
je @ssub30
lea ecx, [ecx+ecx*4-30h*5]
lea ecx, [eax+ecx*2-30h]
movzx eax, byte ptr [edx+2]
test eax,eax
je @F
lea ecx,[ecx+ecx*4]
lea ecx, [eax+ecx*2-30h]
movzx eax, byte ptr [edx+3]
test eax,eax
je @F
lea ecx,[ecx+ecx*4]
lea ecx, [eax+ecx*2-30h]
movzx eax, byte ptr [edx+4]
test eax,eax
je @F
lea ecx,[ecx+ecx*4]
lea ecx, [eax+ecx*2-30h]
movzx eax, byte ptr [edx+5]
test eax,eax
je @F
lea ecx,[ecx+ecx*4]
lea ecx, [eax+ecx*2-30h]
movzx eax, byte ptr [edx+6]
test eax,eax
je @F
lea ecx,[ecx+ecx*4]
lea ecx, [eax+ecx*2-30h]
movzx eax, byte ptr [edx+7]
test eax,eax
je @F
lea ecx,[ecx+ecx*4]
lea ecx, [eax+ecx*2-30h]
movzx eax, byte ptr [edx+8]
test eax,eax
je @F
lea ecx,[ecx+ecx*4]
lea ecx, [eax+ecx*2-30h]
movzx eax, byte ptr [edx+9]
test eax,eax
je @F
lea ecx,[ecx+ecx*4]
lea ecx, [eax+ecx*2-30h]
cmp byte ptr [edx+10],0
je @F
mov eax,0
mov ecx,0
jmp @fora
@@:
mov eax,ecx
mov ecx,1
@fora:
ret 4
@ssub30:
mov eax,ecx
xor eax,30h
ret 4
mineiro   endp

Title: Re: Replacement for atodw and atodw_ex test pieces.
Post by: mineiro on September 01, 2010, 06:25:52 PM
In some procedures in these test return wrong values in ecx or eax.

.data
align dalg
itm1 db "4294967295",0 ;eax==0ffffffffh ecx!=0
align dalg
itm2 db "42949672952",0 ;eax==0 ecx=0
align dalg
itm3 db "0",0 ;eax==0 ecx!=0
align dalg
itm4 db "0000",0 ;eax==0 ecx!=0

regards