News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Replacement for atodw and atodw_ex test pieces.

Started by hutch--, July 31, 2010, 11:24:17 AM

Previous topic - Next topic

lingo


Rockoon

Do you think that asking for a code handout makes you right?

It doesn't. It makes you just as wrong as before you asked.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

lingo

"It makes you just as wrong as before you asked."
Why I'm wrong or guilty? Why? The SUPERSTAR ALGO CODE is from Hutch,
the superstar "real time" testing program is from Hutch too, other testing program is from JJ or from  MichaelW.
It is like to measure voltage with two voltmeters. One draws 24 V other 33V from the same source. Which is wrong?
Am I guilty that I invent the different results between these testing programs and  SUPERSTAR ALGO CODE from hutch and ask about it? Or to close my eyes and say that always is super and there are no differences?

"You are failing to realize the immense mistake in your thinking."
I think how to improve some algos rather than how to use "properly" superstar "real time" testing program or other tools. So, when I see a problem with the tools I just ask their creators about it.

"Do you think that asking for a code..."
If you have some optimized code my advice will be to test it with these two testing programs and after that we will continue our very interesting conversation. :(

Rockoon

How are you wrong?

Lets see...

"It is crazy to write optimized code for different testing programs due to for one algo they compute different results."

Thats wrong. Its not "crazy" to write multiple versions of an algorithm, all "better" in different cases. Its important that different cases have different optimization criteria.

So, test programs writers should obey some standard rules and/or for this site must have just one "the best" and mandatory testing program.

Thats wrong. There are no standard rules that will apply successfully to arbitrary real world applications of the algorithm.

Otherwise I have to write one algo for Hutch's testing program other for A.Fog's testing program, next for  MichaelW's testing program, next for jj testing program, etc...

Thats wrong. You dont "have" to write multiple algorithms. You *get* to write multiple algorithms and *get* to have superior performance in multiple circumstances.

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

jj2007

Quote from: Rockoon on August 08, 2010, 06:00:50 PMTestbeds dont tell you what the fastest code is. There is no fastest code. Testbeds only allow you to compare code within the testbed, and can only tell you which one is faster within the testbed.

Rockoon,

The algos posted here for teasing Lingo are not the best example. Their results differ by CPU because they are the fastest code for a particular CPU: 20 cycles, not a single one less because that CPU has only this limited set of instructions and this limited set of possible combinations. In this sense, yes "there is no fastest code" that fits to all CPUs but there is one for each CPU. And of course the algos can behave different in real life apps.

However, the majority of threads in the Lab starts with the idea "hey, we could speed up lstrwhatever a little bit. And wow, you get a factor 10 with some simple tricks from Agner's or Lingo's or NightWare's code kitchen. On all CPUs, in all real life apps. That is where testbeds have a positive role. It turns into the absurd when later on we start squeezing out the last cycle and discover that we are running different hardware. That is less useful but it is the fun part :bg

Antariy


Hutch,
add to yours test this code:


align 16
Axa2l proc STDCALL lpszStr:DWORD
    mov edx,[esp+4]

        xor eax,eax
        xor ecx,ecx ; try to comment this, and add two nops before proc
        movzx ecx,byte ptr [edx]

        add ecx,-30h
        js @done

        @mainloop:
        lea eax,[eax+eax*4]
        lea eax,[eax*2+ecx]
        movzx ecx,byte ptr [edx+1]
        add edx,1 ; try to change this to "inc edx" (to put main loop in 16bytes long)
        add ecx,-30h
        jns @mainloop


    @done:       

    ret 4
Axa2l endp


I write some comments, what is useful to playing.


Alex

lingo

Let see.... again blah bla bla... without any value in the practice, without any code example or/and sources of your "knowledge"

Its not "crazy" to write multiple versions of an algorithm, all "better" in different cases."
and
"Its important that different cases have different optimization criteria."
and
"Its crazy to write optimized code that only works in one test bed and not the another" by Hutch

For me All OPTIMIZATION criteria and rules are in the INTEL (AMD) Optimization Reference Manual.
If you or Hutch know more sources of optimization criteria please let us know and write a CODE EXAMPLES to see how to do that...

"There are no standard rules that will apply successfully to arbitrary real world applications of the algorithm."
Again: they are in the INTEL (AMD) Optimization Reference Manual and some tips in the help file of VTune soft performance analyzer

"You don't "have" to write multiple algorithms. You *get* to write multiple algorithms and *get* to have superior performance in multiple circumstances."
Try to do this with some CODE EXAMPLES to see how to do that in the practice...

And please stop to fill this thread with empty words Blah, bla, bla, bla and to kill our time to read them...Your code examples or nothing... :(

You can start with your version of atou algo too...or teach Hutch how to make his atou faster with JJ testing program... :lol

Rockoon

Quote from: lingo on August 08, 2010, 11:15:52 PM
For me All OPTIMIZATION criteria and rules are in the INTEL (AMD) Optimization Reference Manual.

No, not all of them. Tell me, what do those optimization manuals say about the wildly differing performance of your routine in the other thread that is either great, or really crappy, depending on what other code is also included in the program.

Quote from: lingo on August 08, 2010, 11:15:52 PM
Again: they are in the INTEL (AMD) Optimization Reference Manual and some tips in the help file of VTune soft performance analyzer

You requested a standard test harness. I was writing about a standard test harness and the flaw in the idea. What you are writing here has nothing to do with that. Why dont you address the topic you brought up?

You seem to think that usage patterns dont matter.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

hutch--

Lingo,

I know you can do better than this. I have had my misgivings about the popular test bed that has been used recently due to granularity with very low result counts so I have done the obvious, I have made a test bed that times a number of algorithms in real time that runs longer and can be made to run longer again.

Now your theory is flawed in that by your own admission you want to write a different algo for each different test bed rather than trying to make the algo faster over the range of conditions that it is designed to operate under. Specialisation on one processor and one test bed is a mistake as they are both variables, different hardware produces different results and different test beds serve different purposes.

The common test bed is based off Michael's timing macros and while the technique is well suited for testing short sections of code that are almost impossible to test in other ways, the technique becomes less useful on complete algorithms that have a very low timing count.

Long ago I have learnt that you tailor a test bed to the task you are performing and the task change from one algorithm to another. If I am testing an algo that needs a large piece of data I build a test bed that has data on this scale, I have commonly tested algos on a gigabyte of data if that form of streaming is where it will be used. At the other end if the algo is very short and its "take off" time is the critical factor, I write a high loop count short data source test bed to test that feature.

The current test bed for these algos is ajustable in real time, uses varible length data input, variable duration based off the loop count, variable inter-algo padding to test code isolation, is controlled to run on a single core, each timing is real time isolated using SleepEx() and the instruction queue is flushed with a serialised instruction before each timing.

The criterion for an algorithm is not if it uses instructions from either the Intel or AMD optimisation manual, its how fast it is in the conditions where it will be used and this is a variable depending on the processor it is used on.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

hutch--

Alex,

I added your algo to the test bed. As a short algo it was slow and messed up the timings of the other algos but I unrolled it by 2 and it came down in its time to near the rest. Even with the layout of this testbed algo placement effects the timing of other algos so I have ordered them to try and get the bets timings for each algo but note that there are fluctuations in the timing of all of them.


172 atou
203 atodL
203 Axa2l
203 atodJJ
219 atou_ex
203 atou
203 atodL
235 Axa2l
218 atodJJ
235 atou_ex
203 atou
203 atodL
203 Axa2l
203 atodJJ
250 atou_ex
219 atou
203 atodL
281 Axa2l
204 atodJJ
234 atou_ex


199 ms average atou
203 ms average atodL
207 ms average atodJJ
234 ms average atou_ex
230 ms average Axa2l
Press any key to continue ...
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

sinsi


    Core MACRO corenum
      mov eax, 1
      .if eax == 1

Always core 1  :bg
Light travels faster than sound, that's why some people seem bright until you hear them.

dedndave


lingo

"I know you can do better than this"

'Or may be I have to write "My algo is fastest with JJ test program, Hutch's algo is fastest with his test program"
No, should be "My algo is fastest with JJ testing program and with Hutch's testing program"  :lol

You are like Rockoon who has a big need from empty speech...but...
Due to I have a Master Degree in Electrical Engineering to argue with me, theoretically, you should have a Ph.D. at least... :lol
So, try to concentrate yourself to find a faster solution for your atou - with other words catch me if you can... :lol
New results from your test program:
C:\7>bm6
171 atou
156 atodL ->Lingo
187 Axa2l
203 atodJJ
234 atou_ex
172 atou
156 atodL ->Lingo 
187 Axa2l
187 atodJJ
172 atou_ex
171 atou
156 atodL  ->Lingo
187 Axa2l
203 atodJJ
187 atou_ex
188 atou
156 atodL  ->Lingo
187 Axa2l
187 atodJJ
203 atou_ex


175 ms average atou
156 ms average atodL ->Lingo
195 ms average atodJJ
199 ms average atou_ex
187 ms average Axa2l
Press any key to continue ...



hutch--

 :bg

No great need, I am pleased you are catching up instead of talking.

Timing is on a 3 gig Core2 quad with 1333 memory.

172 atou
172 atodL
203 Axa2l
203 atodJJ
172 atou_ex
172 atou
171 atodL
204 Axa2l
203 atodJJ
218 atou_ex
172 atou
172 atodL
203 Axa2l
219 atodJJ
219 atou_ex
172 atou
172 atodL
203 Axa2l
203 atodJJ
187 atou_ex


172 ms average atou
171 ms average atodL
207 ms average atodJJ
199 ms average atou_ex
203 ms average Axa2l
Press any key to continue ...
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

lingo

"I am pleased you are catching up"

But I'm not due to big  differences between values for your algos atou and atou_exe in your last and previous results...Somebody else pls..
prev. last
172  172 atou
     172 atodL
     203 Axa2l
     203 atodJJ
219  172 atou_ex
203  172 atou
     171 atodL
     204 Axa2l
     203 atodJJ
235  218 atou_ex
203  172 atou
     172 atodL
     203 Axa2l
     219 atodJJ
250  219 atou_ex
219  172 atou
     172 atodL
     203 Axa2l
     203 atodJJ
234  187 atou_ex


199  172 ms average atou
     171 ms average atodL
     207 ms average atodJJ
234  199 ms average atou_ex
     203 ms average Axa2l
Press any key to continue ...