News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Replacement for atodw and atodw_ex test pieces.

Started by hutch--, July 31, 2010, 11:24:17 AM

Previous topic - Next topic

jj2007

Quote from: lingo on August 07, 2010, 02:29:54 PM
If we have align 16 PLUS db 8 dup(0) after it,  the results will be diferent.

My most sincere apologies, young friend! Indeed, with only 8 bytes more, you gain half a cycle on my modern CPU :U

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
24      cycles for atodL: Lingo's superbloatware
24      cycles for atodJJ: improved by JJ

24      cycles for atodL: Lingo's superbloatware
23      cycles for atodJJ: improved by JJ

24      cycles for atodL: Lingo's superbloatware
24      cycles for atodJJ: improved by JJ

24      cycles for atodL: Lingo's superbloatware
23      cycles for atodJJ: improved by JJ

Code size:
52      atodL Lingo
24      atodJJ Jochen

hutch--

I just mercilessly hacked JJs test bed to try out the changes to atou. I tried an unroll by 2 which brought its size up to a bit bigger than Lingo's algo but still within a reasonable size for a small general purpose algo and its now averaging faster than the long version I have been playing with.


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
17      cycles for atodL: Lingo's replacement
12      cycles for atou: improved by hutch

17      cycles for atodL: Lingo's replacement
12      cycles for atou: improved by hutch

Code size:
44      atodL Lingo
52      atou Hutch

--- ok ---
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

lingo

 :toothy
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
11      cycles for atodL: Lingo's replacement
12      cycles for atou: improved by hutch

11      cycles for atodL: Lingo's replacement
12      cycles for atou: improved by hutch

Code size:
144     atodL Lingo
52      atou Hutch

--- ok ---

hutch--

Lingo,


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
11      cycles for atodL: Lingo's replacement
12      cycles for atou: improved by hutch

11      cycles for atodL: Lingo's replacement
12      cycles for atou: improved by hutch

Code size:
144     atodL Lingo
52      atou Hutch

--- ok ---


Have you got a version of that algo that is an unroll by 2, I have done the same with JJs but I don't know enough about yours to unroll it properly. I tried this on the last one but it did not get any faster.


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    REPEAT ncnt
      padd
    ENDM

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 16

db 8 dup(0)

atodL proc String:DWORD
    mov    edx, [esp+1*4]
    movzx  eax, byte ptr [edx]
    add    edx, 1
    test   eax, eax
    jz    quit
    movzx  ecx, byte ptr [edx]
    sub    eax, 30h
    test   ecx, ecx
    jz    quit
@@:
;     inc    edx
;     lea    eax, [eax+eax*4]
;     cmp    byte ptr [edx], 0
;     lea    eax, [ecx+eax*2-30h]
;     movzx  ecx, byte ptr [edx]
;     jz quit

    inc    edx
    lea    eax, [eax+eax*4]
    cmp    byte ptr [edx], 0
    lea    eax, [ecx+eax*2-30h]
    movzx  ecx, byte ptr [edx]
    jnz    @b

  quit:
    pop ecx
    pop edx
    jmp ecx
atodL   endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Queue

a2b3
AMD Athlon(tm) 4 Processor (SSE1)
31 cycles for atodL: Lingo's replacement
37 cycles for atou: improved by hutch


Kinda sucks that it's 3 times as big.

Queue

hutch--

Here is the latest test bed I have been using for these algos, I just added Lingo's unroll by 4 to it so I have JJs and mine as short versions and Lingo's and my long one all timed. I added a suggestion of Daves to set the affinity mask to stabilise the timings and this seems to have helped get all of the times closest to their minumin.


187 atou
203 atodL
235 atodJJ
218 atou_ex
188 atou
203 atodL
219 atodJJ
219 atou_ex
187 atou
203 atodL
219 atodJJ
203 atou_ex
188 atou
203 atodL
218 atodJJ
250 atou_ex


187 ms average atou
203 ms average atodL
222 ms average atodJJ
222 ms average atou_ex
Press any key to continue ...
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

lingo

Queue,
"Kinda sucks that it's 3 times as big."

144     atodL Lingo

144 is a total length of new and old versions due to I'm lazy to delete them, sorry. :wink

lingo

It is crazy to write optimized code for different testing programs due to for one algo they compute different results.
So, test programs writers should obey some standard rules and/or for this site must have just one "the best"
and mandatory testing program. Otherwise I have to write one algo for Hutch's testing program other for
A.Fog's testing program, next for  MichaelW's testing program, next for jj testing program, etc... :lol

hutch--

 :bg

Its crazy to write optimised code that only works in one test bed and not the another. I test algos in real time when I need to as algos run in real time whan they are being used.

Show us you Lingo test program.  :bg
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

lingo

"that only works in one test bed and not the another"
this is a nonsense.
We optimize algos according the rules in the Intel Optimization manual rather than "rules" of your testing program.
Where are your "rules" because I can't see them? :lol

"I test algos in real time"
It is just your opinion but without rules...
What about other's testing programs?
For example: try to optimize an algo to be fastest with same times in your and jj testing programs. :lol

hutch--

 :bg

There is fantasy involved here, you optimise code to the CLOCK. It either works or it does not, that what you can get from the manual.

Your dependence on one format of test bed is based on "gaming" the test bed to get results in that test bed, not produce the faster algorithm.

If a piece of code is only a superstar in the test bed and not in general purpose, then its useless.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Rockoon

I think that lingo doesnt understand a principle of profiling and optimization.

The only profiling that matters is profiling of real programs. Real production code. The purpose of testbeds (plural) is just to get a general idea of the issues you will face in those aforementioned real production programs. If you restrict yourself to a single testbed then you are just specializing, not using testbeds for the purpose that they are supposed to serve.

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

dedndave

it is hard to optimize code while running on a specific processor
in fact, i am P4-bound, so if i speed up a routine on my machine,
it is not likely to be optimal on any of the newer processors   :P
as time goes by, it hardly makes sense to use a P4 machine for writing code, although it's nice to test it on one   :bg

hmmm - maybe i can use that as an excuse, of some sort - lol

lingo

"Your dependence on one format of test bed is based on "gaming" the test bed to get results in that test bed, not produce the faster algorithm.
If a piece of code is only a superstar in the test bed and not in general purpose, then its useless."


All this is nonsense again. I will try to explain to you last case with YOUR SUPERSTAR ALGO CODE(I like it). 
The YOUR ALGO CODE   from YOUR  a2b2.zip file measured with the jj testing program which YOU use inside finished for 12 clocks. It is Unrolled just 2 times.
If you unroll it 4 times it will finished for 12 clocks again in the jj testing program. Due to you know the difference between your testing program and jj testing program (TP) you stoped using jj TP  and start using your test program bm4.asm. In it you unroll  YOUR ALGO CODE 4 times and it become fastest(but just in your testing program).
Of course in it you included other's algos which are just 2 times unrolled but it is other story (bad gaming).
So, YOUR ALGO CODE unrolled 4 times run different on your test program and on jj test program. 
If YOUR ALGO CODE is superstar try it in jj test program against my code to see the results. You can unroll many times until your face become red but your end result will be 12 clocks or more...so, put your code in the recycle bin because according to you: :lol
"If a piece of code is only a superstar in the test bed and not in general purpose, then its useless."


"There is fantasy involved here"
Everyone can investigate this to see who is wrong...  :lol

I will repeat again and again my offer from my previous post because still I have no answer
For example: try to optimize an algo to be fastest with same times in your and jj testing programs.   :lol


About "gaming"...
You are the "gamer" because you use your testing program to have better results. Try to get them in the jj test program... :lol

Rockoon,
"I think that lingo doesn't understand a principle of profiling and optimization."
What and why you think so is not important. Will be better to post your code rather than blah,bla,bla...

Dave,
We don't talk for different processors. We talk about different testing programs with different testing results for the same algo and CPU..
You can test them with Hutch's superstar code... :lol
Or may be I have to write "My algo is fastest with JJ test program, Hutch's algo is fastest with his test program"... :lol


Rockoon

You are failing to realize the immense mistake in your thinking. There is no such thing as the fastest code.

Testbeds dont tell you what the fastest code is. There is no fastest code. Testbeds only allow you to compare code within the testbed, and can only tell you which one is faster within the testbed. You still cannot determine if you have found the fastest code even in terms of that testbed, let alone a real production product.

Testbeds are not marketable products. Users do not clamor for testbeds. Nobody is saying "Gee, I wish I had JJ's testbed" or "Gee, I wish I had Hutches testbed," because testbeds are USELESS things that serve no purpose other than investigation by people writing code.

What production products is this function a bottleneck in? HMM? HMMMMM??? Once you determine that, THEN you have some RELEVANT testbeds: THE PRODUCTION PRODUCTS.

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.