News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Replacement for atodw and atodw_ex test pieces.

Started by hutch--, July 31, 2010, 11:24:17 AM

Previous topic - Next topic

Antariy

Celeron D Prescott test

atodw Version

4294967295
987
0
9876

Short Version

4294967295
987
0
9876

Long Version

4294967295
987
0
9876
-------
Timings
-------

Timing atodw version
562

Timing short version
438

Timing long version
391

Press any key to continue ...




Alex

hutch--

KeepingRealBusy,

The reason why this pair of algos were posted in the lab was to test them, while you have perfectly valid questions about some of the issues raised, this is not the place for them or other unrelated questions in the previous post.

RE: The design choice for 2 algos, it is simple enough to provide the range difference so that someone who wants to do an occasional conversion can use a shorter version where someone who want to stream data of this type can use a faster but larger one. Its the choice of the user being catered for here, not a theory about library design.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

KeepingRealBusy

Excuse me,

Quote
The Laboratory
Algorithm and code design research laboratory. This is the place to post assembler algorithms and code design for discussion, optimisation and any other improvements that can be made on it. Post code here to be beaten to death to make it better, smaller, faster or more powerful. Feel free to explain the optimisation methods used so that everyone can get a feel for the code design.

By your own words.

If this is not the place, then where is the place?

Dave.

hutch--

Try the Campus. If you have an algo post it in here.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Antariy

Hutch

It seems, what this proc have bottleneck on PIV.
By my test, if you change this:

    movzx eax, BYTE PTR [edx]
    test eax, eax
    jz quit

    lea ecx, [ecx+ecx*4]
    lea ecx, [eax+ecx*2-48]
    movzx eax, BYTE PTR [edx+1]
    test eax, eax
    jz quit


to:


    movzx eax, BYTE PTR [edx]
    add eax,-48 ; <--- this
    js quit ; <--- this

    lea ecx, [ecx+ecx*4]
    lea ecx, [eax+ecx*2] ; <--- this, if remove substraction from LEA, it works faster
    movzx eax, BYTE PTR [edx+1]
    add eax,-48 ; <--- this
    js quit ; <--- this

..... etc .....


On my system speed increase by ~14%
Try this, may be on any hardware this have positive advantages.


Alex

mineiro

Pentium Dual 1.8ghz
Timing atodw version
547
Timing short version
406
Timing long version
187

dedndave

sorry, but those numbers mean nothing without code to go with them   :P

hutch--

Dave,

they are just results of the attachment in the first post.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave


lingo

#24
Your Timing short version
265
My Timing short version
140
align algn

atodL proc String:DWORD
    mov    edx, [esp+1*4]
    movzx  eax, byte ptr [edx]
    add    edx, 1
    test   eax, eax
    jz    quit
    movzx  ecx, byte ptr [edx]
    sub    eax, 30h
    test   ecx, ecx
    jz    quit
@@:
    inc    edx
    lea    eax, [eax+eax*4]
    cmp    byte ptr [edx], 0
    lea    eax, [ecx+eax*2-30h]
    movzx  ecx, byte ptr [edx]
    jnz    @b
  quit:
    pop ecx
    pop edx
    jmp ecx
atodL   endp


jj2007

What I like most about Lingo's code is that it can always be improved :green2

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
24      cycles for atodL: Lingo's bloatware
23      cycles for atodJJ: improved by JJ

24      cycles for atodL: Lingo's bloatware
23      cycles for atodJJ: improved by JJ

Code size:
44      atodL Lingo
24      atodJJ Jochen



hutch--

 :bg


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
17      cycles for atodL: Lingo's bloatware
16      cycles for atodJJ: improved by JJ

17      cycles for atodL: Lingo's bloatware
16      cycles for atodJJ: improved by JJ

Code size:
44      atodL Lingo
24      atodJJ Jochen

--- ok ---
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

ecube


AMD Athlon(tm) 64 Processor 3000+ (SSE3)
24      cycles for atodL: Lingo's bloatware
26      cycles for atodJJ: improved by JJ

22      cycles for atodL: Lingo's bloatware
26      cycles for atodJJ: improved by JJ

Code size:
44      atodL Lingo
24      atodJJ Jochen

--- ok ---


2-4 cycles slower, nice improvement  :tdown

hutch--

I put the new algos into a real time test bed and the results are very different. I played with this one for a while to ensure that they were reasonable comparisons and with the short algos Lingos is easily the fastest, JJs is the smallest and the timings all wander depending on code placement and algorithm alighment.


641 atou
250 atodL
593 atodJJ
235 atou_ex
640 atou
250 atodL
594 atodJJ
188 atou_ex
640 atou
250 atodL
594 atodJJ
203 atou_ex
641 atou
250 atodL
594 atodJJ
234 atou_ex


640 ms average atou
250 ms average atodL
593 ms average atodJJ
215 ms average atou_ex
Press any key to continue ...


Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

lingo

"What I like most about Lingo's code is that it can always be improved 
Code:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
24      cycles for atodL: Lingo's bloatware
23      cycles for atodJJ: improved by JJ

24      cycles for atodL: Lingo's bloatware
23      cycles for atodJJ: improved by JJ"



My algo was for Hutch's test program , so we can see before my original code: align algn
-   If we try my original algo vs "improved by JJ" in the Hutch's test program
the result will be Timing short version 140 vs Timing short version 359 (with my CPU)

The thief is always a liar but our is stupid too due to he don't understand the algo's details (lack of A.Fog)...Why?
-   In his test program he transferred my algo with align 16 ONLY! Hence, after that my loop label is not aligned properly....
If we have align 16 PLUS db 8 dup(0) after it,  the results will be diferent.   
-   In my test versions I have similar short code but it is slower (biger is faster law). Why?
One reason is in the first(empty) looping when eax=0  JJ "improved" algo  spend time for nothing ( we have some empty instructions like  lea    eax, [eax+eax*4] and   lea  eax, [ecx+eax*2-30h])

So, "Lingo's code is that it can always be improved".
I agree, but not from everyone... :lol
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
17      cycles for atodL: Lingo's bloatware
23      cycles for atodJJ: improved by JJ

17      cycles for atodL: Lingo's bloatware
23      cycles for atodJJ: improved by JJ

Code size:
52      atodL Lingo
24      atodJJ Jochen

--- ok ---