News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Replacement for atodw and atodw_ex test pieces.

Started by hutch--, July 31, 2010, 11:24:17 AM

Previous topic - Next topic

Rockoon

Quote from: jj2007 on August 08, 2010, 09:22:20 PM
In this sense, yes "there is no fastest code" that fits to all CPUs but there is one for each CPU. And of course the algos can behave different in real life apps.

The term TANSTATFC, "There Aint No Such Thing As The Fastest Code" has nothing to do with differing architectures. Its the acceptance that there is a limitless number of ways to solve a problem and that you will never be able to try them all, that the chance that you have found the fastest way is vanishingly small. That still further you accept that "the algorithm" is the entire program, not just some subset of it.

The only reason I tackled this problem is because I knew that your claim of a 20 cycle limit was horseshit.


align 16
atou_rock proc String:DWORD

    mov ecx, [esp + 4]
    push ebx

    ; ---

    movzx eax, byte ptr [ecx + 0]
    test eax, eax
    jz @@done

    sub eax, 48

    movzx edx, byte ptr [ecx + 1]
    test edx, edx
    jz @@done

    movzx ebx, byte ptr [ecx + 2]
    lea eax, [4*eax + eax]
    test ebx, ebx
    lea eax, [2*eax + edx - 48]
    jz @@done

    movzx edx, byte ptr [ecx + 3]
    lea eax, [4*eax + eax]
    test edx, edx
    lea eax, [2*eax + ebx - 48]
    jz @@done

    movzx ebx, byte ptr [ecx + 4]
    lea eax, [4*eax + eax]
    test ebx, ebx
    lea eax, [2*eax + edx - 48]
    jz @@done

    movzx edx, byte ptr [ecx + 5]
    lea eax, [4*eax + eax]
    test edx, edx
    lea eax, [2*eax + ebx - 48]
    jz @@done

    movzx ebx, byte ptr [ecx + 6]
    lea eax, [4*eax + eax]
    test ebx, ebx
    lea eax, [2*eax + edx - 48]
    jz @@done

    movzx edx, byte ptr [ecx + 7]
    lea eax, [4*eax + eax]
    test edx, edx
    lea eax, [2*eax + ebx - 48]
    jz @@done

    movzx ebx, byte ptr [ecx + 8]
    lea eax, [4*eax + eax]
    test ebx, ebx
    lea eax, [2*eax + edx - 48]
    jz @@done

    movzx edx, byte ptr [ecx + 9]
    lea eax, [4*eax + eax]
    test edx, edx
    lea eax, [2*eax + ebx - 48]
    jz @@done

    lea eax, [4*eax + eax]
    lea eax, [2*eax + edx - 48]

@@done:
    pop ebx
    ret 4

atou_rock endp


280 ms average atou
499 ms average atodL
265 ms average atodJJ
265 ms average atou_ex
292 ms average Axa2l
218 ms average atou_rock

Phenom II x6 1055T

Lingo's previous versions performed much better (within a point or two of atodJJ)

Edited: Made a small change, and including project files.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

hutch--

The short version "atou" has been faster than the longer version ever since I unrolled it by 2. What is the big deal ?

I unrolled both JJs and Alex's to do a fair comparison, I left it up to you to do an unroll by 2 on your own.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

sinsi

Just goes to show it's all a waste of time wankfest  :bdg

245 ms average atou
218 ms average atodL
257 ms average atodJJ
277 ms average atou_ex
257 ms average Axa2l
276 ms average atou_rock

Q6600 2.4GHz
Light travels faster than sound, that's why some people seem bright until you hear them.

hutch--

Here is the good news and the bad news. I added a code loop between each test to load the processor core between tests, that slowed down my atou algo on the Core2 quad and made lingo's faster. The bad news is its the only Core that its faster on. here are the 5 boxes I have handy. The unrolled algo clarly has the legs on the i7, Lingo has it on the COre2 quad, Alex has it on the Prescott P4, Lingo has it on the Northwood by a tiny amount and the unrolled version has it on the ancient Celeron. Kruel ain't it.


2.8 gig i7 quad
171 ms average atou
171 ms average atodL
202 ms average atodJJ
156 ms average atou_ex  *
191 ms average Axa2l

3 gig Core2 Quad
191 ms average atou
171 ms average atodL    *
227 ms average atodJJ
191 ms average atou_ex
215 ms average Axa2l

3.8 gig Prescott P4
410 ms average atou
386 ms average atodL
511 ms average atodJJ
363 ms average atou_ex
355 ms average Axa2l    *

2.8 gig Northwood P4
633 ms average atou
566 ms average atodL    *
719 ms average atodJJ
570 ms average atou_ex
593 ms average Axa2l

1.2 gig Celeron
1071 ms average atou
1171 ms average atodL
966 ms average atodJJ
871 ms average atou_ex  *
1041 ms average Axa2l


Sinsi,

Digital manual self delusion may be misconstrued as something else.  :P
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

well - run that on a couple fairly new AMD machines
then, add all the times together and see who is fastest overall - lol

another approach might be to load a different library depending on the processor being run   :bg
it seems like this is where the fastest apps will be

with 4 OS's, we need the equiv of 28 test machines   :red
then, we can develop the 7 libraries

lingo

#65
Thanks, sinsi  :U
"Here is the good news and the bad news"
Thanks, but your info is old because we have a new algo... :lol

Rockoon,
I added your algo to the test bed and results are:
C:\7>bm7
172 atou
156 atodL
234 Axa2l
202 atodJJ
188 atou_ex
218 atou_rock
172 atou
156 atodL
187 Axa2l
187 atodJJ
203 atou_ex
218 atou_rock
172 atou
156 atodL
203 Axa2l
202 atodJJ
219 atou_ex
218 atou_rock
172 atou
156 atodL
234 Axa2l
187 atodJJ
172 atou_ex
218 atou_rock


172 ms average atou
156 ms average atodL
194 ms average atodJJ
195 ms average atou_ex
214 ms average Axa2l
218 ms average atou_rock
Press any key to continue ...

with JJ testing program:

C:\7>asc2bin3
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
11      cycles for atodL: Lingo's replacement
12      cycles for atou: improved by hutch
14      cycles for atou_rock: by  Rockoon

11      cycles for atodL: Lingo's replacement
12      cycles for atou: improved by hutch
14      cycles for atou_rock: by  Rockoon

Code size:
100     atodL Lingo
52      atou Hutch
158     atou_rock Rockoon

--- ok ---



hutch--

Lingo,

I don't get any timing difference on the quad but it looks like the algo you attributed to Rockoon is almost the same as the unrolled version of mine.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

sinsi

This is fun!

277 ms average atou
218 ms average atodL
265 ms average atodJJ
288 ms average atou_ex
273 ms average Axa2l
288 ms average atou_rock

Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)
11      cycles for atodL: Lingo's replacement
10      cycles for atou: improved by hutch
10      cycles for atou_rock: by  Rockoon

11      cycles for atodL: Lingo's replacement
10      cycles for atou: improved by hutch
10      cycles for atou_rock: by  Rockoon

Light travels faster than sound, that's why some people seem bright until you hear them.

hutch--

Here is more bad news for you.


171 atodL
187 Axa2l
203 atodJJ
156 atou_ex
172 atou_rock
171 atou
172 atodL
187 Axa2l
187 atodJJ
156 atou_ex
172 atou_rock
171 atou
172 atodL
187 Axa2l
187 atodJJ
156 atou_ex
171 atou_rock


171 ms average atou
171 ms average atodL
191 ms average atodJJ
156 ms average atou_ex
187 ms average Axa2l
171 ms average atou_rock
Press any key to continue ...


Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz (SSE4)
9       cycles for atodL: Lingo's replacement
8       cycles for atou: improved by hutch
8       cycles for atou_rock: by  Rockoon

9       cycles for atodL: Lingo's replacement
8       cycles for atou: improved by hutch
8       cycles for atou_rock: by  Rockoon

Code size:
100     atodL Lingo
210     atou Hutch
158     atou_rock Rockoon

--- ok ---


PS: You must have messed up the code size labels, the atou algo is a bit over 50 bytes long.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on August 09, 2010, 07:33:49 AM
PS: You must have messed up the code size labels, the atou algo is a bit over 50 bytes long.
The labels look OK, but a bit of additional code was inserted.

The attachment makes testing a bit more straightforward. The Smooth macro is for discussion on code cache and alignment issues. It seems to render fairly stable P4 cycle counts.

Usage:
testcorrect atou
testme atou
codesize atou


The codesize macro requires the presence of two labels:
QuoteMyAlgo_s:
MyAlgo proc
...
MyAlgo endp
MyAlgo_endp:

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
797     cycles for 10*Lingo's replacement
445     cycles for 10*atou
442     cycles for 10*atou_rock
858     cycles for 10*atodJJ

784     cycles for 10*Lingo's replacement
443     cycles for 10*atou
444     cycles for 10*atou_rock
829     cycles for 10*atodJJ

Code size:
100      bytes for atodL
158      bytes for atou
158      bytes for atou_rock
24       bytes for atodJJ

lingo

#70
"but it looks like the algo you attributed to Rockoon is almost the same as the unrolled version of mine."
and
"PS: You must have messed up the code size labels, the atou algo is a bit over 50 bytes long."

Hutch,
I apologize. The error is mine. I pasted Rockoon's algo two times...and 2nd time on your algo. Sorry...
Corrected...
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
148     cycles for 10*Lingo's replacement
185     cycles for 10*atou
176     cycles for 10*atou_rock
259     cycles for 10*atodJJ

149     cycles for 10*Lingo's replacement
187     cycles for 10*atou
175     cycles for 10*atou_rock
259     cycles for 10*atodJJ

Code size:
180      bytes for atodL
52       bytes for atou
158      bytes for atou_rock
24       bytes for atodJJ

--- ok ---


FORTRANS

Hi,

   From jj2007's code in Reply #69.

Regards,

Steve

pre-P4 (SSE1)
453     cycles for 10*Lingo's replacement
314     cycles for 10*atou
313     cycles for 10*atou_rock
492     cycles for 10*atodJJ

383     cycles for 10*Lingo's replacement
313     cycles for 10*atou
314     cycles for 10*atou_rock
490     cycles for 10*atodJJ

Code size:
100      bytes for atodL
158      bytes for atou
158      bytes for atou_rock
24       bytes for atodJJ

--- ok ---

jj2007

#72
Quote from: lingo on August 09, 2010, 02:03:58 PM
Please, redownload last asc2bin_testbed1.zip file and test it again.Thanks :wink

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
697     cycles for 10*Lingo's replacement
474     cycles for 10*atou
444     cycles for 10*atou_rock
845     cycles for 10*atodJJ

684     cycles for 10*Lingo's replacement
494     cycles for 10*atou
448     cycles for 10*atou_rock
842     cycles for 10*atodJJ

Code size:
171      bytes for atodL
52       bytes for atou
158      bytes for atou_rock
24       bytes for atodJJ


And one more - but note Lingo's code size, the version has changed again ::)

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
268     cycles for 10*Lingo's replacement
295     cycles for 10*atou
286     cycles for 10*atou_rock
335     cycles for 10*atodJJ

266     cycles for 10*Lingo's replacement
295     cycles for 10*atou
286     cycles for 10*atou_rock
315     cycles for 10*atodJJ

Code size:
180      bytes for atodL
52       bytes for atou
158      bytes for atou_rock
24       bytes for atodJJ

lingo

Please, redownload last asc2bin_testbed1.zip file and test it again.Thanks :wink


hutch--

I swapped the short atou to the longer version, unrolled JJs algo by 8 and changed the algo order and got this result on the i7.


Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz (SSE4)
99      cycles for 10*atou
83      cycles for 10*atou_rock
113     cycles for 10*Lingo's replacement
104     cycles for 10*atodJJ

99      cycles for 10*atou
84      cycles for 10*atou_rock
82      cycles for 10*Lingo's replacement
103     cycles for 10*atodJJ

Code size:
210      bytes for atou
158      bytes for atou_rock
171      bytes for atodL
129      bytes for atodJJ

--- ok ---
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php