News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Timings for AMD, P4, Core Duo

Started by jj2007, February 15, 2009, 10:19:07 AM

Previous topic - Next topic

jj2007

Quote from: PBrennick on February 15, 2009, 03:42:44 PM
JJ,
Thanx for the explanation, my CPU is, indeed, a Celeron, 1.70Ghz. It actually clocks at 1.69Ghz, though. The difference between the Spec. and the actual is so slight I doubt it has any significant impact on any testing I may choose to do.
Probably not. Cycles shouldn't change anyway.

Quote
Do my results look okay to you?

Paul


They look almost identical to rags' P4. I suspect you would get the same dramatic factor 5 improvement for the DestC case (where source and destination are aligned 16).

For the curious: s1 and d1 are SSE2 algos, c1 stands for crt_strcpy, and m1 means Masm32 library szCopy ;-)

jj2007

Quote from: FORTRANS on February 15, 2009, 02:33:42 PM
Hi,

   Not sure if you want some older CPU's, but here goes.

Thanks, Steve. Looks fine. By the way: How did you convince the exe to display the long version of the CPU description? I thought I had coded the short version only ;-)


Mark Jones

From the latest assemble,

AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)

Source len=4096

1677     clocks, mode s1, DestA
1693     clocks, mode s1, DestB
1371     clocks, mode s1, DestC

2246     clocks, mode d1, DestA
1902     clocks, mode d1, DestB
1367     clocks, mode d1, DestC

3863     clocks, mode c1, DestA
3841     clocks, mode c1, DestB
3850     clocks, mode c1, DestC

12520    clocks, mode m1, DestA
12623    clocks, mode m1, DestB
12503    clocks, mode m1, DestC

Source len=128

89       clocks, mode s1, DestA
90       clocks, mode s1, DestB
85       clocks, mode s1, DestC

104      clocks, mode d1, DestA
101      clocks, mode d1, DestB
86       clocks, mode d1, DestC

149      clocks, mode c1, DestA
150      clocks, mode c1, DestB
149      clocks, mode c1, DestC

409      clocks, mode m1, DestA
416      clocks, mode m1, DestB
421      clocks, mode m1, DestC
         --- OK ---
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

sinsi


Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)

Source len=4096

2942     clocks, mode s1, DestA
3687     clocks, mode s1, DestB
1447     clocks, mode s1, DestC

2801     clocks, mode d1, DestA
2705     clocks, mode d1, DestB
1442     clocks, mode d1, DestC

3115     clocks, mode c1, DestA
3113     clocks, mode c1, DestB
3114     clocks, mode c1, DestC

4155     clocks, mode m1, DestA
4157     clocks, mode m1, DestB
4159     clocks, mode m1, DestC

Source len=128

122      clocks, mode s1, DestA
132      clocks, mode s1, DestB
85       clocks, mode s1, DestC

136      clocks, mode d1, DestA
132      clocks, mode d1, DestB
94       clocks, mode d1, DestC

134      clocks, mode c1, DestA
135      clocks, mode c1, DestB
134      clocks, mode c1, DestC

169      clocks, mode m1, DestA
175      clocks, mode m1, DestB
168      clocks, mode m1, DestC
         --- OK ---

Mode m1 doesn't seem to like amd does it?

You seem to like numbers jj...later this week I'll be building a 'new' dev box (p3 1000) - even more numbers for you  :bg
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Quote from: sinsi on February 16, 2009, 04:52:39 AM


Mode m1 doesn't seem to like amd does it?

Indeed, Mark's figures look incredibly slow for the Masm32lib szCopy algo. Among the standard ones, crt_strcpy (c1) is clearly the best - I threw lstrcpy out because it was too bad in all tests.

sinsi

For the more curious, what are DestA etc. ? There is a fair bit of difference in the numbers.
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Quote from: sinsi on February 16, 2009, 07:37:23 AM
For the more curious, what are DestA etc. ? There is a fair bit of difference in the numbers.

Different degrees of misalignent against a 16-byte boundary. SSE2 can work with non-aligned data, but it gets slow - so the algo checks whether aligning is possible; if yes, it goes for movaps etc., if no, it has to decide whether to use movups for the source and movaps for the destination, or vice versa. The problem is some processors are faster with source alignment, others with destination alignment...

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
Data (mis-)alignment:
diff src-DestA:        n*16+4
diff src-DestB:        n*16+12
diff src-DestC:        n*16+0

Source len=4096

2762     clocks, mode s1, DestA
3096     clocks, mode s1, DestB
1488     clocks, mode s1, DestC

2776     clocks, mode d1, DestA
2534     clocks, mode d1, DestB
1501     clocks, mode d1, DestC

5160     clocks, mode c1, DestA
5186     clocks, mode c1, DestB
5634     clocks, mode c1, DestC

8278     clocks, mode m1, DestA
8308     clocks, mode m1, DestB
8299     clocks, mode m1, DestC

jj2007

And one more for the really curious. A P4 is a P4...  :dazzled: ??
Quote from: rags on February 15, 2009, 01:03:49 PM

              Intel(R) Pentium(R) 4 CPU 2.53GHz (SSE2)

Source len=4096

9562     clocks, mode s1, DestA
9159     clocks, mode s1, DestB
1611     clocks, mode s1, DestC

11937    clocks, mode d1, DestA
8450     clocks, mode d1, DestB
1630     clocks, mode d1, DestC

4456     clocks, mode c1, DestA  = crt_strcpy
4189     clocks, mode c1, DestB
4612     clocks, mode c1, DestC

              Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

Source len=4096

8587     clocks, mode s1, DestA
9119     clocks, mode s1, DestB
3505     clocks, mode s1, DestC

3692     clocks, mode d1, DestA
4752     clocks, mode d1, DestB
2096     clocks, mode d1, DestC

9255     clocks, mode c1, DestA  = crt_strcpy
8544     clocks, mode c1, DestB
5532     clocks, mode c1, DestC

rags

JJ, I ran testbed2 again:

              Intel(R) Pentium(R) 4 CPU 2.53GHz (SSE2)

Source len=4096

9628     clocks, mode s1, DestA
9091     clocks, mode s1, DestB
1617     clocks, mode s1, DestC

11898    clocks, mode d1, DestA
8331     clocks, mode d1, DestB
1619     clocks, mode d1, DestC

4535     clocks, mode c1, DestA
4485     clocks, mode c1, DestB
4156     clocks, mode c1, DestC

could different amounts of onboard cache or system ram account for the differences?
I'm not sure how much cache I have , I bought this p4 used from a friend.
I have 2gb system ram.
God made Man, but the monkey applied the glue -DEVO

jj2007

Quote from: rags on February 16, 2009, 11:59:45 AM
JJ, I ran testbed2 again:
...
could different amounts of onboard cache or system ram account for the differences?
I'm not sure how much cache I have , I bought this p4 used from a friend.
I have 2gb system ram.

Don't know what the exact reason is. It's interesting though that the "brand strings" for our processors are identical, while yours is an SSE2, and mine is SSE3. Note also that crt_strcpy runs a lot faster on your (older) processor - as if Microsoft had optimised this algo for early P4's ...

EDIT: It seems you have a Northwood, while I have a Prescott P4 (Wiki):

Northwood
... A 2.4 GHz P4 was released in April 2002, and the bus speed increased from 400 MT/s to 533 MT/s for a 2.26 GHz, 2.4 GHz, and 2.53 GHz part in May, 2.66 GHz and 2.8 GHz parts in August

Prescott
On February 1, 2004, Intel introduced a new core codenamed "Prescott". ...  Some programs benefitted from Prescott's doubled cache and SSE3 instructions, whereas others were more crippled by its long, inefficient pipeline.

So the lesson is: Don't rely on the CPUID brand string...

rags

God made Man, but the monkey applied the glue -DEVO

dsouza123


Athlon Thunderbird 1170 Mhz

---TestBed1---
AMD Athlon(tm) Processor

Source len=4096

5190     clocks, mode s1, DestA
5215     clocks, mode s1, DestB

5191     clocks, mode d1, DestA
5214     clocks, mode d1, DestB

5174     clocks, mode c1, DestA
5195     clocks, mode c1, DestB

16521    clocks, mode m1, DestA
16558    clocks, mode m1, DestB

Source len=128

200      clocks, mode s1, DestA
200      clocks, mode s1, DestB

201      clocks, mode d1, DestA
201      clocks, mode d1, DestB

184      clocks, mode c1, DestA
184      clocks, mode c1, DestB

535      clocks, mode m1, DestA
536      clocks, mode m1, DestB

Source len=16

43       clocks, mode s1, DestA
47       clocks, mode s1, DestB

45       clocks, mode d1, DestA
45       clocks, mode d1, DestB

30       clocks, mode c1, DestA
30       clocks, mode c1, DestB
         --- OK ---

---TestBed2---
AMD Athlon(tm) Processor

Source len=4096

5193     clocks, mode s1, DestA
5210     clocks, mode s1, DestB
5201     clocks, mode s1, DestC

5209     clocks, mode d1, DestA
5190     clocks, mode d1, DestB
5213     clocks, mode d1, DestC

5175     clocks, mode c1, DestA
5197     clocks, mode c1, DestB
5175     clocks, mode c1, DestC

16557    clocks, mode m1, DestA
16543    clocks, mode m1, DestB
17011    clocks, mode m1, DestC

Source len=128

198      clocks, mode s1, DestA
198      clocks, mode s1, DestB
198      clocks, mode s1, DestC

201      clocks, mode d1, DestA
200      clocks, mode d1, DestB
200      clocks, mode d1, DestC

184      clocks, mode c1, DestA
184      clocks, mode c1, DestB
184      clocks, mode c1, DestC

541      clocks, mode m1, DestA
536      clocks, mode m1, DestB
542      clocks, mode m1, DestC
         --- OK ---