News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

StrLen timings needed

Started by jj2007, August 15, 2010, 09:32:10 PM

Previous topic - Next topic

jj2007

Hi folks,
Could I please have some timings on non-Celerons?
Thanks, jj

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

29      cycles for MbStrLen1
34      cycles for MbStrLen2
34      cycles for MbStrLen3
31      cycles for MbStrLen4a
35      cycles for MbStrLen4b
38      cycles for MbStrLen5

ecube

AMD Athlon(tm) 64 Processor 3000+ (SSE3)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

47      cycles for MbStrLen1
48      cycles for MbStrLen2
51      cycles for MbStrLen3
47      cycles for MbStrLen4a
51      cycles for MbStrLen4b
55      cycles for MbStrLen5

47      cycles for MbStrLen1
54      cycles for MbStrLen2
54      cycles for MbStrLen3
52      cycles for MbStrLen4a
58      cycles for MbStrLen4b
53      cycles for MbStrLen5

47      cycles for MbStrLen1
48      cycles for MbStrLen2
52      cycles for MbStrLen3
47      cycles for MbStrLen4a
50      cycles for MbStrLen4b
54      cycles for MbStrLen5

48      cycles for MbStrLen1
54      cycles for MbStrLen2
54      cycles for MbStrLen3
52      cycles for MbStrLen4a
57      cycles for MbStrLen4b
53      cycles for MbStrLen5

48      cycles for MbStrLen1
48      cycles for MbStrLen2
50      cycles for MbStrLen3
47      cycles for MbStrLen4a
52      cycles for MbStrLen4b
55      cycles for MbStrLen5


--- ok ---

MichaelW

P3:

pre-P4 (SSE1)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

45      cycles for MbStrLen1
51      cycles for MbStrLen2
46      cycles for MbStrLen3
51      cycles for MbStrLen4a
47      cycles for MbStrLen4b
59      cycles for MbStrLen5

45      cycles for MbStrLen1
62      cycles for MbStrLen2
46      cycles for MbStrLen3
50      cycles for MbStrLen4a
47      cycles for MbStrLen4b
56      cycles for MbStrLen5

46      cycles for MbStrLen1
51      cycles for MbStrLen2
46      cycles for MbStrLen3
50      cycles for MbStrLen4a
47      cycles for MbStrLen4b
56      cycles for MbStrLen5

45      cycles for MbStrLen1
51      cycles for MbStrLen2
46      cycles for MbStrLen3
51      cycles for MbStrLen4a
47      cycles for MbStrLen4b
55      cycles for MbStrLen5

45      cycles for MbStrLen1
51      cycles for MbStrLen2
46      cycles for MbStrLen3
50      cycles for MbStrLen4a
47      cycles for MbStrLen4b
55      cycles for MbStrLen5
eschew obfuscation

jj2007

Thanks. For the curious: I am testing the Intel recommendation for movxxx xmm, mem:
QuoteIntel, generic optimization of memcpy(): movdqu is suitable for fetching byte-aligned groups of 16 bytes from memory, but not useful for storing them. The Barcelona architecture prefers movaps for stores.  movaps, movdqa, and movapd are functionally equivalent, with movaps having shorter encoding

if 1  ; 4a
movlps qword ptr [esp], xmm0
movhps qword ptr [esp+8], xmm0
else  ; 4b
movdqu [esp], xmm0
endif
...
if 1
movlps xmm0, qword ptr [esp]
movhps xmm0, qword ptr [esp+8]
else
movups xmm0, [esp]
endif


At least for the Celeron and E^cube's AMD, this seems not to be true: The partial lps/hps moves are faster.

(obviously the code does other things, too - the purpose is to efficiently preserve the xmm0 register in a bread-and-butter stringlen algo)

ecube

Off topic but MichaelW my CPU is 10+ years old now I believe, so yours must be ancient, i'm just curious is that your main one? Also jj2007  i'm not sure what your plans are but feel free to take notes on optimization technique you discover  :U while lot of stuff is floating around this board I know people enjoy a single place to read up on such things.

MichaelW

I build my P3 system in 98 or 99, and it's currently my primary system at home. It's still very reliable, but sooner or later...
eschew obfuscation

ecube

Quote from: MichaelW on August 15, 2010, 10:04:47 PM
I build my P3 system in 98 or 99, and it's currently my primary system at home. It's still very reliable, but sooner or later...


heh wow, what os? I can't imagine that thing being able to handle vista, is a resource pig. i'd be suprised if you said windows 2k, I myself wanted to stick with it but was forced to upgrade due to so much software being xp+ only.

KeepingRealBusy

JJ,

Here is my P4:


Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

38      cycles for MbStrLen1
41      cycles for MbStrLen2
46      cycles for MbStrLen3
37      cycles for MbStrLen4a
45      cycles for MbStrLen4b
41      cycles for MbStrLen5

35      cycles for MbStrLen1
40      cycles for MbStrLen2
41      cycles for MbStrLen3
37      cycles for MbStrLen4a
51      cycles for MbStrLen4b
40      cycles for MbStrLen5

34      cycles for MbStrLen1
45      cycles for MbStrLen2
49      cycles for MbStrLen3
36      cycles for MbStrLen4a
51      cycles for MbStrLen4b
40      cycles for MbStrLen5

33      cycles for MbStrLen1
39      cycles for MbStrLen2
40      cycles for MbStrLen3
39      cycles for MbStrLen4a
43      cycles for MbStrLen4b
47      cycles for MbStrLen5

34      cycles for MbStrLen1
40      cycles for MbStrLen2
65      cycles for MbStrLen3
37      cycles for MbStrLen4a
40      cycles for MbStrLen4b
40      cycles for MbStrLen5


--- ok ---

KeepingRealBusy

JJ,

Here are mu AMD timings:


AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (SSE3)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

68      cycles for MbStrLen1
47      cycles for MbStrLen2
66      cycles for MbStrLen3
47      cycles for MbStrLen4a
58      cycles for MbStrLen4b
56      cycles for MbStrLen5

47      cycles for MbStrLen1
37      cycles for MbStrLen2
56      cycles for MbStrLen3
57      cycles for MbStrLen4a
43      cycles for MbStrLen4b
84      cycles for MbStrLen5

47      cycles for MbStrLen1
53      cycles for MbStrLen2
73      cycles for MbStrLen3
47      cycles for MbStrLen4a
52      cycles for MbStrLen4b
70      cycles for MbStrLen5

36      cycles for MbStrLen1
57      cycles for MbStrLen2
54      cycles for MbStrLen3
51      cycles for MbStrLen4a
58      cycles for MbStrLen4b
58      cycles for MbStrLen5

63      cycles for MbStrLen1
47      cycles for MbStrLen2
51      cycles for MbStrLen3
51      cycles for MbStrLen4a
56      cycles for MbStrLen4b
55      cycles for MbStrLen5


--- ok ---

Rockoon

AMD Phenom(tm) II X6 1055T Processo
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

31      cycles for MbStrLen1
34      cycles for MbStrLen2
33      cycles for MbStrLen3
35      cycles for MbStrLen4a
35      cycles for MbStrLen4b
40      cycles for MbStrLen5

31      cycles for MbStrLen1
34      cycles for MbStrLen2
36      cycles for MbStrLen3
35      cycles for MbStrLen4a
35      cycles for MbStrLen4b
40      cycles for MbStrLen5

31      cycles for MbStrLen1
34      cycles for MbStrLen2
33      cycles for MbStrLen3
35      cycles for MbStrLen4a
35      cycles for MbStrLen4b
39      cycles for MbStrLen5

31      cycles for MbStrLen1
37      cycles for MbStrLen2
33      cycles for MbStrLen3
35      cycles for MbStrLen4a
35      cycles for MbStrLen4b
39      cycles for MbStrLen5

31      cycles for MbStrLen1
34      cycles for MbStrLen2
33      cycles for MbStrLen3
35      cycles for MbStrLen4a
35      cycles for MbStrLen4b
40      cycles for MbStrLen5
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

hutch--


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

16      cycles for MbStrLen1
21      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5

17      cycles for MbStrLen1
26      cycles for MbStrLen2
29      cycles for MbStrLen3
23      cycles for MbStrLen4a
32      cycles for MbStrLen4b
24      cycles for MbStrLen5

16      cycles for MbStrLen1
23      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5

17      cycles for MbStrLen1
26      cycles for MbStrLen2
29      cycles for MbStrLen3
23      cycles for MbStrLen4a
32      cycles for MbStrLen4b
24      cycles for MbStrLen5

16      cycles for MbStrLen1
21      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5


--- ok ---
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

mineiro

Intel(R) Pentium(R) Dual  CPU  E2160  @ 1.80GHz (SSE4)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

16      cycles for MbStrLen1
20      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
25      cycles for MbStrLen2
26      cycles for MbStrLen3
23      cycles for MbStrLen4a
26      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
20      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
25      cycles for MbStrLen2
26      cycles for MbStrLen3
23      cycles for MbStrLen4a
26      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
20      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5


--- ok ---

dancho


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

16      cycles for MbStrLen1
20      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
25      cycles for MbStrLen2
26      cycles for MbStrLen3
23      cycles for MbStrLen4a
26      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
20      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
26      cycles for MbStrLen2
26      cycles for MbStrLen3
23      cycles for MbStrLen4a
26      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
20      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5


--- ok ---

Vortex

Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE3)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

65      cycles for MbStrLen1
66      cycles for MbStrLen2
83      cycles for MbStrLen3
67      cycles for MbStrLen4a
66      cycles for MbStrLen4b
77      cycles for MbStrLen5

64      cycles for MbStrLen1
68      cycles for MbStrLen2
71      cycles for MbStrLen3
66      cycles for MbStrLen4a
66      cycles for MbStrLen4b
73      cycles for MbStrLen5

66      cycles for MbStrLen1
66      cycles for MbStrLen2
72      cycles for MbStrLen3
67      cycles for MbStrLen4a
66      cycles for MbStrLen4b
74      cycles for MbStrLen5

72      cycles for MbStrLen1
66      cycles for MbStrLen2
81      cycles for MbStrLen3
66      cycles for MbStrLen4a
74      cycles for MbStrLen4b
86      cycles for MbStrLen5

64      cycles for MbStrLen1
66      cycles for MbStrLen2
74      cycles for MbStrLen3
66      cycles for MbStrLen4a
66      cycles for MbStrLen4b
79      cycles for MbStrLen5

jj2007

Thanks to all of you, that should be enough info :U