News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

StrLen timings needed

Started by jj2007, August 15, 2010, 09:32:10 PM

Previous topic - Next topic

Antariy

Quote from: hutch-- on August 17, 2010, 12:02:21 AM
Alex,

This one ?

Yes, Hutch!

"ERROR" word is not true - I just don' change CodeSize macro.

Thanks!



Alex

Antariy

Hi!

Here is 34bytes long MMX StrLen, and 90bytes long (decreased by 57bytes) SSE1 version by 2 clocks faster.

Peoples, test this please!


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5
34       bytes for AxStrLenMMX
90       bytes for AxStrLenSSE1
113      bytes for StrLen

92      cycles for MbStrLen1
95      cycles for MbStrLen2
89      cycles for MbStrLen3
96      cycles for MbStrLen4a
71      cycles for MbStrLen4b
97      cycles for MbStrLen5

109     cycles for AxStrLenMMX

65      cycles for AxStrLenSSE1

165     cycles for StrLen

100     cycles for MbStrLen1
97      cycles for MbStrLen2
89      cycles for MbStrLen3
90      cycles for MbStrLen4a
100     cycles for MbStrLen4b
95      cycles for MbStrLen5

110     cycles for AxStrLenMMX

65      cycles for AxStrLenSSE1

164     cycles for StrLen

87      cycles for MbStrLen1
90      cycles for MbStrLen2
91      cycles for MbStrLen3
80      cycles for MbStrLen4a
71      cycles for MbStrLen4b
95      cycles for MbStrLen5

109     cycles for AxStrLenMMX

65      cycles for AxStrLenSSE1

164     cycles for StrLen

92      cycles for MbStrLen1
97      cycles for MbStrLen2
90      cycles for MbStrLen3
80      cycles for MbStrLen4a
72      cycles for MbStrLen4b
94      cycles for MbStrLen5

110     cycles for AxStrLenMMX

65      cycles for AxStrLenSSE1

164     cycles for StrLen

92      cycles for MbStrLen1
90      cycles for MbStrLen2
87      cycles for MbStrLen3
80      cycles for MbStrLen4a
71      cycles for MbStrLen4b
95      cycles for MbStrLen5

111     cycles for AxStrLenMMX

65      cycles for AxStrLenSSE1

164     cycles for StrLen


--- ok ---



Note: test maked with unaligned (not 16byte aligned) strings.



Alex

Antariy

Quote from: jj2007 on August 17, 2010, 06:34:22 AM
Quote from: Antariy on August 16, 2010, 11:48:37 PM
Hutch, test my version, please. Jochen made some changes, but algo works NOT in optimal way after his changes. (I see changes made by Jochen).
Test my, please.

Alex,

Sorry, I should have split it but was too tired yesterday night. On the other hand, look at the timings: for Hutch, it's 24 cycles for both versions, for me it's 2 cycles faster without the "extras". And 128 instead of 147 bytes means 8 instead of 9 16-byte instruction cache slots.

:bg

This is always not clear - other's code, I understand.



Alex

KeepingRealBusy

Alex,

Here is my P4:


Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5
34       bytes for AxStrLenMMX
90       bytes for AxStrLenSSE1
113      bytes for StrLen

48      cycles for MbStrLen1
51      cycles for MbStrLen2
46      cycles for MbStrLen3
56      cycles for MbStrLen4a
51      cycles for MbStrLen4b
50      cycles for MbStrLen5

70      cycles for AxStrLenMMX

43      cycles for AxStrLenSSE1

135     cycles for StrLen

79      cycles for MbStrLen1
54      cycles for MbStrLen2
53      cycles for MbStrLen3
47      cycles for MbStrLen4a
48      cycles for MbStrLen4b
51      cycles for MbStrLen5

72      cycles for AxStrLenMMX

45      cycles for AxStrLenSSE1

124     cycles for StrLen

40      cycles for MbStrLen1
49      cycles for MbStrLen2
45      cycles for MbStrLen3
55      cycles for MbStrLen4a
43      cycles for MbStrLen4b
51      cycles for MbStrLen5

69      cycles for AxStrLenMMX

43      cycles for AxStrLenSSE1

126     cycles for StrLen

40      cycles for MbStrLen1
47      cycles for MbStrLen2
45      cycles for MbStrLen3
44      cycles for MbStrLen4a
45      cycles for MbStrLen4b
56      cycles for MbStrLen5

70      cycles for AxStrLenMMX

45      cycles for AxStrLenSSE1

125     cycles for StrLen

59      cycles for MbStrLen1
47      cycles for MbStrLen2
48      cycles for MbStrLen3
42      cycles for MbStrLen4a
44      cycles for MbStrLen4b
53      cycles for MbStrLen5

73      cycles for AxStrLenMMX

50      cycles for AxStrLenSSE1

145     cycles for StrLen


--- ok ---


Dave.

Antariy


Farabi

Intel(R) Pentium(R) Dual  CPU  T2390  @ 1.86GHz (SSE4)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

16      cycles for MbStrLen1
20      cycles for MbStrLen2
24      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
25      cycles for MbStrLen2
26      cycles for MbStrLen3
23      cycles for MbStrLen4a
26      cycles for MbStrLen4b
23      cycles for MbStrLen5

17      cycles for MbStrLen1
20      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
24      cycles for MbStrLen2
26      cycles for MbStrLen3
23      cycles for MbStrLen4a
26      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
20      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
24      cycles for MbStrLen4b
23      cycles for MbStrLen5


--- ok ---
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"

Farabi

Intel(R) Pentium(R) Dual  CPU  T2390  @ 1.86GHz (SSE4)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

16      cycles for MbStrLen1
20      cycles for MbStrLen2
24      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
25      cycles for MbStrLen2
26      cycles for MbStrLen3
23      cycles for MbStrLen4a
26      cycles for MbStrLen4b
23      cycles for MbStrLen5

17      cycles for MbStrLen1
20      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
24      cycles for MbStrLen2
26      cycles for MbStrLen3
23      cycles for MbStrLen4a
26      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
20      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
24      cycles for MbStrLen4b
23      cycles for MbStrLen5


--- ok ---
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"

jj2007

5Test:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
90       bytes for AxStrLenSSE1
113      bytes for StrLen

46      cycles for MbStrLen1
47      cycles for MbStrLen2
50      cycles for MbStrLen3
48      cycles for MbStrLen4a
53      cycles for MbStrLen4b
51      cycles for MbStrLen5

60      cycles for AxStrLenMMX

38      cycles for AxStrLenSSE1


Quotemov eax,[esp+4]      ; Jochen, if you remove this string again :), then
   add esp,-10h      ; algo would almost always work with any string
   movups [esp],xmm7   ; as with unaligned string, because I made
   and eax,0fh      ; checking for alignment in THIS line! ;-)
   jz @F
OOPS :red

Antariy

Quote from: jj2007 on August 18, 2010, 11:11:31 PM
OOPS :red

No, This is because I don't write comments (time economy).



Alex

jj2007

Tried this?
Quote   mov edx,[esp+4]
;   mov eax,[esp+4]   ; Jochen, if you remove this string again :), then
   add esp, -10h   ; algo would almost always work with any string
   movups [esp],xmm7   ; as with unaligned string, because I made
   test dl, 15
;   and eax, 0fh   ; checking for alignment in THIS line!      ;-)
   jz @F

Farabi

Intel(R) Pentium(R) Dual  CPU  T2390  @ 1.86GHz (SSE4)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

16      cycles for MbStrLen1
20      cycles for MbStrLen2
24      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
25      cycles for MbStrLen2
26      cycles for MbStrLen3
23      cycles for MbStrLen4a
26      cycles for MbStrLen4b
23      cycles for MbStrLen5

17      cycles for MbStrLen1
20      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
24      cycles for MbStrLen2
26      cycles for MbStrLen3
23      cycles for MbStrLen4a
26      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
20      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
24      cycles for MbStrLen4b
23      cycles for MbStrLen5


--- ok ---
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"

mineiro

another dual test.

Intel(R) Pentium(R) Dual  CPU  E2160  @ 1.80GHz (SSE4)
58       bytes for MbStrLen1 33 33 33 33 33 cycles
84       bytes for MbStrLen2 34 42 34 42 34
73       bytes for MbStrLen3 36 44 36 44 36
80       bytes for MbStrLen4a 41 41 41 41 41
71       bytes for MbStrLen4b 36 45 36 45 36
78       bytes for MbStrLen5 46 46 46 46 48
34       bytes for AxStrLenMMX 58 62 62 68 64
90       bytes for AxStrLenSSE1 21 27 21 27 21
113      bytes for StrLen 67 67 67 67 67
--- ok ---

Antariy


Hi!

This is slightly changed version of my SSE1 algo. Fixed stumb load (I do they by correlation with MMX, but this is not the same).


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5
34       bytes for AxStrLenMMX
86       bytes for AxStrLenSSE1
113      bytes for StrLen

83      cycles for MbStrLen1
92      cycles for MbStrLen2
80      cycles for MbStrLen3
78      cycles for MbStrLen4a
71      cycles for MbStrLen4b
95      cycles for MbStrLen5

111     cycles for AxStrLenMMX

60      cycles for AxStrLenSSE1

183     cycles for StrLen

110     cycles for MbStrLen1
90      cycles for MbStrLen2
90      cycles for MbStrLen3
90      cycles for MbStrLen4a
71      cycles for MbStrLen4b
95      cycles for MbStrLen5

109     cycles for AxStrLenMMX

61      cycles for AxStrLenSSE1

199     cycles for StrLen

92      cycles for MbStrLen1
102     cycles for MbStrLen2
90      cycles for MbStrLen3
80      cycles for MbStrLen4a
71      cycles for MbStrLen4b
149     cycles for MbStrLen5

109     cycles for AxStrLenMMX

61      cycles for AxStrLenSSE1

235     cycles for StrLen

100     cycles for MbStrLen1
113     cycles for MbStrLen2
74      cycles for MbStrLen3
103     cycles for MbStrLen4a
71      cycles for MbStrLen4b
77      cycles for MbStrLen5

109     cycles for AxStrLenMMX

61      cycles for AxStrLenSSE1

200     cycles for StrLen

92      cycles for MbStrLen1
99      cycles for MbStrLen2
91      cycles for MbStrLen3
90      cycles for MbStrLen4a
71      cycles for MbStrLen4b
95      cycles for MbStrLen5

115     cycles for AxStrLenMMX

61      cycles for AxStrLenSSE1

197     cycles for StrLen


--- ok ---



Big ask to all: test this, please!



Alex

Antariy

Quote from: jj2007 on August 18, 2010, 11:37:45 PM
Tried this?
Quote   mov edx,[esp+4]
;   mov eax,[esp+4]   ; Jochen, if you remove this string again :), then
   add esp, -10h   ; algo would almost always work with any string
   movups [esp],xmm7   ; as with unaligned string, because I made
   test dl, 15
;   and eax, 0fh   ; checking for alignment in THIS line!      ;-)
   jz @F


Jochen, on my CPU, if I use edx for check, proc slower by 2 clocks. If I use part of reg, this is not get anything (I know about this, and this is have very hardware-depended  reasons in work. On moder CPUs this is very slow).



Alex

mineiro

Intel(R) Pentium(R) Dual  CPU  E2160  @ 1.80GHz (SSE4)


58       bytes for MbStrLen1 33 33 33 33 33
84       bytes for MbStrLen2 34 42 34 42 34
73       bytes for MbStrLen3 36 44 36 44 36
80       bytes for MbStrLen4a 41 41 41 41 37
71       bytes for MbStrLen4b 36 45 36 45 36
78       bytes for MbStrLen5 46 46 46 48 46
34       bytes for AxStrLenMMX 59 65 58 68 65
86       bytes for AxStrLenSSE1 19 24 19 25 19
113     bytes for StrLen 67 67 67 67 67
--- ok ---


==       bytes for MbStrLen1 == == == == ==
==       bytes for MbStrLen2 == == == == ==
==       bytes for MbStrLen3 == == == == ==
==       bytes for MbStrLen4a == 37 == == ==
==       bytes for MbStrLen4b == == == == ==
==       bytes for MbStrLen5 == == == 46 ==
==       bytes for AxStrLenMMX 62 == 61 == 62
==       bytes for AxStrLenSSE1 == == == == ==
===     bytes for StrLen == == == == ==
--- ok ---