News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

StrLen timings needed

Started by jj2007, August 15, 2010, 09:32:10 PM

Previous topic - Next topic

KeepingRealBusy

Alex,

Here is my P4 for 80StrLen.zip.

Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
84      code size for AxStrLenSSE1j     84      total bytes for AxStrLenSSE1j
83      code size for StrLenLingo       92      total bytes for StrLenLingo
------- timings, misaligned -------
236     cycles for szLen
52      cycles for MbStrLen4a
44      cycles for AxStrLenSSE1
38      cycles for AxStrLenSSE1j
32      cycles for StrLenLingo

227     cycles for szLen
46      cycles for MbStrLen4a
37      cycles for AxStrLenSSE1
38      cycles for AxStrLenSSE1j
31      cycles for StrLenLingo

230     cycles for szLen
45      cycles for MbStrLen4a
37      cycles for AxStrLenSSE1
38      cycles for AxStrLenSSE1j
32      cycles for StrLenLingo

227     cycles for szLen
46      cycles for MbStrLen4a
37      cycles for AxStrLenSSE1
39      cycles for AxStrLenSSE1j
30      cycles for StrLenLingo


--- ok ---

Dave.

Antariy

Quote from: donkey on August 22, 2010, 10:59:39 PM
Well, 5 years or so ago when I wrote lszLenMMX it was pretty fast and rather unique, now it seems to be a bit of a pig compared with others...

No, Edgar. Don't forgot, what my MMX version is very SLIGHTLY faster, because I emit prologue-epilogue code. All SSE version is faster, because they "eat" twice more data per loop (lingo's - in 4 times more data per loop). This is normal results, not a pig.



Alex

Antariy


jj2007

One more for the night - I shaved off a cycle and six bytes of codesize:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
77      code size for AxStrLenSSE1j     78      total bytes for AxStrLenSSE1j
83      code size for StrLenLingo       90      total bytes for StrLenLingo
------- timings, misaligned -------
132     cycles for szLen
49      cycles for MbStrLen4a
34      cycles for AxStrLenSSE1
33      cycles for AxStrLenSSE1j
24      cycles for StrLenLingo (unsafe)

132     cycles for szLen
49      cycles for MbStrLen4a
34      cycles for AxStrLenSSE1
33      cycles for AxStrLenSSE1j
24      cycles for StrLenLingo (unsafe)

131     cycles for szLen
49      cycles for MbStrLen4a
34      cycles for AxStrLenSSE1
33      cycles for AxStrLenSSE1j
24      cycles for StrLenLingo (unsafe)

132     cycles for szLen
49      cycles for MbStrLen4a
34      cycles for AxStrLenSSE1
34      cycles for AxStrLenSSE1j
24      cycles for StrLenLingo (unsafe)

KeepingRealBusy

JJ,

Here is my P4 for 80bStrLen.zip.

Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
77      code size for AxStrLenSSE1j     78      total bytes for AxStrLenSSE1j
83      code size for StrLenLingo       90      total bytes for StrLenLingo
------- timings, misaligned -------
242     cycles for szLen
56      cycles for MbStrLen4a
40      cycles for AxStrLenSSE1
42      cycles for AxStrLenSSE1j
30      cycles for StrLenLingo (unsafe)

228     cycles for szLen
52      cycles for MbStrLen4a
45      cycles for AxStrLenSSE1
42      cycles for AxStrLenSSE1j
30      cycles for StrLenLingo (unsafe)

228     cycles for szLen
43      cycles for MbStrLen4a
38      cycles for AxStrLenSSE1
42      cycles for AxStrLenSSE1j
31      cycles for StrLenLingo (unsafe)

230     cycles for szLen
54      cycles for MbStrLen4a
38      cycles for AxStrLenSSE1
52      cycles for AxStrLenSSE1j
33      cycles for StrLenLingo (unsafe)


--- ok ---

Dave

jj2007

Quote from: KeepingRealBusy on August 22, 2010, 10:35:58 PM
Alex,

91Test_StrLenSaveXmm.exe crashes on my P4

Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)


Hi Dave & Alex,

I found the "bug": It's lddqu - the instruction requires SSE3.

Attached a new testbed with two AxJJ variants that behave similar on a P4 but very different on my Celeron. Timings?
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
------- timings, misaligned, 100 byte string -------
272     cycles for szLen
89      cycles for MbStrLen4a
63      cycles for AxStrLenSSE1
70      cycles for AxJJStrLen1
68      cycles for AxJJStrLen2

lingo

Again idiotic vain efforts... :lol
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
77      code size for AxJJStrLen1       78      total bytes for AxJJStrLen1
77      code size for AxJJStrLen2       78      total bytes for AxJJStrLen2
79      code size for StrLenLingo       79      total bytes for StrLenLingo

------- timings, misaligned, 100 byte string -------
112     cycles for szLen
37      cycles for MbStrLen4a
20      cycles for AxStrLenSSE1
24      cycles for AxJJStrLen1
26      cycles for AxJJStrLen2
12      cycles for StrLenLingo

110     cycles for szLen
37      cycles for MbStrLen4a
21      cycles for AxStrLenSSE1
25      cycles for AxJJStrLen1
29      cycles for AxJJStrLen2
12      cycles for StrLenLingo

110     cycles for szLen
37      cycles for MbStrLen4a
20      cycles for AxStrLenSSE1
25      cycles for AxJJStrLen1
24      cycles for AxJJStrLen2
12      cycles for StrLenLingo

110     cycles for szLen
37      cycles for MbStrLen4a
20      cycles for AxStrLenSSE1
25      cycles for AxJJStrLen1
29      cycles for AxJJStrLen2
12      cycles for StrLenLingo

------- timings, misaligned, 5 byte string -------
7       cycles for szLen
32      cycles for MbStrLen4a
8       cycles for AxStrLenSSE1
15      cycles for AxJJStrLen1
8       cycles for AxJJStrLen2
2       cycles for StrLenLingo

7       cycles for szLen
32      cycles for MbStrLen4a
15      cycles for AxStrLenSSE1
8       cycles for AxJJStrLen1
24      cycles for AxJJStrLen2
1       cycles for StrLenLingo

7       cycles for szLen
30      cycles for MbStrLen4a
8       cycles for AxStrLenSSE1
15      cycles for AxJJStrLen1
8       cycles for AxJJStrLen2
2       cycles for StrLenLingo

7       cycles for szLen
32      cycles for MbStrLen4a
15      cycles for AxStrLenSSE1
8       cycles for AxJJStrLen1
24      cycles for AxJJStrLen2
1       cycles for StrLenLingo


--- ok ---

jj2007

Quote from: lingo on August 23, 2010, 05:30:46 PM
Again idiotic vain efforts... :lol

Lingo,
While you have a fast CPU, and stolen a lot from Alex and my code, your algo still crashes. Give up.

jj2007

Version d, Celeron M timings:
Quote------- timings, misaligned, 5 byte string -------
12      cycles for szLen
15      cycles for AxStrLenSSE1
11      cycles for AxJJStrLen1
22      cycles for AxJJStrLen2
7       cycles for AxJJStrLen3

12      cycles for szLen
12      cycles for AxStrLenSSE1
17      cycles for AxJJStrLen1
11      cycles for AxJJStrLen2
7       cycles for AxJJStrLen3

12      cycles for szLen
16      cycles for AxStrLenSSE1
11      cycles for AxJJStrLen1
22      cycles for AxJJStrLen2
7       cycles for AxJJStrLen3

12      cycles for szLen
12      cycles for AxStrLenSSE1
17      cycles for AxJJStrLen1
11      cycles for AxJJStrLen2
7       cycles for AxJJStrLen3

The "jumping" is most probably caused by the movups [esp+xxx], xmm0 - in the REPEAT loop, the stack is being gradually decreased (push eax), so every 4 loops one of the algo is lucky to have a 16-byte alignment.
To eliminate this effect, AxJJStrLen3 uses a global aligned variable and movdqua. Results look convincing.

lingo

"and stolen a lot from Alex and my code"

Wow, the thief crying "catch the thief" see the link: (www.masm32.com/board/index.php?topic=11353.msg84371#msg84371)
I can't thieve nothing from you and from the asian lamer just because you have no ideas in assembly , hence your code will be very ugly and slow always... So, get your peels and take it easy.. :lol

"your algo still crashes"

For some lamers in programing may be but for the advanced programmers which test the end of their buffer after every call to VirtualAlloc just no way to crash. :lol

"Give up."

Due to some sick idiotic lamers in the forum :lol....Never!
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
77      code size for AxJJStrLen1       78      total bytes for AxJJStrLen1
77      code size for AxJJStrLen2       78      total bytes for AxJJStrLen2
79      code size for StrLenLingo       79      total bytes for StrLenLingo

------- timings, misaligned, 100 byte string -------
112     cycles for szLen
37      cycles for MbStrLen4a
20      cycles for AxStrLenSSE1
24      cycles for AxJJStrLen1
26      cycles for AxJJStrLen2
12      cycles for StrLenLingo

110     cycles for szLen
37      cycles for MbStrLen4a
21      cycles for AxStrLenSSE1
25      cycles for AxJJStrLen1
29      cycles for AxJJStrLen2
12      cycles for StrLenLingo

110     cycles for szLen
37      cycles for MbStrLen4a
20      cycles for AxStrLenSSE1
25      cycles for AxJJStrLen1
24      cycles for AxJJStrLen2
12      cycles for StrLenLingo

110     cycles for szLen
37      cycles for MbStrLen4a
20      cycles for AxStrLenSSE1
25      cycles for AxJJStrLen1
29      cycles for AxJJStrLen2
12      cycles for StrLenLingo

------- timings, misaligned, 5 byte string -------
7       cycles for szLen
32      cycles for MbStrLen4a
8       cycles for AxStrLenSSE1
15      cycles for AxJJStrLen1
8       cycles for AxJJStrLen2
2       cycles for StrLenLingo

7       cycles for szLen
32      cycles for MbStrLen4a
15      cycles for AxStrLenSSE1
8       cycles for AxJJStrLen1
24      cycles for AxJJStrLen2
1       cycles for StrLenLingo

7       cycles for szLen
30      cycles for MbStrLen4a
8       cycles for AxStrLenSSE1
15      cycles for AxJJStrLen1
8       cycles for AxJJStrLen2
2       cycles for StrLenLingo

7       cycles for szLen
32      cycles for MbStrLen4a
15      cycles for AxStrLenSSE1
8       cycles for AxJJStrLen1
24      cycles for AxJJStrLen2
1       cycles for StrLenLingo


--- ok ---



jj2007

Quote from: lingo on August 23, 2010, 06:56:17 PM
For some lamers in programing may be but for the advanced programmers which test the end of their buffer after every call to VirtualAlloc just no way to crash. :lol

Line 3:
CrashIt =   1   ; overrides MisAlign - the "SSE1" algos will bang their head against the VirtualAlloc boundary

Line 153:
Quote      if 0   ; CrashIt
         print "No result for Lingo's algo, it crashes", 13, 10
      else
         cycles Src, StrLenLingo  ; ok, so let it crash
      endif
:bg

> downloaded 8 times
Nice trick, Lingo. The original is in reply #98, though

lingo

 "The thief is always a liar"[/U]-> just see the link: (www.masm32.com/board/index.php?topic=11353.msg84371#msg84371)
Just put your lame macro in... you know where... :lol
I can explain about VirtualAlloc  to every normal man but it seems you forgot your peels again... :lol
Take care or next step will be the "electroconvulsive therapy"... :lol

ecube

I see the inferior are trying to take on the champ again, with little success  :U

Antariy

Hi, this is new old version. For JJ and some other explainers of VirtualAlloc.


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
72      code size for MbStrLen4aP4      72      total bytes for MbStrLen4aP4
72      code size for MbStrLen4aP42     72      total bytes for MbStrLen4aP42
34      code size for AxStrLenMMX       34      total bytes for AxStrLenMMX
85      code size for AxStrLenSSE1a     88      total bytes for AxStrLenSSE1a
83      code size for StrLenLingo       88      total bytes for StrLenLingo
49      code size for lszLenMMX 49      total bytes for lszLenMMX
84      code size for AxStrLenSSE1j     84      total bytes for AxStrLenSSE1j
------- timings -------
251     cycles for szLen
105     cycles for MbStrLen4a
109     cycles for AxStrLenMMX
62      cycles for AxStrLenSSE1a
53      cycles for StrLenLingo
119     cycles for lszLenMMX
65      cycles for AxStrLenSSE1J

251     cycles for szLen
77      cycles for MbStrLen4a
109     cycles for AxStrLenMMX
62      cycles for AxStrLenSSE1a
53      cycles for StrLenLingo
118     cycles for lszLenMMX
65      cycles for AxStrLenSSE1J

258     cycles for szLen
105     cycles for MbStrLen4a
109     cycles for AxStrLenMMX
62      cycles for AxStrLenSSE1a
53      cycles for StrLenLingo
115     cycles for lszLenMMX
65      cycles for AxStrLenSSE1J

251     cycles for szLen
78      cycles for MbStrLen4a
109     cycles for AxStrLenMMX
62      cycles for AxStrLenSSE1a
53      cycles for StrLenLingo
119     cycles for lszLenMMX
65      cycles for AxStrLenSSE1J


--- ok ---




Alex

Antariy

Quote from: E^cube on August 23, 2010, 09:30:19 PM
I see the inferior are trying to take on the champ again, with little success  :U


Why anybody think, what this is big need and deal: "Beat Lingo!". Wow! Not any need.

E^cube, you underestimate yourself, if you are think, what all peoples have only one target: beating of Lingo.

This is funny :)

His proc eat twice more data per loop, his proc have twice less functionality (it crashes and not preserves regs, which is needed for fair comparsion with Jochen's procs).
And his proc have only ~45% of performance gain on HIS CPU only. This is your "champ"? This is bad programmer, which cherish hopes to other soft for make his procs reliable.
What he make "fastest" proc because something etc - this is excuse for inability of making proc with the same functionality and bigger speed.

And, Jochen fix his proc, otherwice it crashes on short strings.
So, your "champ" not have any respect - he produce bad unreliable code (maybe fast, but NOT working).



Alex