News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Benchmark and test for htodw algos.

Started by hutch--, August 03, 2010, 07:05:52 AM

Previous topic - Next topic

dedndave

yes - but it's a P4
some might say it is obsolete   :P
i prefer to say it is becoming obsolete
that way, i don't have to go out and buy a new computer   :lol
i am just now getting the hang of properly building this one

Antariy

How timings have this version (5bytes long string, sources the same)?



Alex

(Edited) I don't post my timings, because I have many runned apps with no small loading of CPU...

Antariy

Quote from: dedndave on August 14, 2010, 11:27:06 PM
yes - but it's a P4
some might say it is obsolete   :P
i prefer to say it is becoming obsolete
that way, i don't have to go out and buy a new computer   :lol
i am just now getting the hang of properly building this one

Why "tongue"? :)
I have Prescott Celeron, with trimmed cache, without DEP, without HT... And I don't say what it is obsolete :)
I'm not a gamer, and I don't needed in teraherzs with liquid cooling :)



Alex

dedndave

Prescott w/htt:
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

21      cycles for Fast version
21      cycles for Fast version under AMD
33      cycles for Small 1
33      cycles for Small 2
33      cycles for Small 3
31      cycles for Small 3.1
33      cycles for Small 4
28      cycles for MMX 1
28      cycles for MMX 2
36      cycles for SSE1

Other's Versions:
34      cycles for Axhex2dw improved by Hutch (1)
53      cycles for Axhex2dw improved by Hutch (2)

43      cycles for Lingo's SSE version
24      cycles for Lingo's BIG integer version
23      cycles for Jochen's WORD-Indexed version
27      cycles for Dave's version (with minor changes)

24      cycles for Fast version
21      cycles for Fast version under AMD
33      cycles for Small 1
33      cycles for Small 2
33      cycles for Small 3
33      cycles for Small 3.1
35      cycles for Small 4
28      cycles for MMX 1
60      cycles for MMX 2
56      cycles for SSE1

Other's Versions:
34      cycles for Axhex2dw improved by Hutch (1)
53      cycles for Axhex2dw improved by Hutch (2)

31      cycles for Lingo's SSE version
23      cycles for Lingo's BIG integer version
23      cycles for Jochen's WORD-Indexed version
24      cycles for Dave's version (with minor changes)

23      cycles for Fast version
21      cycles for Fast version under AMD
33      cycles for Small 1
33      cycles for Small 2
30      cycles for Small 3
33      cycles for Small 3.1
33      cycles for Small 4
26      cycles for MMX 1
28      cycles for MMX 2
30      cycles for SSE1

Other's Versions:
33      cycles for Axhex2dw improved by Hutch (1)
53      cycles for Axhex2dw improved by Hutch (2)

31      cycles for Lingo's SSE version
21      cycles for Lingo's BIG integer version
23      cycles for Jochen's WORD-Indexed version
24      cycles for Dave's version (with minor changes)

jj2007

5bytes:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
15      cycles for Fast version
15      cycles for Fast version under AMD
26      cycles for Small 1
27      cycles for Small 2
27      cycles for Small 3
27      cycles for Small 3.1
27      cycles for Small 4
14      cycles for MMX 1
14      cycles for MMX 2
21      cycles for SSE1

Other's Versions:
27      cycles for Axhex2dw improved by Hutch (1)
27      cycles for Axhex2dw improved by Hutch (2)

13      cycles for Lingo's SSE version
12      cycles for Lingo's BIG integer version
14      cycles for Jochen's WORD-Indexed version
18      cycles for Dave's version (with minor changes)

dedndave

well - "modern" would be a core duo or i7 - or one of the more recent AMD's
to be honest, i am pleased with the performance of this machine
of course, i know how to set it up to be fast
i can see where, if i were a lay-person, it might not be so wonderful

Antariy

Quote from: dedndave on August 14, 2010, 11:34:18 PM
well - "modern" would be a core duo or i7 - or one of the more recent AMD's
to be honest, i am pleased with the performance of this machine
of course, i know how to set it up to be fast
i can see where, if i were a lay-person, it might not be so wonderful

I agree - most good thing - correct tuning and maintenance of computer.
I agree more, because this is my speciality :)



Alex

Antariy

Quote from: jj2007 on August 14, 2010, 11:32:38 PM
5bytes:

Jochen, this is because your proc runs without branching. But in real-world testing it be champion. And, if port it to 64bit - it be fastest, because code be not very changed, contrary to other unrolled versions, which be have almost linearly twice biggest timings.



Alex
P.S. How timings have your proc with full-notated strings?

MichaelW

Since Dave mentioned obsolete, here are the timings for a P3:

☺☺☻♥ (SSE1)

23      cycles for Fast version
28      cycles for Fast version under AMD
60      cycles for Small 1
60      cycles for Small 2
59      cycles for Small 3
59      cycles for Small 3.1
63      cycles for Small 4
18      cycles for MMX 1
17      cycles for MMX 2
21      cycles for SSE1

Other's Versions:
59      cycles for Axhex2dw improved by Hutch (1)
59      cycles for Axhex2dw improved by Hutch (2)

17      cycles for Lingo's SSE version
27      cycles for Lingo's BIG integer version
34      cycles for Jochen's WORD-Indexed version
12      cycles for Dave's version (with minor changes)


23      cycles for Fast version
28      cycles for Fast version under AMD
59      cycles for Small 1
60      cycles for Small 2
59      cycles for Small 3
59      cycles for Small 3.1
60      cycles for Small 4
18      cycles for MMX 1
17      cycles for MMX 2
21      cycles for SSE1

Other's Versions:
59      cycles for Axhex2dw improved by Hutch (1)
59      cycles for Axhex2dw improved by Hutch (2)

19      cycles for Lingo's SSE version
27      cycles for Lingo's BIG integer version
34      cycles for Jochen's WORD-Indexed version
11      cycles for Dave's version (with minor changes)


23      cycles for Fast version
28      cycles for Fast version under AMD
59      cycles for Small 1
60      cycles for Small 2
60      cycles for Small 3
59      cycles for Small 3.1
60      cycles for Small 4
18      cycles for MMX 1
17      cycles for MMX 2
21      cycles for SSE1

Other's Versions:
59      cycles for Axhex2dw improved by Hutch (1)
59      cycles for Axhex2dw improved by Hutch (2)

16      cycles for Lingo's SSE version
27      cycles for Lingo's BIG integer version
34      cycles for Jochen's WORD-Indexed version
12      cycles for Dave's version (with minor changes)


In terms of cycle counts a P3 looks good against the P4s, but not against the more recent processors.
eschew obfuscation

Antariy

Quote from: MichaelW on August 14, 2010, 11:42:23 PM
Since Dave mentioned obsolete, here are the timings for a P3:


Thanks, Michael!

PIII - good CPU, but I see, what procs with big lookup table runs on PIIIs not very good (I see FORTRANS CPU and your CPU timings).
On my CPU MMX/SSE works very not well. This is feature only Celerons, or every PIV?



Alex

hutch--

I would not take too much notice of what is deemed to be obsolete and what is not, if it does the job, it does the job. I particularly liked the Northwood P4 I used to develop on and after endless pissing around to get some legacy boards that stayed working, the two P4s I have running are both fast and useful machines. Yes you can do faster parallel processing on late model quads but they are laggier than a single core and not always faster.

Michael, you are lucky to be able to keep a PIII going, I had hell's own problems getting later P4s reliable once my old board died. My only old timer is a fluke, someone gave me a 1200 Celeron 8 years ago and when I tested it a few months ago it booted straight up so I shoved it into a box.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

i have a p1/mmx running, but it isn't convenient to hook it up to the internet
if i had thought about it, i could have ran an ether line to the router when i installed it
but, getting a modern wireless adapter to work under win98 isn't very likely

hutch--

Now you know why i still run a fully wired netword with a couple of gigabit hubs, you can plonk a gigbit adapter into just about any PCI slot and it will work fine. I have one of the spares in the ancient Celeron box and its probably faster than the bus on the board but in performance terms it runs fine.

What I am waiting for is much faster fully optical networking as you can then start to gang machines in very interesting ways. CAT6 will handle 10 gigabit if routed properly but full optical has the potential to be much faster again. I have seen AOE data where you can RAID stripe multiple connections to get massive data transfer rates and this will start to be possible if you get reallty high speed optical networking for PC networks.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Antariy

Hi!

This is new test, in which I change only Dave's (aka KeepingRealBusy) proc.
This proc may report about success or not of conversion in ecx. I add support of different string terminators to code (new subproc addeded). This proc may set or reset any char_code/char/char_sequence to be treated as correct terminator or not.

So, see comments, I go to offline, sorry.
Big ask to all: test this please. See Dave's timings (because only his proc changed). I decrease timings by one clock on my CPU.

My timings are this:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)



25      cycles for Fast version
27      cycles for Fast version under AMD
48      cycles for Small 1
44      cycles for Small 2
43      cycles for Small 3
47      cycles for Small 3.1
43      cycles for Small 4
28      cycles for MMX 1
28      cycles for MMX 2
30      cycles for SSE1

Other's Versions:
48      cycles for Axhex2dw improved by Hutch (1)
164     cycles for Axhex2dw improved by Hutch (2)

28      cycles for Lingo's SSE version
24      cycles for Lingo's BIG integer version
23      cycles for Jochen's WORD-Indexed version
27      cycles for Dave's version (with minor changes)


26      cycles for Fast version
27      cycles for Fast version under AMD
48      cycles for Small 1
44      cycles for Small 2
43      cycles for Small 3
47      cycles for Small 3.1
43      cycles for Small 4
28      cycles for MMX 1
28      cycles for MMX 2
31      cycles for SSE1

Other's Versions:
48      cycles for Axhex2dw improved by Hutch (1)
83      cycles for Axhex2dw improved by Hutch (2)

29      cycles for Lingo's SSE version
24      cycles for Lingo's BIG integer version
23      cycles for Jochen's WORD-Indexed version
27      cycles for Dave's version (with minor changes)


25      cycles for Fast version
25      cycles for Fast version under AMD
48      cycles for Small 1
44      cycles for Small 2
43      cycles for Small 3
47      cycles for Small 3.1
43      cycles for Small 4
28      cycles for MMX 1
28      cycles for MMX 2
30      cycles for SSE1

Other's Versions:
48      cycles for Axhex2dw improved by Hutch (1)
83      cycles for Axhex2dw improved by Hutch (2)

43      cycles for Lingo's SSE version
24      cycles for Lingo's BIG integer version
23      cycles for Jochen's WORD-Indexed version
27      cycles for Dave's version (with minor changes)

==========
Codesizes:
Axhex2dw_Unrolled:      396
Axhex2dw_Unrolled_AMD:  396
Axhex2dw1 - 1:  70
Axhex2dw2 - 2:  69
Axhex2dw3 - 3:  57
Axhex2dw3_1 - 3.1:      56
Axhex2dw3 - 4:  61
Axhex2dw_MMX:   128
Axhex2dw_MMX2:  160
Axhex2dw_SSE:   160
Alex_Short_Hutch:       59
Axhex2dw_Hutch2:        54
Hex2dwLingoSSE: 160
lingo_htodw:    1950
ax_jj_htodw:    174
krbhtodw:       547
--- ok ---




Alex

KeepingRealBusy

Alex,

Here are my P4 timings:


Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)



14      cycles for Fast version
19      cycles for Fast version under AMD
32      cycles for Small 1
35      cycles for Small 2
35      cycles for Small 3
44      cycles for Small 3.1
35      cycles for Small 4
20      cycles for MMX 1
18      cycles for MMX 2
25      cycles for SSE1

Other's Versions:
34      cycles for Axhex2dw improved by Hutch (1)
61      cycles for Axhex2dw improved by Hutch (2)

14      cycles for Lingo's SSE version
13      cycles for Lingo's BIG integer version
5       cycles for Jochen's WORD-Indexed version
15      cycles for Dave's version (with minor changes)


14      cycles for Fast version
18      cycles for Fast version under AMD
21      cycles for Small 1
45      cycles for Small 2
35      cycles for Small 3
35      cycles for Small 3.1
36      cycles for Small 4
20      cycles for MMX 1
18      cycles for MMX 2
23      cycles for SSE1

Other's Versions:
34      cycles for Axhex2dw improved by Hutch (1)
61      cycles for Axhex2dw improved by Hutch (2)

14      cycles for Lingo's SSE version
12      cycles for Lingo's BIG integer version
5       cycles for Jochen's WORD-Indexed version
15      cycles for Dave's version (with minor changes)


14      cycles for Fast version
14      cycles for Fast version under AMD
32      cycles for Small 1
43      cycles for Small 2
35      cycles for Small 3
34      cycles for Small 3.1
37      cycles for Small 4
20      cycles for MMX 1
18      cycles for MMX 2
23      cycles for SSE1

Other's Versions:
97      cycles for Axhex2dw improved by Hutch (1)
85      cycles for Axhex2dw improved by Hutch (2)

18      cycles for Lingo's SSE version
12      cycles for Lingo's BIG integer version
5       cycles for Jochen's WORD-Indexed version
15      cycles for Dave's version (with minor changes)

==========
Codesizes:
Axhex2dw_Unrolled:      396
Axhex2dw_Unrolled_AMD:  396
Axhex2dw1 - 1:  70
Axhex2dw2 - 2:  69
Axhex2dw3 - 3:  57
Axhex2dw3_1 - 3.1:      56
Axhex2dw3 - 4:  61
Axhex2dw_MMX:   128
Axhex2dw_MMX2:  160
Axhex2dw_SSE:   160
Alex_Short_Hutch:       59
Axhex2dw_Hutch2:        54
Hex2dwLingoSSE: 160
lingo_htodw:    1950
ax_jj_htodw:    174
krbhtodw:       547
--- ok ---