News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Benchmark and test for htodw algos.

Started by hutch--, August 03, 2010, 07:05:52 AM

Previous topic - Next topic

Antariy

Hutch, test this your old test-bed, or include ax_jj_htodw algo to your test-bed.
This is most of Jochen's word-indexed table look-up algo, but with support of short strings.
It must be not slow... Check this, please. Because algo use big look-up table, it works better on more newest CPUs, than my.



Alex
P.S. Copyright (c) - Jochen, aka JJ, aka jj2007
I add support of short strings, and reorder some things in algo only.

jj2007

Very nice, Alex :U
But it is really Copyright (c) Alex, with some inspiration from Jochen :wink

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz
625 htodw JJ short (124 bytes)
3015 atodw library
1468 Alex short
1204 Lingo long
781 Alex long
1484 clive short


hutch--

Very good result.


Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz
297 htodw JJ short (124 bytes)
1453 atodw library
563 Alex short
281 Lingo long
329 Alex long
515 clive short

297 htodw JJ short (124 bytes)
1344 atodw library
547 Alex short
281 Lingo long
328 Alex long
516 clive short

Press any key to continue ...
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

FORTRANS

Hi,

   Pentium III results.

Regards,

Steve N.


G:\WORK>2test_ax
Pentium Pro, II or Celeron Processor
2764 htodw JJ short (124 bytes)
12348 atodw library
4276 Alex short
2032 Lingo long
1792 Alex long
4466 clive short

2774 htodw JJ short (124 bytes)
12348 atodw library
3866 Alex short
2023 Lingo long
1773 Alex long
4466 clive short

Press any key to continue ...

Antariy

Hi!


Big ask to all: run the test-bed in archive attached, please!

There is continue of hex2dword proc's development and testing.

Test include latest (yesterday's) Lingo's proc (which is "...reordered something...").
Then, test include Jochens perfect WORD-indexed lookup table algo, which is support short strings now.
And, it test included ALL versions of my hex2dword procs: 5 my small versions, with different algo implementation, and some algos work with not-zero terminated, but space, CRLF etc. terminated strings (code less than 30h). And 2 tweaks of Hutch versions.
Also included my versions of fast GPRs and MMX and SSE1 versions of algos. From them, MMX/SSE versions support not-zero terminated strings also.

These are my timings:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)



25      cycles for Fast version
27      cycles for Fast version under AMD
48      cycles for Small 1
44      cycles for Small 2
43      cycles for Small 3
47      cycles for Small 3.1
43      cycles for Small 4
28      cycles for MMX 1
28      cycles for MMX 2
30      cycles for SSE1

Other's Versions:
48      cycles for Axhex2dw improved by Hutch (1)
83      cycles for Axhex2dw improved by Hutch (2)

28      cycles for Lingo's SSE version
24      cycles for Lingo's BIG integer version
28      cycles for Jochen's WORD-Indexed version
28      cycles for Dave's version (with minor changes)


28      cycles for Fast version
27      cycles for Fast version under AMD
48      cycles for Small 1
46      cycles for Small 2
43      cycles for Small 3
47      cycles for Small 3.1
43      cycles for Small 4
28      cycles for MMX 1
29      cycles for MMX 2
31      cycles for SSE1

Other's Versions:
48      cycles for Axhex2dw improved by Hutch (1)
83      cycles for Axhex2dw improved by Hutch (2)

29      cycles for Lingo's SSE version
24      cycles for Lingo's BIG integer version
28      cycles for Jochen's WORD-Indexed version
28      cycles for Dave's version (with minor changes)


25      cycles for Fast version
27      cycles for Fast version under AMD
48      cycles for Small 1
44      cycles for Small 2
43      cycles for Small 3
46      cycles for Small 3.1
43      cycles for Small 4
28      cycles for MMX 1
28      cycles for MMX 2
55      cycles for SSE1

Other's Versions:
48      cycles for Axhex2dw improved by Hutch (1)
83      cycles for Axhex2dw improved by Hutch (2)

29      cycles for Lingo's SSE version
24      cycles for Lingo's BIG integer version
28      cycles for Jochen's WORD-Indexed version
28      cycles for Dave's version (with minor changes)

==========
Codesizes:
Axhex2dw_Unrolled:      396
Axhex2dw_Unrolled_AMD:  396
Axhex2dw1 - 1:  70
Axhex2dw2 - 2:  69
Axhex2dw3 - 3:  57
Axhex2dw3_1 - 3.1:      56
Axhex2dw3 - 4:  61
Axhex2dw_MMX:   128
Axhex2dw_MMX2:  120
Axhex2dw_SSE:   160
Alex_Short_Hutch:       59
Axhex2dw_Hutch2:        54
Hex2dwLingoSSE: 160
lingo_htodw:    1950
ax_jj_htodw:    166
krbhtodw:       468
--- ok ---



Optimization of integer algos on my CPU not very easy task (Hutch know how this are: Celeron Prescott coding). But MMX/SSE code on my system works very NOT well. Any reorderment, using different regs and instructions not give anything advantage. I think, this is because my L2 cache is small.
So, any test-reports and suggestions is welcome.



Dave's (aka KeepingRealBusy) version is the first posted his version, which exposed to some my minor changes. Sorry, Dave, if you would not want to join to tests. I post fist version of this tweak there: "http://www.masm32.com/board/index.php?topic=14438.msg116559#msg116559". Maybe, you miss this post. But on my machine, this tweak the same faster: your tweak with using ROL and SHL have 34 clocks, my tweak - 28 clocks (on my CPU). Other tweaks have timings not less than 42 clocks. So, for testing I select this revision.



Jochen, sorry for "thefting" your algo :), as I say already, I also contrive similar proc, but you don't be lazy as I, to implement it first. So, copyright is yours, because you *firstly* write this algo. And my implementation of support short strings not the best - this is solution "on fast hand".



Hutch, test this please, if you have time.
My point of view that: Jochen's algo may be the same fastest with long strings, if make fast solution for support of short strings. His proc may be also reliable if make checking table.
Dave's algo the same reliable from all tested algos, because it tests input, and might "speak" about errors with very small elaboration. With this it be reliable/fast/relatively_small.
Both Jochen's and Dave's algo have no (or very small) sensitivity of "code/data placement", because theirs look-up tables is byte-tables. This have two advantages: alignment no needed and have no significance, and in cache-line will be placed in 4 times more data of table.
Note: under "reliability" I mean checking of data, which processed in code - hex is this, or not.

All MMX/SSE versions is most useful with long strings, i.e. with 8byte strings (or full-notated). Because timings of all MMX/SSE versions is not depended from string length (my procs), or timings slightly longer with not 8byte strings (Lingo's proc).
So, for occasional conversion, and most usable by my point of view, is short versions of procs.
For fast conversion most usable integer versions, if string length may have size not 8bytes.


This is my thinkings about usability of algos, so any other peoples may have other opinions.



Alex

hutch--

Timings on Core2 quad.


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)



16      cycles for Fast version
19      cycles for Fast version under AMD
25      cycles for Small 1
27      cycles for Small 2
27      cycles for Small 3
29      cycles for Small 3.1
27      cycles for Small 4
10      cycles for MMX 1
11      cycles for MMX 2
11      cycles for SSE1

Other's Versions:
28      cycles for Axhex2dw improved by Hutch (1)
27      cycles for Axhex2dw improved by Hutch (2)

5       cycles for Lingo's SSE version
13      cycles for Lingo's BIG integer version
11      cycles for Jochen's WORD-Indexed version
18      cycles for Dave's version (with minor changes)


17      cycles for Fast version
19      cycles for Fast version under AMD
25      cycles for Small 1
27      cycles for Small 2
45      cycles for Small 3
27      cycles for Small 3.1
27      cycles for Small 4
11      cycles for MMX 1
11      cycles for MMX 2
11      cycles for SSE1

Other's Versions:
46      cycles for Axhex2dw improved by Hutch (1)
27      cycles for Axhex2dw improved by Hutch (2)

5       cycles for Lingo's SSE version
13      cycles for Lingo's BIG integer version
11      cycles for Jochen's WORD-Indexed version
18      cycles for Dave's version (with minor changes)


16      cycles for Fast version
19      cycles for Fast version under AMD
25      cycles for Small 1
47      cycles for Small 2
27      cycles for Small 3
27      cycles for Small 3.1
27      cycles for Small 4
10      cycles for MMX 1
11      cycles for MMX 2
11      cycles for SSE1

Other's Versions:
28      cycles for Axhex2dw improved by Hutch (1)
27      cycles for Axhex2dw improved by Hutch (2)

5       cycles for Lingo's SSE version
13      cycles for Lingo's BIG integer version
11      cycles for Jochen's WORD-Indexed version
18      cycles for Dave's version (with minor changes)

==========
Codesizes:
Axhex2dw_Unrolled:      396
Axhex2dw_Unrolled_AMD:  396
Axhex2dw1 - 1:  70
Axhex2dw2 - 2:  69
Axhex2dw3 - 3:  57
Axhex2dw3_1 - 3.1:      56
Axhex2dw3 - 4:  61
Axhex2dw_MMX:   128
Axhex2dw_MMX2:  120
Axhex2dw_SSE:   160
Alex_Short_Hutch:       59
Axhex2dw_Hutch2:        54
Hex2dwLingoSSE: 160
lingo_htodw:    1950
ax_jj_htodw:    166
krbhtodw:       468
--- ok ---
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Antariy

Thanks, Hutch!

It seems, what Jochen's integer algo is the same fast from integer versions. But I use MMX in it to comute string length. So, this is not fully only-integer version... But it can work on PI-MMX. Not very new CPU :)

Lingo's SSE version is very fast, but how timing it have with 7byte string length, for example?
I cannot make very good MMX/SSE code, because any used technics not work well :( I cannot select needed way to implementation. My MMX/SSE version is mostly for fun (SIMD remake of my short Axhex2dw, as you see), but on my CPU they faster by 1 clock. So, I will be use it :)


Dave's algo is very fast, with consideration of look-up table, checking of input, and some dependencies in code.



Alex

Antariy

Hutch, it seems, than the same first of optimized versions of Axhex2dw is the same fast on newest CPUs.


16      cycles for Fast version
19      cycles for Fast version under AMD
25      cycles for Small 1 *** THIS ***
27      cycles for Small 2



Interesting... Optimizing to drop 5 clocks on PIV gets anti-optimizing to up 2 clocks on Core.



Alex

jj2007

Good job, Alex :U
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

20      cycles for Fast version
20      cycles for Fast version under AMD
41      cycles for Small 1
41      cycles for Small 2
41      cycles for Small 3
57      cycles for Small 3.1
41      cycles for Small 4
14      cycles for MMX 1
15      cycles for MMX 2
21      cycles for SSE1

Other's Versions:
41      cycles for Axhex2dw improved by Hutch (1)
58      cycles for Axhex2dw improved by Hutch (2)

9       cycles for Lingo's SSE version
35      cycles for Lingo's BIG integer version
14      cycles for Jochen's WORD-Indexed version
24      cycles for Dave's version (with minor changes)

FORTRANS

Hi,

   PIII, Dave's looks good here.

Steve


G:\WORK>12alex's
☺☺☻♥ (SSE1)



23      cycles for Fast version
28      cycles for Fast version under AMD
63      cycles for Small 1
62      cycles for Small 2
59      cycles for Small 3
59      cycles for Small 3.1
60      cycles for Small 4
18      cycles for MMX 1
19      cycles for MMX 2
21      cycles for SSE1

Other's Versions:
59      cycles for Axhex2dw improved by Hutch (1)
59      cycles for Axhex2dw improved by Hutch (2)

17      cycles for Lingo's SSE version
27      cycles for Lingo's BIG integer version
36      cycles for Jochen's WORD-Indexed version
10      cycles for Dave's version (with minor changes)


25      cycles for Fast version
30      cycles for Fast version under AMD
64      cycles for Small 1
60      cycles for Small 2
59      cycles for Small 3
59      cycles for Small 3.1
60      cycles for Small 4
18      cycles for MMX 1
19      cycles for MMX 2
21      cycles for SSE1

Other's Versions:
59      cycles for Axhex2dw improved by Hutch (1)
59      cycles for Axhex2dw improved by Hutch (2)

17      cycles for Lingo's SSE version
27      cycles for Lingo's BIG integer version
36      cycles for Jochen's WORD-Indexed version
11      cycles for Dave's version (with minor changes)


24      cycles for Fast version
30      cycles for Fast version under AMD
60      cycles for Small 1
60      cycles for Small 2
59      cycles for Small 3
60      cycles for Small 3.1
60      cycles for Small 4
18      cycles for MMX 1
19      cycles for MMX 2
23      cycles for SSE1

Other's Versions:
59      cycles for Axhex2dw improved by Hutch (1)
60      cycles for Axhex2dw improved by Hutch (2)

15      cycles for Lingo's SSE version
27      cycles for Lingo's BIG integer version
36      cycles for Jochen's WORD-Indexed version
10      cycles for Dave's version (with minor changes)

==========
Codesizes:
Axhex2dw_Unrolled:      396
Axhex2dw_Unrolled_AMD:  396
Axhex2dw1 - 1:  70
Axhex2dw2 - 2:  69
Axhex2dw3 - 3:  57
Axhex2dw3_1 - 3.1:      56
Axhex2dw3 - 4:  61
Axhex2dw_MMX:   128
Axhex2dw_MMX2:  120
Axhex2dw_SSE:   160
Alex_Short_Hutch:       59
Axhex2dw_Hutch2:        54
Hex2dwLingoSSE: 160
lingo_htodw:    1950
ax_jj_htodw:    166
krbhtodw:       468
--- ok ---

Antariy

Hi!

Very BIG Thanks to all testers!!!

This is new code. Improved Jochen's proc, which have timings by 5 clocks smaller on my CPU. Now Jochen's proc the SAME faster from ALL procs, on my CPU.
I rewrite my second MMX version also. Now it contain more instructions of ALU cluster. MAY be, it faster. But on my CPU - not (as I say already, this is usual behaviour).

My timings:


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)



25      cycles for Fast version
27      cycles for Fast version under AMD
48      cycles for Small 1
44      cycles for Small 2
43      cycles for Small 3
72      cycles for Small 3.1
43      cycles for Small 4
28      cycles for MMX 1
28      cycles for MMX 2
30      cycles for SSE1

Other's Versions:
48      cycles for Axhex2dw improved by Hutch (1)
83      cycles for Axhex2dw improved by Hutch (2)

28      cycles for Lingo's SSE version
24      cycles for Lingo's BIG integer version
23      cycles for Jochen's WORD-Indexed version
28      cycles for Dave's version (with minor changes)


27      cycles for Fast version
27      cycles for Fast version under AMD
48      cycles for Small 1
44      cycles for Small 2
43      cycles for Small 3
46      cycles for Small 3.1
43      cycles for Small 4
28      cycles for MMX 1
26      cycles for MMX 2
30      cycles for SSE1

Other's Versions:
48      cycles for Axhex2dw improved by Hutch (1)
83      cycles for Axhex2dw improved by Hutch (2)

29      cycles for Lingo's SSE version
24      cycles for Lingo's BIG integer version
23      cycles for Jochen's WORD-Indexed version
28      cycles for Dave's version (with minor changes)


25      cycles for Fast version
30      cycles for Fast version under AMD
48      cycles for Small 1
44      cycles for Small 2
43      cycles for Small 3
46      cycles for Small 3.1
43      cycles for Small 4
28      cycles for MMX 1
28      cycles for MMX 2
30      cycles for SSE1

Other's Versions:
48      cycles for Axhex2dw improved by Hutch (1)
83      cycles for Axhex2dw improved by Hutch (2)

28      cycles for Lingo's SSE version
24      cycles for Lingo's BIG integer version
23      cycles for Jochen's WORD-Indexed version
28      cycles for Dave's version (with minor changes)

==========
Codesizes:
Axhex2dw_Unrolled:      396
Axhex2dw_Unrolled_AMD:  396
Axhex2dw1 - 1:  70
Axhex2dw2 - 2:  69
Axhex2dw3 - 3:  57
Axhex2dw3_1 - 3.1:      56
Axhex2dw3 - 4:  61
Axhex2dw_MMX:   128
Axhex2dw_MMX2:  160
Axhex2dw_SSE:   160
Alex_Short_Hutch:       59
Axhex2dw_Hutch2:        54
Hex2dwLingoSSE: 160
lingo_htodw:    1950
ax_jj_htodw:    182
krbhtodw:       468
--- ok ---


Big ask to all: test this please.



Alex

hutch--

Here is the timing off my Core2 box.


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)



16      cycles for Fast version
19      cycles for Fast version under AMD
25      cycles for Small 1
27      cycles for Small 2
27      cycles for Small 3
29      cycles for Small 3.1
27      cycles for Small 4
10      cycles for MMX 1
10      cycles for MMX 2
11      cycles for SSE1

Other's Versions:
28      cycles for Axhex2dw improved by Hutch (1)
27      cycles for Axhex2dw improved by Hutch (2)

5       cycles for Lingo's SSE version
13      cycles for Lingo's BIG integer version
12      cycles for Jochen's WORD-Indexed version
18      cycles for Dave's version (with minor changes)


17      cycles for Fast version
19      cycles for Fast version under AMD
25      cycles for Small 1
27      cycles for Small 2
45      cycles for Small 3
27      cycles for Small 3.1
27      cycles for Small 4
11      cycles for MMX 1
10      cycles for MMX 2
11      cycles for SSE1

Other's Versions:
28      cycles for Axhex2dw improved by Hutch (1)
27      cycles for Axhex2dw improved by Hutch (2)

5       cycles for Lingo's SSE version
13      cycles for Lingo's BIG integer version
12      cycles for Jochen's WORD-Indexed version
18      cycles for Dave's version (with minor changes)


16      cycles for Fast version
19      cycles for Fast version under AMD
25      cycles for Small 1
28      cycles for Small 2
27      cycles for Small 3
27      cycles for Small 3.1
27      cycles for Small 4
10      cycles for MMX 1
10      cycles for MMX 2
11      cycles for SSE1

Other's Versions:
28      cycles for Axhex2dw improved by Hutch (1)
27      cycles for Axhex2dw improved by Hutch (2)

5       cycles for Lingo's SSE version
13      cycles for Lingo's BIG integer version
12      cycles for Jochen's WORD-Indexed version
18      cycles for Dave's version (with minor changes)

==========
Codesizes:
Axhex2dw_Unrolled:      396
Axhex2dw_Unrolled_AMD:  396
Axhex2dw1 - 1:  70
Axhex2dw2 - 2:  69
Axhex2dw3 - 3:  57
Axhex2dw3_1 - 3.1:      56
Axhex2dw3 - 4:  61
Axhex2dw_MMX:   128
Axhex2dw_MMX2:  160
Axhex2dw_SSE:   160
Alex_Short_Hutch:       59
Axhex2dw_Hutch2:        54
Hex2dwLingoSSE: 160
lingo_htodw:    1950
ax_jj_htodw:    182
krbhtodw:       468
--- ok ---
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

Prescott w/htt:
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

25      cycles for Fast version
27      cycles for Fast version under AMD
48      cycles for Small 1
44      cycles for Small 2
45      cycles for Small 3
46      cycles for Small 3.1
43      cycles for Small 4
28      cycles for MMX 1
30      cycles for MMX 2
30      cycles for SSE1

Other's Versions:
48      cycles for Axhex2dw improved by Hutch (1)
83      cycles for Axhex2dw improved by Hutch (2)

30      cycles for Lingo's SSE version
24      cycles for Lingo's BIG integer version
23      cycles for Jochen's WORD-Indexed version
28      cycles for Dave's version (with minor changes)

27      cycles for Fast version
27      cycles for Fast version under AMD
48      cycles for Small 1
55      cycles for Small 2
45      cycles for Small 3
54      cycles for Small 3.1
43      cycles for Small 4
28      cycles for MMX 1
28      cycles for MMX 2
30      cycles for SSE1

Other's Versions:
48      cycles for Axhex2dw improved by Hutch (1)
99      cycles for Axhex2dw improved by Hutch (2)

28      cycles for Lingo's SSE version
24      cycles for Lingo's BIG integer version
36      cycles for Jochen's WORD-Indexed version
28      cycles for Dave's version (with minor changes)

25      cycles for Fast version
27      cycles for Fast version under AMD
48      cycles for Small 1
44      cycles for Small 2
45      cycles for Small 3
46      cycles for Small 3.1
43      cycles for Small 4
28      cycles for MMX 1
28      cycles for MMX 2
31      cycles for SSE1

Other's Versions:
48      cycles for Axhex2dw improved by Hutch (1)
83      cycles for Axhex2dw improved by Hutch (2)

27      cycles for Lingo's SSE version
24      cycles for Lingo's BIG integer version
23      cycles for Jochen's WORD-Indexed version
28      cycles for Dave's version (with minor changes)

Antariy

Quote from: hutch-- on August 14, 2010, 11:16:53 PM
Here is the timing off my Core2 box.


Thanks, Hutch!

Jochen's proc the same fast. It have size in ~11 TIMES short as lingo's proc, and speedy by 1 clock.

As I expect, MMX version is not well :(


Alex

Antariy

Quote from: dedndave on August 14, 2010, 11:19:37 PM
Prescott w/htt:

Thanks, Dave!

On your CPU Jochen's proc is the same small/fast also.



Alex