I have on and off followed the algo development for replacing htodw that Alex Yakubchik wrote years ago and without claiming to do justice to the full range of algos developed I have written a benchmark to test the algos in real time. The fastest non-sse is lingos long version but for reasons i cannot accurately track down it is highly sensitive to code placement and code before and after it and this is even with controlled padding between the algos. Alex's long version is very consistent over a wide range of tests and on different hardware.
For the short versions Alex's appears to be the most consistent over different processors, Clive's is faster on this quad but a lot slower on the PIVs.
What I have in mind with these algos is to seleect a short and a long version naming them respectively "htodw" and htodw_ex" so that the user has the choice of size or speed where they think it matters.
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz
250 atodw library
109 Alex short
47 Lingo long
63 Alex long
94 clive short
Press any key to continue ...
Prescott P4
Genuine Intel(R) CPU 3.80GHz
312 atodw library
141 Alex short
62 Lingo long
78 Alex long
219 clive short
Press any key to continue ...
Northwood P4
Intel(R) Pentium(R) 4 CPU 2.80GHz
390 atodw library
172 Alex short
78 Lingo long
94 Alex long
266 clive short
Press any key to continue ...
Intel(R) Celeron(TM) CPU 1200MHz
1452 atodw library
471 Alex short
241 Lingo long
211 Alex long
541 clive short
Press any key to continue ...
I could not be bothered turning the i7 on as its too much messing around to get the data in and out but it behaves similar to the Core2 quad.
On my Core 2 Duo:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
468 atodw library
125 Alex short
94 Lingo long
94 Alex long
140 clive short
Press any key to continue ...
:U
Cannot Identify x86 Processor
265 atodw library
124 Alex short
63 Lingo long
78 Alex long
109 clive short
Press any key to continue ...
Thats a Phenom II x6 1055T
(whats up with your CPU detection?)
Poor showing here :(
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
483 atodw library
124 Alex short
78 Lingo long
78 Alex long
140 clive short
Note that both frktons and sinsi benched the same CPU, but got significantly different timings. Its all on the memory timings it seems.
Maybe quad vs duo? Memory shoudn't matter (maybe cache?) I have ddr2 1066
Commenting out the 'rcnt' macro is a bit better...
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
327 atodw library
125 Alex short
62 Lingo long
78 Alex long
124 clive short
Alignment is such a finicky thing eh?
Quote from: sinsi on August 03, 2010, 02:24:21 PM
Maybe quad vs duo? Memory shoudn't matter (maybe cache?) I have ddr2 1066
Commenting out the 'rcnt' macro is a bit better...
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
327 atodw library
125 Alex short
62 Lingo long
78 Alex long
124 clive short
Alignment is such a finicky thing eh?
Well, the difference is only in quad vs duo. I've ddr2 1066 as well.
In order to compile the test, I had to use the
.686 directive, it won't without.
I did like Sinsi, commented out the rcnt macro and the timings are a bit better:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
328 atodw library
140 Alex short
78 Lingo long
93 Alex long
140 clive short
Press any key to continue ...
I tried with .486 - .586 but it didn't compile. ::)
In the first test I just used the executable.
Trying it many times I get different results, like:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
327 atodw library
140 Alex short
78 Lingo long
78 Alex long
156 clive short
Press any key to continue ...
and sometimes Lingo is slower than Alex as well. ::)
Peoples, test my original version, please.
Link to archive "http://www.masm32.com/board/index.php?action=dlattach;topic=14438.0;id=7889"
This link to already attached archive, to other thread.
In archive - my original versions of procs. Hutch tune my short version of algo, and on some hardware (on my CPU), it seems to be algo slower, than my version. On most not-hi-end CPUs, my version faster.
Test this please, for clearness of on which hardware short proc have most satisfactory results.
Thanks!
Alex
Hutch,
by this link "http://www.masm32.com/board/index.php?topic=14438.msg117187#msg117187" I post some my thinks about probability of inconsistency of Lingo's long algo with some changing: placement of algo, data, padding etc.
And read the post below from this. In it some thinkings also.
Alex
Guys, you forgot to add these lines to sources of test:
invoke GetCurrentThread
invoke SetThreadAffinityMask,eax,1
invoke Sleep,0 ;<--- force to switch mode by dropping current time slice
May be - this is reason of inconsistency timings under Core2 with one core and four cores. Usually, this is cannot change something, but with tests under quad-core Core i7 I have similar results - timings be very inconsistent, and differently by tens of clocks.
Alex
Quote from: sinsi on August 03, 2010, 02:24:21 PM
Commenting out the 'rcnt' macro is a bit better...
...............skipped............
Alignment is such a finicky thing eh?
Yes, it seems, which Lingo's algo with big look-up table is very dependent from alignment and placement in the file.
Need some "dancing with tambourine" to do it works faster :)
But it have one advantage: his algo use pre-calculated table with all needed values (very BIG table), so it *may* be fastest algo in best case.
My algo have "trimmed" look-up table, and some simple math used in algo to do valid values. So, algo slower in theory, but, under multi-tasking Windows and limited (yes, big, but *limited*) size of caches, my (and any algo with small look-up table) algo have some advantages.
Alex
Frank - you might try this to see if it helps get more consistent timings...
[HKEY_LOCAL_MACHINE\SYSTEM\ControlSet002\Control\Session Manager\Throttle]
"PerfEnablePackageIdle"=dword:00000001
take several readings prior to applying the change
then, take several more afterward to verify that it helps
it makes a big difference on my P4 prescott w/htt
Just be careful in how you modify the test conditions for algos of this type, for all of its irritations ring3 access real time testing is the closest you will get to the conditions that the algorithms are used under and narrowing these conditions can alter the results so they are no longer referential.
Lingo's algo has been faster on most of the recent hardware but its highly sensitive to code placement both before and after it and this can slow it down by over 50%. I added the variable padding between algos for a reason, it is a way of testing how sensitive the code is to code placement. You can vary the iteration count and the style of padding (nop) (db 0) (int 3) to test if these make any difference on a particular hardware. I also found that Lingo's algo was slower if it had not first been run while the rest were effectively indifferent to this problem.
For general use the take off time of an algo is important as a vast number of algorithms get called from time to time in normal software operation rather than being set up for an optimised pass at the data.
Just as reference for the results I originally posted, the quad Core2 I use has 1333 memory which may effect the results.
"I have a question for Lingo. I have been testing your unrolled algo and in isolation it is the fastest of the non SSE/MMX versions by a reasonable amount but as soon as I add other algos to test against it slows down by over 50%. I have tried various things including putting the tables in the .DATA section which helped a little but when I added another algo for testing it slowed down by a long way. Any idea why its so sensitive to code placement before and after it ?"
But here we see you got the idea and found the way: :lol
"it is a way of testing how sensitive the code is to code placement."
How did you do it? Is it possible some external links about your way? :lol
"I also found that Lingo's algo was slower if it had not first been run while the rest were effectively indifferent to this problem."
Is it possible to post two different asm and exe files to prove it? :lol
And the epilog:
"For general use the take off time of an algo is important as a vast number of algorithms get called from time to time in normal software operation rather than being set up for an optimized pass at the data.:"
It is the same as:
"Yes, it seems, which Lingo's algo with big look-up table is very dependent from alignment and placement in the file. Need some "dancing with tambourine" to do it works faster :)" by the asian lamer with archaic CPU.
So, with other words Lingo is some kind of liar who try to manipulate the people and Lingo's algos are achieved by fraud...
:bg
I am not sure what you are after here, I bothered to do the testing then keep fiddling the test piece so that your algo ran at it highest tested speed but I have addressed the problems in doing so. Simply by adding another algo AFTER IT the speed dropped by more than 50% so I moved it around until it ran at its highest speed again.
Nothing like a test piece to prove the result. Include or disallow some of the other algos and it slows down by about half.
312 atodw library
110 Alex short
93 Lingo long
110 clive short
Press any key to continue ...
Now this is a speed difference of 47 to 93 ms, about twice as slow depending on code location.
Instead of flapping your mouth off at me, try addressing the problem, your code is fast but it is sensitive to where it is in the executable where none of the others are.
"Instead of flapping your mouth off at me, try addressing the problem"
The "problem" as I expected is in the fluctoations in the testing program.
1. I used Hutch's h2dtimings.zip testing program from the begining of the thread
2. In 1st file fileh2dt1.asm I moved lingo_htodw proc in the first position and run fileh2dt1.exe 5 times
3. In 2nd file fileh2dt2.asm I moved lingo_htodw proc in the 2nd position and run fileh2dt2.exe 5 times
4. In 3rd file fileh2dt3.asm I moved lingo_htodw proc in the 3rd position and run fileh2dt3.exe 5 times
5. In 4th file fileh2dt4.asm I moved lingo_htodw proc in the 4th place and run fileh2dt4.exe 5 times
Results:
lingo_htodw in 1st place ->fileh2dt1.exe
Results:
C:\5>h2dt1
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
296 atodw library
109 Alex short
62 Lingo long
62 Alex long
94 clive short
Press any key to continue ...
C:\5>h2dt1
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
312 atodw library
110 Alex short
47 Lingo long
62 Alex long
110 clive short
Press any key to continue ...
C:\5>h2dt1
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
296 atodw library
110 Alex short
62 Lingo long
63 Alex long
94 clive short
Press any key to continue ...
C:\5>h2dt1
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
312 atodw library
109 Alex short
47 Lingo long
62 Alex long
94 clive short
Press any key to continue ...
C:\5>h2dt1
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
296 atodw library
108 Alex short
47 Lingo long
94 Alex long
94 clive short
Press any key to continue ...
End Results for 1st place:
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
296 atodw library
108 Alex short
47 Lingo long
62 Alex long
94 clive short
Press any key to continue ...
lingo_htodw in 2nd place ->fileh2dt2.exe
Results:
C:\5>h2dt2
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
312 atodw library
110 Alex short
47 Lingo long
62 Alex long
110 clive short
Press any key to continue ...
C:\5>h2dt2
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
312 atodw library
109 Alex short
62 Lingo long
62 Alex long
94 clive short
Press any key to continue ...
C:\5>h2dt2
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
312 atodw library
109 Alex short
63 Lingo long
62 Alex long
93 clive short
Press any key to continue ...
C:\5>h2dt2
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
296 atodw library
109 Alex short
62 Lingo long
63 Alex long
93 clive short
Press any key to continue ...
C:\5>h2dt2
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
281 atodw library
109 Alex short
47 Lingo long
62 Alex long
94 clive short
Press any key to continue ...
End Results for 2nd place:
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
281 atodw library
108 Alex short
47 Lingo long
62 Alex long
93 clive short
Press any key to continue ...
lingo_htodw in 3rd place ->fileh2dt3.exe
Results:
C:\5>h2dt3
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
312 atodw library
109 Alex short
62 Lingo long
62 Alex long
94 clive short
Press any key to continue ...
C:\5>h2dt3
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
281 atodw library
109 Alex short
63 Lingo long
62 Alex long
93 clive short
Press any key to continue ...
C:\5>h2dt3
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
296 atodw library
109 Alex short
46 Lingo long
78 Alex long
93 clive short
Press any key to continue ...
C:\5>h2dt3
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
296 atodw library
110 Alex short
47 Lingo long
63 Alex long
94 clive short
Press any key to continue ...
C:\5>h2dt3
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
296 atodw library
110 Alex short
62 Lingo long
63 Alex long
94 clive short
Press any key to continue ...
End Results for 3rd place:
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
281 atodw library
109 Alex short
46 Lingo long
62 Alex long
93 clive short
Press any key to continue ...
lingo_htodw in 4th place ->fileh2dt4.exe
Results:
C:\5>h2dt4
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
297 atodw library
109 Alex short
63 Lingo long
62 Alex long
93 clive short
Press any key to continue ...
C:\5>h2dt4
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
312 atodw library
110 Alex short
47 Lingo long
62 Alex long
110 clive short
Press any key to continue ...
C:\5>h2dt4
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
343 atodw library
109 Alex short
47 Lingo long
63 Alex long
109 clive short
Press any key to continue ...
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
297 atodw library
109 Alex short
63 Lingo long
62 Alex long
94 clive short
Press any key to continue ...
C:\5>h2dt4
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
296 atodw library
109 Alex short
47 Lingo long
62 Alex long
109 clive short
Press any key to continue ...
End Results for 4th place:
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
297 atodw library
109 Alex short
47 Lingo long
62 Alex long
93 clive short
Press any key to continue ...
If someone wants to test my files let try every exe file 5 times consecutively and get the best results for every algo. Thanks!
This does not help you after flapping your mouth off. I originall reported that your algo was the fastest but was subject to slowdowns depending on the placement of the code.
I posted a second test piece which is proof that your algo is unreliable in terms of timing due to code placement.
Here are your 5 consecutive runs with the SECOND test piece on the 3 gig Core2 quad I am using. Memory is 1333.
313 atodw library
109 Alex short
94 Lingo long
93 clive short
Press any key to continue ...
312 atodw library
109 Alex short
157 Lingo long
93 clive short
Press any key to continue ...
297 atodw library
109 Alex short
110 Lingo long
109 clive short
Press any key to continue ...
313 atodw library
109 Alex short
94 Lingo long
109 clive short
Press any key to continue ...
313 atodw library
109 Alex short
94 Lingo long
94 clive short
Press any key to continue ...
You use test file from lingo_slow.zip rather than files from h2dt.zip...Somebody else? :lol
:bg
Like it or lump it your algo is unreliable due to code placement and I posted the example to prove it.
"algo is unreliable due to code placement"
Send quickly your invention to Intel and get a big prize :lol
have you tried different masm/jwasm versions hutch? maybe just the one you're using is bugged. because code not working consistantly due to placement, beyond a slight differences seems like a serious assembler bug to me.
It has absolutely nothing to do with "assembler bugs". See the code location sensitivity thread (http://www.masm32.com/board/index.php?topic=11454.0).
Quote from: jj2007 on August 04, 2010, 06:52:24 PM
It has absolutely nothing to do with "assembler bugs". See the code location sensitivity thread (http://www.masm32.com/board/index.php?topic=11454.0).
alignment and code location aren't the same thing, if it were a simple alignment issue i'm sure hutch would of tweaked it by now.
You are the expert :toothy
Quote from: jj2007 on August 04, 2010, 08:54:39 PM
You are the expert :toothy
you may post quite a bit of code, but that doesn't make you an expert either. What i'm saying is common sense, if I have a 1000 different functions all aligned by 16 and one happens to be lingos, which gives drastically different speed results by simply making it the first function rather than the last, then that's a problem and as said is not the same as a simple align issue.
If you are not too tired, just read the sensitivity thread - it has very little to do with alignment.
Lingo,
Quote
"Yes, it seems, which Lingo's algo with big look-up table is very dependent from alignment and placement in the file. Need some "dancing with tambourine" to do it works faster :)" by the asian lamer with archaic CPU.
"When arguments is finished - started an insults". English proverb, for your info.
So, you don't adequate man, which may say something interesting to Mans civilization, because you NEVER arguments your thinks.
You don't understand hardware, and you cannot generate useful code, you may do something "great" only on newest hardware, which have tolerance to yours lamer's techincs, and work fast only on newest hardware, which may run your BLOATED algos satisfactory (but with creaking :).
And, epilog:
Lingo, yours Russian so good, as my English :P
Alex
Hutch, remember yours benchmark, which you made for "String-to-DWORD conversion procs, and one Bug of atodw proc" thread (http://www.masm32.com/board/index.php?board=6;topic=14438.0)? This BM be in bmk2.zip archive.
This BM contain generator of hex-text file, which used in real-world testing.
I think, to real justice results, need to run BM on lines not only 8 bytes long.
I write similar to yours prog, which generate text file with variable length of strings: from 1 byte to 8, cyclic.
Yours testing algorithm I don't change, I only substitute my procs, and change console output text. In .BAT file I change hextxt.exe to hextxt2.exe - for generating variable length strings-file.
And, I add code to printing size of all procs.
So, with variable length strings, results other (my PC test):
1000000 = item count in file
1797 ms Lingo 1
1610 ms Lingo 2
1390 ms Alex Unrolled
1610 ms Alex Unrolled (AMD)
1765 ms Lingo 1
1594 ms Lingo 2
1406 ms Alex Unrolled
1594 ms Alex Unrolled (AMD)
1766 ms Lingo 1
1593 ms Lingo 2
1438 ms Alex Unrolled
1594 ms Alex Unrolled (AMD)
1765 ms Lingo 1
1610 ms Lingo 2
1421 ms Alex Unrolled
1610 ms Alex Unrolled (AMD)
1781 ms Lingo 1
1594 ms Lingo 2
1406 ms Alex Unrolled
1594 ms Alex Unrolled (AMD)
1781 ms Lingo 1
1594 ms Lingo 2
1437 ms Alex Unrolled
1594 ms Alex Unrolled (AMD)
1766 ms Lingo 1
1593 ms Lingo 2
1407 ms Alex Unrolled
1593 ms Alex Unrolled (AMD)
1766 ms Lingo 1
1609 ms Lingo 2
1422 ms Alex Unrolled
1594 ms Alex Unrolled (AMD)
1773 MS Lingo average
1599 MS Lingo2 average
1415 MS Alex average
1597 MS Alex (AMD) average
Size of code:
171 Lingo 1 proc
2076 Lingo 2 proc
396 Alex proc
396 Alex proc (AMD)
Press any key to continue ...
With consideration, which this versions are universal - i.e. may work with short-notated strings, need to run all benchmarks on variable (short-notated) length hex-strings.
Because I have SSE2 version of proc also, which is limited only to 8bytes strings and have speed by 67% faster, than my unrolled version. But I don't use it in testing, because it not universal and cannot run on CPUs less than PIV.
My "AMD" version - should runs faster on AMDs' CPUs, but this is not guarantee, because it tested only on E^cube's and Dave's AMD CPUs.
So, request to all peoples, run this benchmark, please, if you have ~1minute. This is more real-world benchmark, idea of this benchmark belong to Hutch.
I don't try prove something - this is very interesting info about different hardware, and only.
Alex
Hutch's test
Quote from: hutch-- on August 04, 2010, 01:20:18 PM
Instead of flapping your mouth off at me, try addressing the problem, your code is fast but it is sensitive to where it is in the executable where none of the others are.
Results:
594 atodw library
250 Alex short
235 Lingo long
406 clive short
Press any key to continue ...
Wow! Lingo gets 6% of performance, by getting 3000% of bigger code size! This is great, really! :bdg
Alex
Quote from: jj2007 on August 04, 2010, 09:33:58 PM
If you are not too tired, just read the sensitivity thread - it has very little to do with alignment.
the word align is said
103 times in that thread...
Lingo's latest test (with many executables):
exe1
Intel(R) Celeron(R) CPU 2.13GHz
594 atodw library
250 Alex short
141 Lingo long
140 Alex long
407 clive short
Press any key to continue ...
Intel(R) Celeron(R) CPU 2.13GHz
594 atodw library
250 Alex short
141 Lingo long
140 Alex long
407 clive short
Press any key to continue ...
exe2
Intel(R) Celeron(R) CPU 2.13GHz
578 atodw library
250 Alex short
141 Lingo long
140 Alex long
391 clive short
Press any key to continue ...
Intel(R) Celeron(R) CPU 2.13GHz
594 atodw library
250 Alex short
140 Lingo long
141 Alex long
406 clive short
Press any key to continue ...
exe3
Intel(R) Celeron(R) CPU 2.13GHz
579 atodw library
250 Alex short
140 Lingo long
141 Alex long
406 clive short
Press any key to continue ...
Intel(R) Celeron(R) CPU 2.13GHz
579 atodw library
250 Alex short
140 Lingo long
141 Alex long
406 clive short
Press any key to continue ...
exe4
Intel(R) Celeron(R) CPU 2.13GHz
593 atodw library
250 Alex short
141 Lingo long
141 Alex long
422 clive short
Press any key to continue ...
Intel(R) Celeron(R) CPU 2.13GHz
578 atodw library
250 Alex short
141 Lingo long
140 Alex long
406 clive short
Press any key to continue ...
So, which results we have. Lingo's proc don't (probably) dependent from code placement, but his code don't faster in this test (which is not really real-world, with flexible string length). With consideration of VERY BIG size, this code is not very useful. But some peoples may have different opinions (especially - author of BIG algo).
Alex
E^cube, test this, please:
"http://www.masm32.com/board/index.php?action=dlattach;topic=14438.0;id=7889"
and this:
"http://www.masm32.com/board/index.php?action=dlattach;topic=14540.0;id=7903"
If I don't unpleasant to you, of course.
Alex
I ran into the problem with Lingo's algo while writing the first test piece. I could get it to clock 47 ms with no problems but as I added extra algos to do the comparison against its timing altered back and forth as code was added.
The first test piece was fiddled to make sure Lingo's algo was located within the exe so it ran at its full speed and I reported the problem of code placement.
In response to Lingo flapping his mouth off I posted his identical algorithm in another test piece where his algo speed dropped by half simply due to code placement and it is this speed fluctuation that makes it unreliable.
For an algorithm to be general purpose it needs to be called from anywhere anytime and not be dependent on obscure conditions to work properly and at full speed.
I will add the two most reliable algos to the library but it will be based on real time testing across a wide range of hardware, not test piece cooked to make one algo look good.
You are right, Hutch.
Try to test this: "http://www.masm32.com/board/index.php?action=dlattach;topic=14540.0;id=7903"
This is your algo, Hutch, for testing, but I add flexible length strings only for test. Real-world testing need test not only fixed-length strings, is it?
This is real-world read-time test, not "clocks" test. See, please, maybe, you add some improvements.
Alex
Quote from: Antariy on August 04, 2010, 11:01:36 PM
E^cube, test this, please:
"http://www.masm32.com/board/index.php?action=dlattach;topic=14438.0;id=7889"
and this:
"http://www.masm32.com/board/index.php?action=dlattach;topic=14540.0;id=7903"
If I don't unpleasant to you, of course.
Alex
the second test did unpleasant me because it took awhile, but here you go.
AMD Athlon(tm) 64 Processor 3000+ (SSE3)
Alex's versions
All algos work on i386+ CPUs
ABCDEF01 Result of Unrolled
ABCDEF01 Result of Unrolled (AMD)
ABCDEF01 Result of Short
35 cycles for Fast version
22 cycles for Fast version under AMD
40 cycles for Small version in faster compilation
Codesizes:
Axhex2dw_Unrolled: 396
Axhex2dw_Unrolled_AMD: 396
Axhex2dw - Small: 69
--- ok ---
Generating test file...
1000000 = item count in file
2172 ms Lingo 1
1500 ms Lingo 2
2016 ms Alex Unrolled
1796 ms Alex Unrolled (AMD)
2172 ms Lingo 1
1500 ms Lingo 2
2032 ms Alex Unrolled
1796 ms Alex Unrolled (AMD)
2172 ms Lingo 1
1500 ms Lingo 2
2000 ms Alex Unrolled
1797 ms Alex Unrolled (AMD)
2172 ms Lingo 1
1500 ms Lingo 2
2000 ms Alex Unrolled
1797 ms Alex Unrolled (AMD)
2172 ms Lingo 1
1484 ms Lingo 2
2016 ms Alex Unrolled
1797 ms Alex Unrolled (AMD)
2156 ms Lingo 1
1484 ms Lingo 2
2016 ms Alex Unrolled
1797 ms Alex Unrolled (AMD)
2172 ms Lingo 1
1500 ms Lingo 2
2031 ms Alex Unrolled
1797 ms Alex Unrolled (AMD)
2156 ms Lingo 1
1516 ms Lingo 2
2015 ms Alex Unrolled
1797 ms Alex Unrolled (AMD)
2168 MS Lingo average
1498 MS Lingo2 average
2015 MS Alex average
1796 MS Alex (AMD) average
Size of code:
171 Lingo 1 proc
2076 Lingo 2 proc
396 Alex proc
396 Alex proc (AMD)
Press any key to continue ...
"I could get it to clock 47 ms with no problems..
me too.. :toothy
but as I added extra algos to do the comparison against its timing altered back and forth as code was added."
May be it is the "problem"... :lol
"it ran at its full speed and I reported the problem of code placement."
So, it will be interesting where is the "right" code placement to run faster?
Or may be how to write speed optimized code and place it at right place
or what is a bad code placement and how to avoid it, etc...
It will be a new invented chapter with rules in the Intel's Optimization Reference Manual... :lol
"I will add the two most reliable algos to the library..."
as usual, so that every stupid lamer can fix and improve them, but it is your human right and responsibility as an owner... :toothy
Example (http://www.masm32.com/board/index.php?topic=14438.0)
:bg
I could also get it to clock twice as slow and that was the problem, with no algorithm modification at all the algo slowed to the speed of the small versions because it has a problem with code placement.
Now while the rest were reasonably consistent in their timings apart from the old library version "htodw" that I placed first, the problem is you cannot pick its performance without individually benchmarking an application after each piece of code is added to it and that renders the algorithm unworkable in its present form.
For Lingo flapping his mouth off the questions is "Does the clock tell lies ?" I suggest it does not. In motor racing there is an expression "When the flag drops the bullsh*t stops" and while having an algo that is fast in some places may be good for flapping your mouth off, when the end result is its unreliable in terms of timing, then its no use. An F1 car may win a race but if it can't win a series its a loser. :bdg
I have got a bit more consistency out of ingo's algo by setting up the data to be converted for test in this form. All of the other test algos do not change timings.
.data
hword db "C0000000",0
align 16
pword dd hword
Quote from: E^cube on August 04, 2010, 10:48:47 PM
Quote from: jj2007 on August 04, 2010, 09:33:58 PM
If you are not too tired, just read the sensitivity thread - it has very little to do with alignment.
the word align is said 103 times in that thread...
You are good at counting words. If you had understood the contents of the thread, you would know that alignment by itself does not explain the code location sensitivity. As was demonstrated in Hutch's example.
Quote from: jj2007 on August 05, 2010, 07:53:20 AM
You are good at counting words. If you had understood the contents of the thread, you would know that alignment by itself does not explain the code location sensitivity. As was demonstrated in Hutch's example.
Quote from: E^cube on August 04, 2010, 06:57:55 PM
alignment and code location aren't the same thing, if it were a simple alignment issue i'm sure hutch would of tweaked it by now.
reading and actually comprehending is the key here, of which you're lacking, and you're right I am good at counting, I said the above on the second page(thats 2) and you repeated what I said above on this page which is 3. And since 2 comes before 3, you do the math :thumbu
One thing no-one has mentioned is things like ASLR and DEP which depend on the version of windows you are using as well as the CPU.
There are way too many variables when timing code, little things like RAM size can play a part.
Maybe we need a 'windows version' as well as a 'cpu version'
Quote from: sinsi on August 05, 2010, 08:21:14 AM
One thing no-one has mentioned is things like ASLR and DEP which depend on the version of windows you are using as well as the CPU.
There are way too many variables when timing code, little things like RAM size can play a part.
Maybe we need a 'windows version' as well as a 'cpu version'
I think that's a good idea, it'll be the template for all future speed tests, and it can include cpu usage(for all cores), detailed windows info,DEP settings,amount of ram etc...
i would not hold out any great hope of higher level factors like DEP playing much of a part in intense algorithm timings. Apart from a processors capacity to schedule instruction depending on their order, the only other factor I can think of that does effect timing is memory speed but it has never been that big a difference. The later P4 generation was running DDR400 followed by DDR2 533, a couple of faster versions and DDR3 at 1333 and I have seen 1666 somewhere. It does speed up memory operation but it has limited impact on raw timings as processor is still a lot faster than memory.
Now processor task switching intervals can have an effect but it tends to heve been reasonably uniform across versions as old as Win95oem through to current Win7 OS versions. From memory intensive algos get faster with longer time slices so if you are using an OS version that allows you to adjust the time slice you can alter that to speed algos up. Something that really did mess up timings was a hyperthreading processor running on Win2000, you had to turn it off in the BIOS as Win2000 did not properly support hyperthreading and saw it as 2 processors.
Quote from: hutch-- on August 05, 2010, 03:53:31 AM
I have got a bit more consistency out of ingo's algo by setting up the data to be converted for test in this form. All of the other test algos do not change timings.
.data
hword db "C0000000",0
align 16
pword dd hword
Hutch,
I tested that one using this macro:
hutchtrick = 0
fnMAC MACRO algo, ptr ; second arg ignored
LOCAL tmp$
if hutchtrick
tmp$ CATSTR <fn >, <algo>, <, pword>
else
tmp$ CATSTR <fn >, <algo>, <, offset hword>
endif
tmp$
ENDM
However, Lingo's algo is consistent with and without the "hutchtrick":
Intel(R) Pentium(R) 4 CPU 3.40GHz
328 htodw JJ short (124 bytes)
1813 atodw library
781 Alex short
438 Lingo long
468 Alex long
1265 clive short
On the contrary, when using the simple fn algo, "c0000", the first run was consistently much slower for Lingo's algo (and only for Lingo's algo).
Given that
fn algo, "c0000" creates a new string every time, one explanation could indeed be the size of his code - see Rockoon's last post in the "sensitivity" thread.
For comparison (consistent timings):
Celeron M, icnt = 50000000:
453 htodw JJ short (124 bytes)
3032 atodw library
1484 Alex short
781 Lingo long
797 Alex long
1500 clive short
Guys,
I'm wondering why you continue with these emotions?
It is clear Hutch is invented new complex phenomena (named from him wrongly problem of "code placement" link (http://portal.acm.org/citation.cfm?id=268469))
that occur in the Computer Science. So, he will provide soon a fundamentally new paradigm for modeling, analysis and speed optimizing of this phenomena and his first step was so named from JJ hutchtrick. :toothy
JJ,
that is what the mod was for, to reduce the variation by making every algo use the same string for testing. It still does not solve the problem with the identical string in each algo assigned to the data section that some code arrangements effected Lingo's algo but not the others.
I found all of this stuff while building the original benchmark, Lingos was the fastest but it was the only one that slowed down by this extent when code was added either before OR after it. The problem is consistency for a freestanding algo, it may be ego massaging to be the top of the food chain with test pieces for timing but its little use with an algorithm that is sensitive to code placement as it renders it inconsistent in general purpose use.
Now for Lingo continuing to flap off his mouth,
> So, he will provide soon a fundamentally new paradigm for modeling, analysis and speed optimizing of this phenomena and his first step was so named from JJ hutchtrick
It is in fact a fundamentally OLD paradigm for modelling, analysis and speed optimisation called REAL TIME TESTING as against theoretical test frameworks for no matter how ego massaging the test bed results are, the real time test in the one that matters.
In the first benchmark I posted after a lot of fiddling Lingo's algo was the fastest on my Core2 hardware but it was inconsistent with code placement (Big word for LINGO = OFFSET) and its time fluctuated where the others did not and it was prone to be slow on its first call which renders it as it is useless for general purpose data processing.
Now given that Lingo is too lazy to try and fix the problem so that a fast algo is available in general purpose terms there are enough hex to DWORD conversion algos around that are reliable and consistent to not bother with his if he is more interested in flapping his mouth off than coding something more reliable.
Ain't like I get paid for fixing other peoples code so its not like its any great loss.
"but its little use with an algorithm that is sensitive to code placement as it renders it inconsistent in general purpose use."
and
" which renders it as it is useless for general purpose data processing."
Wrong again...Why?
According to you, in the worst case, time of my algo is equal or faster to other's algos but not SLOWER,
in the best "code placement "case it is times faster too...
Hence, it is FASTER than others algos ALWAYS, independent of the cases!
'he is more interested in flapping his mouth off than coding something more reliable.'
Why to continue to do that? You see from years what I receive after that... :wink
This is the problem in terms of benchmarking, I do most of my work on a Core2 Quad but with a range of other machines to test on I get wildly different results across all of them. Now with your algo running at its fastest its faster on all of them except the antique Celeron but with fluctuations in its timing depending on the code placement it runs from about as fast as a short version to slower.
Now what I am after is an algo that is faster on most of the processors most of the time and while this one at its best is fast enough, when it is not at its best it performs poorly. I have tried to track it down with a number of methods, code location (OFFSET), leading and trailing code, inter-algo padding, different code and table alignments and even played with changing the table order (ascending, descending and interleaved) and got it faster in its worst cae but not running at its full speed.
Real time testing brings out all of these types of problems and they are the hardest ones to solve so flapping off at me when I have written none of the algos is a sure fire way for me to stop wasting my time and just pick something reliable as its hardly a high usage requirement to convert 1 to 8 character hex to DWORD.
i am curious...
wouldn't it make sense to take the l_tbl_n tables out of the code segment and place them in the data segment ?
it seems to me that the data cache would work more efficiently that way
particularly if the tables were somewhat close to the string being operated on
Thanks for the entertainment guys :bg
"and place them in the data segment?"
But he has no data segment in his "test" file.
Ok! I inserted .data segment and some data in it:
.data
align 16
mask39h dq 3939393939393939h
Recompiled the file and oops.... :clap:
C:\5>lingoslow
312 atodw library
109 Alex short
47 Lingo long
110 clive short
Press any key to continue ...
C:\5>lingoslow
297 atodw library
109 Alex short
47 Lingo long
109 clive short
Press any key to continue ...
C:\5>lingoslow
296 atodw library
109 Alex short
47 Lingo long
93 clive short
Press any key to continue ...
C:\5>lingoslow
312 atodw library
109 Alex short
46 Lingo long
94 clive short
Press any key to continue ...
C:\5>lingoslow
312 atodw library
109 Alex short
47 Lingo long
94 clive short
Press any key to continue ...
E^cube,
Y a welcome.. :toothy
Jochen's h2dtimings.zip at last post in 3rd page:
Intel(R) Celeron(R) CPU 2.13GHz
468 htodw JJ short (124 bytes)
3047 atodw library
1282 Alex short
703 Lingo long
766 Alex long
2078 clive short
469 htodw JJ short (124 bytes)
3000 atodw library
1281 Alex short
703 Lingo long
750 Alex long
2078 clive short
Press any key to continue ...
Bravo, Jochen!
If it not only works fast, but and right - this is very good algo!
If you remember, I talk about 40MB exe to you? It must implement similar things, but with bigger "span" :), and works about ~10 commands :)
Bravo!
Alex
Note to community: I'm really contrive algo, similar to Jochen's algo. But I not implement it... So, Jochen make this very nice, and I don't accuse Jochen in thefting, NOTE this :)
Jochen, BRAVO, you don't be too lazy to make this, and this - is great!
Next step of this - making table with all DWORDs in DWORD range :), but this is impossible (even on 64bits systems)
Alex
Quote from: Antariy on August 05, 2010, 11:00:49 PM
Intel(R) Celeron(R) CPU 2.13GHz
468 htodw JJ short (124 bytes)
703 Lingo long
Bravo, Jochen!
Thanks, Alex :bg
However, I did not advertise this one because it is limited to 8-byte strings. If your file has fixed length strings, it will be fine (and it's not even SSE...)
Dave,
it was easy enough to put the tables at the end of the algo into the .data section, it just takes adding the .data and .code tags. This is one of the mods I did in early testing when the times started to wander. There was no timing difference either way. I changed the alignment of the table to 4 instead of 16 but there was no change in timing, changed the main algo alignment to 4 but with no change and tried misaligning the lead to the algo but no change.
I have another benchmark testing 1 million random length hex strings which load in dynamic memory where the algo is close to consistent and its just barely faster than Alex's long version by a couple of percent on this Core2 Quad but I have yet to test it over other hardware.
Intel(R) Pentium(R) Dual CPU E2160 @ 1.80GHz
656 atodw library
172 Alex short
93 Lingo long
110 Alex long
172 clive short
Press any key to continue ...
Thanks, mineiro but try it 5 times consecutively and get the best results for every algo. :toothy
Next, please download my file h2dt.zip from page2 and try every exe file 5 times consecutively and get the best results for every algo. Thanks! :U
"Intel(R) Celeron(R) CPU 2.13GHz
468 htodw JJ short (124 bytes)->wrong
703 Lingo long
Bravo, Jochen!
Thanks, Alex ...bla..bla..blah.."
It is a new attempt of the two liars to manipulate the people again, because JJ didn't include the creation time of his table... :lol
Here is the later benchmark. It tests the algos on 1 million random hex strings of variable length. As with an earlier benchmark, run the batch file first to build the test file of hex numbers. Once it is built you can run BM without recreating the test file.
Here are the times I am getting, Lingo's algo is slightly faster on the Core2 and i7 where Alex's long algo is clearly faster on 2 generations of P4 and the antique Celeron.
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz
2055 MS library htodw average
750 MS Alex_Short average
629 MS lingo_htodw average
664 MS Alex_Long average
734 MS clive_htodw average
Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz
2039 MS library htodw average
733 MS Alex_Short average
581 MS lingo_htodw average
592 MS Alex_Long average
733 MS clive_htodw average
Prescott Core P4
Genuine Intel(R) CPU 3.80GHz
3121 MS library htodw average
1218 MS Alex_Short average
1047 MS lingo_htodw average
984 MS Alex_Long average
1265 MS clive_htodw average
Northwood Core P4
Intel(R) Pentium(R) 4 CPU 2.80GHz
3511 MS library htodw average
1219 MS Alex_Short average
1140 MS lingo_htodw average
968 MS Alex_Long average
1355 MS clive_htodw average
Intel(R) Celeron(TM) CPU 1200MHz
8372 MS library htodw average
4737 MS Alex_Short average
4444 MS lingo_htodw average
4196 MS Alex_Long average
4977 MS clive_htodw average
Writing 1000000 HEX strings to file
..............................................
...................Done
Intel(R) Pentium(R) Dual CPU E2160 @ 1.80GHz
1000000 = item count in file
3421 ms library htodw
1079 ms Alex_Long
1203 ms Alex_Short
1062 ms lingo_htodw
1172 ms clive_htodw
3438 ms library htodw
1093 ms Alex_Long
1235 ms Alex_Short
1063 ms lingo_htodw
1172 ms clive_htodw
3468 ms library htodw
1079 ms Alex_Long
1203 ms Alex_Short
1062 ms lingo_htodw
1172 ms clive_htodw
3422 ms library htodw
1094 ms Alex_Long
1250 ms Alex_Short
1062 ms lingo_htodw
1172 ms clive_htodw
3437 MS library htodw average
1222 MS Alex_Short average
1062 MS lingo_htodw average
1086 MS Alex_Long average
1172 MS clive_htodw average
Press any key to continue ...
mineiro, These results are invalid due to:
Invalid bm.exe file -> no .data section in it... :lol
Can't create new bm.exe file from bm.asm ->error
"Assembling: bm.asm
bm.asm(106) : error A2006:undefined symbol : ltok"
Sorry...
Writing 1000000 HEX strings to file
................................................................................
...................Done
Cannot Identify x86 Processor
1000000 = item count in file
1653 ms library htodw
577 ms Alex_Long
702 ms Alex_Short
499 ms lingo_htodw
702 ms clive_htodw
1669 ms library htodw
562 ms Alex_Long
702 ms Alex_Short
515 ms lingo_htodw
702 ms clive_htodw
1653 ms library htodw
578 ms Alex_Long
702 ms Alex_Short
499 ms lingo_htodw
702 ms clive_htodw
1669 ms library htodw
577 ms Alex_Long
687 ms Alex_Short
515 ms lingo_htodw
702 ms clive_htodw
1661 MS library htodw average
698 MS Alex_Short average
507 MS lingo_htodw average
573 MS Alex_Long average
702 MS clive_htodw average
Press any key to continue ...
Again, whats up with the CPU detection algorithm?
Its an AMD Phenom II x6 1055T @ 3.36GHz
sorry about what I posted Sr lingo.
Intel(R) Pentium(R) Dual CPU E2160 @ 1.80GHz
h2dt1.exe | 516 atodw library | 171 Alex short | 110 Lingo long | 109 Alex long | 172 clive short |
h2dt2.exe | 515 atodw library | 172 Alex short | 94 Lingo long | 125 Alex long | 172 clive short |
h2dt3.exe | 516 atodw library | 187 Alex short | 94 Lingo long | 109 Alex long | 172 clive short |
h2dt4.exe | 484 atodw library | 172 Alex short | 203 Lingo long | 110 Alex long | 171 clive short |
Press any key to continue ...
:bg
Poor Lingo, can't read the contents of an EXE file yet and does not have up to date libraries.
Section Table
-------------
01 .text Virtual Address 00001000
Virtual Size 00001454
Raw Data Offset 00000400
Raw Data Size 00001600
Relocation Offset 00000000
Relocation Count 0000
Line Number Offset 00000000
Line Number Count 0000
Characteristics 60000020
Code
Executable
Readable
02 .rdata Virtual Address 00003000
Virtual Size 00000210
Raw Data Offset 00001A00
Raw Data Size 00000400
Relocation Offset 00000000
Relocation Count 0000
Line Number Offset 00000000
Line Number Count 0000
Characteristics 40000040
Initialized Data
Readable
03 .data Virtual Address 00004000
Virtual Size 00000318
Raw Data Offset 00001E00
Raw Data Size 00000400
Relocation Offset 00000000
Relocation Count 0000
Line Number Offset 00000000
Line Number Count 0000
Characteristics C0000040
Initialized Data
Readable
Writeable
In case you have missed it, the DumpPE result shows the EXE file's .DATA section.
"ltok" has been parrt of the masm32 library for years.
Come on Lingo, you can do better than that.
Rockoon,
Sorry but I don't have a late AMD to test CPUID algos on.
Quote from: lingo on August 06, 2010, 03:28:37 AM
"Intel(R) Celeron(R) CPU 2.13GHz
468 htodw JJ short (124 bytes)->wrong
703 Lingo long
Bravo, Jochen!
Thanks, Alex ...bla..bla..blah.."
It is a new attempt of the two liars to manipulate the people again, because JJ didn't include the creation time of his table... :lol
The table has to be created once, which costs a few nanoseconds. It has no influence on average timings, and that's the only thing that counts in real life. You have that strange belief that benchmarks are meant to win a prize for the fastest algo ever under the most peculiar constraints. Nope, they serve to improve code for libraries, and real life conditions determine the design of algo and benchmarks.
P.S. Calling other members liars gives you the image of an immature person.
Quote from: hutch-- on August 06, 2010, 05:25:48 AM
Rockoon,
Sorry but I don't have a late AMD to test CPUID algos on.
I am not quite sure that I understand.
Are you not using the 48-byte processor name string reported by CPUID?
(CPUID functions 80000002h, 80000003h, and 80000004h)
Rockoon,
I take your point but the code is near the end of the test piece. If I had access at a late AMD it would be easy to fix but it at least works on all of the Intel hardware I have available.
I see the problem. Your CPU detection code has a bug that will bite you in the ass on Intels as well (if not now, then in the future)
Specifically, you are testing the highest extended function number and only allowing string collection when it is exactly 4 or exactly 8. What you actually want to do is to collect the string whenever the highest extended function number is greater than or equal to 4 (because extended function 4 is the highest extended function number that you are calling)
For more information, check either Intel's or AMD's CPUID specifications.
From Intel's manual: http://www.intel.com/Assets/PDF/appnote/241618.pdf
Quote
2.2.1 Largest Extended Function # (Function 8000_0000h)
When EAX is initialized to a value of 8000_0000h, the CPUID instruction returns the
largest extended function number supported by the processor in register EAX.
It almost seems like you reverse engineered the magic values of 4 and 8 by looking at specific processor output, rather than checked the specs!
:bg
You could be right but the specs are all over the place like a mad womans sewerage. I can test on Intel hardware but have no AMD machines to test with.
Quote from: hutch-- on August 06, 2010, 10:13:55 AM
:bg
You could be right but the specs are all over the place like a mad womans sewerage. I can test on Intel hardware but have no AMD machines to test with.
I took the first specs I could find from both Intel and AMD and they agree. I'm not sure what source of information you are using. Maybe you should stop using them.
MadWomansSewerage.pdf (http://www.intel.com/Assets/PDF/appnote/241618.pdf)
:lol
Hi,
FWIW a PIII.
G:\WORK>runme
Writing 1000000 HEX strings to file
................................................................................
...................Done
Pentium Pro, II or Celeron Processor
1000000 = item count in file
11787 ms library htodw
4757 ms Alex_Long
5598 ms Alex_Short
5187 ms lingo_htodw
5918 ms clive_htodw
11717 ms library htodw
4786 ms Alex_Long
5638 ms Alex_Short
5187 ms lingo_htodw
5919 ms clive_htodw
11697 ms library htodw
4737 ms Alex_Long
5599 ms Alex_Short
5188 ms lingo_htodw
5919 ms clive_htodw
11696 ms library htodw
4737 ms Alex_Long
5598 ms Alex_Short
5187 ms lingo_htodw
5918 ms clive_htodw
11724 MS library htodw average
5608 MS Alex_Short average
5187 MS lingo_htodw average
4754 MS Alex_Long average
5918 MS clive_htodw average
Press any key to continue ...
Needs comma separators. <g>
Regards,
Steve N.
"Poor Lingo, can't read the contents of an EXE file yet...
For many years I use HIEW32 and IDA rather then DumpPE but no time and interest to investigate and use your "testing" program.
...and does not have up to date libraries."
This is true because for many years I do not use your ancient code libraries.
They are slow, with C-like code; without SSE, etc... or with other words smell of old age...Sorry :(
They are useful for newbies to start but advanced users have nothing to learn from them and it is a reason most of them to use GoASM or other stuff.
JJ,
'The table has to be created once,...
By you or by Hutch as a publisher of your algo? :lol
Because you don't post the file with it...
...which costs a few nanoseconds"
How do you know when you haven't a ready to use table in your file?
Why these "few nanoseconds" are not included in the: 468 htodw JJ short?
"It has no influence on average timings, and that's the only thing that counts in real life"
I'm sure that Hutch's "testing" program will have the same "problem" of "code placement" with it... :lol
"P.S. Calling other members liars gives you the image of an immature person."
I'm mature enough to know the thief is always a liar... :lol
:bg
> For many years I use HIEW32 and IDA
Congratulations but if you cannot find the .DATA section in an EXE file you aren't doing it right.
> This is true because for many years I do not use your ancient code libraries
Thats no excuse for not having a tokeniser handy
> They are slow, with C-like code; without SSE, etc... or with other words smell of old age...Sorry
Gee SSE4.2 runs badly on an earlier processor. tried SSE3 on a PIII lately ? How about i7 opcodes on your Core series processor.
Come on Lingo, you can do better than that.
What fascinates me is after I bothered to write another test piece that showed your algo when its not fluctuating is fastest by a small amount on an i7/Core2 quad, you still want to lose the same argument. ::)
Quote from: lingo on August 06, 2010, 01:50:50 PM
JJ,
'The table has to be created once,...
By you or by Hutch as a publisher of your algo? :lol
Because you don't post the file with it...
See attachment under reply #44:
.data?
hxjTable dd 65536/4 dup(?) ; bytes
.code
align 16
jj_htodw_s:
... table creation code...
Quote
...which costs a few nanoseconds"
How do you know when you haven't a ready to use table in your file?
Why these "few nanoseconds" are not included in the: 468 htodw JJ short?
They are included.
Quote
"It has no influence on average timings, and that's the only thing that counts in real life"
I'm sure that Hutch's "testing" program will have the same "problem" of "code placement" with it... :lol
There is no such problem because my code, including the creation of the table, is 124 bytes short.
Quote
"P.S. Calling other members liars gives you the image of an immature person."
I'm mature enough to know the thief is always a liar... :lol
When did you lose your last friend?
"How about i7 opcodes on your Core series processor."
You will lose this argument soon due to my wife will receive a gift for her birthday: I think to be a new lapi with i7-620M CPU in it. link (http://store.shopfujitsu.com/fpc/Ecommerce/buildseriesbean.do?series=NH570#BVRRWidgetID)
"..you still want to lose the same argument."
Non me ne frega un cazzo :lol
:bg
> You will lose this argument soon due to my wife will receive a gift for her birthday: I think to be a new lapi with i7-620M CPU in it.
I am pleased to hear she is getting a new computer but I doubt that will help you to get i7 instructions to run on your Core series processor. :P
he has a wife ? :eek
Quote from: dedndave on August 06, 2010, 07:22:18 PM
he has a wife ? :eek
heh looks like you're not the only one married on here dave :P
yah - but, i'm a nice guy :bg
Nah,
We have plenty of guys in here who have a better half, Jack is bashed around the eas regularly by a lovely lady Donna, Van posted a good photo of his better half and I know there are many others.
We do not talk about the girls because for most of us, our girls couldn't code their way out of a paper bag.
:bg
Thats what they have you for. :P
Quote from: jj2007 on August 06, 2010, 12:41:05 AM
Quote from: Antariy on August 05, 2010, 11:00:49 PM
Intel(R) Celeron(R) CPU 2.13GHz
468 htodw JJ short (124 bytes)
703 Lingo long
Bravo, Jochen!
Thanks, Alex :bg
However, I did not advertise this one because it is limited to 8-byte strings. If your file has fixed length strings, it will be fine (and it's not even SSE...)
I know,
Jochen.
This text is writte after testing - sources I see a bit later. All the same - this algo very good. Need only add support of short strings.
Alex
hmm... i probably shouldn't enter these discussions, but :
to lingo :
Quote from: lingo on August 04, 2010, 11:49:49 AM
"Yes, it seems, which Lingo's algo with big look-up table is very dependent from alignment and placement in the file. Need some "dancing with tambourine" to do it works faster :)" by the asian lamer with archaic CPU.
So, with other words Lingo is some kind of liar who try to manipulate the people and Lingo's algos are achieved by fraud...
hmm... not by fraud, but you use artifice/knowledge to defeat the others in the speed tests proposed.
ex: "jmp ecx" CAN'T be faster than "ret" (in most case). coz the code read is absorbed by the test loop.
plus, "ret" never need 2 memory clusters to be decoded (it's the case for "jmp ecx", with big slowdown, depending of the code placement)
jmp ecx CAN be faster ONLY in the situation of a speed test/like... and it's rarely the case in real life...
to hutch :
your speed test has no more values than the others, coz the read of the code is also absorbed by the loop. and you can't see the cost of the cache filling/no cache/cache effects on the others algos.
to the others, (ALL THE OTHERS):
lingo rarely initiate a speed test, he just propose a code for the test proposed (like every coder should do). if you don't understand the lacks of the test you've initiated it's not lingo's fault, if you only count on the result obtained without understanding what's really done/hidden it's not lingo's fault.
to make it short, if your're idiots, it's NOT lingo's fault.
personally i use and consider that MichaelW's macros give a good evaluation of the code written. i don't blindly follow the results. for the same reason explained before, i know that stuff like "add reg,1" CAN'T be faster than "inc reg" except in special circonstancies, rarely encountered in real life. THINK YOUR CODE ! AND UNDERSTAND WHAT'S UNDER THE RESULTS ! :P
Hutch,
I reordered something in my algo Lingo-long in your file lingoslow and now I receive 46 with easy.
Please, try it. Thank you.
Note:Due to Hutch's testing program is very memory usage sensitive please close all other programs and run the test minimum 3 times. Get the best values. Thanks. :toothy
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
C:\8>lingoslow
250 atodw library
109 Alex short
46 Lingo long
94 clive short
Press any key to continue ...
AMD Turion(tm) 64 Mobile Technology ML-30 (SSE3)
640 atodw library
265 Alex short
125 Lingo long
282 clive short
Press any key to continue ...
532 atodw library
203 Alex short
125 Lingo long
218 clive short
Press any key to continue ...
641 atodw library
343 Alex short
141 Lingo long
281 clive short
Hutch, test this your old test-bed, or include ax_jj_htodw algo to your test-bed.
This is most of Jochen's word-indexed table look-up algo, but with support of short strings.
It must be not slow... Check this, please. Because algo use big look-up table, it works better on more newest CPUs, than my.
Alex
P.S. Copyright (c) - Jochen, aka JJ, aka jj2007
I add support of short strings, and reorder some things in algo only.
Very nice, Alex :U
But it is really Copyright (c) Alex, with some inspiration from Jochen :wink
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz
625 htodw JJ short (124 bytes)
3015 atodw library
1468 Alex short
1204 Lingo long
781 Alex long
1484 clive short
Very good result.
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz
297 htodw JJ short (124 bytes)
1453 atodw library
563 Alex short
281 Lingo long
329 Alex long
515 clive short
297 htodw JJ short (124 bytes)
1344 atodw library
547 Alex short
281 Lingo long
328 Alex long
516 clive short
Press any key to continue ...
Hi,
Pentium III results.
Regards,
Steve N.
G:\WORK>2test_ax
Pentium Pro, II or Celeron Processor
2764 htodw JJ short (124 bytes)
12348 atodw library
4276 Alex short
2032 Lingo long
1792 Alex long
4466 clive short
2774 htodw JJ short (124 bytes)
12348 atodw library
3866 Alex short
2023 Lingo long
1773 Alex long
4466 clive short
Press any key to continue ...
Hi!
Big ask to all: run the test-bed in archive attached, please!
There is continue of hex2dword proc's development and testing.
Test include latest (yesterday's) Lingo's proc (which is "...reordered something...").
Then, test include Jochens perfect WORD-indexed lookup table algo, which is support short strings now.
And, it test included ALL versions of my hex2dword procs: 5 my small versions, with different algo implementation, and some algos work with not-zero terminated, but space, CRLF etc. terminated strings (code less than 30h). And 2 tweaks of Hutch versions.
Also included my versions of fast GPRs and MMX and SSE1 versions of algos. From them, MMX/SSE versions support not-zero terminated strings also.
These are my timings:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
25 cycles for Fast version
27 cycles for Fast version under AMD
48 cycles for Small 1
44 cycles for Small 2
43 cycles for Small 3
47 cycles for Small 3.1
43 cycles for Small 4
28 cycles for MMX 1
28 cycles for MMX 2
30 cycles for SSE1
Other's Versions:
48 cycles for Axhex2dw improved by Hutch (1)
83 cycles for Axhex2dw improved by Hutch (2)
28 cycles for Lingo's SSE version
24 cycles for Lingo's BIG integer version
28 cycles for Jochen's WORD-Indexed version
28 cycles for Dave's version (with minor changes)
28 cycles for Fast version
27 cycles for Fast version under AMD
48 cycles for Small 1
46 cycles for Small 2
43 cycles for Small 3
47 cycles for Small 3.1
43 cycles for Small 4
28 cycles for MMX 1
29 cycles for MMX 2
31 cycles for SSE1
Other's Versions:
48 cycles for Axhex2dw improved by Hutch (1)
83 cycles for Axhex2dw improved by Hutch (2)
29 cycles for Lingo's SSE version
24 cycles for Lingo's BIG integer version
28 cycles for Jochen's WORD-Indexed version
28 cycles for Dave's version (with minor changes)
25 cycles for Fast version
27 cycles for Fast version under AMD
48 cycles for Small 1
44 cycles for Small 2
43 cycles for Small 3
46 cycles for Small 3.1
43 cycles for Small 4
28 cycles for MMX 1
28 cycles for MMX 2
55 cycles for SSE1
Other's Versions:
48 cycles for Axhex2dw improved by Hutch (1)
83 cycles for Axhex2dw improved by Hutch (2)
29 cycles for Lingo's SSE version
24 cycles for Lingo's BIG integer version
28 cycles for Jochen's WORD-Indexed version
28 cycles for Dave's version (with minor changes)
==========
Codesizes:
Axhex2dw_Unrolled: 396
Axhex2dw_Unrolled_AMD: 396
Axhex2dw1 - 1: 70
Axhex2dw2 - 2: 69
Axhex2dw3 - 3: 57
Axhex2dw3_1 - 3.1: 56
Axhex2dw3 - 4: 61
Axhex2dw_MMX: 128
Axhex2dw_MMX2: 120
Axhex2dw_SSE: 160
Alex_Short_Hutch: 59
Axhex2dw_Hutch2: 54
Hex2dwLingoSSE: 160
lingo_htodw: 1950
ax_jj_htodw: 166
krbhtodw: 468
--- ok ---
Optimization of integer algos on my CPU not very easy task (Hutch know how this are: Celeron Prescott coding). But MMX/SSE code on my system works very NOT well. Any reorderment, using different regs and instructions not give anything advantage. I think, this is because my L2 cache is small.
So, any test-reports and suggestions is welcome.
Dave's (aka KeepingRealBusy) version is the first posted his version, which exposed to some my minor changes. Sorry, Dave, if you would not want to join to tests. I post fist version of this tweak there: "http://www.masm32.com/board/index.php?topic=14438.msg116559#msg116559". Maybe, you miss this post. But on my machine, this tweak the same faster: your tweak with using ROL and SHL have 34 clocks, my tweak - 28 clocks (on my CPU). Other tweaks have timings not less than 42 clocks. So, for testing I select this revision.
Jochen, sorry for "thefting" your algo :), as I say already, I also contrive similar proc, but you don't be lazy as I, to implement it first. So, copyright is yours, because you *firstly* write this algo. And my implementation of support short strings not the best - this is solution "on fast hand".
Hutch, test this please, if you have time.
My point of view that: Jochen's algo may be the same fastest with long strings, if make fast solution for support of short strings. His proc may be also reliable if make checking table.
Dave's algo the same reliable from all tested algos, because it tests input, and might "speak" about errors with very small elaboration. With this it be reliable/fast/relatively_small.
Both Jochen's and Dave's algo have no (or very small) sensitivity of "code/data placement", because theirs look-up tables is byte-tables. This have two advantages: alignment no needed and have no significance, and in cache-line will be placed in 4 times more data of table.
Note: under "reliability" I mean checking of data, which processed in code - hex is this, or not.
All MMX/SSE versions is most useful with long strings, i.e. with 8byte strings (or full-notated). Because timings of all MMX/SSE versions is not depended from string length (my procs), or timings slightly longer with not 8byte strings (Lingo's proc).
So, for occasional conversion, and most usable by my point of view, is short versions of procs.
For fast conversion most usable integer versions, if string length may have size not 8bytes.
This is my thinkings about usability of algos, so any other peoples may have other opinions.
Alex
Timings on Core2 quad.
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
16 cycles for Fast version
19 cycles for Fast version under AMD
25 cycles for Small 1
27 cycles for Small 2
27 cycles for Small 3
29 cycles for Small 3.1
27 cycles for Small 4
10 cycles for MMX 1
11 cycles for MMX 2
11 cycles for SSE1
Other's Versions:
28 cycles for Axhex2dw improved by Hutch (1)
27 cycles for Axhex2dw improved by Hutch (2)
5 cycles for Lingo's SSE version
13 cycles for Lingo's BIG integer version
11 cycles for Jochen's WORD-Indexed version
18 cycles for Dave's version (with minor changes)
17 cycles for Fast version
19 cycles for Fast version under AMD
25 cycles for Small 1
27 cycles for Small 2
45 cycles for Small 3
27 cycles for Small 3.1
27 cycles for Small 4
11 cycles for MMX 1
11 cycles for MMX 2
11 cycles for SSE1
Other's Versions:
46 cycles for Axhex2dw improved by Hutch (1)
27 cycles for Axhex2dw improved by Hutch (2)
5 cycles for Lingo's SSE version
13 cycles for Lingo's BIG integer version
11 cycles for Jochen's WORD-Indexed version
18 cycles for Dave's version (with minor changes)
16 cycles for Fast version
19 cycles for Fast version under AMD
25 cycles for Small 1
47 cycles for Small 2
27 cycles for Small 3
27 cycles for Small 3.1
27 cycles for Small 4
10 cycles for MMX 1
11 cycles for MMX 2
11 cycles for SSE1
Other's Versions:
28 cycles for Axhex2dw improved by Hutch (1)
27 cycles for Axhex2dw improved by Hutch (2)
5 cycles for Lingo's SSE version
13 cycles for Lingo's BIG integer version
11 cycles for Jochen's WORD-Indexed version
18 cycles for Dave's version (with minor changes)
==========
Codesizes:
Axhex2dw_Unrolled: 396
Axhex2dw_Unrolled_AMD: 396
Axhex2dw1 - 1: 70
Axhex2dw2 - 2: 69
Axhex2dw3 - 3: 57
Axhex2dw3_1 - 3.1: 56
Axhex2dw3 - 4: 61
Axhex2dw_MMX: 128
Axhex2dw_MMX2: 120
Axhex2dw_SSE: 160
Alex_Short_Hutch: 59
Axhex2dw_Hutch2: 54
Hex2dwLingoSSE: 160
lingo_htodw: 1950
ax_jj_htodw: 166
krbhtodw: 468
--- ok ---
Thanks, Hutch!
It seems, what Jochen's integer algo is the same fast from integer versions. But I use MMX in it to comute string length. So, this is not fully only-integer version... But it can work on PI-MMX. Not very new CPU :)
Lingo's SSE version is very fast, but how timing it have with 7byte string length, for example?
I cannot make very good MMX/SSE code, because any used technics not work well :( I cannot select needed way to implementation. My MMX/SSE version is mostly for fun (SIMD remake of my short Axhex2dw, as you see), but on my CPU they faster by 1 clock. So, I will be use it :)
Dave's algo is very fast, with consideration of look-up table, checking of input, and some dependencies in code.
Alex
Hutch, it seems, than the same first of optimized versions of Axhex2dw is the same fast on newest CPUs.
16 cycles for Fast version
19 cycles for Fast version under AMD
25 cycles for Small 1 *** THIS ***
27 cycles for Small 2
Interesting... Optimizing to drop 5 clocks on PIV gets anti-optimizing to up 2 clocks on Core.
Alex
Good job, Alex :U
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
20 cycles for Fast version
20 cycles for Fast version under AMD
41 cycles for Small 1
41 cycles for Small 2
41 cycles for Small 3
57 cycles for Small 3.1
41 cycles for Small 4
14 cycles for MMX 1
15 cycles for MMX 2
21 cycles for SSE1
Other's Versions:
41 cycles for Axhex2dw improved by Hutch (1)
58 cycles for Axhex2dw improved by Hutch (2)
9 cycles for Lingo's SSE version
35 cycles for Lingo's BIG integer version
14 cycles for Jochen's WORD-Indexed version
24 cycles for Dave's version (with minor changes)
Hi,
PIII, Dave's looks good here.
Steve
G:\WORK>12alex's
☺☺☻♥ (SSE1)
23 cycles for Fast version
28 cycles for Fast version under AMD
63 cycles for Small 1
62 cycles for Small 2
59 cycles for Small 3
59 cycles for Small 3.1
60 cycles for Small 4
18 cycles for MMX 1
19 cycles for MMX 2
21 cycles for SSE1
Other's Versions:
59 cycles for Axhex2dw improved by Hutch (1)
59 cycles for Axhex2dw improved by Hutch (2)
17 cycles for Lingo's SSE version
27 cycles for Lingo's BIG integer version
36 cycles for Jochen's WORD-Indexed version
10 cycles for Dave's version (with minor changes)
25 cycles for Fast version
30 cycles for Fast version under AMD
64 cycles for Small 1
60 cycles for Small 2
59 cycles for Small 3
59 cycles for Small 3.1
60 cycles for Small 4
18 cycles for MMX 1
19 cycles for MMX 2
21 cycles for SSE1
Other's Versions:
59 cycles for Axhex2dw improved by Hutch (1)
59 cycles for Axhex2dw improved by Hutch (2)
17 cycles for Lingo's SSE version
27 cycles for Lingo's BIG integer version
36 cycles for Jochen's WORD-Indexed version
11 cycles for Dave's version (with minor changes)
24 cycles for Fast version
30 cycles for Fast version under AMD
60 cycles for Small 1
60 cycles for Small 2
59 cycles for Small 3
60 cycles for Small 3.1
60 cycles for Small 4
18 cycles for MMX 1
19 cycles for MMX 2
23 cycles for SSE1
Other's Versions:
59 cycles for Axhex2dw improved by Hutch (1)
60 cycles for Axhex2dw improved by Hutch (2)
15 cycles for Lingo's SSE version
27 cycles for Lingo's BIG integer version
36 cycles for Jochen's WORD-Indexed version
10 cycles for Dave's version (with minor changes)
==========
Codesizes:
Axhex2dw_Unrolled: 396
Axhex2dw_Unrolled_AMD: 396
Axhex2dw1 - 1: 70
Axhex2dw2 - 2: 69
Axhex2dw3 - 3: 57
Axhex2dw3_1 - 3.1: 56
Axhex2dw3 - 4: 61
Axhex2dw_MMX: 128
Axhex2dw_MMX2: 120
Axhex2dw_SSE: 160
Alex_Short_Hutch: 59
Axhex2dw_Hutch2: 54
Hex2dwLingoSSE: 160
lingo_htodw: 1950
ax_jj_htodw: 166
krbhtodw: 468
--- ok ---
Hi!
Very BIG Thanks to all testers!!!
This is new code. Improved Jochen's proc, which have timings by 5 clocks smaller on my CPU. Now Jochen's proc the SAME faster from ALL procs, on my CPU.
I rewrite my second MMX version also. Now it contain more instructions of ALU cluster. MAY be, it faster. But on my CPU - not (as I say already, this is usual behaviour).
My timings:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
25 cycles for Fast version
27 cycles for Fast version under AMD
48 cycles for Small 1
44 cycles for Small 2
43 cycles for Small 3
72 cycles for Small 3.1
43 cycles for Small 4
28 cycles for MMX 1
28 cycles for MMX 2
30 cycles for SSE1
Other's Versions:
48 cycles for Axhex2dw improved by Hutch (1)
83 cycles for Axhex2dw improved by Hutch (2)
28 cycles for Lingo's SSE version
24 cycles for Lingo's BIG integer version
23 cycles for Jochen's WORD-Indexed version
28 cycles for Dave's version (with minor changes)
27 cycles for Fast version
27 cycles for Fast version under AMD
48 cycles for Small 1
44 cycles for Small 2
43 cycles for Small 3
46 cycles for Small 3.1
43 cycles for Small 4
28 cycles for MMX 1
26 cycles for MMX 2
30 cycles for SSE1
Other's Versions:
48 cycles for Axhex2dw improved by Hutch (1)
83 cycles for Axhex2dw improved by Hutch (2)
29 cycles for Lingo's SSE version
24 cycles for Lingo's BIG integer version
23 cycles for Jochen's WORD-Indexed version
28 cycles for Dave's version (with minor changes)
25 cycles for Fast version
30 cycles for Fast version under AMD
48 cycles for Small 1
44 cycles for Small 2
43 cycles for Small 3
46 cycles for Small 3.1
43 cycles for Small 4
28 cycles for MMX 1
28 cycles for MMX 2
30 cycles for SSE1
Other's Versions:
48 cycles for Axhex2dw improved by Hutch (1)
83 cycles for Axhex2dw improved by Hutch (2)
28 cycles for Lingo's SSE version
24 cycles for Lingo's BIG integer version
23 cycles for Jochen's WORD-Indexed version
28 cycles for Dave's version (with minor changes)
==========
Codesizes:
Axhex2dw_Unrolled: 396
Axhex2dw_Unrolled_AMD: 396
Axhex2dw1 - 1: 70
Axhex2dw2 - 2: 69
Axhex2dw3 - 3: 57
Axhex2dw3_1 - 3.1: 56
Axhex2dw3 - 4: 61
Axhex2dw_MMX: 128
Axhex2dw_MMX2: 160
Axhex2dw_SSE: 160
Alex_Short_Hutch: 59
Axhex2dw_Hutch2: 54
Hex2dwLingoSSE: 160
lingo_htodw: 1950
ax_jj_htodw: 182
krbhtodw: 468
--- ok ---
Big ask to all: test this please.
Alex
Here is the timing off my Core2 box.
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
16 cycles for Fast version
19 cycles for Fast version under AMD
25 cycles for Small 1
27 cycles for Small 2
27 cycles for Small 3
29 cycles for Small 3.1
27 cycles for Small 4
10 cycles for MMX 1
10 cycles for MMX 2
11 cycles for SSE1
Other's Versions:
28 cycles for Axhex2dw improved by Hutch (1)
27 cycles for Axhex2dw improved by Hutch (2)
5 cycles for Lingo's SSE version
13 cycles for Lingo's BIG integer version
12 cycles for Jochen's WORD-Indexed version
18 cycles for Dave's version (with minor changes)
17 cycles for Fast version
19 cycles for Fast version under AMD
25 cycles for Small 1
27 cycles for Small 2
45 cycles for Small 3
27 cycles for Small 3.1
27 cycles for Small 4
11 cycles for MMX 1
10 cycles for MMX 2
11 cycles for SSE1
Other's Versions:
28 cycles for Axhex2dw improved by Hutch (1)
27 cycles for Axhex2dw improved by Hutch (2)
5 cycles for Lingo's SSE version
13 cycles for Lingo's BIG integer version
12 cycles for Jochen's WORD-Indexed version
18 cycles for Dave's version (with minor changes)
16 cycles for Fast version
19 cycles for Fast version under AMD
25 cycles for Small 1
28 cycles for Small 2
27 cycles for Small 3
27 cycles for Small 3.1
27 cycles for Small 4
10 cycles for MMX 1
10 cycles for MMX 2
11 cycles for SSE1
Other's Versions:
28 cycles for Axhex2dw improved by Hutch (1)
27 cycles for Axhex2dw improved by Hutch (2)
5 cycles for Lingo's SSE version
13 cycles for Lingo's BIG integer version
12 cycles for Jochen's WORD-Indexed version
18 cycles for Dave's version (with minor changes)
==========
Codesizes:
Axhex2dw_Unrolled: 396
Axhex2dw_Unrolled_AMD: 396
Axhex2dw1 - 1: 70
Axhex2dw2 - 2: 69
Axhex2dw3 - 3: 57
Axhex2dw3_1 - 3.1: 56
Axhex2dw3 - 4: 61
Axhex2dw_MMX: 128
Axhex2dw_MMX2: 160
Axhex2dw_SSE: 160
Alex_Short_Hutch: 59
Axhex2dw_Hutch2: 54
Hex2dwLingoSSE: 160
lingo_htodw: 1950
ax_jj_htodw: 182
krbhtodw: 468
--- ok ---
Prescott w/htt:
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
25 cycles for Fast version
27 cycles for Fast version under AMD
48 cycles for Small 1
44 cycles for Small 2
45 cycles for Small 3
46 cycles for Small 3.1
43 cycles for Small 4
28 cycles for MMX 1
30 cycles for MMX 2
30 cycles for SSE1
Other's Versions:
48 cycles for Axhex2dw improved by Hutch (1)
83 cycles for Axhex2dw improved by Hutch (2)
30 cycles for Lingo's SSE version
24 cycles for Lingo's BIG integer version
23 cycles for Jochen's WORD-Indexed version
28 cycles for Dave's version (with minor changes)
27 cycles for Fast version
27 cycles for Fast version under AMD
48 cycles for Small 1
55 cycles for Small 2
45 cycles for Small 3
54 cycles for Small 3.1
43 cycles for Small 4
28 cycles for MMX 1
28 cycles for MMX 2
30 cycles for SSE1
Other's Versions:
48 cycles for Axhex2dw improved by Hutch (1)
99 cycles for Axhex2dw improved by Hutch (2)
28 cycles for Lingo's SSE version
24 cycles for Lingo's BIG integer version
36 cycles for Jochen's WORD-Indexed version
28 cycles for Dave's version (with minor changes)
25 cycles for Fast version
27 cycles for Fast version under AMD
48 cycles for Small 1
44 cycles for Small 2
45 cycles for Small 3
46 cycles for Small 3.1
43 cycles for Small 4
28 cycles for MMX 1
28 cycles for MMX 2
31 cycles for SSE1
Other's Versions:
48 cycles for Axhex2dw improved by Hutch (1)
83 cycles for Axhex2dw improved by Hutch (2)
27 cycles for Lingo's SSE version
24 cycles for Lingo's BIG integer version
23 cycles for Jochen's WORD-Indexed version
28 cycles for Dave's version (with minor changes)
Quote from: hutch-- on August 14, 2010, 11:16:53 PM
Here is the timing off my Core2 box.
Thanks,
Hutch!
Jochen's proc the same fast. It have size in ~11 TIMES short as lingo's proc, and speedy by 1 clock.
As I expect, MMX version is not well :(
Alex
Quote from: dedndave on August 14, 2010, 11:19:37 PM
Prescott w/htt:
Thanks,
Dave!
On your CPU Jochen's proc is the same small/fast also.
Alex
yes - but it's a P4
some might say it is obsolete :P
i prefer to say it is becoming obsolete
that way, i don't have to go out and buy a new computer :lol
i am just now getting the hang of properly building this one
How timings have this version (5bytes long string, sources the same)?
Alex
(Edited) I don't post my timings, because I have many runned apps with no small loading of CPU...
Quote from: dedndave on August 14, 2010, 11:27:06 PM
yes - but it's a P4
some might say it is obsolete :P
i prefer to say it is becoming obsolete
that way, i don't have to go out and buy a new computer :lol
i am just now getting the hang of properly building this one
Why "tongue"? :)
I have Prescott Celeron, with trimmed cache, without DEP, without HT... And I don't say what it is obsolete :)
I'm not a gamer, and I don't needed in teraherzs with liquid cooling :)
Alex
Prescott w/htt:
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
21 cycles for Fast version
21 cycles for Fast version under AMD
33 cycles for Small 1
33 cycles for Small 2
33 cycles for Small 3
31 cycles for Small 3.1
33 cycles for Small 4
28 cycles for MMX 1
28 cycles for MMX 2
36 cycles for SSE1
Other's Versions:
34 cycles for Axhex2dw improved by Hutch (1)
53 cycles for Axhex2dw improved by Hutch (2)
43 cycles for Lingo's SSE version
24 cycles for Lingo's BIG integer version
23 cycles for Jochen's WORD-Indexed version
27 cycles for Dave's version (with minor changes)
24 cycles for Fast version
21 cycles for Fast version under AMD
33 cycles for Small 1
33 cycles for Small 2
33 cycles for Small 3
33 cycles for Small 3.1
35 cycles for Small 4
28 cycles for MMX 1
60 cycles for MMX 2
56 cycles for SSE1
Other's Versions:
34 cycles for Axhex2dw improved by Hutch (1)
53 cycles for Axhex2dw improved by Hutch (2)
31 cycles for Lingo's SSE version
23 cycles for Lingo's BIG integer version
23 cycles for Jochen's WORD-Indexed version
24 cycles for Dave's version (with minor changes)
23 cycles for Fast version
21 cycles for Fast version under AMD
33 cycles for Small 1
33 cycles for Small 2
30 cycles for Small 3
33 cycles for Small 3.1
33 cycles for Small 4
26 cycles for MMX 1
28 cycles for MMX 2
30 cycles for SSE1
Other's Versions:
33 cycles for Axhex2dw improved by Hutch (1)
53 cycles for Axhex2dw improved by Hutch (2)
31 cycles for Lingo's SSE version
21 cycles for Lingo's BIG integer version
23 cycles for Jochen's WORD-Indexed version
24 cycles for Dave's version (with minor changes)
5bytes:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
15 cycles for Fast version
15 cycles for Fast version under AMD
26 cycles for Small 1
27 cycles for Small 2
27 cycles for Small 3
27 cycles for Small 3.1
27 cycles for Small 4
14 cycles for MMX 1
14 cycles for MMX 2
21 cycles for SSE1
Other's Versions:
27 cycles for Axhex2dw improved by Hutch (1)
27 cycles for Axhex2dw improved by Hutch (2)
13 cycles for Lingo's SSE version
12 cycles for Lingo's BIG integer version
14 cycles for Jochen's WORD-Indexed version
18 cycles for Dave's version (with minor changes)
well - "modern" would be a core duo or i7 - or one of the more recent AMD's
to be honest, i am pleased with the performance of this machine
of course, i know how to set it up to be fast
i can see where, if i were a lay-person, it might not be so wonderful
Quote from: dedndave on August 14, 2010, 11:34:18 PM
well - "modern" would be a core duo or i7 - or one of the more recent AMD's
to be honest, i am pleased with the performance of this machine
of course, i know how to set it up to be fast
i can see where, if i were a lay-person, it might not be so wonderful
I agree - most good thing - correct tuning and maintenance of computer.
I agree more, because this is my speciality :)
Alex
Quote from: jj2007 on August 14, 2010, 11:32:38 PM
5bytes:
Jochen, this is because your proc runs without branching. But in real-world testing it be champion. And, if port it to 64bit - it be fastest, because code be not very changed, contrary to other unrolled versions, which be have almost linearly twice biggest timings.
Alex
P.S. How timings have your proc with full-notated strings?
Since Dave mentioned obsolete, here are the timings for a P3:
☺☺☻♥ (SSE1)
23 cycles for Fast version
28 cycles for Fast version under AMD
60 cycles for Small 1
60 cycles for Small 2
59 cycles for Small 3
59 cycles for Small 3.1
63 cycles for Small 4
18 cycles for MMX 1
17 cycles for MMX 2
21 cycles for SSE1
Other's Versions:
59 cycles for Axhex2dw improved by Hutch (1)
59 cycles for Axhex2dw improved by Hutch (2)
17 cycles for Lingo's SSE version
27 cycles for Lingo's BIG integer version
34 cycles for Jochen's WORD-Indexed version
12 cycles for Dave's version (with minor changes)
23 cycles for Fast version
28 cycles for Fast version under AMD
59 cycles for Small 1
60 cycles for Small 2
59 cycles for Small 3
59 cycles for Small 3.1
60 cycles for Small 4
18 cycles for MMX 1
17 cycles for MMX 2
21 cycles for SSE1
Other's Versions:
59 cycles for Axhex2dw improved by Hutch (1)
59 cycles for Axhex2dw improved by Hutch (2)
19 cycles for Lingo's SSE version
27 cycles for Lingo's BIG integer version
34 cycles for Jochen's WORD-Indexed version
11 cycles for Dave's version (with minor changes)
23 cycles for Fast version
28 cycles for Fast version under AMD
59 cycles for Small 1
60 cycles for Small 2
60 cycles for Small 3
59 cycles for Small 3.1
60 cycles for Small 4
18 cycles for MMX 1
17 cycles for MMX 2
21 cycles for SSE1
Other's Versions:
59 cycles for Axhex2dw improved by Hutch (1)
59 cycles for Axhex2dw improved by Hutch (2)
16 cycles for Lingo's SSE version
27 cycles for Lingo's BIG integer version
34 cycles for Jochen's WORD-Indexed version
12 cycles for Dave's version (with minor changes)
In terms of cycle counts a P3 looks good against the P4s, but not against the more recent processors.
Quote from: MichaelW on August 14, 2010, 11:42:23 PM
Since Dave mentioned obsolete, here are the timings for a P3:
Thanks,
Michael!
PIII - good CPU, but I see, what procs with big lookup table runs on PIIIs not very good (I see FORTRANS CPU and your CPU timings).
On my CPU MMX/SSE works very not well. This is feature only Celerons, or every PIV?
Alex
I would not take too much notice of what is deemed to be obsolete and what is not, if it does the job, it does the job. I particularly liked the Northwood P4 I used to develop on and after endless pissing around to get some legacy boards that stayed working, the two P4s I have running are both fast and useful machines. Yes you can do faster parallel processing on late model quads but they are laggier than a single core and not always faster.
Michael, you are lucky to be able to keep a PIII going, I had hell's own problems getting later P4s reliable once my old board died. My only old timer is a fluke, someone gave me a 1200 Celeron 8 years ago and when I tested it a few months ago it booted straight up so I shoved it into a box.
i have a p1/mmx running, but it isn't convenient to hook it up to the internet
if i had thought about it, i could have ran an ether line to the router when i installed it
but, getting a modern wireless adapter to work under win98 isn't very likely
Now you know why i still run a fully wired netword with a couple of gigabit hubs, you can plonk a gigbit adapter into just about any PCI slot and it will work fine. I have one of the spares in the ancient Celeron box and its probably faster than the bus on the board but in performance terms it runs fine.
What I am waiting for is much faster fully optical networking as you can then start to gang machines in very interesting ways. CAT6 will handle 10 gigabit if routed properly but full optical has the potential to be much faster again. I have seen AOE data where you can RAID stripe multiple connections to get massive data transfer rates and this will start to be possible if you get reallty high speed optical networking for PC networks.
Hi!
This is new test, in which I change only Dave's (aka KeepingRealBusy) proc.
This proc may report about success or not of conversion in ecx. I add support of different string terminators to code (new subproc addeded). This proc may set or reset any char_code/char/char_sequence to be treated as correct terminator or not.
So, see comments, I go to offline, sorry.
Big ask to all: test this please. See Dave's timings (because only his proc changed). I decrease timings by one clock on my CPU.
My timings are this:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
25 cycles for Fast version
27 cycles for Fast version under AMD
48 cycles for Small 1
44 cycles for Small 2
43 cycles for Small 3
47 cycles for Small 3.1
43 cycles for Small 4
28 cycles for MMX 1
28 cycles for MMX 2
30 cycles for SSE1
Other's Versions:
48 cycles for Axhex2dw improved by Hutch (1)
164 cycles for Axhex2dw improved by Hutch (2)
28 cycles for Lingo's SSE version
24 cycles for Lingo's BIG integer version
23 cycles for Jochen's WORD-Indexed version
27 cycles for Dave's version (with minor changes)
26 cycles for Fast version
27 cycles for Fast version under AMD
48 cycles for Small 1
44 cycles for Small 2
43 cycles for Small 3
47 cycles for Small 3.1
43 cycles for Small 4
28 cycles for MMX 1
28 cycles for MMX 2
31 cycles for SSE1
Other's Versions:
48 cycles for Axhex2dw improved by Hutch (1)
83 cycles for Axhex2dw improved by Hutch (2)
29 cycles for Lingo's SSE version
24 cycles for Lingo's BIG integer version
23 cycles for Jochen's WORD-Indexed version
27 cycles for Dave's version (with minor changes)
25 cycles for Fast version
25 cycles for Fast version under AMD
48 cycles for Small 1
44 cycles for Small 2
43 cycles for Small 3
47 cycles for Small 3.1
43 cycles for Small 4
28 cycles for MMX 1
28 cycles for MMX 2
30 cycles for SSE1
Other's Versions:
48 cycles for Axhex2dw improved by Hutch (1)
83 cycles for Axhex2dw improved by Hutch (2)
43 cycles for Lingo's SSE version
24 cycles for Lingo's BIG integer version
23 cycles for Jochen's WORD-Indexed version
27 cycles for Dave's version (with minor changes)
==========
Codesizes:
Axhex2dw_Unrolled: 396
Axhex2dw_Unrolled_AMD: 396
Axhex2dw1 - 1: 70
Axhex2dw2 - 2: 69
Axhex2dw3 - 3: 57
Axhex2dw3_1 - 3.1: 56
Axhex2dw3 - 4: 61
Axhex2dw_MMX: 128
Axhex2dw_MMX2: 160
Axhex2dw_SSE: 160
Alex_Short_Hutch: 59
Axhex2dw_Hutch2: 54
Hex2dwLingoSSE: 160
lingo_htodw: 1950
ax_jj_htodw: 174
krbhtodw: 547
--- ok ---
Alex
Alex,
Here are my P4 timings:
Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)
14 cycles for Fast version
19 cycles for Fast version under AMD
32 cycles for Small 1
35 cycles for Small 2
35 cycles for Small 3
44 cycles for Small 3.1
35 cycles for Small 4
20 cycles for MMX 1
18 cycles for MMX 2
25 cycles for SSE1
Other's Versions:
34 cycles for Axhex2dw improved by Hutch (1)
61 cycles for Axhex2dw improved by Hutch (2)
14 cycles for Lingo's SSE version
13 cycles for Lingo's BIG integer version
5 cycles for Jochen's WORD-Indexed version
15 cycles for Dave's version (with minor changes)
14 cycles for Fast version
18 cycles for Fast version under AMD
21 cycles for Small 1
45 cycles for Small 2
35 cycles for Small 3
35 cycles for Small 3.1
36 cycles for Small 4
20 cycles for MMX 1
18 cycles for MMX 2
23 cycles for SSE1
Other's Versions:
34 cycles for Axhex2dw improved by Hutch (1)
61 cycles for Axhex2dw improved by Hutch (2)
14 cycles for Lingo's SSE version
12 cycles for Lingo's BIG integer version
5 cycles for Jochen's WORD-Indexed version
15 cycles for Dave's version (with minor changes)
14 cycles for Fast version
14 cycles for Fast version under AMD
32 cycles for Small 1
43 cycles for Small 2
35 cycles for Small 3
34 cycles for Small 3.1
37 cycles for Small 4
20 cycles for MMX 1
18 cycles for MMX 2
23 cycles for SSE1
Other's Versions:
97 cycles for Axhex2dw improved by Hutch (1)
85 cycles for Axhex2dw improved by Hutch (2)
18 cycles for Lingo's SSE version
12 cycles for Lingo's BIG integer version
5 cycles for Jochen's WORD-Indexed version
15 cycles for Dave's version (with minor changes)
==========
Codesizes:
Axhex2dw_Unrolled: 396
Axhex2dw_Unrolled_AMD: 396
Axhex2dw1 - 1: 70
Axhex2dw2 - 2: 69
Axhex2dw3 - 3: 57
Axhex2dw3_1 - 3.1: 56
Axhex2dw3 - 4: 61
Axhex2dw_MMX: 128
Axhex2dw_MMX2: 160
Axhex2dw_SSE: 160
Alex_Short_Hutch: 59
Axhex2dw_Hutch2: 54
Hex2dwLingoSSE: 160
lingo_htodw: 1950
ax_jj_htodw: 174
krbhtodw: 547
--- ok ---
Alex,
Here are my AMD timings:
AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (SSE3)
47 cycles for Fast version
22 cycles for Fast version under AMD
51 cycles for Small 1
53 cycles for Small 2
59 cycles for Small 3
50 cycles for Small 3.1
39 cycles for Small 4
15 cycles for MMX 1
27 cycles for MMX 2
24 cycles for SSE1
Other's Versions:
47 cycles for Axhex2dw improved by Hutch (1)
39 cycles for Axhex2dw improved by Hutch (2)
15 cycles for Lingo's SSE version
13 cycles for Lingo's BIG integer version
23 cycles for Jochen's WORD-Indexed version
33 cycles for Dave's version (with minor changes)
35 cycles for Fast version
22 cycles for Fast version under AMD
40 cycles for Small 1
36 cycles for Small 2
39 cycles for Small 3
67 cycles for Small 3.1
83 cycles for Small 4
20 cycles for MMX 1
27 cycles for MMX 2
24 cycles for SSE1
Other's Versions:
35 cycles for Axhex2dw improved by Hutch (1)
39 cycles for Axhex2dw improved by Hutch (2)
15 cycles for Lingo's SSE version
13 cycles for Lingo's BIG integer version
23 cycles for Jochen's WORD-Indexed version
33 cycles for Dave's version (with minor changes)
58 cycles for Fast version
39 cycles for Fast version under AMD
50 cycles for Small 1
40 cycles for Small 2
52 cycles for Small 3
50 cycles for Small 3.1
39 cycles for Small 4
16 cycles for MMX 1
27 cycles for MMX 2
24 cycles for SSE1
Other's Versions:
44 cycles for Axhex2dw improved by Hutch (1)
39 cycles for Axhex2dw improved by Hutch (2)
15 cycles for Lingo's SSE version
13 cycles for Lingo's BIG integer version
35 cycles for Jochen's WORD-Indexed version
33 cycles for Dave's version (with minor changes)
==========
Codesizes:
Axhex2dw_Unrolled: 396
Axhex2dw_Unrolled_AMD: 396
Axhex2dw1 - 1: 70
Axhex2dw2 - 2: 69
Axhex2dw3 - 3: 57
Axhex2dw3_1 - 3.1: 56
Axhex2dw3 - 4: 61
Axhex2dw_MMX: 128
Axhex2dw_MMX2: 160
Axhex2dw_SSE: 160
Alex_Short_Hutch: 59
Axhex2dw_Hutch2: 54
Hex2dwLingoSSE: 160
lingo_htodw: 1950
ax_jj_htodw: 174
krbhtodw: 547
--- ok ---