I've been looking at Mark's optimisation webpage, am I correct in thinking that this code :-
movzx eax, BYTE PTR [esi]
inc esi ;or maybe add esi,1?
is faster than this:-
mov al,[esi] ;lob
inc esi ;Macro
and eax,00000000000000000000000011111111b
On a Celeron M, inc and add yield equal timings:
96 cycles for 100*movzx, inc esi
96 cycles for 100*movzx, add esi, 1
396 cycles for 100*mov al
Test it yourself:
.nolist
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
LOOP_COUNT = 1000000
.data
MainString db "This is a long string meant for testing the code", 0
.code
start:
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; --------- the traditional way ---------
mov esi, offset MainString
REPEAT 100
movzx eax, BYTE PTR [esi]
inc esi
ENDM
counter_end
print str$(eax), 9, "cycles for 100*movzx, inc esi", 13, 10, 10 ; --------- end traditional way ---------
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; --------- the traditional way ---------
mov esi, offset MainString
REPEAT 100
movzx eax, BYTE PTR [esi]
add esi, 1
ENDM
counter_end
print str$(eax), 9, "cycles for 100*movzx, add esi, 1", 13, 10, 10 ; --------- end traditional way ---------
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; --------- the traditional way ---------
mov esi, offset MainString
REPEAT 100
mov al, [esi] ;lob
inc esi ;Macro
and eax,00000000000000000000000011111111b
ENDM
counter_end
print str$(eax), 9, "cycles for 100*mov al", 13, 10, 10 ; --------- end traditional way ---------
inkey "--- ok ---"
exit
end start
This is what I got:-
95 cycles for 100*movzx, inc esi
95 cycles for 100*movzx, add esi,1
371 cycles for 100*mov al
So inc & add are the same & the first method is much quicker than the second.
Thanks JJ :U
Neil,
It depends on the processor hardware between INC and ADD REG, 1. On the PIV family ADD is faster, on much other hardware INC is faster. As most speed issues are related to memory access speed, you may not need to lose any sleep over which one you choose. Go over the algo and reduce any memory accesses that you can and you may see it go faster, twiddling between INC and ADD will very rarely ever give you any useful difference.
Thanks hutch, I'm going to stick with inc, it's quicker to type :bg
Not to mention 1/3 the size!
i must be missing sumpin - lol
LODSB
Neil, generally INC/DEC are considerably faster than ADD/SUB on the AMD Athlon processors.
As always, timing the code is the best bet. Of course, to determine this condition, this requires one actually own these processors. Too bad there isn't some service out there which could time code snippets on all major processor types. (Or a relative comparison of processor instruction latency between all the major brands.)
Thanks for that Mark, my test was done on an Intel processor but I have a spare computer with an Athlon processor, I'll fire it up tomorrow & see what the test results are on that.
Quote from: dedndave on May 01, 2009, 03:58:29 PM
i must be missing sumpin - lol
LODSB
Sorry :bg
Quote96 cycles for 100*movzx, inc esi
364 cycles for 100*lodsb
Generally, the lods, scas, movs etc stuff is a bit slow - with one exception: rep movsd is blazingly fast for aligned memcopies, see inter alia this post by Hutch (http://www.masm32.com/board/index.php?topic=6427.msg47991#msg47991). I use lodsb if speed is not important.
ahhhhh - that is good to know
i guess, when i do use LODSB (without the REP prefix), it is a case where speed is not critical
generally speaking, i use it in cases like parsing a command line
still, this is good info - i will have to take a look at Marks' page
btw - REP LODS doesn't make much sense - lol
i don't think i have ever used that
It really depends upon how you write the test. This test uses repeat 1000, and only does it once. lodsb is 3 times faster on my AMD, 4 times faster on my celeron and about 15% slower on my 1.8Ghz pentium M
[attachment deleted by admin]
trying to locate Marks' page
heliosstudios says i don't have permission to access - is that the one ?
http://heliosstudios.net/index.html.disabled
What's that? Oh that page is so antiquated, was started and never completed (like so many other things in my life, sigh.)
I thought you were talking about Mark Larson's page. That has some useful stuff on it. :bg
Quote from: Jimg on May 01, 2009, 10:48:08 PM
It really depends upon how you write the test. This test uses repeat 1000, and only does it once. lodsb is 3 times faster on my AMD, 4 times faster on my celeron and about 15% slower on my 1.8Ghz pentium M
Jim,
3968 cycles for lodsb
997 cycles for 100*movzx, inc esi
997 cycles for 100*movzx, add esi, 1
4017 cycles for 100*mov al
3993 cycles for lodsb
This is your test, but with LOOP_COUNT = 1000000. MichaelW has written the timing routines, and is in a much better position to explain what happens if you reduce the outer count to 1. I had written the REPEAT 1000 because esi is being increased when doing a lodsb - doing that a million times may have undesired side effects :wink
You can do a rougher test by allocating a large buffer for the esi memory access, and then simply use GetTickCount with a large ncnt.
i think i am looking for Mark Larsons' page - lol
i am looking for the page that has code optimization by Mark - lol
the one that Neil was refering to in the first post of the thread
any help ?
It's at the top right of the page, under Forum Links & Websites :U
that page really tells me that i have much to learn - lol
i was fairly proficient at writing fast code for the 8088
in those days, we also went for small code - not as important now - this simple fact will really change how i write code
i would say that half of what applies then still applies (per Marks' page)
half is totally different - even reverse
that is the worst case for my learning curve
having to remember which half it is in is half the battle - lol
Quoteguess, when i do use LODSB (without the REP prefix), it is a case where speed is not critical
generally speaking, i use it in cases like parsing a command line
still, this is good info - i will have to take a look at Marks' page
I think hutch has mentioned this stuff in his help file about the string instructions (other than rep) being slower.
Mark's page (http://www.mark.masmcode.com/) is full of really useful hints, but don't forget his advice to time the code. On top of The Laboratory, you find the code timing macros (http://www.masm32.com/board/index.php?topic=770.0) - extremely useful.
Two minor points re Mark's page:
- Point 6, PSHUFD faster than MOVDQA: On my Celeron M, MOVDQA is a lot faster
- mmx: If you can, go for xmm instead. SSE2 is in general faster, and it avoids trashing the FPU. Using the FPU is not important for everybody, but it offers high precision at reasonable speeds - and the mmx instructions destroy the content of the FPU register. Combined mmx/FPU code is really really slow.
Precisely my point. It depends upon how you test it.
If you loop a million times, it's a good test of looping a million times.
If you execute it a million of the instructions once, it's a good test of the instruction.
What do you want to test, looping or instruction time?
In normal code, you are calling this proc or that proc and none of them are in the cache. To test instruction timing, you have to test code that is not in the cache, just the way is is executed most of the time. Looping a million times is just silly. It's just a test of the size and speed of the cache on a particular machine, not the code.
If you're going to loop, time it properly. The current timing macros do not.
Time each loop separately. Pick the fastest one. That's the fastest the code can run.
Or pick the MEDIAN.
If you print out the times for each each loop, you will see half a dozen of them that are hundreds or even thousands of times larger than the norm.
Doing an average including these where windows goes off and does its thing is not in any regard a test of the time it take any particular piece of code to run.
Doing an average is just silly.
If you want to test real world, duplicate the code many times and time that once.
i dunno Jim
seems to me that the code that i really want to optimize is that which is inside loops - that means it is in the cache
(provided it is a short enough loop - not some long aborition)
much of the code that is executed once in a program should be written for clarity and small size - not speed
however, if you generate 1000's of the same instruction, and execute them, most of them get cached also
you also measure the time required to re-load the cache every so often
i agree that you do not want to time the loop itself
it seems to me a practical method is a comprimise
instead of executing the same instruction inside a loop 1000 times
or generating 1000 copies of the instruction, then executing it
make a loop of 100 copies, execute it 10 times, then subtract some agreed-to standard overhead time for the loop
Exactly. If you want to know how fast a particular chunk of code executes, e.g. where it normally loops 100 times, then test the chunk of code looping 100 times. If you ask "which instruction is fastest", then test the instruction, not looping. And in both cases, don't do it a million times and average in windows doing housekeeping.
Do what you want to test. If that is normally a loop that executed 36 times, then time how long it takes to do the loop 36 times.
The time it takes is the time it takes. If you want multiple samples, do the test again. Timing it each time. Pick either the first time (most realistic), the fastest, or the median. Not the blooming average.
also - those measured times you mentioned - when windows is doing housecleaning and other programs are executing
statistically, those data points should be thrown out altogether
nor is the fastest time going to be perfect, either
if you have a set of data points like this.....
1 10
2 13
3 11
4 19
5 14
6 12
7 11
8 12
9 17
it is good practice to toss out points 4 and 9, and take the average of the rest
even though you may be measuring other things, it is a good PRACTICAL representation of the real-world
I don't think the average is ever a good value, it is too prone to widows effects. If you examine real world values, the median is almost always smaller than the average, unless the code under test always triggers some kind of windows event, in which case, timing it is irrelevant. If you don't want to take the fastest, take the median. Or throw away the slowest half and average what's left. Do something to get rid of windows effects.
Also, in the real world, (lodsb) is slightly slower then (movzx eax,[esi]/add esi,1), however, also in the real world, and how the code is usually used, the difference is insignificant because the instructions end up in a non-optimal alignment, or are affected by the preceding and following instructions executing simultaneously, or any of several other variables.
I understand what you are saying, Jim.
The fastest time may well represent the most accurate measurement of the instruction, itself.
However, the problem is that it may not, as well.
In other words, if we run instructionA and happen to hit it's best time.
Then, we run instructionB and do not happen to hit it's best time.
We have invalidated the comparison of the two.
But if we take an average of the two instructions as mentioned above,
we are likely to obtain more useful comparison information.
Let's face it, we really do not want to know how many nanoseconds each takes,
rather, we want to know which of the two is performing the best.
By averaging a set of values, we take into account that we may or may not
have measured the best performance time of each instruction.
QuoteBy averaging a set of values, we take into account that we may or may not have measured the best performance time of each instruction.
I agree with everything you've said up to that point. Average will never make a measurement better.
Take a median. Take a standard deviation. The average is still way too prone to other effects. Or if you must average, average only the lower half of the results.
Well, at least we agree to disagree - lol
Truthfully, I doubt there would be much difference in the two methods.
Although you may see a static difference between the two methods, the
comparison ratios of two instructions would be very nearly the same,
assuming we tossed out the same data points.
i.e. Even though...
my method might say 100 cycles for instructionA and 120 cycles for instructionB (20% change)
your method might yield 90 and 108 (20% change)
Quote from: Jimg on May 02, 2009, 02:34:25 PM
In normal code, you are calling this proc or that proc and none of them are in the cache. To test instruction timing, you have to test code that is not in the cache, just the way is is executed most of the time. Looping a million times is just silly.
You might want to see the Timings and Cache (http://www.masm32.com/board/index.php?topic=11036.msg81254#msg81254) thread. By the way, "Looping a million times is just silly" means there are lots of silly people in this forum. Me included, of course :bg
A method that might make us both happy would be to set up a dynamic threshold.
Make measurements (some minimum number) and keep track of minima and maxima along the way.
Calculate a threshold at some arbitrary level, say 0.15 x (max-min) + min.
Once you have acquired 20 data points below that theshold, stop the measurement and calculate the average of those points.
This method would be very repeatable.
Sound good. Have to try it out to see. The key point, is that you are timing each loop, not a million loops and dividing the result by a million.
QuoteYou might want to see the Timings and Cache thread. By the way, "Looping a million times is just silly" means there are lots of silly people in this forum. Me included, of course BigGrin
Yes! :P and me too!
I've been doing these games for ten years now, which is why I've come to the conclusions I have. :toothy
Quote from: Jimg on May 02, 2009, 07:31:00 PM
Sound good. Have to try it out to see. The key point, is that you are timing each loop, not a million loops and dividing the result by a million.
QuoteYou might want to see the Timings and Cache thread. By the way, "Looping a million times is just silly" means there are lots of silly people in this forum. Me included, of course BigGrin
Yes! :P and me too!
I've been doing these games for ten years now, which is why I've come to the conclusions I have. :toothy
That sounds promising, but maybe I have not fully understood what you want - my apologies. How often would you time, for example, a BitBlt loop for a 1920*1600 screen? I mean: Not just choosing the fastest available Microsoft API, but rather rewrite the code that seems to be the bottleneck. You must have gathered some specific experience in this thread (http://www.masm32.com/board/index.php?topic=10421.msg77198#msg77198).
QuoteHow often would you time, for example, a BitBlt loop for a 1920*1600 screen?
Strangely enough, just yesterday. Trying to find how to decrease the time it takes to generate a particular screen, I found that 1/3 of the time was taken up by clearing the background. So I tested-
tryclr = 1
starttimetest 7
if tryclr eq 1
inv PatBlt,pi.td,0,0,mwidth,mheight,WHITENESS
else
mov edi,dibh1.bits
mov ecx,mwidth
imul ecx,mheight
mov eax,0ffffffh
rep stosd
endif
endtimetest 7
it turns out the first takes around 15300 clicks, and the second around 15150 clicks. So even though it's faster, it's not worth the effort.
That would be 150/45000 = .003%
This is the code I use for testing various sections-
.data?
align 8
strtime dq ?,?,?,?,?,?,?
endtime dq ?,?,?,?,?,?,?
elapsed0 dd ?
elapsed1 dd ?
elapsed2 dd ?
elapsed3 dd ?
elapsed4 dd ?
elapsed5 dd ?
elapsed6 dd ?
elapsed7 dd ?
.code
starttimetest macro testnum
if DoDebug
inv QueryPerformanceCounter,addr [strtime + testnum*8]
endif
endm
endoftest proc testnum
push esi
mov esi,testnum
inv QueryPerformanceCounter,addr [endtime+esi*8]
finit
fild qword ptr [endtime+esi*8]
fild qword ptr [strtime+esi*8]
fsub
fist dword ptr [elapsed0+esi*4]
pop esi
ret
endoftest endp
endtimetest macro testnum
if DoDebug
inv endoftest,testnum
endif
endm
.
.
.
if DoDebug
printxa " qdt=",dd elapsed1,32,dd elapsed0,32,dd elapsed6,32,dd elapsed7
endif
inv SetWindowText,hWin,mbuff
It may not be as precise as doing a cpuid and rdtsc, but in the real world, it more than sufficient.
Quote from: Jimg on May 02, 2009, 08:51:38 PM
it turns out the first takes around 15300 clicks, and the second around 15150 clicks. So even though it's faster, it's not worth the effort.
That would be 150/45000 = .003%
Sure, rep stosd is one of the exceptions where there is nothing to optimise, as you know (http://www.masm32.com/board/index.php?topic=6576.msg63583#msg63583) :bg
I recall from the many many pages of the code timing thread, a general convergence towards the fastest loop timing also being ideal. Does not Petezold's PROCTIMERS use the lowest cycle count of 10,000 iterations? (Much less iterations, and the timing results are rock-steady.)
In my experience, Petezold's timer is very good.
Quote from: Mark Jones on May 02, 2009, 11:09:51 PM
I recall from the many many pages of the code timing thread, a general convergence towards the fastest loop timing also being ideal.
In my spare time, I am still trying to improve the Instr algo, and I stumble all the time over outliers as the one marked below:
QuoteIntel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
Timings:
(_imp__strstr=crt_strstr, InstrCi=my non-SSE version, InString=Masm32 library; TestSub?=Easy, Difficult, Xtreme)
7345 _imp__strstr, addr Mainstr, addr TestSubE
8551 _imp__strstr, addr Mainstr, addr TestSubD
12789 _imp__strstr, addr Mainstr, addr TestSubX
8151 InstrCi, 1, addr Mainstr, addr TestSubE, 0
8314 InstrCi, 1, addr Mainstr, addr TestSubD, 0
10836 InstrCi, 1, addr Mainstr, addr TestSubX, 0
7458 InString, 1, addr Mainstr, addr TestSubE
9767 InString, 1, addr Mainstr, addr TestSubD
13001 InString, 1, addr Mainstr, addr TestSubX
1866 InstrJJ, 1, addr Mainstr, addr TestSubE, 0
1870 InstrJJ, 1, addr Mainstr, addr TestSubE, 0
1868 InstrJJ, 1, addr Mainstr, addr TestSubE, 0
1881 InstrJJ, 1, addr Mainstr, addr TestSubD, 0
1884 InstrJJ, 1, addr Mainstr, addr TestSubD, 0
1882 InstrJJ, 1, addr Mainstr, addr TestSubD, 0
3814 InstrJJ, 1, addr Mainstr, addr TestSubX, 0
3727 InstrJJ, 1, addr Mainstr, addr TestSubX, 0
3830 InstrJJ, 1, addr Mainstr, addr TestSubX, 0
Average cycle count:
2513 InstrJJ
10075 MasmLib InstringL
InstrJJ : InstringL = 24 %
Code size InstrJJ=366
I have no explanation why code can suddenly, out of the blue, run 3% faster, but it happens all the time. To improve reliability, one might consider to eliminate both fast and slow outliers, but it would require some overhead in \masm32\macros\timers.asm - such as a 100 loops to calculate the expected average before starting the main exercise...?
:bg
Now do you understand why I only ever test in real time ?
Quote from: hutch-- on May 03, 2009, 06:20:32 AM
:bg
Now do you understand why I only ever test in real time ?
Hutch, you want to provoke me to write
"Now I understand why your code is so slow". But nope, I will not let you provoke me, and I will not write such nasty things about you!!!
:bg
JJ,
The code is not suddenly running 3%, or whatever, faster. The problem is that the test is being interrupted, and the more it's interrupted the higher the cycle counts. The second set of timing macros was an attempt to correct this problem. These macros capture the lowest cycle count that occurs in a single loop through the block of code, on the assumption that the lowest count is the correct count. The higher counts that occur are the result of one or more context switches within the loop. Context switches can occur at the end of a time slice, so to minimize the possibility of the loop overlapping the time slice the ctr_begin macro starts a new time slice at the beginning of the loop. If the execution time for a single loop is greater than the duration of a time slice (approximately 20ms under Windows), then the loop will overlap the time slice, and if another thread of equal priority is ready to run, then a context switch will occur. Here are the typical results of the code from the attachment running on my P3:
441 271
441 271
441 271
441 271
441 271
441 271
441 271
441 271
441 271
441 271
441 271
441 271
441 271
441 271
441 271
441 271
441 271
441 271
441 271
441 271
Unfortunately, these macros do not work well with a P4, typically returning cycle counts that are a multiple of 4 and frequently higher than they should be.
[attachment deleted by admin]
:bg
You should know by now that these things leave me sitting up at night, wringing my hands inbetween wiping the tearstains from my face while losing sleep about it. Further I have dismally failed to write the world's fastest MessageBoxA() after 20 years of trying and to cap it off I still can't get SSE4.5 to run on a 486. Such may be the case with matter of such great importance but when it comes to timing an algo I have done it the right way for many years, design the test/timing method to fit the task then make as big as you can fit in memory and bash it long enough to reduce the variations to below 1%. Intel spec 3% but true fanaticism requires better. :bdg
The mad thievish gipsy use part of my strlen code from here (http://www.masm32.com/board/index.php?topic=1807.240) to produce lame slow code and is shameless to post it everywhere... :bdg
No offend, but who uses .if .elseif .else or preserve ecx and edx in speed critical algos? IMO idiots in assembly and it is the result: :lol
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
Search Test 1 - value expected 37; lenSrchPattern ->22
InString - JJ: 38 ; clocks: 99
InString - Lingo: 37 ; clocks: 39
Search Test 2 - value expected 1007; lenSrchPattern ->17
InString - JJ: 1008 ; clocks: 22567
InString - Lingo: 1007 ; clocks: 6294
Search Test 3 - value expected 1008 ;lenSrchPattern ->16
InString - JJ: 1009 ; clocks: 712
InString - Lingo: 1008 ; clocks: 502
Search Test 4 - value expected 1008 ;lenSrchPattern ->16
InString - JJ: 1009 ; clocks: 6600
InString - Lingo: 1008 ; clocks: 1418
Search Test 5 - value expected 1008 ;lenSrchPattern ->16
InString - JJ: 1009 ; clocks: 5426
InString - Lingo: 1008 ; clocks: 1308
Search Test 6 - value expected 1008 ;lenSrchPattern ->16
InString - JJ: 1009 ; clocks: 629
InString - Lingo: 1008 ; clocks: 498
Search Test 7 - value expected 1009 ;lenSrchPattern ->14
InString - JJ: 1010 ; clocks: 625
InString - Lingo: 1009 ; clocks: 502
Search Test 8 - value expected 1001 ;lenSrchPattern ->1
InString - JJ: 0 ; clocks: 781
InString - Lingo: 1001 ; clocks: 102
Search Test 9 - value expected 1001 ;lenSrchPattern ->2
InString - JJ: 1002 ; clocks: 611
InString - Lingo: 1001 ; clocks: 512
Search Test 10 - value expected 1001 ;lenSrchPattern ->3
InString - JJ: 1002 ; clocks: 625
InString - Lingo: 1001 ; clocks: 435
Search Test 11 - value expected 1001 ;lenSrchPattern ->4
InString - JJ: 1002 ; clocks: 635
InString - Lingo: 1001 ; clocks: 496
Search Test 12 - value expected 1001 ;lenSrchPattern ->5
InString - JJ: 1002 ; clocks: 795
InString - Lingo: 1001 ; clocks: 638
Search Test 13 --Find 'Duplicate inc' in 'windows.inc' ;lenSrchPattern ->13
InString - JJ: 1127625 ; clocks: 679836
InString - Lingo: 1127624 ; clocks: 543385
Press ENTER to exit...
Quote from: lingo on May 03, 2009, 12:49:30 PM
IMO idiots in assembly
Lingo, post code, not insults.
Hutch, is it possible to add an icon that says "middle finger"
something like this ? :lol
(http://upload.wikimedia.org/wikipedia/commons/thumb/3/36/The_gesture02.jpg/180px-The_gesture02.jpg)
:dazzled:
It must be the silly season, everbody seems to be unhappy. :boohoo:
To get back on topic-
Quote from: jj2007 on May 03, 2009, 05:21:49 AM
I have no explanation why code can suddenly, out of the blue, run 3% faster, but it happens all the time. To improve reliability, one might consider to eliminate both fast and slow outliers, but it would require some overhead in \masm32\macros\timers.asm - such as a 100 loops to calculate the expected average before starting the main exercise...?
Some times I think the words I write just stay local to my machine, echo back to me when I read a thread, but never actually go where anyone else can see them ::)
This is exactly what I have been complaining about the last few pages.
Time each execution of the code. Throw away the slowest half, because something was obviously going on in a windows background process. If you do this, there is no need to run it a million times, the fastest values are the ones that weren't affected by something else. I have found that 100 iterations is more than enough, either throw away the slowest half and average the rest, or just pick the fastest one. Doing either of these I get rock solid consistent results. That's why I say it's silly to loop a million times.
I am beginning to agree with you Jim - lol
(that'll cheer him up, for sure, Steve)
I think part of the problem may be the length of time each test takes.
Let's call each pass a "burst" - not a good term for purists, perhaps, but it is descriptive.
If the burst period is brief, in terms of CPU time, the occurance of anomalies will be minimalized.
Of course, if it is too brief, the time measurements have too much overhead.
Carefully selecting the length of each burst seems important.
In fact, the iterations per burst, should be adjusted until the burst period falls within a certain window.
This seems to make sense, particularily when comparing one "Instruction Sequence Under Test" to another.
If we want to compare two ISUT's, we should adjust one or both iteration counts until the burst lengths are nearly the same.
Then, run them enough times to assure acquistion of the fastest time, as Jim suggests.
Also, it should not be difficult to run time measurements on the overhead and subtract it from the results.
This overhead will vary from one platform to another, the same as any other instruction sequence.
Any time measurements we agree upon should 1) produce predictable accurate results with known ISUT's, and
2) produce stable and repeatable results on several platforms with several ISUT's.
As far as 32-bit code is concerned, I am a novice, to be sure.
But, after 30 years in Electronics Engineering, I do have some experience devising certification/verification tests.
Certainly, there is much for me to learn about all the CPU's out there, but basics are basic and statistics still apply.
this subject has peaked my curiosity, for some reason - lol
Unfortunately, I don't feel qualified to develop the entire program, myself. There is too much
about 32-bit code that I have yet to learn in order to cover all the bases.
One simple question popped into my head, though. It has to do with register dependancies.
As most of us know, the NOP instruction was derived from XCHG AX,AX, or XCHG EAX,EAX in protected mode.
I think the assembler will code 90h for either. These timing routines you guys use should show results easily.
ISUT #1:
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
ISUT #2
XCHG EAX,ECX
XCHG EAX,ECX
XCHG EAX,ECX
XCHG EAX,ECX
XCHG EAX,ECX
XCHG EAX,ECX
XCHG EAX,ECX
XCHG EAX,ECX
ISUT #3
XCHG EAX,ECX
XCHG EAX,EDX
XCHG EAX,ECX
XCHG EAX,EDX
XCHG EAX,ECX
XCHG EAX,EDX
XCHG EAX,ECX
XCHG EAX,EDX
I wonder if NOP is dependant on the value in AX/EAX?
My guess is that the microcode is smart enough to know better.
Those guys at Intel are pretty sharp.
btw - is there a "standard" timing test program used here in MASM32 forum,
or are the results I see posted using different code?
Quote from: MichaelW on May 03, 2009, 07:23:08 AM
JJ,
The code is not suddenly running 3%, or whatever, faster. The problem is that the test is being interrupted, and the more it's interrupted the higher the cycle counts.
Michael,
"My" problem with this (I am a measurement specialist, too, although not in Masm) is that this is such a rare event - which would imply that 99% of the time the code is
measured, say, 3% too slow. Where does this "constant" +3% error come from? See remarks on time slices below.
Quote
The second set of timing macros was an attempt to correct this problem. These macros capture the lowest cycle count that occurs in a single loop through the block of code, on the assumption that the lowest count is the correct count.
Are these the second attachment in the sticky Lab post (http://www.masm32.com/board/index.php?topic=770.msg5281#msg5281)? Celeron M results are not so clear to me ::)
HIGH_PRIORITY_CLASS
-132 cycles, empty
0 cycles, mov eax,1
0 cycles, mov eax,1 mov eax,2
0 cycles, nops 4
-108 cycles, mul ecx
0 cycles, rol ecx,32
0 cycles, rcr ecx,31
36 cycles, div ecx
36 cycles, StrLen
REALTIME_PRIORITY_CLASS
0 cycles, empty
-108 cycles, mov eax,1
0 cycles, mov eax,1 mov eax,2
0 cycles, nops 4
0 cycles, mul ecx
-120 cycles, rol ecx,32
0 cycles, rcr ecx,31
36 cycles, div ecx
24 cycles, StrLen
Quote
The higher counts that occur are the result of one or more context switches within the loop. Context switches can occur at the end of a time slice, so to minimize the possibility of the loop overlapping the time slice the ctr_begin macro starts a new time slice at the beginning of the loop. If the execution time for a single loop is greater than the duration of a time slice (approximately 20ms under Windows), then the loop will overlap the time slice, and if another thread of equal priority is ready to run, then a context switch will occur.
This would somehow imply that the loop to be timed must need more than 20ms - quite a high number of cycles, and not typical for what we are timing here... or do I misunderstand something?
Quote
Unfortunately, these macros do not work well with a P4, typically returning cycle counts that are a multiple of 4 and frequently higher than they should be.
Indeed ;-)
@dedndave: Most who do timings here use MichaelW's macros, see first post in the Laboratory.
Quote from: jj2007 on May 03, 2009, 06:37:33 PM
Michael,
"My" problem with this (I am a measurement specialist, too, although not in Masm) is that this is such a rare event - which would imply that 99% of the time the code is measured, say, 3% too slow. Where does this "constant" +3% error come from?
My take on this, is that there is always going to be some difference between hardware and OS, even depending on the running apps, so +/-3% is a not worth the hassle. Just take the fastest time, and consider it "the fastest time." For user-mode code, there are always going to be things to slow it down. The fastest time at least gives a baseline speed value.
I guess that means for real-mode code, a new set of "timers" need to be created. :bg
Quote from: Mark Jones on May 03, 2009, 07:39:21 PM
Just take the fastest time
-120 cycles, rol ecx,32
For example?
Quote from: jj2007 on May 03, 2009, 06:37:33 PM
"My" problem with this (I am a measurement specialist, too, although not in Masm) is that this is such a rare event - which would imply that 99% of the time the code is measured, say, 3% too slow. Where does this "constant" +3% error come from? See remarks on time slices below.
With counts in the range of several thousand cycles, I suspect that all of the results include interruptions. I have no idea how to account for the "constant" 3%, but it seems plausible to me that the system is performing some activity that in bursts is using 3% of the processor time.
Quote
Are these the second attachment in the sticky Lab post (http://www.masm32.com/board/index.php?topic=770.msg5281#msg5281)? Celeron M results are not so clear to me ::)
HIGH_PRIORITY_CLASS
-132 cycles, empty
0 cycles, mov eax,1
0 cycles, mov eax,1 mov eax,2
0 cycles, nops 4
-108 cycles, mul ecx
0 cycles, rol ecx,32
0 cycles, rcr ecx,31
36 cycles, div ecx
36 cycles, StrLen
REALTIME_PRIORITY_CLASS
0 cycles, empty
-108 cycles, mov eax,1
0 cycles, mov eax,1 mov eax,2
0 cycles, nops 4
0 cycles, mul ecx
-120 cycles, rol ecx,32
0 cycles, rcr ecx,31
36 cycles, div ecx
24 cycles, StrLen
Yes, in counter2.zip. Your results look like P4 results, in code without any significant delay after the app starts and before entering the timing loops. Try implementing a 3-4 second delay at the start and, if you have one, try running the test on a non-NetBurst processor. Also, it's not reasonable to expect meaningful counts on any processor for an instruction that executes in less than one clock cycle, or due to the inability to isolate the timed instructions from the timing instructions, in any small number of clock cycles. I would be interested to see the results for my code from the attachment, both with and without the delay at the start.
QuoteThis would somehow imply that the loop to be timed must need more than 20ms - quite a high number of cycles, and not typical for what we are timing here... or do I misunderstand something?
Sorry, I copied most of the text from the documentation for similar macros that I implemented in another language, where I expected some of the readers to have no idea of what a clock cycle actually is.
Michael,
Here are the timings for counter2.zip, on a Celeron M (not a P4):
Quote99
99
444 192
444 192
444 192
444 192
444 192
444 192
444 192
444 156
444 156
444 156
444 192
444 192
444 156
444 156
444 192
444 192
444 192
444 192
444 192
444 192
Microsoft blames rdtsc for the negative cycles and outliers, and suggests QPC plus clamping (http://msdn.microsoft.com/en-us/library/bb173458.aspx)...
Quote from: jj2007 on May 03, 2009, 07:57:58 PM
Quote from: Mark Jones on May 03, 2009, 07:39:21 PM
Just take the fastest time
-120 cycles, rol ecx,32
For example?
Clearly the method for determining the loop overhead will occasionally itself contain windows glitches. Perhaps you need to test the loop overhead a hundred times, and use the smallest result. Subtracting smallest overhead from fastest time should be quite consistent.
from what i can see, there is no good way to time these functions - lol
this is especially true for those (myself included) that have a dual-core CPU
one thing MS suggests to help is to confine the thread to a single core
now, how are you supposed to evaluate the advantage of having two cores ? - lol
it appears that the burst needs to be substantially long - which introduces anomalies
i think, at some point, you have to call it "good enough" and accept what you get
is it possible to....
switch to real mode
get a value from the motherboard counter-timer
switch to protected mode
ISUT burst sample
switch back to real mode
get another value from the motherboard counter-timer
switch back to protected mode
yield result
i realize the overhead is large, but it could be subtracted out
scratch that idea - real mode doesn't get you access to the counter-timer, either
let me dust off my stopwatch - lol
:bg
I wonder how long it will take for everyone to realise that real time testing on large samples is one of the few techniques that does not suffer most of these problems. Tailor the test data to the task at hand, make it BIG enough and run it LONG enough and you will get under the magic one percent.
There is yet another variation of this, allocate so much memory that it cannot fit into cache then copy the test data in blocks larger than the cache sequentially in the allocated memory then do random rotation of the bits being read and you will really see how bad some algos are.
I tend to agree, more and more each day. It's not nearly as much fun, but real world testing is the way to go.
............ and test it on a few different CPU's
i fear optimization that leans toward the authors CPU only
QuoteMicrosoft blames rdtsc for...
That article is not about timing code in clock cycles; it's about implementing high-resolution timers. A variable clock speed makes the TSC useless as a time base for a timer, so the article recommends using the high-resolution performance counter. With multiple processors/cores there are multiple TSCs, typically running asynchronously, causing obvious problems if the timer thread is not confined to a single processor/core, so the article recommends using the high-resolution performance counter.
For timing code in clock cycles the problem with multiple processors/cores should be solvable by restricting the code to running on a single processor/core using SetProcessAffinityMask or SetThreadAffinityMask. For timing code in clock cycles a variable clock speed does not present a problem, but it does present a serious problem for timing in units of time.
Quote from: lingo on May 03, 2009, 12:49:30 PM
No offend, but who uses .if .elseif .else or preserve ecx and edx in speed critical algos? IMO idiots in assembly and it is the result: :lol
hmm, as being an idiot, i've just few things to say (stupid stuff, certainly...) :
preserving registers is a programming commodity, allow you to NOT loose your time to debug, and allow you to USE EVERY register (instead of considering some of them not usable like the others...).
plus, since, by essence, it's NOT IN speed critical algos (and only wrap the algos) who care about extra clock cycles executed just once ? (IMO people who have never understood what coding consist of...), yep few clock cycles areN'T measurable by a human (or maybe this algo don't need human interaction ?).
plus2, (if we go this way...) can you explain me why you preserve ebx/esi/edi ? it's made by the operating system no ? so why are you doing the job once more ? (especially if speed is critical, IMO some people have a weird logic...). it's quite understandable that the operating system preserve the table/source/destination registers "for you", BUT YOU ?
Quote from: hutch-- on May 04, 2009, 12:13:42 AM
There is yet another variation of this, allocate so much memory that it cannot fit into cache
no, coz here you will just measure useless stuff, (data read, wich is something we don't care). or you must also clean the code from the cache, and also the address for branch mis/preditions. the result you will obtain will essentially consist of data reading (something uncompressable), and you will NOT SEE the effect of the algo.
Quote from: hutch-- on May 04, 2009, 12:13:42 AM
then copy the test data in blocks larger than the cache sequentially in the allocated memory
the interest ? knowing the cost of reading data again ? UNLESS YOU DON'T READ THE SAME THINGS, IT'S THE SAME RESULTS IN ALL CASE... (and if you don't read the same things, what the hell are you comparing ?)
Quote from: hutch-- on May 04, 2009, 12:13:42 AM
then do random rotation of the bits being read and you will really see how bad some algos are.
no, random access will not allow us to know if it's a memory aliasing case or not. plus here again, measuring the reading of data has no interest.
MichaelW's macros measure a work from one point (the beginning) to another one (the end), and divide the result by the number of loop. and IMO it's the way to test algos. OS interaction certainly alterate the results a bit, but it's ALSO the case in normal use.
so the results ARE consistents (an focus on the algo, coz with the cache the code/data read + branch mispredictions (things that we don't care, in most case where speed is critical...) are absorbed by the loop). the "consistent" problem come from others factors, and also HOW you use the algo IN your app. to finish, with random access we will not obtain the normal benefit of the hardware prefetching (so the results obtained will be more than discutable... coz not reflecting the normal use...).
PS : it was just to join the unhappy club... :bdg
I wonder is there another assembly idiot in this forum who preserves always ecx and edx by default?
NightWare,
You appear to have missed why you avoid preloaded data in cache, you get false readings by having the data in cache and this does not help you with real world situations where the data is rarely ever in cache. The method I described brings into play a phenomenon called "page thrashing which really does slow down phony readings of data in cache.
In a relatrively small number of situations, (primarily testing instruction sequences) you can benefit by testing on highly localised data that is very small but in most instances the test is useless in emulating real world situations that regularly occur in daily software use.
By allocating a much larger block of memory than will fit into cache you force the algorithms to read the data and process the data directly so all of the processor factors come into play, branch prediction, instruction order, pipeline effects, pairing etc .....
If I have learnt one thing over the years of tweaking C algos in assembler, reduce the number of memory accesses and you will see real world speed increases where the sum total of the rest show only minor and trivial improvements in timing.
Quote from: lingo on May 04, 2009, 02:42:19 AM
I wonder is there another assembly idiot in this forum who preserves always ecx and edx by default?
Chill out, this is a programming forum, no need to have the attitude of a 14-year old mmorpg player. Why do you see anything as a skill contest ?
Quote from: hutch-- on May 04, 2009, 12:13:42 AM
There is yet another variation of this, allocate so much memory that it cannot fit into cache then copy the test data in blocks larger than the cache sequentially in the allocated memory then do random rotation of the bits being read and you will really see how bad some algos are.
I doubt whether the "random rotation" will change a lot, but the large data block has been tested already in Timings and Cache (http://www.masm32.com/board/index.php?topic=11036.msg81254#msg81254) (and probably earlier, I doubt that I invented this technique :wink):
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
Masm32 lib szLen 126 cycles
crt strlen 101 cycles
strlen32s 33 cycles
strlen64LingoB 28 cycles
_strlen (Agner Fog) 30 cycles
Same but with a scheme that eliminates the influence of the cache:
Masm32 lib szLen *** 672 ms for 7078488 loops
crt strlen *** 609 ms for 7078488 loops
strlen32s *** 328 ms for 7078488 loops
strlen64LingoB *** 343 ms for 7078488 loops
_strlen (Agner Fog) *** 344 ms for 7078488 loops
Differences between algos remain significant (I see Nightware has doubts about that, although I fully agree with 99% of this post), but they are much smaller than for the same tests with data in the cache. Which is nothing sensational. In most cases, we can assume data is in the cache, but in the case of a virus scanner, for example, terabytes have to be read and scanned, no data in cache, therefore "cache-free" timing algo needed.
:bg
Quote
I doubt whether the "random rotation" will change a lot, but the large data block has been tested already in Timings and Cache (and probably earlier, I doubt that I invented this technique
Here is a man who has yet to test page thrashing. At its crudest do the test so that each successive read is further up the offset than the cache size and watch the timings crash. To produce a long enough timing randomly pick addresses that are larger than the cache size apart and watch your timings change, not by milliseconds but seconds.
To complicate the matter, try both temporal and non-temporal reads and writes to see the difference and why non-cached writes are useful.
Most of the testing done with small samples already loaded in cache are a waste of space that don't reflect how the algo performs in real world situations.
Quote from: hutch-- on May 04, 2009, 10:49:05 AM
Most of the testing done with small samples already loaded in cache are a waste of space that don't reflect how the algo performs in real world situations.
Hutch,
As I mentioned in my post, my test is constructed in a way that it does work on non-cached data because the allocated buffer is far beyond cache size. But I am really curious to see a code sample that supports your "seconds instead of milliseconds" statement.
:bg
It was a bit big to upload, it was about 1.5 gigabytes of code that built to about a 350 meg test piece. Now what you are testing is very simple, make EVERY read reload the current memory page and suddenly it all gets SSSLLLLLOOOOOOOOOWWWWWWWW. If you are not getting this effect, you are doing something wrong.
Hutch,
The program e1 seems to be using is not the one i found in the laboratory 1st thread.
The one they use IDs the CPU at the beginning. Where is a link to that one ?
Hutch,
Maybe we misunderstand each other. Here is the (simplified) core piece of my test code:
mov esi, len(offset arg) ; the length of the string whose zero delimiter we want to find
..
mov ebx, LoopCt
invoke GetTickCount
push eax ; save timer
push ebx ; save counter
.Repeat
invoke pAlgo, edi ; pAlgo=strlen, strlen32s, etc.
add edi, esi ; move start position higher with EACH loop
dec ebx
.Until Zero?
invoke GetTickCount
pop ebx
pop ecx
sub eax, ecx
print str$(eax)," ms for "
print str$(ebx)," loops",13,10
Since the size of FatBuffer is 100 MB, as far as I can see, in 99% of all cases the strlen algo must read from memory rather than from cache. But please correct me if I am wrong - I freely admit that I am uncertain. I'd like to understand it ::)
Output:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
100000000 bytes allocated
codesizes: strlen32s=85, strlen64A=120, strlen64B=87
-- test 16k, misaligned 0, 16384 bytes
Masm32 lib szLen *** 62 ms for 6096 loops
crt strlen *** 32 ms for 6096 loops
strlen32s *** 31 ms for 6096 loops
strlen64LingoB *** 47 ms for 6096 loops
_strlen (Agner Fog) *** 47 ms for 6096 loops
-- test 4k, misaligned 11, 4096 bytes
Masm32 lib szLen *** 63 ms for 24304 loops
crt strlen *** 32 ms for 24304 loops
strlen32s *** 31 ms for 24304 loops
strlen64LingoB *** 32 ms for 24304 loops
_strlen (Agner Fog) *** 31 ms for 24304 loops
JJ - where can I obtain this test program?
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
100000000 bytes allocated
codesizes: strlen32s=85, strlen64A=120, strlen64B=87
-- test 16k, misaligned 0, 16384 bytes
Masm32 lib szLen *** 62 ms for 6096 loops
crt strlen *** 32 ms for 6096 loops
strlen32s *** 31 ms for 6096 loops
strlen64LingoB *** 47 ms for 6096 loops
_strlen (Agner Fog) *** 47 ms for 6096 loops
-- test 4k, misaligned 11, 4096 bytes
Masm32 lib szLen *** 63 ms for 24304 loops
crt strlen *** 32 ms for 24304 loops
strlen32s *** 31 ms for 24304 loops
strlen64LingoB *** 32 ms for 24304 loops
_strlen (Agner Fog) *** 31 ms for 24304 loops
Quote from: dedndave on May 04, 2009, 01:38:35 PM
JJ - where can I obtain this test program?
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
100000000 bytes allocated
Here it is. You need to move slenSSE2.inc \masm32\include\slenSSE2.inc
Same for include \masm32\macros\timers.asm (not included to avoid version confusion, see top post of The Laboratory).
No warranties about mental sanity etc. after reading my code. Look for
AlgoTest inside the asm file...
[attachment deleted by admin]
JJ,
Your example is linear read. The way to test this is to allocate 100 meg as in your example, write the same 1 meg of data to each megabyte in the allocated buffer then read the same piece of data from each of the 1 meg buffers that make up the 100 meg buffer. Do it either randomly or a preset order to ensure that no read is sequential and make the piece of data small, 32 bytes or similar.
On each read if you do this you force the processor to reload the page table which ensures you have no cache reads. Watch it drop to snail pace in comparison to cached reads.
Thanks JJ,
I ran your exe. My CPU takes several more cycles than yours, as it is a dual core.
I thought the program ID'ed dual-cores.
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
200000000 bytes allocated
codesizes: strlen32s=132, strlen64B=84, NWStrLen=118, _strlen=66 bytes
-- test 16k, misaligned 0, 16384 bytes
Masm32 lib szLen ** 188 ms for 12193 loops
crt strlen ** 93 ms for 12193 loops
strlen32s ** 47 ms for 12193 loops
strlen64LingoB ** 47 ms for 12193 loops
NWStrLen ** 47 ms for 12193 loops
_strlen (Agner Fog) ** 47 ms for 12193 loops
-- test 4k, misaligned 11, 4096 bytes
Masm32 lib szLen ** 172 ms for 48611 loops
crt strlen ** 94 ms for 48611 loops
strlen32s ** 47 ms for 48611 loops
strlen64LingoB ** 47 ms for 48611 loops
NWStrLen ** 47 ms for 48611 loops
_strlen (Agner Fog) ** 62 ms for 48611 loops
Quote from: hutch-- on May 04, 2009, 02:28:29 PM
read the same piece of data from each of the 1 meg buffers that make up the 100 meg buffer. Do it either randomly or a preset order to ensure that no read is sequential and make the piece of data small, 32 bytes or similar.
Could you give a real life example of an application that would behave like this? I chose linear read because I thought of e.g.
- reading Windows.inc into a buffer, find first occurence of "Duplicate.inc"
- a virus scanner reading word.exe into a buffer, find first occurence of a pattern
etc.
In any case I hope we can agree that during the life cycle of the loop, i.e. between the two GetTickCounts, the code reads
memory into the cache, at a buffer size of 200 Mega.
Microsoft suggests confining the thread to a single core
that means that the advantage of having 2 cores (or more) is negated by the test
what is needed, is to acquire the timer values from all cores, run the test, then acquire them all again
and see how many total cycles were used
i am not experienced enough to know how to do that
From what little testing I have done on a dual-core system my small, simple test apps seemed to run on one core only. I have yet to see any clear demonstration where having multiple cores provided a performance advantage.
it may be that it hasn't been measured yet
until we devise a test that accomodates more than one core, we won't really know
In my experience with the AMD Athlon dual-core, Windows likes to assign a single-thread process to one CPU core, and it alternates cores for each new process. I.e., if you open up two single-threaded programs, they both run on a separate core. So yes, two things can be running at the same time. Of course, they have to share the same busses, so it is not exactly 2x the performance.
Programs like BOINC spawn new worker processes to utilize all available cores. Quick and easy solution. Applications which are multi-threaded utilize the additional cores by creating threads to run on each core. New threads may alternate cores like processes do; I will have to look into that. But thread affinity can also be set, to force the thread to only run on the selected core.
There is performance to be had in utilizing additional cores, but with this comes added complexity. If an app uses two threads on different cores, then the programmer must make provisions for synchronization. If the threads need to communicate with each other, then the EnterCriticalSection and LeaveCriticalSection API's are helpful to guarantee things don't get desynchronized.
As a side note, a single-CPU system can run a multi-threaded app just fine. Each thread just runs on the same core, and time slices are divided between the threads. There is little overhead in the thread switching, something like a few thousand clocks.
When it comes to implementing multi-threading in general, the best design concept seems to be that of one master-thread which "doles out work units" to n independent worker threads, where n is the number of cores detected at startup. This concept guarantees total processor usage, and is tolerant of thread timing variance. (Sorry if this was a little more info than necessary, lol.)
well, that is kind of what I thought, too
but, when I run a simple timing test, the numbers tell me otherwise....
Reference null tests:
Null: 20 clocks.
10x NOP: 25 clocks.
Failure-mode CMP tests:
10x CMP REG,REG: 7 clocks.
10x CMP REG,IMMED: -210 clocks.
10x CMP MEM,REG: -202 clocks.
10x CMP MEM,IMMED: -202 clocks.
Success-mode CMP tests:
10x CMP REG,REG: 1 clocks.
10x CMP REG,IMMED: 554189126 clocks.
10x CMP MEM,REG: -202 clocks.
10x CMP MEM,IMMED: -202 clocks.
Failure-mode TEST tests:
10x TEST REG,REG: 1356305252 clocks.
10x TEST REG,IMMED: 554189125 clocks.
10x TEST MEM,REG: 6 clocks.
10x TEST MEM,IMMED: 54 clocks.
Success-mode TEST tests:
10x TEST REG,REG: 18 clocks.
10x TEST REG,IMMED: 15 clocks.
10x TEST MEM,REG: -330 clocks.
10x TEST MEM,IMMED: 47 clocks.
it seems obvious that the counters are coming from the 2 cores
that kind of implies that this single process is running on both cores, no ?
btw - i am using XP
- this could well be OS dependant
Quoteit seems obvious that the counters are coming from the 2 cores
If you think that is so, then try restricting the process to the first core by adding these statements to your source somewhere above the tests:
invoke GetCurrentProcess
invoke SetProcessAffinityMask, eax, 1
And you might also want to try the second core, specified with an affinity mask value of 2.
JJ,
> Could you give a real life example of an application that would behave like this?
I just don't have time to write a test piece for you but I wonder what is the problem. The bottom line is ensure that each read is not in cache and that the size of each read is small, a sample of less than 32 bytes comes to mind.
Real time examples are things like a small in memory database under 2 gig in size, a very large table of preset data, anything that is large enough to be useful that is loaded directly into memory and accessed in a random manner.
To simulate conditions of this type ensure the reads are NOT linear and not in cache. An algorithm is as good as it performs under conditions of this type and almost exclusively small test pieces that repeatedly bash the same address in cache do not effectively emulate these conditions.
"hmm, as being an idiot, i've just few things to say (stupid stuff, certainly...) :
preserving registers is a programming commodity, allow you to NOT loose your time to debug...bla,blah, bla"
NightWare,
Will be better to teach your lovely kleptomaniac how to preserve ecx and edx faster :lol
For your info kleptomania is an inability or great difficulty in resisting impulses of stealing.
People with this disorder are likely to have a comorbid condition, specifically paranoid, schizoid or borderline personality disorder
Kleptomania can occur after traumatic brain injury...etc.
Example:
What means for the kleptomaniac
"inner loop inspired by Lingo, with adaptions'"
For the kleptomaniac that means copy and paste.... :lol
As an idiot he preserved ecx and edx again (because NightWare preserved registers) and his program will become 'faster' on his 'special' CPUs.
From another point of view it is not a big deal for everyone from this forum to beat kleptomaniac's code. Just take a look: :lol
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
codesizes: strlen32=80, strlen64A=93, _strlen=66
-- test 16k return values Lingo, jj, Agner: 16384, 16384, 16384
crt_strlen : 11096 cycles
strlen32 : 1577 cycles
strlen64LingoA : 1511 cycles
_strlen (Agner Fog): 2761 cycles
-- test 4k return values Lingo, jj, Agner: 4096, 4096, 4096
crt_strlen : 2727 cycles
strlen32 : 416 cycles
strlen64LingoA : 395 cycles
_strlen (Agner Fog): 707 cycles
-- test 1k return values Lingo, jj, Agner: 1024, 1024, 1024
crt_strlen : 726 cycles
strlen32 : 97 cycles
strlen64LingoA : 77 cycles
_strlen (Agner Fog): 192 cycles
-- test 0 return values Lingo, jj, Agner: 191, 191, 191
crt_strlen : 148 cycles
strlen32 : 23 cycles
strlen64LingoA : 18 cycles
_strlen (Agner Fog): 59 cycles
-- test 1 return values Lingo, jj, Agner: 191, 191, 191
crt_strlen : 152 cycles
strlen32 : 38 cycles
strlen64LingoA : 33 cycles
_strlen (Agner Fog): 40 cycles
-- test 4 return values Lingo, jj, Agner: 191, 191, 191
crt_strlen : 147 cycles
strlen32 : 23 cycles
strlen64LingoA : 18 cycles
_strlen (Agner Fog): 42 cycles
-- test 7 return values Lingo, jj, Agner: 191, 191, 191
crt_strlen : 150 cycles
strlen32 : 23 cycles
strlen64LingoA : 18 cycles
_strlen (Agner Fog): 40 cycles
Press any key to exit...
[attachment deleted by admin]
Quote from: lingo on May 05, 2009, 03:22:06 AM
"hmm, as being an idiot, i've just few things to say (stupid stuff, certainly...) :
preserving registers is a programming commodity, allow you to NOT loose your time to debug...bla,blah, bla"
NightWare,
Will be better to teach your lovely kleptomaniac how to preserve ecx and edx faster :lol
For your info kleptomania is an inability or great difficulty in resisting impulses of stealing.
People with this disorder are likely to have a comorbid condition, specifically paranoid, schizoid or borderline personality disorder
Kleptomania can occur after traumatic brain injury...etc.
Lingo, please seek professional advice, at least on the definition of kleptomania and its application in code development.
A propos code: Compliments, it seems your sense of competition is still working fine. Your code beats mine in most cases, except Test 1 (on my archaic Celeron M). Will you make it public domain, or will you suit thieves?
:bg
I have worked out what this this antagonism is at last, it must be something in the water. Has there been a reactor leak recently in the EU or perhaps a chemical spill or even worse, the EU water supply is fed directly from the GRAY Danube (used to be blue), perhaps Berlesconi washed his socks in it or even worse, Tony Blair gave a speech nearby and it ended up full of sewerage.
Now I think there is only one solution, force the contestants to drink bottled water from African water supplies or perhaps Indian ones so that they end up with such a severe case of the trots that they don't have time to throw the surplus medium at each other. :clap:
everyone knows - if you go to Mexico - don't drink the water - just tequillia
:bg
Dave,
We don't want them to drink Mexican water, they may kiss a pig later. :P
Quote from: hutch-- on May 04, 2009, 11:52:50 PM
JJ,
... ensure the reads are NOT linear and not in cache. An algorithm is as good as it performs under conditions of this type
If and only if you have that type of application - a database in memory that needs many thousand random accesses per second. How realistic is that? Maybe Google needs it that way ::)
As I mentioned earlier, a virus scanner, or a "find RtlZeroMemory in all *.asm files" algo would behave in the way my test was designed.
Quote from: hutch-- on May 05, 2009, 11:46:04 AM
it must be something in the water. Has there been a reactor leak recently in the EU ...
Very funny, Sir Hutch. What is your official policy in this forum regarding calling other members (Nightware, myself)
idiots? Do you recommend it nowadays officially? Do you prefer other labels, can you make suggestions? I have been tempted many times, but until now my good education stopped me from answering in the same language. However, you seem to like this style. What do other members of the forum think about it?
"Your code beats mine in most cases.." Let see what is "your" and what is "mine"
strlen64B proc szBuffer : dword
pop ecx
pop eax
movdqu xmm2, [eax]
pxor xmm0, xmm0
pcmpeqb xmm2, xmm0
pxor xmm1, xmm1
pmovmskb edx, xmm2
test edx, edx
jz @f
bsf eax, edx
jmp ecx
@@:
lea ecx, [eax+16]
and eax, -16
@@:
pcmpeqb xmm0, [eax+16]
pcmpeqb xmm1, [eax+32]
por xmm1, xmm0
add eax, 32
pmovmskb edx, xmm1
test edx, edx
jz @B
shl edx, 16
sub eax, ecx
pmovmskb ecx, xmm0
or edx, ecx
mov ecx, [esp-8]
bsf edx, edx
add eax, edx
jmp ecx
strlen64B endp
strlen32s proc src:DWORD ; with lots of inspiration from Lingo, NightWare and Agner Fog
pop eax ; trash the return address
pop eax ; the src pointer
pxor xmm0, xmm0 ; zero for comparison (no longer needed for xmm1 - thanks, NightWare)
movups xmm1, [eax] ; move 16 bytes into xmm1, unaligned (adapted from Lingo/NightWare)
pcmpeqb xmm1, xmm0 ; set bytes in xmm1 to FF if nullbytes found in xmm1
mov edx, eax ; save pointer to string
pmovmskb eax, xmm1 ; set byte mask in eax
bsf eax, eax ; bit scan forward
jne Lt16 ; less than 16 bytes, we are done
mov MbGlobRet, edx ; edx preserved because Masm32 szLen preserves it
and edx, -16 ; align initial pointer to 16-byte boundary
lea eax, [edx+16] ; aligned pointer + 16 (first 0..15 dealt with by movups above)
@@:
pcmpeqb xmm0, [eax] ; ---- inner loop inspired by Lingo, with adaptions -----
pcmpeqb xmm1, [eax+16] ; compare packed bytes in [m128] and xmm1 for equality
lea eax, [eax+32] ; len counter (moving up lea or add costs 3 cycles for the 191 byte string)
por xmm1, xmm0 ; or them: one of the mem locations may contain a nullbyte
pmovmskb edx, xmm1 ; set byte mask in edx
test edx, edx
jz @B
@@:
sub eax, [esp-4] ; subtract original src pointer
shl edx, 16 ; create space for the ecx bytes
push ecx ; all registers preserved, except edx and eax = return value
pmovmskb ecx, xmm0 ; set byte mask in ecx (has to be repeated, sorry)
or edx, ecx ; combine xmm0 and xmm1 results
bsf edx, edx ; bit scan for the index
pop ecx
lea eax, [eax+edx-32] ; add scan index
mov edx, MbGlobRet
Lt16:
jmp dword ptr [esp-4-4] ; ret address, one arg - the Lingo style equivalent to ret 4 ;-)
strlen32s endp
[/size][/pre]
Hutch,I appreciate your knowledge about water old link (http://www.webmd.com/food-recipes/features/top-6-myths-about-bottled-water) but will be better to see your opinion again about ecx and edx preservation old link (http://www.masm32.com/board/index.php?topic=4205.msg41584#msg41584)
Everyone (including sick people) can do with my code what they want but when someone tolerate idiotic behavior
as a useless registers preservation I can't be quiet.
:bg
> How realistic is that?
Extremely, I use to write them, they are called fixed length records and will generally rip the titz of a relational database.
RE: various forms of name calling, I have explained that admin has enough trouble making mountains into molehills but there is an easy way that we try and avoid, the bulldozer approach is to shut the topic and move it to the trash can, that turns mountains into flat plains very quickly. :bdg
Quote from: lingo on May 05, 2009, 01:43:44 PM
Let see what is "your" and what is "mine"
strlen64B proc szBuffer : dword
pop ecx
pop eax
....
strlen32s proc src:DWORD
pop eax ; trash the return address
pop eax ; the src pointer
Quote from: jj2007 on November 25, 2008, 08:57:12 PM
The SetSmallRect procedure looks ok. Here is another one, just in case - only 20 bytes long and pretty fast.
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
SetRect16 proc ps_r:DWORD,left:DWORD,top:DWORD,right:DWORD,bottom:DWORD
pop edx ; trash the return address
pop edx ; move the first argument to edx
pop dword ptr [edx].SMALL_RECT.Left
pop dword ptr [edx].SMALL_RECT.Top
pop dword ptr [edx].SMALL_RECT.Right
pop word ptr [edx].SMALL_RECT.Bottom
sub esp, 5*4+2 ; correct for 5 dword + 1 word pop, restore return address
ret 5*4 ; correct stack for five arguments
SetRect16 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
etc etc - so who is the thief here?
But I simply don't have the time to follow Lingo's game. Nobody "steals" here, I even acknowledge Lingo when I take over bits of his code. The Intel set has only a limited number of mnemonics, so it is inevitable that certain sequences pop up all over the place. Try this Google search for pcmpeqb pmovmskb (http://www.google.it/search?hl=en&safe=off&num=50&newwindow=1&ei=u00ASq6_KMSv-QbU0Y3hBw&sa=X&oi=spell&resnum=0&ct=result&cd=1&q=pcmpeqb+pmovmskb&spell=1). Does he ever admit where he gets his inspiration? Does he "steal" from Intel when he uses their manuals?
I couldn't care less for Lingo, but the whole forum loses credibility if members are insulted as "idiots" and "thieves", and no moderator intervenes.
Quote from: MichaelW on May 04, 2009, 10:26:49 PM
Quoteit seems obvious that the counters are coming from the 2 cores
If you think that is so, then try restricting the process to the first core by adding these statements to your source somewhere above the tests:
invoke GetCurrentProcess
invoke SetProcessAffinityMask, eax, 1
And you might also want to try the second core, specified with an affinity mask value of 2.
I don't think this is the issue, because the source
does set the process affinity as suggested, and the timing routine (Petroizki's ptimers.inc) is included in-line and seems to be a single-threaded routine. Thus all of it should run in one process, thread, and core, no?
Perhaps what Dave is seeing is a power-saving feature of his CPU causing errors in the timing resolution, due to clock variance?
QuotePerhaps what Dave is seeing is a power-saving feature of his CPU causing errors in the timing resolution, due to clock variance?
AFAIK the TSC count should be independent of the clock frequency. Although I can't back this up, I was under the impression that recent processors use a fully static design that can accommodate any clock frequency from zero to the rated maximum, without missing a step. I think it's more likely that some other process is "borrowing" the processor while the test (or reference loop) is running.
well, i see two possibilities:
1) the RDTSC instruction is reading TS counter values from the 2 cores (makes sense)
2) the floating point math used is executing differently on my machine (as in a FP instruction serialization type problem)
(i.e. it is possible a fwait is missing that does not cause trouble until it gets executed on a dual core cpu)
as for power/standby/hibernation, the very first thing i do after rebuilding a drive is turn all that stuff off
always on (desktop) - never turn off drives - never turn off monitor - disable hibernation - i also select screensaver: none
i have all the toys in place to test it
you guys will probably chuckle when i say, "the hard part is displaying the processor type" - lol
i swear - i thought MS was bad
i am half-tempted to copy/paste JJ's CpuID code in there - lol
but, i don't learn anything by doing that
Quote from: jj2007 on May 05, 2009, 01:37:47 PM
Very funny, Sir Hutch. What is your official policy in this forum regarding calling other members (Nightware, myself) idiots? Do you recommend it nowadays officially?
hmm, personally i don't care, everybody can think what they want (plus, it's just words, if you take them all seriously...). now, drinking WATER ? why not MILK ? it's clearly an insult. i will report the author to the moderators, one day... :bg
Quote from: lingo on May 05, 2009, 01:43:44 PM
Everyone (including sick people) can do with my code what they want but when someone tolerate idiotic behavior as a useless registers preservation I can't be quiet.
you mean like preserving ebx/esi/edi when it has been already done by the OS ? :lol
plus, just for info, i don't preserve ecx/edx by default, i ONLY preserve the registers i use/alterate (including ebx/esi/edi) AND for MY OWN use. it's MY OWN calling convention, and i perfectly know why i proceed like that. if i don't preach to impose this calling technic it's because i understand that others can see it differently.
however, blindly following Microsoft's recommandations (or an interpretation of thoses recommandations) isn't very clever... (hmm... maybe one day, if i want to become another sheep, later...).
hmmmm,
> but the whole forum loses credibility if members are insulted as "idiots" and "thieves", and no moderator intervenes.
Seems that bulldozer approach will have to be put in place soon. Why does the "pregnant schoolgirl" image come to mind ? Perhaps Deja Vu (and it seems, ahhhhhhhhh been here before ..... and it makes me woder [pause] whats goin' on etc .......) (apologies to Crosby Stills Nash and Young).
I loath to close down a topic that has code in it but sooner or even sooner still, if the PHUKING nonsense does not stop, I will turn Mount Everest into the Utah Salt Flats. I have no feel whatsoever for "camp" melodramas and I don't see it has a place in a technical forum for programmers. I will not take sides between members in a dispute as silly as this one, I will just pull the plug on it if it continues.
well - there has been a lot of fruitful discussion on this thread, also
hate to see it disappear, as i am using some of it as reference for a current project
Quote from: dedndave on May 05, 2009, 10:25:51 PM
i am half-tempted to copy/paste JJ's CpuID code in there - lol
but, i don't learn anything by doing that
Attached for copy & paste the mininum code, displaying brand string and SSE level, adds only 142 bytes to the exe. It is commented, but reading it together with the Wikipedia description of CPUID (http://en.wikipedia.org/wiki/CPUID) might help. Don't forget to move the PROTO upstairs :thumbu
Output:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)[attachment deleted by admin]
Quote from: NightWare on May 05, 2009, 11:46:33 PM
i don't preserve ecx/edx by default, i ONLY preserve the registers i use/alterate (including ebx/esi/edi) AND for MY OWN use. it's MY OWN calling convention, and i perfectly know why i proceed like that. if i don't preach to impose this calling technic it's because i understand that others can see it differently.
I agree 100%. It is a question of personal taste, everybody is free to do whatever he/she likes. My taste is to preserve
ecx and
edx when I alter them, it costs me 4 bytes and 3 cycles. Not a big "loss" for routines that typically run in hundreds or thousands of cycles, and are often being called more than once in a context where these two registers already serve a purpose and therefore must be saved anyway.
:bg
Thanks JJ
I found this pdf file from Intel, "Intel Processor Identification and the CPUID Instruction"
www.intel.com/Assets/PDF/appnote/241618.pdf
You can save the file as text, or google "241618.pdf" and use google to convert it to HTML, then use the browser to save it as text
It has a very coprehensive program in assembler that ID's Intel CPU's
Then, use this one from AMD, "CPUID Specification" to add a few touches
www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25481.pdf
Really, this is beyond the scope of what I wanted to do for CPU identification
I really just want something like your "Short Version" - I don't even care about the clock frequency
In fact, as a minimalist approach, all I really need is how many cores the CPU has
GetProcessAffinityMask tells me how many the system uses - that is really enough for the program to function
Quote from: dedndave on May 06, 2009, 07:40:44 AM
In fact, as a minimalist approach, all I really need is how many cores the CPU has
GetProcessAffinityMask tells me how many the system uses - that is really enough for the program to function
Remember your own post here (http://www.masm32.com/board/index.php?topic=10848.msg82318#msg82318)?
The code gives me the same results:
QuoteCPU family 15, model 4, Pentium 4 Prescott (2005+), MMX, SSE6
Cores 2
... but according to Wikipedia (http://en.wikipedia.org/wiki/Pentium_4#Prescott) the Prescott has only one core ::)
GetProcessAffinityMask sounds promising, though - thanks for the hint. But it also says I have two cores:
SystemAffinityMask: 00000000000000000000000000000011
ProcessAffinityMask: 00000000000000000000000000000011
Any hardware experts around...?
include \masm32\include\masm32rt.inc
.data?
ProcessAffinityMask dd ?
SystemAffinityMask dd ?
buffer dd 10 dup (?)
.code
start:
invoke GetCurrentProcess
invoke GetProcessAffinityMask, eax, offset ProcessAffinityMask, offset SystemAffinityMask
print "SystemAffinityMask: ", 9
invoke dw2bin_ex, SystemAffinityMask, offset buffer
print offset buffer,13,10
print "ProcessAffinityMask: ", 9
invoke dw2bin_ex, ProcessAffinityMask, offset buffer
print offset buffer,13,10
getkey
exit
end start
Hyperthreading ?
Well, I can tell you the Prescott has 2 cores - lol
I re-read the Wikipedia article you linked and it does not really say it has a single core, per se
In that article, they use the term "core" to mean the overall core - not stating it has 2 (or 1, either)
That is odd that you pointed that out JJ - lol - I had read that page earlier this week and had not noticed the omission
I guess I was more interested in the heat issue it mentions
I am an Electronics Engineer, although I avoid the term "expert" because those who use it are usually showing how little they know
In any event, the system affinity mask returned by the GetProcessAffinityMask really tells you what you are able to access
If you had a CPU with 8 cores, but the system only uses 7, 7 is probably all you could use without generating some kind of protection fault
The real authority on how many cores the CPU has is the manufacturer, I suppose
If you use CPUID (and all it's whack-a-mole caveats), it will tell you that the Prescott is dual core
Quote from: dedndave on May 06, 2009, 08:55:55 AM
Well, I can tell you the Prescott has 2 cores - lol
You probably have seen them with your own eyes, so I believe you :bg
That Prescott story is truly confusing. Apparently, before the first "true" Dual Core came out, they fumbled two Prescotts on one board and called it "Smithfield" - see Google search (http://www.google.it/search?hl=en&safe=off&client=firefox-a&rls=org.mozilla%3Aen-GB%3Aofficial&hs=g6I&num=50&newwindow=1&q=Prescott+dual+core+smithfield&btnG=Search). Besides, I could not find any clear documentation of the CPUID code that is supposed to tell you the number of cores :(
check this document out JJ - go to the end look at their masm code
www.intel.com/Assets/PDF/appnote/241618.pdf
the prescott - 2 cores right across the middle
now, you can say you have seen them, too - lol
(http://images.sudhian.com/review/cpu/intel/Prescott/prescott_die_8in.jpg)
I am suprised that your BIOS does not tell you the Intel designation for your processor. I have 2 EM64T processors that identify as Prescott, an earlier one that identifies as a Northwood and all of them can handle hyperthreading which you turn off in win2000 as it is not optimised for this technology. Without turning it off in the BIOS, Win2000 reports 2 processors and runs badly with uneven timings.
Quote from: hutch-- on May 06, 2009, 11:24:32 AM
I have 2 EM64T processors that identify as Prescott, an earlier one that identifies as a Northwood
2 questions for you Hutch.....
1) "identify as" - you mean all CPUID programs tell you the EM64T's are Prescotts ?
2) "an earlier one that identifies as a Northwood" - an earlier EM64T ?
They are all Intel processors - they should ID properly - i could understand if they were manufactured by someone else
Perhaps it is that the CpuID programs used are not doing their thing ?
Of course, if you ask the salesman at radio shack, "is it quad-core", he will answer yes - lol
Do you run a 64-bit OS on any of them?
I bet the Vista-64 installer would tell you right away that they are not Prescotts
Dave,
I know their Intel designation from the Intel box they came in. The 6 year old Northwood is the last true 32 bit processor, the EM64T (3.2 and a 3.0 gig versions in 2 seperate boxes) were both designated as Prescott. Here is one of many agreeing dumps from some of the toys I have.
Number of CPU(s) One Physical Processor / One Core / One Logical Processor / 64 bits
Vendor GenuineIntel
CPU Full Name Intel Pentium 4 HT
CPU Name Intel(R) Pentium(R) 4 CPU 3.20GHz
CPU Code Name Prescott
Technology 0.09µ
Platform Name LGA775
Type Original OEM processor
FSB Mode QDR
Platform ID 4
Microcode ID 03
Type ID 0
CPU Clock 3193.53
System Bus Clock 798.38
System Clock 199.60
Multiplier 16.00
Original Clock 3200.00
Original Bus Clock 800.00
Original System Clock 200.00
Original Multiplier 16.00
L2 Cache Speed 3193.53 MHz
L2 Cache Speed Full
CPU Family / Model / Stepping F / 4 / 9
Family Extended 00
HyperThreading 2
L1 T-Cache 12 KµOps
L1 D-Cache 16 KB
L2 Cache 1024 KB
RDMSR 00000000 00000000 10120210 00000000
MMX Yes
MMX+ No
SSE Yes
SSE2 Yes
SSE3 Yes
3DNow! No
3DNow!+ No
DualCore No
HyperThreading Yes
IA-64 No
AMD64 No
EM64T Yes
NX/XD Yes
SpeedStep No
PowerNow! No
LongHaul No
LongRun No
Architecture x86
Just got back from a re-boot - I wanted to see what my BIOS said.
I never looked at the CPU part much, other than to notice all the features were enabled.
It says I have a Pentium P4 - on the next line it says "EM64T Capable"
Dang it Hutch - I was all warm and fuzzy with a Prescottt - Now I hafta go find out what the hell "EM64T Capable" means :eek
Check this out :-
http://www.mbreview.com/em64t.php
Thanks Neil
Now I am thoroughly confused :dazzled:
From what I can gather, EM64T is a specification, not a processor
Let me look further.......
Yikes! - another "whack-a-mole" definition
notice the last sentance....
The final sub-mode of EM64T is 64-bit mode. As one would likely assume, 64-bit mode is utilized by 64-bit applications when they're run under a 64-bit operating system. Intel has made several key changes to the IA-32 architecture to allow for these 64-bit applications, such as adding support for 64-bit linear addressing. Linear addressing is a scheme that allows access to the entirety of memory with use of a single address, usually loaded in a register or instruction. Variations of the IA-32 architecture may not offer full 64-bit linear addressing, an example being the current 600 series Pentium 4 processors which only allow for 48-bit linear addressing.
I am now googling for "Vista-48" - lol
Unless I can run Vista-64 with limited 48-bit linear addressing, it makes no sense to me.
I have no intention of going to Windows x64. I think that OS will, in the future, be viewed much as we now view OS2.
Truthfully, I have my hands full with 32-bits.
Intel, Micorsoft, and the computer manufacturers are in a damn fast hurry to make us believe we need 64 bits.
Anyone who does not believe that they are all working together, is not seeing the big picture.
They want us to scrap out our "old" 32-bit computers, OS's, and software to buy all new stuff.
Hey! Billy! Don't you have enough #$%@#$% money, as it is?
I am happy with 32-bits and, at my age, I don't foresee a need for 64-bits for myself.
I suppose if I was running some high-powered CAD or emulation software, or making CGI graphics for movies, I might want more machine.
They have to realize that "Joe and Jan Sixpack" that want to get their e-mail, edit a few pictures of the kids, and surf the net a bit
do not want, need, nor can they afford more than they already have. Especially after the big corporate CEO's, Bush/Cheney, and all the
other politicians on the planet have sucked the life out of the world economy.
Oops! Sorry for sliding off topic. It is, however, slightly germaine to the discussion at hand.
We would not be in here trying to figure out how to optimize/benchamrk code on all these platforms,
or just identify those platforms, if the issues I mentioned did not affect us.
We have to work harder in order to accomodate "the conspiracy".
P.S. I am an old guy who still thinks the 8088 was a powerhouse. Compared to what we had beforehand, it was.
I do, however, think they may have reached a point of diminishing return with respect to the "average" consumers' needs.
Dave, the more you say, the more I like you. :bg
Quote from: dedndave on May 06, 2009, 02:32:58 PM
...Variations of the IA-32 architecture may not offer full 64-bit linear addressing, an example being the current 600 series Pentium 4 processors which only allow for 48-bit linear addressing.
I am now googling for "Vista-48" - lol
I think where some of this abiguity comes from, is that
not all bits must be used in an instruction or addressing width. I.e., a 64-bit memory "pointer" may only have 48 bits actually used or implemented, which will give a lot more addressable space than a 32-bit pointer without using the full 64 bits. It is ambiguous, indeed.
Quote
Intel, Microsoft, and the computer manufacturers are in a damn fast hurry to make us believe we need 64 bits.
Well, they are just "propogating the market." If the OS and program size keeps growing exponentially, then the user will be eternally forced to upgrade hardware to keep up... ingenious, if not borderline shady business practice...
Quote
P.S. I am an old guy who still thinks the 8088 was a powerhouse. Compared to what we had beforehand, it was.
I do, however, think they may have reached a point of diminishing return with respect to the "average" consumers' needs.
Indeed. I think a lot of us view the older hardware in such a positive light because it was so simple and managable that it was elegant. Granted, segmented memory and interrupt tables look nasty, but the instructions and processor itself were nice and concise. Shaving off a few bytes and clocks made amazing improvements, and there was an immediate "reward" for the time invested coding meaningful, good code. However nowadays, a CPU has 4MB of
third-level cache and
four cores with
out-of-order preemptive pipelining or some other crap... making not only programming it effectively a nightmare, but the "golden tweaks" of yesteryear far less valuable or even noticable.
In short, the
hardware is adapting to
software bloat, instead of vice-versa. ::)
I think that is totally true.
I also think that XP, for example, is way more of an OS than is required for many day-to-day tasks.
Back in the days, we could do a DIR on a larger directory and get a feel for how fast the machine was.
If the entire list of files could be read, one by one as they went by, it was a 4.77 mhz 8088 - lol.
If the list scrolled off the screen in a flash of green light, you had a fast machine.
The operating systems in use today have so much overhead.
The machine I am typing on now is fairly decent, and I am pleased with its' performance.
But, not without some tweaking and tuning of the OS.
If I were to (or even could - don't think it would work - no drivers) put DOS 3.3 on this machine, it would fly! - lol
To bring the point a little closer to home, Windows 95 was a very fast OS.
Windows 98, a little less so, but it had many desirable feature upgrades.
Those OS's would blaze on this hardware, but I doubt I could get a complete set of device drivers.
Each OS that comes out is more and more demanding on the hardware.
Where you can really see "the conspiracy", though, is in things like XP support for newer hardware.
It is slowly trickling away. The hardware manufacturers are well aware of their mortality if they continue to support XP.
They actually have to change the hardware in some way so that XP will not run correctly or completely.
This insures that Vista (much slower OS) will be used.
Which, in turn, assures the obsolesance of XP.
Microsoft happily bats the birdie back over the net and assures that each OS requires better, newer, faster hardware to run on.
It all comes out of our pocket, and is part of the reason that the wealthiest 2% of the population have 98% of the money.
One final note - then back to the technical stuff on benchmarking......
Notice how Vista-64 will not run a real mode, making 16-bit code obsolete. MS/Intel/AMD did not have to do that, if they didn't want to.
At the same time, Windows XP-64 is just crippled enough to make it unwanted.
This is one example that really makes me believe that Microsoft, Intel, AMD, the computer manufacturers, and perhaps some of
the larger software companies (other than MS) get together for an annual or bi-annual "meeting" on someones yacht to discuss how
they are going to slowly walk the consumers away from XP into something they don't really need.
I did not hear the consumers hollering for a 64-bit CPU and/or OS to begin with.
Quote from: dedndave on May 06, 2009, 05:51:01 PM
If I were to (or even could - don't think it would work - no drivers) put DOS 3.3 on this machine, it would fly! - lol
You probably could boot DOS, I was running a benchmark, and
did it on my "newest" machine. Made a nice RAM disk 'cause the
hard disks were too big. And no network. And no sound. But I
got some real nice numbers from my nice little graphics program.
Cheers,
Steve N.
P.S. I would go with DOS 6.x, the last couple of times dealing with
DOS 3.1 were not pretty. <g>
Quote from: Mark Jones on May 06, 2009, 05:14:52 PM
Granted, segmented memory and interrupt tables look nasty, but the instructions and processor itself were nice and concise.
LOAD_ARRP: MOVE.W D4,P_FE0(A6)
MOVE.L (A0),A0 *A0=Arrptr
ADDQ.L #4,A0 *A0=first descriptor
MULU #6,D4
ADD.L D4,A0 *A0=descriptor P_fe%
MOVE.L A0,P_FE_A(A6) *save for backtrailer
MOVE.L D0,D4
RTS
I used to write my own screen and printer drivers in 68000 assembler. At that time, late 80's, people who were fumbling with segments were considered anachronists :bg
Just ran my old word processor on an emulator. A factor 10 faster than on the real thing, at least
That code looks like a foriegn language to most in here JJ - lol
May as well be Greek (or Italiano)
I haven't seen 68000 code since i worked at Edge Computer - mid 80's
I do not miss segmentation, much
I do miss interrupts for hardware detection, though
I haven't figured out how all that works on 32-bit yet
One step at a time...
ok, Hutch
What you and I have (I think the same) are Prescott dual-core CPU's. Mine is a model 630.
They are capable of running Vista 64. I am not sure if we have the 48-bit address or 64-bit address capablilities.
Personally, I can't imagine my needing to breach the first one, as it allows for more memory than my motherboard will allow.
One thing to note; The motherboard, as well as the CPU, must be able to support Vista 64. This is probably why determining
whether or not a particular machine is capable is a bit fuzzy. As for my own m/b, it is an Intel, which made that determination
a little easier.
Now - here is what will excite some of us. By poking around on the Intel site, I managed to find an Intel CPU ID program.
They also have a utility to measure frequencies. Following is a link to a page with both downloads.
Intel CPU ID Program
http://support.intel.com/support/processors/sb/CS-015477.htm
Dave,
This has been the most useful toy in the processor identification area that I have found. There are others that work just as well.
http://www.cpuid.com/cpuz.php
Here is the official AMD CPU ID program.
Personally, I would think if you want to ID an Intel CPU, who better to trust than Intel.
If you want to ID an AMD CPU, who better to trust than AMD.
The manufacturer may more completely identify a certain CPU's features.
AMD CPU ID Program
http://support.amd.com/us/Pages/dynamicDetails.aspx?ListID=c5cd2c08-1432-4756-aafa-4d9dc646342f&ItemID=132
Intel CPU ID Program
http://support.intel.com/support/processors/sb/CS-015477.htm
btw Hutch - I like that one too - it tells me a few things about my m/b I was not aware of.
Quote from: Jimg on May 02, 2009, 04:40:15 PM
Also, in the real world, (lodsb) is slightly slower then (movzx eax,[esi]/add esi,1), however, also in the real world, and how the code is usually used, the difference is insignificant because the instructions end up in a non-optimal alignment, or are affected by the preceding and following instructions executing simultaneously, or any of several other variables.
Jim,
Out of boredom, I just replaced a movzx eax,[esi] with a lodsb - and it runs over 20 cycles faster. It's not even a speed-critical loop, but it consistently makes a difference. See UseLodsb switch in new attachment here (http://www.masm32.com/board/index.php?topic=9370.msg84533#msg84533).
push esi
lea ebx, [esi+ecx-16] ; Mainstring (only ebx is free)
mov esi, [esp+6*4+3*4+4] ; lpPattern
add ebx, ebp
if UseLodsb
dec esi
endif
@@:
if UseLodsb
lodsb ; ca. 25 cycles faster!
else
inc esi
movzx eax, byte ptr [esi] ; esi=Substr (mov al is a few cycles slower)
endif
test al, al ; this could be shifted lower, but there is the rare case of equality after the zero delimiter:
je @F ; db "Mainstr", 0, "abc" ... db "str", 0, "abx" would crash
inc ebx
cmp al, [ebx] ; ebx=Mainstr
je @B
@@: pop esi
Just from looking at the above post, your two possibilities don't seem equivalent.
Quoteif UseLodsb
add ebx, ebp
dec esi
@@:
lodsb ; ca. 25 cycles faster!
test al, al ; this could be shifted lower, but there is the rare case of equality after the zero delimiter:
je @F ; db "Mainstr", 0, "abc" ... db "str", 0, "abx" would crash
inc ebx
cmp al, [ebx] ; ebx=Mainstr
je @B
else
add ebx, ebp
@@:
inc esi
movzx eax, byte ptr [esi] ; esi=Substr (mov al is a few cycles slower)
test al, al ; this could be shifted lower, but there is the rare case of equality after the zero delimiter:
je @F ; db "Mainstr", 0, "abc" ... db "str", 0, "abx" would crash
inc ebx
cmp al, [ebx] ; ebx=Mainstr
je @B
endif
I could be missing something, but it looks like you start two bytes earlier when using lodsb?
Quote from: Jimg on May 09, 2009, 02:04:07 PM
Just from looking at the above post, your two possibilities don't seem equivalent.
...
I could be missing something, but it looks like you start two bytes earlier when using lodsb?
You are perfectly right, Jim :red
In addition, I found a bug that shows up for certain unusual patterns, so I better pull back that code until it's correct.
Looking at your latest-
if UseLodsb
add esi, 6
else
add esi, 5
endif
@@:
if UseLodsb
lodsb ; some cycles faster!
else
inc esi
movzx eax, byte ptr [esi] ; esi=Substr (mov al is a few cycles slower)
endif
test al, al ; this cannot be shifted below because of the rare case of equality after the zero delimiter:
je @F ; db "Mainstr", 0, "abc" ... db "str", 0, "abx" would crash
inc ebx
cmp al, [ebx] ; ebx=Mainstr
je @B
lodsb loads and then increments esi, so esi is pointing at the next character.
for your non lodsb code, you increment first then load, so esi is pointing at the current character.
therefore, the lodsb code will not end at the same esi as the non-lodsb code.
Does this matter?
If not, then use the same startup and move the "inc esi" below the "movzx eax, byte ptr [esi]" so the movzx doesn't have to wait for the inc esi to be done incrementing. Should pick up a cycle.
If it does, you'll have to decrement esi later for the lodsb code.
Yes?
Quote from: Jimg on May 10, 2009, 01:43:12 AM
Does this matter?
If not, then use the same startup and move the "inc esi" below the "movzx eax, byte ptr [esi]"
Jim,
Good point, thanks. It doesn't matter because soon after I pop esi, but it makes the code more readable.
add ebx, ebp
add esi, 6
@@:
if UseLodsb
lodsb ; some cycles faster!
test al, al ; this cannot be shifted below because of the rare case of equality after the zero delimiter:
je @F ; db "Mainstr", 0, "abc" ... db "str", 0, "abx" would crash
else
movzx eax, byte ptr [esi] ; esi=Substr (mov al is a few cycles slower)
je @F
inc esi
endif
inc ebx
cmp al, [ebx] ; ebx=Mainstr
je @B
@@: pop ebx
pop esi
jne BadLuck ; test al, al not needed, flags still valid
or simply
add esi, 6
@@:
if UseLodsb
lodsb ; some cycles faster!
else
movzx eax, byte ptr [esi] ; esi=Substr (mov al is a few cycles slower)
inc esi
endif
test al, al ; this cannot be shifted below because of the rare case of equality after the zero delimiter:
je @F
inc ebx
cmp al, [ebx] ; ebx=Mainstr
je @B
Oops, that should have been
movzx eax, byte ptr [esi] ; esi=Substr (mov al is a few cycles slower)
test al, al
je @F
inc esi
My version is about half a cycle faster because it tests for zero before the inc esi :wink
JJ,
See if "test eax, eax" is faster than the byte test.
Quote from: hutch-- on May 10, 2009, 03:08:19 PM
JJ,
See if "test eax, eax" is faster than the byte test.
Not on a Celeron M (code below):
664 cycles for 1000* test 1000*reg32, reg32
664 cycles for 1000* test reg8, reg8
0.7 cycles for an algo that needs about 2500, so "full algo" testing is simply impossible.
But I'll replace it, thanks - test eax after a movzx eax simply
looks better.
The limits are elsewhere anyway. I have tested three variants now for the SSE2 Instring:
1. word scanner in SSE2, 3 more bytes through cmp edi, [mem+2]: 2470 cycles on average
2. byte scanner in SSE2, if match: word scanner, then 3 more bytes through cmp edi, [mem+2]: 2470 cycles on average
3. byte scanner in SSE2, 3 more bytes through cmp edi, [mem+1]: 2470 cycles on average
See the problem? :bg
No. 2+3 get really complex, so I'll stick with the word scanner posted in the laboratory (http://www.masm32.com/board/index.php?topic=9370.new#new).
Of course, No. 2+3 are blazing fast if you are searching for a pattern starting with a rare letter like
X, but that is the exception...
Quote.nolist
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm ; get them from the Masm32 Laboratory (http://www.masm32.com/board/index.php?topic=770.0)
LOOP_COUNT = 1000000
.code
start:
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 250
test eax, eax
test ebx, ebx
test ecx, ecx
test edx, edx
ENDM
counter_end
print str$(eax), 9, "cycles for 1000* test reg32, reg32", 13, 10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 250
test al, al
test bl, bl
test cl, cl
test dl, dl
ENDM
counter_end
print str$(eax), 9, "cycles for 1000* test reg8, reg8", 13, 10
inkey chr$(13, 10, "--- ok ---", 13)
exit
end start