The MASM Forum Archive 2004 to 2012

General Forums => The Workshop => Topic started by: Neil on May 01, 2009, 10:56:52 AM

Title: Which is faster?
Post by: Neil on May 01, 2009, 10:56:52 AM
I've been looking at Mark's optimisation webpage, am I correct in thinking that this code :-

               movzx eax, BYTE PTR [esi]
               inc esi                              ;or maybe add esi,1?

is faster than this:-

                mov al,[esi]                     ;lob
                inc esi                             ;Macro
                and eax,00000000000000000000000011111111b
Title: Re: Which is faster?
Post by: jj2007 on May 01, 2009, 12:24:11 PM
On a Celeron M, inc and add yield equal timings:

96      cycles for 100*movzx, inc esi
96      cycles for 100*movzx, add esi, 1
396     cycles for 100*mov al

Test it yourself:
.nolist
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm

LOOP_COUNT = 1000000
.data
MainString db "This is a long string meant for testing the code", 0

.code
start:
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; --------- the traditional way ---------
mov esi, offset MainString
REPEAT 100
movzx eax, BYTE PTR [esi]
inc esi
ENDM
counter_end
print str$(eax), 9, "cycles for 100*movzx, inc esi", 13, 10, 10 ; --------- end traditional way ---------

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; --------- the traditional way ---------
mov esi, offset MainString
REPEAT 100
movzx eax, BYTE PTR [esi]
add esi, 1
ENDM
counter_end
print str$(eax), 9, "cycles for 100*movzx, add esi, 1", 13, 10, 10 ; --------- end traditional way ---------

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; --------- the traditional way ---------
mov esi, offset MainString
REPEAT 100
mov al, [esi] ;lob
inc esi ;Macro
and eax,00000000000000000000000011111111b
ENDM
counter_end
print str$(eax), 9, "cycles for 100*mov al", 13, 10, 10 ; --------- end traditional way ---------
inkey "--- ok ---"
exit
end start
Title: Re: Which is faster?
Post by: Neil on May 01, 2009, 12:46:04 PM
This is what I got:-

95     cycles for 100*movzx, inc esi

95     cycles for 100*movzx, add esi,1

371    cycles for 100*mov al

So inc & add are the same & the first method is much quicker than the second.
Thanks JJ  :U

Title: Re: Which is faster?
Post by: hutch-- on May 01, 2009, 01:09:14 PM
Neil,

It depends on the processor hardware between INC and ADD REG, 1. On the PIV family ADD is faster, on much other hardware INC is faster. As most speed issues are related to memory access speed, you may not need to lose any sleep over which one you choose. Go over the algo and reduce any memory accesses that you can and you may see it go faster, twiddling between INC and ADD will very rarely ever give you any useful difference.
Title: Re: Which is faster?
Post by: Neil on May 01, 2009, 02:50:04 PM
Thanks hutch, I'm going to stick with inc, it's quicker to type :bg
Title: Re: Which is faster?
Post by: Jimg on May 01, 2009, 03:22:26 PM
Not to mention 1/3 the size!
Title: Re: Which is faster?
Post by: dedndave on May 01, 2009, 03:58:29 PM
i must be missing sumpin - lol

     LODSB
Title: Re: Which is faster?
Post by: Mark Jones on May 01, 2009, 04:47:32 PM
Neil, generally INC/DEC are considerably faster than ADD/SUB on the AMD Athlon processors.

As always, timing the code is the best bet. Of course, to determine this condition, this requires one actually own these processors. Too bad there isn't some service out there which could time code snippets on all major processor types. (Or a relative comparison of processor instruction latency between all the major brands.)
Title: Re: Which is faster?
Post by: Neil on May 01, 2009, 05:38:50 PM
Thanks for that Mark, my test was done on an Intel processor but I have a spare computer with an Athlon processor, I'll fire it up tomorrow & see what the test results are on that.
Title: Re: Which is faster?
Post by: jj2007 on May 01, 2009, 05:51:54 PM
Quote from: dedndave on May 01, 2009, 03:58:29 PM
i must be missing sumpin - lol

     LODSB

Sorry :bg

Quote96      cycles for 100*movzx, inc esi
364     cycles for 100*lodsb

Generally, the lods, scas, movs etc stuff is a bit slow - with one exception: rep movsd is blazingly fast for aligned memcopies, see inter alia this post by Hutch (http://www.masm32.com/board/index.php?topic=6427.msg47991#msg47991). I use lodsb if speed is not important.
Title: Re: Which is faster?
Post by: dedndave on May 01, 2009, 08:41:33 PM
ahhhhh - that is good to know
i guess, when i do use LODSB (without the REP prefix), it is a case where speed is not critical
generally speaking, i use it in cases like parsing a command line
still, this is good info - i will have to take a look at Marks' page
btw - REP LODS doesn't make much sense - lol
i don't think i have ever used that
Title: Re: Which is faster?
Post by: Jimg on May 01, 2009, 10:48:08 PM
It really depends upon how you write the test.  This test uses repeat 1000, and only does it once.  lodsb is 3 times faster on my AMD, 4 times faster on my celeron and about 15% slower on my 1.8Ghz pentium M

[attachment deleted by admin]
Title: Re: Which is faster?
Post by: dedndave on May 02, 2009, 02:59:49 AM
trying to locate Marks' page
heliosstudios says i don't have permission to access - is that the one ?
Title: Re: Which is faster?
Post by: MichaelW on May 02, 2009, 03:24:46 AM
http://heliosstudios.net/index.html.disabled
Title: Re: Which is faster?
Post by: Mark Jones on May 02, 2009, 04:09:21 AM
What's that? Oh that page is so antiquated, was started and never completed (like so many other things in my life, sigh.)

I thought you were talking about Mark Larson's page. That has some useful stuff on it. :bg
Title: Re: Which is faster?
Post by: jj2007 on May 02, 2009, 05:18:38 AM
Quote from: Jimg on May 01, 2009, 10:48:08 PM
It really depends upon how you write the test.  This test uses repeat 1000, and only does it once.  lodsb is 3 times faster on my AMD, 4 times faster on my celeron and about 15% slower on my 1.8Ghz pentium M

Jim,
3968    cycles for lodsb
997     cycles for 100*movzx, inc esi
997     cycles for 100*movzx, add esi, 1
4017    cycles for 100*mov al
3993    cycles for lodsb


This is your test, but with LOOP_COUNT = 1000000. MichaelW has written the timing routines, and is in a much better position to explain what happens if you reduce the outer count to 1. I had written the REPEAT 1000 because esi is being increased when doing a lodsb - doing that a million times may have undesired side effects :wink

You can do a rougher test by allocating a large buffer for the esi memory access, and then simply use GetTickCount with a large ncnt.
Title: Re: Which is faster?
Post by: dedndave on May 02, 2009, 11:19:34 AM
i think i am looking for Mark Larsons' page - lol
i am looking for the page that has code optimization by Mark - lol
the one that Neil was refering to in the first post of the thread
any help ?
Title: Re: Which is faster?
Post by: Neil on May 02, 2009, 11:31:13 AM
It's at the top right of the page, under Forum Links & Websites :U
Title: Re: Which is faster?
Post by: dedndave on May 02, 2009, 12:48:39 PM
that page really tells me that i have much to learn - lol
i was fairly proficient at writing fast code for the 8088
in those days, we also went for small code - not as important now - this simple fact will really change how i write code
i would say that half of what applies then still applies (per Marks' page)
half is totally different - even reverse
that is the worst case for my learning curve
having to remember which half it is in is half the battle - lol
Title: Re: Which is faster?
Post by: Rainstorm on May 02, 2009, 01:20:37 PM
Quoteguess, when i do use LODSB (without the REP prefix), it is a case where speed is not critical
generally speaking, i use it in cases like parsing a command line
still, this is good info - i will have to take a look at Marks' page

I think hutch has mentioned this stuff in his help file about the string instructions (other than rep) being slower.
Title: Re: Which is faster?
Post by: jj2007 on May 02, 2009, 01:36:49 PM
Mark's page (http://www.mark.masmcode.com/) is full of really useful hints, but don't forget his advice to time the code. On top of The Laboratory, you find the code timing macros (http://www.masm32.com/board/index.php?topic=770.0) - extremely useful.

Two minor points re Mark's page:
- Point 6, PSHUFD faster than MOVDQA: On my Celeron M, MOVDQA is a lot faster
- mmx: If you can, go for xmm instead. SSE2 is in general faster, and it avoids trashing the FPU. Using the FPU is not important for everybody, but it offers high precision at reasonable speeds - and the mmx instructions destroy the content of the FPU register. Combined mmx/FPU code is really really slow.
Title: Re: Which is faster?
Post by: Jimg on May 02, 2009, 02:34:25 PM
Precisely my point.  It depends upon how you test it.
If you loop a million times, it's a good test of looping a million times.
If you execute it a million of the instructions once, it's a good test of the instruction.
What do you want to test, looping or instruction time?
In normal code, you are calling this proc or that proc and none of them are in the cache.  To test instruction timing, you have to test code that is not in the cache, just the way is is executed most of the time.  Looping a million times is just silly.  It's just a test of the size and speed of the cache on a particular machine, not the code.
If you're going to loop, time it properly.  The current timing macros do not.
Time each loop separately. Pick the fastest one.  That's the fastest the code can run.
Or pick the MEDIAN.
If you print out the times for each each loop, you will see half a dozen of them that are hundreds or even thousands of times larger than the norm.
Doing an average including these where windows goes off and does its thing is not in any regard a test of the time it take any particular piece of code to run.
Doing an average is just silly.
If you want to test real world, duplicate the code many times and time that once.
Title: Re: Which is faster?
Post by: dedndave on May 02, 2009, 02:59:12 PM
i dunno Jim

seems to me that the code that i really want to optimize is that which is inside loops - that means it is in the cache
(provided it is a short enough loop - not some long aborition)
much of the code that is executed once in a program should be written for clarity and small size - not speed

however, if you generate 1000's of the same instruction, and execute them, most of them get cached also
you also measure the time required to re-load the cache every so often

i agree that you do not want to time the loop itself
it seems to me a practical method is a comprimise
instead of executing the same instruction inside a loop 1000 times
or generating 1000 copies of the instruction, then executing it
make a loop of 100 copies, execute it 10 times, then subtract some agreed-to standard overhead time for the loop
Title: Re: Which is faster?
Post by: Jimg on May 02, 2009, 03:39:39 PM
Exactly.  If you want to know how fast a particular chunk of code executes, e.g.  where it normally loops 100 times, then test the chunk of code looping 100 times.  If you ask "which instruction is fastest", then test the instruction, not looping.  And in both cases, don't do it a million times and average in windows doing housekeeping.
Do what you want to test. If that is normally a loop that executed 36 times, then time how long it takes to do the loop 36 times.
The time it takes is the time it takes.  If you want multiple samples, do the test again. Timing it each time.  Pick either the first time (most realistic), the fastest, or the median.  Not the blooming average.
Title: Re: Which is faster?
Post by: dedndave on May 02, 2009, 04:01:13 PM
also - those measured times you mentioned - when windows is doing housecleaning and other programs are executing
statistically, those data points should be thrown out altogether
nor is the fastest time going to be perfect, either
if you have a set of data points like this.....

1  10
2  13
3  11
4  19
5  14
6  12
7  11
8  12
9  17

it is good practice to toss out points 4 and 9, and take the average of the rest
even though you may be measuring other things, it is a good PRACTICAL representation of the real-world
Title: Re: Which is faster?
Post by: Jimg on May 02, 2009, 04:15:47 PM
I don't think the average is ever a good value, it is too prone to widows effects.  If you examine real world values, the median is almost always smaller than the average, unless the code under test always triggers some kind of windows event, in which case, timing it is irrelevant.   If you don't want to take the fastest, take the median.  Or throw away the slowest half and average what's left.  Do something to get rid of windows effects.
Title: Re: Which is faster?
Post by: Jimg on May 02, 2009, 04:40:15 PM
Also, in the real world, (lodsb) is slightly slower then (movzx eax,[esi]/add esi,1), however, also in the real world, and how the code is usually used, the difference is insignificant because the instructions end up in a non-optimal alignment, or are affected by the preceding and following instructions executing simultaneously, or any of several other variables.

Title: Re: Which is faster?
Post by: dedndave on May 02, 2009, 04:52:48 PM
I understand what you are saying, Jim.
The fastest time may well represent the most accurate measurement of the instruction, itself.

However, the problem is that it may not, as well.
In other words, if we run instructionA and happen to hit it's best time.
Then, we run instructionB and do not happen to hit it's best time.
We have invalidated the comparison of the two.

But if we take an average of the two instructions as mentioned above,
we are likely to obtain more useful comparison information.
Let's face it, we really do not want to know how many nanoseconds each takes,
rather, we want to know which of the two is performing the best.
By averaging a set of values, we take into account that we may or may not
have measured the best performance time of each instruction.
Title: Re: Which is faster?
Post by: Jimg on May 02, 2009, 06:38:52 PM
QuoteBy averaging a set of values, we take into account that we may or may not have measured the best performance time of each instruction.
I agree with everything you've said up to that point.  Average will never make a measurement better.
Take a median.  Take a standard deviation.  The average is still way too prone to other effects.  Or if you must average, average only the lower half of the results.
Title: Re: Which is faster?
Post by: dedndave on May 02, 2009, 06:44:30 PM
Well, at least we agree to disagree - lol

Truthfully, I doubt there would be much difference in the two methods.
Although you may see a static difference between the two methods, the
comparison ratios of two instructions would be very nearly the same,
assuming we tossed out the same data points.

i.e. Even though...
my method might say 100 cycles for instructionA and 120 cycles for instructionB (20% change)
your method might yield 90 and 108 (20% change)
Title: Re: Which is faster?
Post by: jj2007 on May 02, 2009, 06:53:09 PM
Quote from: Jimg on May 02, 2009, 02:34:25 PM
In normal code, you are calling this proc or that proc and none of them are in the cache.  To test instruction timing, you have to test code that is not in the cache, just the way is is executed most of the time.  Looping a million times is just silly.

You might want to see the Timings and Cache (http://www.masm32.com/board/index.php?topic=11036.msg81254#msg81254) thread. By the way, "Looping a million times is just silly" means there are lots of silly people in this forum. Me included, of course :bg
Title: Re: Which is faster?
Post by: dedndave on May 02, 2009, 07:10:51 PM
A method that might make us both happy would be to set up a dynamic threshold.
Make measurements (some minimum number) and keep track of minima and maxima along the way.
Calculate a threshold at some arbitrary level, say 0.15 x (max-min) + min.
Once you have acquired 20 data points below that theshold, stop the measurement and calculate the average of those points.
This method would be very repeatable.
Title: Re: Which is faster?
Post by: Jimg on May 02, 2009, 07:31:00 PM
Sound good.  Have to try it out to see.  The key point, is that you are timing each loop, not a million loops and dividing the result by a million.

QuoteYou might want to see the Timings and Cache thread. By the way, "Looping a million times is just silly" means there are lots of silly people in this forum. Me included, of course BigGrin
Yes!  :P and me too!
I've been doing these games for ten years now, which is why I've come to the conclusions I have. :toothy
Title: Re: Which is faster?
Post by: jj2007 on May 02, 2009, 08:15:37 PM
Quote from: Jimg on May 02, 2009, 07:31:00 PM
Sound good.  Have to try it out to see.  The key point, is that you are timing each loop, not a million loops and dividing the result by a million.

QuoteYou might want to see the Timings and Cache thread. By the way, "Looping a million times is just silly" means there are lots of silly people in this forum. Me included, of course BigGrin
Yes!  :P and me too!
I've been doing these games for ten years now, which is why I've come to the conclusions I have. :toothy

That sounds promising, but maybe I have not fully understood what you want - my apologies. How often would you time, for example, a BitBlt loop for a 1920*1600 screen? I mean: Not just choosing the fastest available Microsoft API, but rather rewrite the code that seems to be the bottleneck. You must have gathered some specific experience in this thread (http://www.masm32.com/board/index.php?topic=10421.msg77198#msg77198).
Title: Re: Which is faster?
Post by: Jimg on May 02, 2009, 08:51:38 PM
QuoteHow often would you time, for example, a BitBlt loop for a 1920*1600 screen?
Strangely enough, just yesterday.  Trying to find how to decrease the time it takes to generate a particular screen, I found that 1/3 of the time was taken up by clearing the background.  So I tested-
    tryclr = 1
    starttimetest 7
    if tryclr eq 1
        inv PatBlt,pi.td,0,0,mwidth,mheight,WHITENESS
    else
        mov edi,dibh1.bits
        mov ecx,mwidth
        imul ecx,mheight
        mov eax,0ffffffh
        rep stosd
    endif
    endtimetest 7

it turns out the first takes around 15300 clicks, and the second around 15150 clicks.  So even though it's faster, it's not worth the effort.
That would be 150/45000 = .003%


This is the code I use for testing various sections-
    .data?
    align 8
        strtime dq ?,?,?,?,?,?,?
        endtime dq ?,?,?,?,?,?,?
        elapsed0 dd ?
        elapsed1 dd ?
        elapsed2 dd ?
        elapsed3 dd ?
        elapsed4 dd ?
        elapsed5 dd ?
        elapsed6 dd ?
        elapsed7 dd ?
    .code
   
    starttimetest macro testnum
        if DoDebug
            inv QueryPerformanceCounter,addr [strtime + testnum*8]
        endif
    endm
   
    endoftest proc testnum
        push esi
        mov esi,testnum
        inv QueryPerformanceCounter,addr [endtime+esi*8]
        finit
        fild qword ptr [endtime+esi*8]
        fild qword ptr [strtime+esi*8]
        fsub
        fist dword ptr [elapsed0+esi*4]
        pop esi
    ret
    endoftest endp   
   
    endtimetest macro testnum
        if DoDebug
            inv endoftest,testnum
        endif
    endm

.
.
.
    if DoDebug
        printxa "  qdt=",dd elapsed1,32,dd elapsed0,32,dd elapsed6,32,dd elapsed7
    endif
    inv SetWindowText,hWin,mbuff


It may not be as precise as doing a cpuid and rdtsc, but in the real world, it more than sufficient.
Title: Re: Which is faster?
Post by: jj2007 on May 02, 2009, 09:40:41 PM
Quote from: Jimg on May 02, 2009, 08:51:38 PM
it turns out the first takes around 15300 clicks, and the second around 15150 clicks.  So even though it's faster, it's not worth the effort.
That would be 150/45000 = .003%

Sure, rep stosd is one of the exceptions where there is nothing to optimise, as you know (http://www.masm32.com/board/index.php?topic=6576.msg63583#msg63583) :bg
Title: Re: Which is faster?
Post by: Mark Jones on May 02, 2009, 11:09:51 PM
I recall from the many many pages of the code timing thread, a general convergence towards the fastest loop timing also being ideal. Does not Petezold's PROCTIMERS use the lowest cycle count of 10,000 iterations? (Much less iterations, and the timing results are rock-steady.)

In my experience, Petezold's timer is very good.
Title: Re: Which is faster?
Post by: jj2007 on May 03, 2009, 05:21:49 AM
Quote from: Mark Jones on May 02, 2009, 11:09:51 PM
I recall from the many many pages of the code timing thread, a general convergence towards the fastest loop timing also being ideal.

In my spare time, I am still trying to improve the Instr algo, and I stumble all the time over outliers as the one marked below:
QuoteIntel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

Timings:
(_imp__strstr=crt_strstr, InstrCi=my non-SSE version, InString=Masm32 library; TestSub?=Easy, Difficult, Xtreme)
7345    _imp__strstr, addr Mainstr, addr TestSubE
8551    _imp__strstr, addr Mainstr, addr TestSubD
12789   _imp__strstr, addr Mainstr, addr TestSubX
8151    InstrCi, 1, addr Mainstr, addr TestSubE, 0
8314    InstrCi, 1, addr Mainstr, addr TestSubD, 0
10836   InstrCi, 1, addr Mainstr, addr TestSubX, 0
7458    InString, 1, addr Mainstr, addr TestSubE
9767    InString, 1, addr Mainstr, addr TestSubD
13001   InString, 1, addr Mainstr, addr TestSubX

1866    InstrJJ, 1, addr Mainstr, addr TestSubE, 0
1870    InstrJJ, 1, addr Mainstr, addr TestSubE, 0
1868    InstrJJ, 1, addr Mainstr, addr TestSubE, 0

1881    InstrJJ, 1, addr Mainstr, addr TestSubD, 0
1884    InstrJJ, 1, addr Mainstr, addr TestSubD, 0
1882    InstrJJ, 1, addr Mainstr, addr TestSubD, 0

3814    InstrJJ, 1, addr Mainstr, addr TestSubX, 0
3727    InstrJJ, 1, addr Mainstr, addr TestSubX, 0
3830    InstrJJ, 1, addr Mainstr, addr TestSubX, 0

Average cycle count:
2513     InstrJJ
10075    MasmLib InstringL
InstrJJ : InstringL = 24 %
Code size InstrJJ=366

I have no explanation why code can suddenly, out of the blue, run 3% faster, but it happens all the time. To improve reliability, one might consider to eliminate both fast and slow outliers, but it would require some overhead in \masm32\macros\timers.asm - such as a 100 loops to calculate the expected average before starting the main exercise...?
Title: Re: Which is faster?
Post by: hutch-- on May 03, 2009, 06:20:32 AM
 :bg

Now do you understand why I only ever test in real time ?
Title: Re: Which is faster?
Post by: jj2007 on May 03, 2009, 06:33:45 AM
Quote from: hutch-- on May 03, 2009, 06:20:32 AM
:bg

Now do you understand why I only ever test in real time ?

Hutch, you want to provoke me to write "Now I understand why your code is so slow". But nope, I will not let you provoke me, and I will not write such nasty things about you!!!

:bg
Title: Re: Which is faster?
Post by: MichaelW on May 03, 2009, 07:23:08 AM
JJ,

The code is not suddenly running 3%, or whatever, faster. The problem is that the test is being interrupted, and the more it's interrupted the higher the cycle counts. The second set of timing macros was an attempt to correct this problem. These macros capture the lowest cycle count that occurs in a single loop through the block of code, on the assumption that the lowest count is the correct count. The higher counts that occur are the result of one or more context switches within the loop. Context switches can occur at the end of a time slice, so to minimize the possibility of the loop overlapping the time slice the ctr_begin macro starts a new time slice at the beginning of the loop. If the execution time for a single loop is greater than the duration of a time slice (approximately 20ms under Windows), then the loop will overlap the time slice, and if another thread of equal priority is ready to run, then a context switch will occur. Here are the typical results of the code from the attachment running on my P3:

441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271


Unfortunately, these macros do not work well with a P4, typically returning cycle counts that are a multiple of 4 and frequently higher than they should be.


[attachment deleted by admin]
Title: Re: Which is faster?
Post by: hutch-- on May 03, 2009, 07:24:43 AM
 :bg

You should know by now that these things leave me sitting up at night, wringing my hands inbetween wiping the tearstains from my face while losing sleep about it. Further I have dismally failed to write the world's fastest MessageBoxA() after 20 years of trying and to cap it off I still can't get SSE4.5 to run on a 486. Such may be the case with matter of such great importance but when it comes to timing an algo I have done it the right way for many years, design the test/timing method to fit the task then make as big as you can fit in memory and bash it long enough to reduce the variations to below 1%. Intel spec 3% but true fanaticism requires better.  :bdg
Title: Re: Which is faster?
Post by: lingo on May 03, 2009, 12:49:30 PM
The mad thievish gipsy use part of my strlen code from here (http://www.masm32.com/board/index.php?topic=1807.240)  to produce lame slow code and  is shameless to post it everywhere...  :bdg
No offend, but who uses .if  .elseif .else or preserve ecx and edx in  speed critical algos? IMO idiots in assembly and it is the result:   :lol
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)

Search Test 1 - value expected 37; lenSrchPattern ->22
InString - JJ:                         38 ; clocks: 99
InString - Lingo:                      37 ; clocks: 39

Search Test 2 - value expected 1007; lenSrchPattern ->17
InString - JJ:                         1008 ; clocks: 22567
InString - Lingo:                      1007 ; clocks: 6294

Search Test 3 - value expected 1008 ;lenSrchPattern ->16
InString - JJ:                         1009 ; clocks: 712
InString - Lingo:                      1008 ; clocks: 502

Search Test 4 - value expected 1008 ;lenSrchPattern ->16
InString - JJ:                         1009 ; clocks: 6600
InString - Lingo:                      1008 ; clocks: 1418

Search Test 5 - value expected 1008 ;lenSrchPattern ->16
InString - JJ:                         1009 ; clocks: 5426
InString - Lingo:                      1008 ; clocks: 1308

Search Test 6 - value expected 1008 ;lenSrchPattern ->16
InString - JJ:                         1009 ; clocks: 629
InString - Lingo:                      1008 ; clocks: 498

Search Test 7 - value expected 1009 ;lenSrchPattern ->14
InString - JJ:                         1010 ; clocks: 625
InString - Lingo:                      1009 ; clocks: 502

Search Test 8 - value expected 1001 ;lenSrchPattern ->1
InString - JJ:                         0 ; clocks: 781
InString - Lingo:                      1001 ; clocks: 102

Search Test 9 - value expected 1001 ;lenSrchPattern ->2
InString - JJ:                         1002 ; clocks: 611
InString - Lingo:                      1001 ; clocks: 512

Search Test 10 - value expected 1001 ;lenSrchPattern ->3
InString - JJ:                         1002 ; clocks: 625
InString - Lingo:                      1001 ; clocks: 435

Search Test 11 - value expected 1001 ;lenSrchPattern ->4
InString - JJ:                         1002 ; clocks: 635
InString - Lingo:                      1001 ; clocks: 496

Search Test 12 - value expected 1001 ;lenSrchPattern ->5
InString - JJ:                         1002 ; clocks: 795
InString - Lingo:                      1001 ; clocks: 638

Search Test 13 --Find 'Duplicate inc' in 'windows.inc' ;lenSrchPattern ->13
InString - JJ:                         1127625 ; clocks: 679836
InString - Lingo:                      1127624 ; clocks: 543385

Press ENTER to exit...


Title: Re: Which is faster?
Post by: jj2007 on May 03, 2009, 01:11:14 PM
Quote from: lingo on May 03, 2009, 12:49:30 PM
IMO idiots in assembly

Lingo, post code, not insults.
Title: Re: Which is faster?
Post by: lingo on May 03, 2009, 02:23:31 PM
Hutch, is it possible to add an icon that says "middle finger"
something like this ?  :lol
(http://upload.wikimedia.org/wikipedia/commons/thumb/3/36/The_gesture02.jpg/180px-The_gesture02.jpg)

Title: Re: Which is faster?
Post by: hutch-- on May 03, 2009, 02:33:31 PM
 :dazzled:

It must be the silly season, everbody seems to be unhappy.  :boohoo:
Title: Re: Which is faster?
Post by: Jimg on May 03, 2009, 02:34:55 PM
To get back on topic-
Quote from: jj2007 on May 03, 2009, 05:21:49 AM
I have no explanation why code can suddenly, out of the blue, run 3% faster, but it happens all the time. To improve reliability, one might consider to eliminate both fast and slow outliers, but it would require some overhead in \masm32\macros\timers.asm - such as a 100 loops to calculate the expected average before starting the main exercise...?
Some times I think the words I write just stay local to my machine, echo back to me when I read a thread, but never actually go where anyone else can see them ::)

This is exactly what I have been complaining about the last few pages.

Time each execution of the code.  Throw away the slowest half, because something was obviously going on in a windows background process.   If you do this, there is no need to run it a million times, the fastest values are the ones that weren't affected by something else.  I have found that 100 iterations is more than enough, either throw away the slowest half and average the rest, or just pick the fastest one.  Doing either of these I get rock solid consistent results.  That's why I say it's silly to loop a million times.
Title: Re: Which is faster?
Post by: dedndave on May 03, 2009, 04:51:11 PM
I am beginning to agree with you Jim - lol
(that'll cheer him up, for sure, Steve)
I think part of the problem may be the length of time each test takes.
Let's call each pass a "burst" - not a good term for purists, perhaps, but it is descriptive.
If the burst period is brief, in terms of CPU time, the occurance of anomalies will be minimalized.
Of course, if it is too brief, the time measurements have too much overhead.
Carefully selecting the length of each burst seems important.
In fact, the iterations per burst, should be adjusted until the burst period falls within a certain window.
This seems to make sense, particularily when comparing one "Instruction Sequence Under Test" to another.
If we want to compare two ISUT's, we should adjust one or both iteration counts until the burst lengths are nearly the same.
Then, run them enough times to assure acquistion of the fastest time, as Jim suggests.
Also, it should not be difficult to run time measurements on the overhead and subtract it from the results.
This overhead will vary from one platform to another, the same as any other instruction sequence.
Any time measurements we agree upon should 1) produce predictable accurate results with known ISUT's, and
2) produce stable and repeatable results on several platforms with several ISUT's.
As far as 32-bit code is concerned, I am a novice, to be sure.
But, after 30 years in Electronics Engineering, I do have some experience devising certification/verification tests.
Certainly, there is much for me to learn about all the CPU's out there, but basics are basic and statistics still apply.
Title: Re: Which is faster?
Post by: dedndave on May 03, 2009, 05:39:48 PM
this subject has peaked my curiosity, for some reason - lol

Unfortunately, I don't feel qualified to develop the entire program, myself. There is too much
about 32-bit code that I have yet to learn in order to cover all the bases.

One simple question popped into my head, though. It has to do with register dependancies.
As most of us know, the NOP instruction was derived from XCHG AX,AX, or XCHG EAX,EAX in protected mode.
I think the assembler will code 90h for either. These timing routines you guys use should show results easily.

ISUT #1:

NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP

ISUT #2

XCHG EAX,ECX
XCHG EAX,ECX
XCHG EAX,ECX
XCHG EAX,ECX
XCHG EAX,ECX
XCHG EAX,ECX
XCHG EAX,ECX
XCHG EAX,ECX

ISUT #3

XCHG EAX,ECX
XCHG EAX,EDX
XCHG EAX,ECX
XCHG EAX,EDX
XCHG EAX,ECX
XCHG EAX,EDX
XCHG EAX,ECX
XCHG EAX,EDX

I wonder if NOP is dependant on the value in AX/EAX?
My guess is that the microcode is smart enough to know better.
Those guys at Intel are pretty sharp.


btw - is there a "standard" timing test program used here in MASM32 forum,
or are the results I see posted using different code?
Title: Re: Which is faster?
Post by: jj2007 on May 03, 2009, 06:37:33 PM
Quote from: MichaelW on May 03, 2009, 07:23:08 AM
JJ,

The code is not suddenly running 3%, or whatever, faster. The problem is that the test is being interrupted, and the more it's interrupted the higher the cycle counts.

Michael,
"My" problem with this (I am a measurement specialist, too, although not in Masm) is that this is such a rare event - which would imply that 99% of the time the code is measured, say, 3% too slow. Where does this "constant" +3% error come from? See remarks on time slices below.

Quote
The second set of timing macros was an attempt to correct this problem. These macros capture the lowest cycle count that occurs in a single loop through the block of code, on the assumption that the lowest count is the correct count.

Are these the second attachment in the sticky Lab post (http://www.masm32.com/board/index.php?topic=770.msg5281#msg5281)? Celeron M results are not so clear to me ::)
HIGH_PRIORITY_CLASS
-132 cycles, empty
0 cycles, mov eax,1
0 cycles, mov eax,1 mov eax,2
0 cycles, nops 4
-108 cycles, mul ecx
0 cycles, rol ecx,32
0 cycles, rcr ecx,31
36 cycles, div ecx
36 cycles, StrLen

REALTIME_PRIORITY_CLASS
0 cycles, empty
-108 cycles, mov eax,1
0 cycles, mov eax,1 mov eax,2
0 cycles, nops 4
0 cycles, mul ecx
-120 cycles, rol ecx,32
0 cycles, rcr ecx,31
36 cycles, div ecx
24 cycles, StrLen

Quote
The higher counts that occur are the result of one or more context switches within the loop. Context switches can occur at the end of a time slice, so to minimize the possibility of the loop overlapping the time slice the ctr_begin macro starts a new time slice at the beginning of the loop. If the execution time for a single loop is greater than the duration of a time slice (approximately 20ms under Windows), then the loop will overlap the time slice, and if another thread of equal priority is ready to run, then a context switch will occur.

This would somehow imply that the loop to be timed must need more than 20ms - quite a high number of cycles, and not typical for what we are timing here... or do I misunderstand something?

Quote
Unfortunately, these macros do not work well with a P4, typically returning cycle counts that are a multiple of 4 and frequently higher than they should be.

Indeed ;-)

@dedndave: Most who do timings here use MichaelW's macros, see first post in the Laboratory.
Title: Re: Which is faster?
Post by: Mark Jones on May 03, 2009, 07:39:21 PM
Quote from: jj2007 on May 03, 2009, 06:37:33 PM
Michael,
"My" problem with this (I am a measurement specialist, too, although not in Masm) is that this is such a rare event - which would imply that 99% of the time the code is measured, say, 3% too slow. Where does this "constant" +3% error come from?

My take on this, is that there is always going to be some difference between hardware and OS, even depending on the running apps, so +/-3% is a not worth the hassle. Just take the fastest time, and consider it "the fastest time." For user-mode code, there are always going to be things to slow it down. The fastest time at least gives a baseline speed value.

I guess that means for real-mode code, a new set of "timers" need to be created. :bg
Title: Re: Which is faster?
Post by: jj2007 on May 03, 2009, 07:57:58 PM
Quote from: Mark Jones on May 03, 2009, 07:39:21 PM
Just take the fastest time

-120 cycles, rol ecx,32

For example?
Title: Re: Which is faster?
Post by: MichaelW on May 03, 2009, 08:08:34 PM
Quote from: jj2007 on May 03, 2009, 06:37:33 PM
"My" problem with this (I am a measurement specialist, too, although not in Masm) is that this is such a rare event - which would imply that 99% of the time the code is measured, say, 3% too slow. Where does this "constant" +3% error come from? See remarks on time slices below.

With counts in the range of several thousand cycles, I suspect that all of the results include interruptions. I have no idea how to account for the "constant" 3%, but it seems plausible to me that the system is performing some activity that in bursts is using 3% of the processor time.

Quote
Are these the second attachment in the sticky Lab post (http://www.masm32.com/board/index.php?topic=770.msg5281#msg5281)? Celeron M results are not so clear to me ::)
HIGH_PRIORITY_CLASS
-132 cycles, empty
0 cycles, mov eax,1
0 cycles, mov eax,1 mov eax,2
0 cycles, nops 4
-108 cycles, mul ecx
0 cycles, rol ecx,32
0 cycles, rcr ecx,31
36 cycles, div ecx
36 cycles, StrLen

REALTIME_PRIORITY_CLASS
0 cycles, empty
-108 cycles, mov eax,1
0 cycles, mov eax,1 mov eax,2
0 cycles, nops 4
0 cycles, mul ecx
-120 cycles, rol ecx,32
0 cycles, rcr ecx,31
36 cycles, div ecx
24 cycles, StrLen


Yes, in counter2.zip. Your results look like P4 results, in code without any significant delay after the app starts and before entering the timing loops. Try implementing a 3-4 second delay at the start and, if you have one, try running the test on a non-NetBurst processor. Also, it's not reasonable to expect meaningful counts on any processor for an instruction that executes in less than one clock cycle, or due to the inability to isolate the timed instructions from the timing instructions, in any small number of clock cycles. I would be interested to see the results for my code from the attachment, both with and without the delay at the start.

QuoteThis would somehow imply that the loop to be timed must need more than 20ms - quite a high number of cycles, and not typical for what we are timing here... or do I misunderstand something?

Sorry, I copied most of the text from the documentation for similar macros that I implemented in another language, where I expected some of the readers to have no idea of what a clock cycle actually is.
Title: Re: Which is faster?
Post by: jj2007 on May 03, 2009, 08:45:10 PM
Michael,
Here are the timings for counter2.zip, on a Celeron M (not a P4):
Quote99
99
444     192
444     192
444     192
444     192
444     192
444     192
444     192
444     156
444     156
444     156
444     192
444     192
444     156
444     156
444     192
444     192
444     192
444     192
444     192
444     192

Microsoft blames rdtsc for the negative cycles and outliers, and suggests QPC plus clamping (http://msdn.microsoft.com/en-us/library/bb173458.aspx)...
Title: Re: Which is faster?
Post by: Jimg on May 03, 2009, 10:06:49 PM
Quote from: jj2007 on May 03, 2009, 07:57:58 PM
Quote from: Mark Jones on May 03, 2009, 07:39:21 PM
Just take the fastest time

-120 cycles, rol ecx,32

For example?

Clearly the method for determining the loop overhead will occasionally itself contain windows glitches.  Perhaps you need to test the loop overhead a hundred times, and use the smallest result.  Subtracting smallest overhead from fastest time should be quite consistent.
Title: Re: Which is faster?
Post by: dedndave on May 03, 2009, 10:30:01 PM
from  what i can see, there is no good way to time these functions - lol
this is especially true for those (myself included) that have a dual-core CPU
one thing MS suggests to help is to confine the thread to a single core
now, how are you supposed to evaluate the advantage of having two cores ? - lol
it appears that the burst needs to be substantially long - which introduces anomalies
i think, at some point, you have to call it "good enough" and accept what you get

is it possible to....
switch to real mode
get a value from the motherboard counter-timer
switch to protected mode
ISUT burst sample
switch back to real mode
get another value from the motherboard counter-timer
switch back to protected mode
yield result

i realize the overhead is large, but it could be subtracted out
scratch that idea - real mode doesn't get you access to the counter-timer, either
let me dust off my stopwatch - lol
Title: Re: Which is faster?
Post by: hutch-- on May 04, 2009, 12:13:42 AM
 :bg

I wonder how long it will take for everyone to realise that real time testing on large samples is one of the few techniques that does not suffer most of these problems. Tailor the test data to the task at hand, make it BIG enough and run it LONG enough and you will get under the magic one percent.

There is yet another variation of this, allocate so much memory that it cannot fit into cache then copy the test data in blocks larger than the cache sequentially in the allocated memory then do random rotation of the bits being read and you will really see how bad some algos are.
Title: Re: Which is faster?
Post by: Jimg on May 04, 2009, 12:33:35 AM
I tend to agree, more and more each day.  It's not nearly as much fun, but real world testing is the way to go.
Title: Re: Which is faster?
Post by: dedndave on May 04, 2009, 01:23:11 AM
............ and test it on a few different CPU's
i fear optimization that leans toward the authors CPU only
Title: Re: Which is faster?
Post by: MichaelW on May 04, 2009, 01:42:21 AM
QuoteMicrosoft blames rdtsc for...

That article is not about timing code in clock cycles; it's about implementing high-resolution timers. A variable clock speed makes the TSC useless as a time base for a timer, so the article recommends using the high-resolution performance counter. With multiple processors/cores there are multiple TSCs, typically running asynchronously, causing obvious problems if the timer thread is not confined to a single processor/core, so the article recommends using the high-resolution performance counter.

For timing code in clock cycles the problem with multiple processors/cores should be solvable by restricting the code to running on a single processor/core using SetProcessAffinityMask or SetThreadAffinityMask. For timing code in clock cycles a variable clock speed does not present a problem, but it does present a serious problem for timing in units of time.
Title: Re: Which is faster?
Post by: NightWare on May 04, 2009, 01:43:48 AM
Quote from: lingo on May 03, 2009, 12:49:30 PM
No offend, but who uses .if  .elseif .else or preserve ecx and edx in  speed critical algos? IMO idiots in assembly and it is the result:   :lol

hmm, as being an idiot, i've just few things to say (stupid stuff, certainly...) :

preserving registers is a programming commodity, allow you to NOT loose your time to debug, and allow you to USE EVERY register (instead of considering some of them not usable like the others...).

plus, since, by essence, it's NOT IN speed critical algos (and only wrap the algos) who care about extra clock cycles executed just once ? (IMO people who have never understood what coding consist of...), yep few clock cycles areN'T measurable by a human (or maybe this algo don't need human interaction ?).

plus2, (if we go this way...) can you explain me why you preserve ebx/esi/edi ? it's made by the operating system no ? so why are you doing the job once more ? (especially if speed is critical, IMO some people have a weird logic...). it's quite understandable that the operating system preserve the table/source/destination registers "for you", BUT YOU ?

Quote from: hutch-- on May 04, 2009, 12:13:42 AM
There is yet another variation of this, allocate so much memory that it cannot fit into cache

no, coz here you will just measure useless stuff, (data read, wich is something we don't care). or you must also clean the code from the cache, and also the address for branch mis/preditions. the result you will obtain will essentially consist of data reading (something uncompressable), and you will NOT SEE the effect of the algo.

Quote from: hutch-- on May 04, 2009, 12:13:42 AM
then copy the test data in blocks larger than the cache sequentially in the allocated memory

the interest ? knowing the cost of reading data again ? UNLESS YOU DON'T READ THE SAME THINGS, IT'S THE SAME RESULTS IN ALL CASE... (and if you don't read the same things, what the hell are you comparing ?)

Quote from: hutch-- on May 04, 2009, 12:13:42 AM
then do random rotation of the bits being read and you will really see how bad some algos are.

no, random access will not allow us to know if it's a memory aliasing case or not. plus here again, measuring the reading of data has no interest.
MichaelW's macros measure a work from one point (the beginning) to another one (the end), and divide the result by the number of loop. and IMO it's the way to test algos. OS interaction certainly alterate the results a bit, but it's ALSO the case in normal use.
so the results ARE consistents (an focus on the algo, coz with the cache the code/data read + branch mispredictions (things that we don't care, in most case where speed is critical...) are absorbed by the loop). the "consistent" problem come from others factors, and also HOW you use the algo IN your app. to finish, with random access we will not obtain the normal benefit of the hardware prefetching (so the results obtained will be more than discutable... coz not reflecting the normal use...).

PS : it was just to join the unhappy club...  :bdg
Title: Re: Which is faster?
Post by: lingo on May 04, 2009, 02:42:19 AM
I wonder is there another assembly idiot in this forum who preserves always ecx and edx by default?
Title: Re: Which is faster?
Post by: hutch-- on May 04, 2009, 02:59:41 AM
NightWare,

You appear to have missed why you avoid preloaded data in cache, you get false readings by having the data in cache and this does not help you with real world situations where the data is rarely ever in cache. The method I described brings into play a phenomenon called "page thrashing which really does slow down phony readings of data in cache.

In a relatrively small number of situations, (primarily testing instruction sequences) you can benefit by testing on highly localised data that is very small but in most instances the test is useless in emulating real world situations that regularly occur in daily software use.

By allocating a much larger block of memory than will fit into cache you force the algorithms to read the data and process the data directly so all of the processor factors come into play, branch prediction, instruction order, pipeline effects, pairing etc .....

If I have learnt one thing over the years of tweaking C algos in assembler, reduce the number of memory accesses and you will see real world speed increases where the sum total of the rest show only minor and trivial improvements in timing.
Title: Re: Which is faster?
Post by: BlackVortex on May 04, 2009, 05:34:18 AM
Quote from: lingo on May 04, 2009, 02:42:19 AM
I wonder is there another assembly idiot in this forum who preserves always ecx and edx by default?
Chill out, this is a programming forum, no need to have the attitude of a 14-year old mmorpg player. Why do you see anything as a skill contest ?
Title: Re: Which is faster?
Post by: jj2007 on May 04, 2009, 07:01:08 AM
Quote from: hutch-- on May 04, 2009, 12:13:42 AM
There is yet another variation of this, allocate so much memory that it cannot fit into cache then copy the test data in blocks larger than the cache sequentially in the allocated memory then do random rotation of the bits being read and you will really see how bad some algos are.

I doubt whether the "random rotation" will change a lot, but the large data block has been tested already in Timings and Cache (http://www.masm32.com/board/index.php?topic=11036.msg81254#msg81254) (and probably earlier, I doubt that I invented this technique :wink):

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

  Masm32 lib szLen   126 cycles
  crt strlen         101 cycles
strlen32s            33 cycles
strlen64LingoB       28 cycles
_strlen (Agner Fog)  30 cycles


Same but with a scheme that eliminates the influence of the cache:

Masm32 lib szLen   *** 672 ms for 7078488 loops
  crt strlen         *** 609 ms for 7078488 loops
strlen32s            *** 328 ms for 7078488 loops
strlen64LingoB       *** 343 ms for 7078488 loops
_strlen (Agner Fog)  *** 344 ms for 7078488 loops


Differences between algos remain significant (I see Nightware has doubts about that, although I fully agree with 99% of this post), but they are much smaller than for the same tests with data in the cache. Which is nothing sensational. In most cases, we can assume data is in the cache, but in the case of a virus scanner, for example, terabytes have to be read and scanned, no data in cache, therefore "cache-free" timing algo needed.
Title: Re: Which is faster?
Post by: hutch-- on May 04, 2009, 10:49:05 AM
 :bg

Quote
I doubt whether the "random rotation" will change a lot, but the large data block has been tested already in Timings and Cache (and probably earlier, I doubt that I invented this technique

Here is a man who has yet to test page thrashing. At its crudest do the test so that each successive read is further up the offset than the cache size and watch the timings crash. To produce a long enough timing randomly pick addresses that are larger than the cache size apart and watch your timings change, not by milliseconds but seconds.

To complicate the matter, try both temporal and non-temporal reads and writes to see the difference and why non-cached writes are useful.

Most of the testing done with small samples already loaded in cache are a waste of space that don't reflect how the algo performs in real world situations.
Title: Re: Which is faster?
Post by: jj2007 on May 04, 2009, 11:37:04 AM
Quote from: hutch-- on May 04, 2009, 10:49:05 AM

Most of the testing done with small samples already loaded in cache are a waste of space that don't reflect how the algo performs in real world situations.

Hutch,
As I mentioned in my post, my test is constructed in a way that it does work on non-cached data because the allocated buffer is far beyond cache size. But I am really curious to see a code sample that supports your "seconds instead of milliseconds" statement.
:bg
Title: Re: Which is faster?
Post by: hutch-- on May 04, 2009, 12:48:25 PM
It was a bit big to upload, it was about 1.5 gigabytes of code that built to about a 350 meg test piece. Now what you are testing is very simple, make EVERY read reload the current memory page and suddenly it all gets SSSLLLLLOOOOOOOOOWWWWWWWW. If you are not getting this effect, you are doing something wrong.
Title: Re: Which is faster?
Post by: dedndave on May 04, 2009, 01:03:11 PM
Hutch,
   The program e1 seems to be using is not the one i found in the laboratory 1st thread.
The one they use IDs the CPU at the beginning. Where is a link to that one ?
Title: Re: Which is faster?
Post by: jj2007 on May 04, 2009, 01:36:53 PM
Hutch,
Maybe we misunderstand each other. Here is the (simplified) core piece of my test code:

mov esi, len(offset arg)  ; the length of the string whose zero delimiter we want to find
..
mov ebx, LoopCt
invoke GetTickCount
push eax ; save timer
push ebx ; save counter
  .Repeat
invoke pAlgo, edi ; pAlgo=strlen, strlen32s, etc.
add edi, esi ; move start position higher with EACH loop
dec ebx
.Until Zero?
invoke GetTickCount
pop ebx
pop ecx
sub eax, ecx
print str$(eax)," ms for "
print str$(ebx)," loops",13,10


Since the size of FatBuffer is 100 MB, as far as I can see, in 99% of all cases the strlen algo must read from memory rather than from cache. But please correct me if I am wrong - I freely admit that I am uncertain. I'd like to understand it ::)

Output:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
100000000 bytes allocated

codesizes: strlen32s=85, strlen64A=120, strlen64B=87

-- test 16k, misaligned 0, 16384 bytes
  Masm32 lib szLen   *** 62 ms for 6096 loops
  crt strlen         *** 32 ms for 6096 loops
strlen32s            *** 31 ms for 6096 loops
strlen64LingoB       *** 47 ms for 6096 loops
_strlen (Agner Fog)  *** 47 ms for 6096 loops

-- test 4k, misaligned 11, 4096 bytes
  Masm32 lib szLen   *** 63 ms for 24304 loops
  crt strlen         *** 32 ms for 24304 loops
strlen32s            *** 31 ms for 24304 loops
strlen64LingoB       *** 32 ms for 24304 loops
_strlen (Agner Fog)  *** 31 ms for 24304 loops
Title: Re: Which is faster?
Post by: dedndave on May 04, 2009, 01:38:35 PM
JJ - where can I obtain this test program?

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
100000000 bytes allocated

codesizes: strlen32s=85, strlen64A=120, strlen64B=87

-- test 16k, misaligned 0, 16384 bytes
  Masm32 lib szLen   *** 62 ms for 6096 loops
  crt strlen         *** 32 ms for 6096 loops
strlen32s            *** 31 ms for 6096 loops
strlen64LingoB       *** 47 ms for 6096 loops
_strlen (Agner Fog)  *** 47 ms for 6096 loops

-- test 4k, misaligned 11, 4096 bytes
  Masm32 lib szLen   *** 63 ms for 24304 loops
  crt strlen         *** 32 ms for 24304 loops
strlen32s            *** 31 ms for 24304 loops
strlen64LingoB       *** 32 ms for 24304 loops
_strlen (Agner Fog)  *** 31 ms for 24304 loops
Title: Re: Which is faster?
Post by: jj2007 on May 04, 2009, 02:17:47 PM
Quote from: dedndave on May 04, 2009, 01:38:35 PM
JJ - where can I obtain this test program?

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
100000000 bytes allocated

Here it is. You need to move slenSSE2.inc \masm32\include\slenSSE2.inc
Same for include \masm32\macros\timers.asm (not included to avoid version confusion, see top post of The Laboratory).

No warranties about mental sanity etc. after reading my code. Look for AlgoTest inside the asm file...

[attachment deleted by admin]
Title: Re: Which is faster?
Post by: hutch-- on May 04, 2009, 02:28:29 PM
JJ,

Your example is linear read. The way to test this is to allocate 100 meg as in your example, write the same 1 meg of data to each megabyte in the allocated buffer then read the same piece of data from each of the 1 meg buffers that make up the 100 meg buffer. Do it either randomly or a preset order to ensure that no read is sequential and make the piece of data small, 32 bytes or similar.

On each read if you do this you force the processor to reload the page table which ensures you have no cache reads. Watch it drop to snail pace in comparison to cached reads.
Title: Re: Which is faster?
Post by: dedndave on May 04, 2009, 02:43:15 PM
Thanks JJ,
   I ran your exe. My CPU takes several more cycles than yours, as it is a dual core.
I thought the program ID'ed dual-cores.

              Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
200000000 bytes allocated
codesizes: strlen32s=132, strlen64B=84, NWStrLen=118, _strlen=66 bytes

-- test 16k, misaligned 0, 16384 bytes
  Masm32 lib szLen   ** 188 ms for 12193 loops
  crt strlen         ** 93 ms for 12193 loops
strlen32s            ** 47 ms for 12193 loops
strlen64LingoB       ** 47 ms for 12193 loops
NWStrLen             ** 47 ms for 12193 loops
_strlen (Agner Fog)  ** 47 ms for 12193 loops

-- test 4k, misaligned 11, 4096 bytes
  Masm32 lib szLen   ** 172 ms for 48611 loops
  crt strlen         ** 94 ms for 48611 loops
strlen32s            ** 47 ms for 48611 loops
strlen64LingoB       ** 47 ms for 48611 loops
NWStrLen             ** 47 ms for 48611 loops
_strlen (Agner Fog)  ** 62 ms for 48611 loops
Title: Re: Which is faster?
Post by: jj2007 on May 04, 2009, 02:54:07 PM
Quote from: hutch-- on May 04, 2009, 02:28:29 PM
read the same piece of data from each of the 1 meg buffers that make up the 100 meg buffer. Do it either randomly or a preset order to ensure that no read is sequential and make the piece of data small, 32 bytes or similar.

Could you give a real life example of an application that would behave like this? I chose linear read because I thought of e.g.
- reading Windows.inc into a buffer, find first occurence of "Duplicate.inc"
- a virus scanner reading word.exe into a buffer, find first occurence of a pattern
etc.

In any case I hope we can agree that during the life cycle of the loop, i.e. between the two GetTickCounts, the code reads memory into the cache, at a buffer size of 200 Mega.
Title: Re: Which is faster?
Post by: dedndave on May 04, 2009, 06:07:00 PM
Microsoft suggests confining the thread to a single core
that means that the advantage of having 2 cores (or more) is negated by the test
what is needed, is to acquire the timer values from all cores, run the test, then acquire them all again
and see how many total cycles were used
i am not experienced enough to know how to do that
Title: Re: Which is faster?
Post by: MichaelW on May 04, 2009, 07:02:37 PM
From what little testing I have done on a dual-core system my small, simple test apps seemed to run on one core only. I have yet to see any clear demonstration where having multiple cores provided a performance advantage.
Title: Re: Which is faster?
Post by: dedndave on May 04, 2009, 07:44:13 PM
it may be that it hasn't been measured yet
until we devise a test that accomodates more than one core, we won't really know
Title: Re: Which is faster?
Post by: Mark Jones on May 04, 2009, 09:45:54 PM
In my experience with the AMD Athlon dual-core, Windows likes to assign a single-thread process to one CPU core, and it alternates cores for each new process. I.e., if you open up two single-threaded programs, they both run on a separate core. So yes, two things can be running at the same time. Of course, they have to share the same busses, so it is not exactly 2x the performance.

Programs like BOINC spawn new worker processes to utilize all available cores. Quick and easy solution. Applications which are multi-threaded utilize the additional cores by creating threads to run on each core. New threads may alternate cores like processes do; I will have to look into that. But thread affinity can also be set, to force the thread to only run on the selected core.

There is performance to be had in utilizing additional cores, but with this comes added complexity. If an app uses two threads on different cores, then the programmer must make provisions for synchronization. If the threads need to communicate with each other, then the EnterCriticalSection and LeaveCriticalSection API's are helpful to guarantee things don't get desynchronized.

As a side note, a single-CPU system can run a multi-threaded app just fine. Each thread just runs on the same core, and time slices are divided between the threads. There is little overhead in the thread switching, something like a few thousand clocks.

When it comes to implementing multi-threading in general, the best design concept seems to be that of one master-thread which "doles out work units" to n independent worker threads, where n is the number of cores detected at startup. This concept guarantees total processor usage, and is tolerant of thread timing variance. (Sorry if this was a little more info than necessary, lol.)
Title: Re: Which is faster?
Post by: dedndave on May 04, 2009, 09:57:48 PM
well, that is kind of what I thought, too
but, when I run a simple timing test, the numbers tell me otherwise....

Reference null tests:
Null:               20 clocks.
10x NOP:            25 clocks.

Failure-mode CMP tests:
10x CMP REG,REG:    7 clocks.
10x CMP REG,IMMED:  -210 clocks.
10x CMP MEM,REG:    -202 clocks.
10x CMP MEM,IMMED:  -202 clocks.

Success-mode CMP tests:
10x CMP REG,REG:    1 clocks.
10x CMP REG,IMMED:  554189126 clocks.
10x CMP MEM,REG:    -202 clocks.
10x CMP MEM,IMMED:  -202 clocks.

Failure-mode TEST tests:
10x TEST REG,REG:   1356305252 clocks.
10x TEST REG,IMMED: 554189125 clocks.
10x TEST MEM,REG:   6 clocks.
10x TEST MEM,IMMED: 54 clocks.

Success-mode TEST tests:
10x TEST REG,REG:   18 clocks.
10x TEST REG,IMMED: 15 clocks.
10x TEST MEM,REG:   -330 clocks.
10x TEST MEM,IMMED: 47 clocks.

it seems obvious that the counters are coming from the 2 cores
that kind of implies that this single process is running on both cores, no ?

btw - i am using XP
- this could well be OS dependant
Title: Re: Which is faster?
Post by: MichaelW on May 04, 2009, 10:26:49 PM
Quoteit seems obvious that the counters are coming from the 2 cores

If you think that is so, then try restricting the process to the first core by adding these statements to your source somewhere above the tests:

    invoke GetCurrentProcess
    invoke SetProcessAffinityMask, eax, 1


And you might also want to try the second core, specified with an affinity mask value of 2.
Title: Re: Which is faster?
Post by: hutch-- on May 04, 2009, 11:52:50 PM
JJ,

> Could you give a real life example of an application that would behave like this?

I just don't have time to write a test piece for you but I wonder what is the problem. The bottom line is ensure that each read is not in cache and that the size of each read is small, a sample of less than 32 bytes comes to mind.

Real time examples are things like a small in memory database under 2 gig in size, a very large table of preset data, anything that is large enough to be useful that is loaded directly into memory and accessed in a random manner.

To simulate conditions of this type ensure the reads are NOT linear and not in cache. An algorithm is as good as it performs under conditions of this type and almost exclusively small test pieces that repeatedly bash the same address in cache do not effectively emulate these conditions.
Title: Re: Which is faster?
Post by: lingo on May 05, 2009, 03:22:06 AM
"hmm, as being an idiot, i've just few things to say (stupid stuff, certainly...) :
preserving registers is a programming commodity, allow you to NOT loose your time to debug...bla,blah, bla"


NightWare,
Will be better to teach your lovely kleptomaniac how to preserve ecx and edx faster  :lol
For your info kleptomania  is an inability or great difficulty in resisting impulses of stealing.
People with this disorder are likely to have a comorbid condition, specifically paranoid, schizoid or borderline personality disorder
Kleptomania can occur after traumatic brain injury...etc.
Example:
What means for the kleptomaniac 
"inner loop inspired by Lingo, with adaptions'"
For the kleptomaniac that means copy and paste....  :lol
As an idiot he preserved ecx and edx again (because NightWare preserved registers) and his program will become 'faster' on his 'special' CPUs.
From another point of view it is not a big deal for everyone from this forum to beat kleptomaniac's code. Just take a look:  :lol
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
codesizes: strlen32=80, strlen64A=93, _strlen=66

-- test 16k           return values Lingo, jj, Agner: 16384, 16384, 16384
crt_strlen    :       11096 cycles
strlen32      :       1577 cycles
strlen64LingoA :      1511 cycles
_strlen (Agner Fog):  2761 cycles

-- test 4k            return values Lingo, jj, Agner: 4096, 4096, 4096
crt_strlen    :       2727 cycles
strlen32      :       416 cycles
strlen64LingoA :      395 cycles
_strlen (Agner Fog):  707 cycles

-- test 1k            return values Lingo, jj, Agner: 1024, 1024, 1024
crt_strlen    :       726 cycles
strlen32      :       97 cycles
strlen64LingoA :      77 cycles
_strlen (Agner Fog):  192 cycles

-- test 0             return values Lingo, jj, Agner: 191, 191, 191
crt_strlen    :       148 cycles
strlen32      :       23 cycles
strlen64LingoA :      18 cycles
_strlen (Agner Fog):  59 cycles

-- test 1             return values Lingo, jj, Agner: 191, 191, 191
crt_strlen    :       152 cycles
strlen32      :       38 cycles
strlen64LingoA :      33 cycles
_strlen (Agner Fog):  40 cycles

-- test 4             return values Lingo, jj, Agner: 191, 191, 191
crt_strlen    :       147 cycles
strlen32      :       23 cycles
strlen64LingoA :      18 cycles
_strlen (Agner Fog):  42 cycles

-- test 7             return values Lingo, jj, Agner: 191, 191, 191
crt_strlen    :       150 cycles
strlen32      :       23 cycles
strlen64LingoA :      18 cycles
_strlen (Agner Fog):  40 cycles

Press any key to exit...





[attachment deleted by admin]
Title: Re: Which is faster?
Post by: jj2007 on May 05, 2009, 06:39:16 AM
Quote from: lingo on May 05, 2009, 03:22:06 AM
"hmm, as being an idiot, i've just few things to say (stupid stuff, certainly...) :
preserving registers is a programming commodity, allow you to NOT loose your time to debug...bla,blah, bla"


NightWare,
Will be better to teach your lovely kleptomaniac how to preserve ecx and edx faster  :lol
For your info kleptomania  is an inability or great difficulty in resisting impulses of stealing.
People with this disorder are likely to have a comorbid condition, specifically paranoid, schizoid or borderline personality disorder
Kleptomania can occur after traumatic brain injury...etc.


Lingo, please seek professional advice, at least on the definition of kleptomania and its application in code development.

A propos code: Compliments, it seems your sense of competition is still working fine. Your code beats mine in most cases, except Test 1 (on my archaic Celeron M). Will you make it public domain, or will you suit thieves?
Title: Re: Which is faster?
Post by: hutch-- on May 05, 2009, 11:46:04 AM
 :bg

I have worked out what this this antagonism is at last, it must be something in the water. Has there been a reactor leak recently in the EU or perhaps a chemical spill or even worse, the EU water supply is fed directly from the GRAY Danube (used to be blue), perhaps Berlesconi washed his socks in it or even worse, Tony Blair gave a speech nearby and it ended up full of sewerage.

Now I think there is only one solution, force the contestants to drink bottled water from African water supplies or perhaps Indian ones so that they end up with such a severe case of the trots that they don't have time to throw the surplus medium at each other.  :clap:
Title: Re: Which is faster?
Post by: dedndave on May 05, 2009, 11:49:46 AM
everyone knows - if you go to Mexico - don't drink the water - just tequillia
Title: Re: Which is faster?
Post by: hutch-- on May 05, 2009, 12:49:36 PM
 :bg

Dave,

We don't want them to drink Mexican water, they may kiss a pig later.  :P
Title: Re: Which is faster?
Post by: jj2007 on May 05, 2009, 01:37:47 PM
Quote from: hutch-- on May 04, 2009, 11:52:50 PM
JJ,
...  ensure the reads are NOT linear and not in cache. An algorithm is as good as it performs under conditions of this type

If and only if you have that type of application - a database in memory that needs many thousand random accesses per second. How realistic is that? Maybe Google needs it that way ::)

As I mentioned earlier, a virus scanner, or a "find RtlZeroMemory in all *.asm files" algo would behave in the way my test was designed.

Quote from: hutch-- on May 05, 2009, 11:46:04 AM
it must be something in the water. Has there been a reactor leak recently in the EU ...

Very funny, Sir Hutch. What is your official policy in this forum regarding calling other members (Nightware, myself) idiots? Do you recommend it nowadays officially? Do you prefer other labels, can you make suggestions? I have been tempted many times, but until now my good education stopped me from answering in the same language. However, you seem to like this style. What do other members of the forum think about it?
Title: Re: Which is faster?
Post by: lingo on May 05, 2009, 01:43:44 PM
"Your code beats mine in most cases.."

Let see what is "your" and what is "mine"

strlen64B     proc szBuffer : dword
   pop        ecx
   pop        eax
   movdqu     xmm2, [eax]
   pxor       xmm0, xmm0
   pcmpeqb    xmm2, xmm0
   pxor       xmm1, xmm1
   pmovmskb   edx, xmm2
   test       edx, edx
   jz         @f
   bsf        eax, edx
   jmp        ecx
@@:
   lea       ecx,   [eax+16]
   and       eax,    -16
@@:
   pcmpeqb    xmm0, [eax+16]
   pcmpeqb   xmm1, [eax+32]
   por       xmm1, xmm0
   add       eax,    32
   pmovmskb   edx,    xmm1
   test       edx,    edx
   jz       @B
   shl       edx,    16
   sub       eax,    ecx
   pmovmskb    ecx,    xmm0
   or       edx,    ecx
   mov       ecx,    [esp-8]
   bsf       edx,    edx
   add       eax,    edx
   jmp       ecx
strlen64B       endp

strlen32s    proc      src:DWORD   ; with lots of inspiration from Lingo, NightWare and Agner Fog
      pop       eax         ; trash the return address
      pop       eax         ; the src pointer
      pxor       xmm0, xmm0   ; zero for comparison (no longer needed for xmm1 - thanks, NightWare)
      movups    xmm1, [eax]    ; move 16 bytes into xmm1, unaligned (adapted from Lingo/NightWare)
      pcmpeqb    xmm1, xmm0   ; set bytes in xmm1 to FF if nullbytes found in xmm1
      mov       edx,     eax      ; save pointer to string
      pmovmskb    eax,     xmm1   ; set byte mask in eax
      bsf       eax,     eax      ; bit scan forward
      jne       Lt16         ; less than 16 bytes, we are done
      mov       MbGlobRet, edx   ; edx preserved because Masm32 szLen preserves it
      and       edx,      -16      ; align initial pointer to 16-byte boundary
      lea       eax,      [edx+16]    ; aligned pointer + 16 (first 0..15 dealt with by movups above)
@@:   
      pcmpeqb    xmm0, [eax]    ; ---- inner loop inspired by Lingo, with adaptions -----
      pcmpeqb   xmm1, [eax+16]    ; compare packed bytes in [m128] and xmm1 for equality
      lea       eax,      [eax+32]    ; len counter (moving up lea or add costs 3 cycles for the 191 byte string)
      por       xmm1, xmm0   ; or them: one of the mem locations may contain a nullbyte
      pmovmskb    edx,      xmm1   ; set byte mask in edx
      test       edx,      edx
      jz      @B
@@:
      sub       eax,   [esp-4]    ; subtract original src pointer
      shl       edx,    16      ; create space for the ecx bytes
      push       ecx         ; all registers preserved, except edx and eax = return value
      pmovmskb    ecx,    xmm0   ; set byte mask in ecx (has to be repeated, sorry)
      or       edx,    ecx      ; combine xmm0 and xmm1 results
      bsf       edx,    edx      ; bit scan for the index
      pop       ecx
      lea       eax,    [eax+edx-32] ; add scan index
      mov       edx,    MbGlobRet
Lt16:      
      jmp       dword ptr [esp-4-4] ; ret address, one arg - the Lingo style equivalent to ret 4 ;-)
strlen32s    endp
[/size][/pre]

Hutch,
I appreciate your knowledge about water old link (http://www.webmd.com/food-recipes/features/top-6-myths-about-bottled-water) but will be better to see your opinion again about ecx and edx preservation  old link (http://www.masm32.com/board/index.php?topic=4205.msg41584#msg41584)
Everyone (including sick people) can do with my code what they want but when someone tolerate idiotic behavior
as a useless registers preservation I can't be quiet.

Title: Re: Which is faster?
Post by: hutch-- on May 05, 2009, 01:45:07 PM
 :bg

> How realistic is that?

Extremely, I use to write them, they are called fixed length records and will generally rip the titz of a relational database.

RE: various forms of name calling, I have explained that admin has enough trouble making mountains into molehills but there is an easy way that we try and avoid, the bulldozer approach is to shut the topic and move it to the trash can, that turns mountains into flat plains very quickly.  :bdg
Title: Re: Which is faster?
Post by: jj2007 on May 05, 2009, 02:39:04 PM
Quote from: lingo on May 05, 2009, 01:43:44 PM
Let see what is "your" and what is "mine"

strlen64B     proc szBuffer : dword
   pop        ecx
   pop        eax
....
strlen32s    proc      src:DWORD
      pop       eax         ; trash the return address
      pop       eax         ; the src pointer


Quote from: jj2007 on November 25, 2008, 08:57:12 PM
The SetSmallRect procedure looks ok. Here is another one, just in case - only 20 bytes long and pretty fast.

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
SetRect16 proc ps_r:DWORD,left:DWORD,top:DWORD,right:DWORD,bottom:DWORD
pop edx ; trash the return address
pop edx ; move the first argument to edx
pop dword ptr [edx].SMALL_RECT.Left
pop dword ptr [edx].SMALL_RECT.Top
pop dword ptr [edx].SMALL_RECT.Right
pop word ptr [edx].SMALL_RECT.Bottom
sub esp, 5*4+2 ; correct for 5 dword + 1 word pop, restore return address
ret 5*4 ; correct stack for five arguments
SetRect16 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef


etc etc - so who is the thief here?

But I simply don't have the time to follow Lingo's game. Nobody "steals" here, I even acknowledge Lingo when I take over bits of his code. The Intel set has only a limited number of mnemonics, so it is inevitable that certain sequences pop up all over the place. Try this Google search for pcmpeqb pmovmskb (http://www.google.it/search?hl=en&safe=off&num=50&newwindow=1&ei=u00ASq6_KMSv-QbU0Y3hBw&sa=X&oi=spell&resnum=0&ct=result&cd=1&q=pcmpeqb+pmovmskb&spell=1). Does he ever admit where he gets his inspiration? Does he "steal" from Intel when he uses their manuals?

I couldn't care less for Lingo, but the whole forum loses credibility if members are insulted as "idiots" and "thieves", and no moderator intervenes.
Title: Re: Which is faster?
Post by: Mark Jones on May 05, 2009, 08:52:01 PM
Quote from: MichaelW on May 04, 2009, 10:26:49 PM
Quoteit seems obvious that the counters are coming from the 2 cores

If you think that is so, then try restricting the process to the first core by adding these statements to your source somewhere above the tests:

    invoke GetCurrentProcess
    invoke SetProcessAffinityMask, eax, 1


And you might also want to try the second core, specified with an affinity mask value of 2.

I don't think this is the issue, because the source does set the process affinity as suggested, and the timing routine (Petroizki's ptimers.inc) is included in-line and seems to be a single-threaded routine. Thus all of it should run in one process, thread, and core, no?

Perhaps what Dave is seeing is a power-saving feature of his CPU causing errors in the timing resolution, due to clock variance?
Title: Re: Which is faster?
Post by: MichaelW on May 05, 2009, 10:02:19 PM
QuotePerhaps what Dave is seeing is a power-saving feature of his CPU causing errors in the timing resolution, due to clock variance?

AFAIK the TSC count should be independent of the clock frequency. Although I can't back this up, I was under the impression that recent processors use a fully static design that can accommodate any clock frequency from zero to the rated maximum, without missing a step. I think it's more likely that some other process is "borrowing" the processor while the test (or reference loop) is running.
Title: Re: Which is faster?
Post by: dedndave on May 05, 2009, 10:25:51 PM
well, i see two possibilities:
1) the RDTSC instruction is reading TS counter values from the 2 cores (makes sense)
2) the floating point math used is executing differently on my machine (as in a FP instruction serialization type problem)
    (i.e. it is possible a fwait is missing that does not cause trouble until it gets executed on a dual core cpu)

as for power/standby/hibernation, the very first thing i do after rebuilding a drive is turn all that stuff off
always on (desktop) - never turn off drives - never turn off monitor - disable hibernation - i also select screensaver: none

i have all the toys in place to test it
you guys will probably chuckle when i say, "the hard part is displaying the processor type" - lol
i swear - i thought MS was bad
i am half-tempted to copy/paste JJ's CpuID code in there - lol
but, i don't learn anything by doing that
Title: Re: Which is faster?
Post by: NightWare on May 05, 2009, 11:46:33 PM
Quote from: jj2007 on May 05, 2009, 01:37:47 PM
Very funny, Sir Hutch. What is your official policy in this forum regarding calling other members (Nightware, myself) idiots? Do you recommend it nowadays officially?

hmm, personally i don't care, everybody can think what they want (plus, it's just words, if you take them all seriously...). now, drinking WATER ? why not MILK ? it's clearly an insult. i will report the author to the moderators, one day...  :bg

Quote from: lingo on May 05, 2009, 01:43:44 PM
Everyone (including sick people) can do with my code what they want but when someone tolerate idiotic behavior as a useless registers preservation I can't be quiet.

you mean like preserving ebx/esi/edi when it has been already done by the OS ?  :lol

plus, just for info, i don't preserve ecx/edx by default, i ONLY preserve the registers i use/alterate (including ebx/esi/edi) AND for MY OWN use. it's MY OWN calling convention, and i perfectly know why i proceed like that. if i don't preach to impose this calling technic it's because i understand that others can see it differently.

however, blindly following Microsoft's recommandations (or an interpretation of thoses recommandations) isn't very clever... (hmm... maybe one day, if i want to become another sheep, later...).


Title: Re: Which is faster?
Post by: hutch-- on May 06, 2009, 02:31:22 AM
hmmmm,

> but the whole forum loses credibility if members are insulted as "idiots" and "thieves", and no moderator intervenes.

Seems that bulldozer approach will have to be put in place soon. Why does the "pregnant schoolgirl" image come to mind ? Perhaps Deja Vu (and it seems, ahhhhhhhhh been here before ..... and it makes me woder [pause] whats goin' on etc .......) (apologies to Crosby Stills Nash and Young).

I loath to close down a topic that has code in it but sooner or even sooner still, if the PHUKING nonsense does not stop, I will turn Mount Everest into the Utah Salt Flats. I have no feel whatsoever for "camp" melodramas and I don't see it has a place in a technical forum for programmers. I will not take sides between members in a dispute as silly as this one, I will just pull the plug on it if it continues.
Title: Re: Which is faster?
Post by: dedndave on May 06, 2009, 02:48:52 AM
well - there has been a lot of fruitful discussion on this thread, also
hate to see it disappear, as i am using some of it as reference for a current project
Title: Re: Which is faster?
Post by: jj2007 on May 06, 2009, 06:22:53 AM
Quote from: dedndave on May 05, 2009, 10:25:51 PM

i am half-tempted to copy/paste JJ's CpuID code in there - lol
but, i don't learn anything by doing that


Attached for copy & paste the mininum code, displaying brand string and SSE level, adds only 142 bytes to the exe. It is commented, but reading it together with the Wikipedia description of CPUID (http://en.wikipedia.org/wiki/CPUID) might help. Don't forget to move the PROTO upstairs :thumbu
Output:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

[attachment deleted by admin]
Title: Re: Which is faster?
Post by: jj2007 on May 06, 2009, 06:57:36 AM
Quote from: NightWare on May 05, 2009, 11:46:33 PM

i don't preserve ecx/edx by default, i ONLY preserve the registers i use/alterate (including ebx/esi/edi) AND for MY OWN use. it's MY OWN calling convention, and i perfectly know why i proceed like that. if i don't preach to impose this calling technic it's because i understand that others can see it differently.


I agree 100%. It is a question of personal taste, everybody is free to do whatever he/she likes. My taste is to preserve ecx and edx when I alter them, it costs me 4 bytes and 3 cycles. Not a big "loss" for routines that typically run in hundreds or thousands of cycles, and are often being called more than once in a context where these two registers already serve a purpose and therefore must be saved anyway.

:bg
Title: Re: Which is faster?
Post by: dedndave on May 06, 2009, 07:40:44 AM
Thanks JJ

I found this pdf file from Intel, "Intel Processor Identification and the CPUID Instruction"

www.intel.com/Assets/PDF/appnote/241618.pdf

You can save the file as text, or google "241618.pdf" and use google to convert it to HTML, then use the browser to save it as text
It has a very coprehensive program in assembler that ID's Intel CPU's

Then, use this one from AMD, "CPUID Specification" to add a few touches

www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25481.pdf

Really, this is beyond the scope of what I wanted to do for CPU identification
I really just want something like your "Short Version" - I don't even care about the clock frequency
In fact, as a minimalist approach, all I really need is how many cores the CPU has
GetProcessAffinityMask tells me how many the system uses - that is really enough for the program to function
Title: Re: Which is faster?
Post by: jj2007 on May 06, 2009, 08:20:02 AM
Quote from: dedndave on May 06, 2009, 07:40:44 AM

In fact, as a minimalist approach, all I really need is how many cores the CPU has
GetProcessAffinityMask tells me how many the system uses - that is really enough for the program to function


Remember your own post here (http://www.masm32.com/board/index.php?topic=10848.msg82318#msg82318)?
The code gives me the same results:
QuoteCPU family 15, model 4, Pentium 4 Prescott (2005+), MMX, SSE6
Cores           2
... but according to Wikipedia (http://en.wikipedia.org/wiki/Pentium_4#Prescott) the Prescott has only one core ::)

GetProcessAffinityMask sounds promising, though - thanks for the hint. But it also says I have two cores:

SystemAffinityMask:     00000000000000000000000000000011
ProcessAffinityMask:    00000000000000000000000000000011


Any hardware experts around...?

include \masm32\include\masm32rt.inc

.data?
ProcessAffinityMask dd ?
SystemAffinityMask dd ?
buffer dd 10 dup (?)

.code
start:
invoke GetCurrentProcess
invoke GetProcessAffinityMask, eax, offset ProcessAffinityMask, offset SystemAffinityMask
print "SystemAffinityMask: ", 9
invoke dw2bin_ex, SystemAffinityMask, offset buffer
print offset buffer,13,10
print "ProcessAffinityMask: ", 9
invoke dw2bin_ex, ProcessAffinityMask, offset buffer
print offset buffer,13,10
getkey
exit

end start
Title: Re: Which is faster?
Post by: BlackVortex on May 06, 2009, 08:38:06 AM
Hyperthreading ?
Title: Re: Which is faster?
Post by: dedndave on May 06, 2009, 08:55:55 AM
Well, I can tell you the Prescott has 2 cores - lol
I re-read the Wikipedia article you linked and it does not really say it has a single core, per se
In that article, they use the term "core" to mean the overall core - not stating it has 2 (or 1, either)
That is odd that you pointed that out JJ - lol - I had read that page earlier this week and had not noticed the omission
I guess I was more interested in the heat issue it mentions
I am an Electronics Engineer, although I avoid the term "expert" because those who use it are usually showing how little they know
In any event, the system affinity mask returned by the GetProcessAffinityMask really tells you what you are able to access
If you had a CPU with 8 cores, but the system only uses 7, 7 is probably all you could use without generating some kind of protection fault
The real authority on how many cores the CPU has is the manufacturer, I suppose
If you use CPUID (and all it's whack-a-mole caveats), it will tell you that the Prescott is dual core
Title: Re: Which is faster?
Post by: jj2007 on May 06, 2009, 09:38:55 AM
Quote from: dedndave on May 06, 2009, 08:55:55 AM
Well, I can tell you the Prescott has 2 cores - lol

You probably have seen them with your own eyes, so I believe you :bg

That Prescott story is truly confusing. Apparently, before the first "true" Dual Core came out, they fumbled two Prescotts on one board and called it "Smithfield" - see Google search (http://www.google.it/search?hl=en&safe=off&client=firefox-a&rls=org.mozilla%3Aen-GB%3Aofficial&hs=g6I&num=50&newwindow=1&q=Prescott+dual+core+smithfield&btnG=Search). Besides, I could not find any clear documentation of the CPUID code that is supposed to tell you the number of cores :(
Title: Re: Which is faster?
Post by: dedndave on May 06, 2009, 09:46:43 AM

check this document out JJ - go to the end look at their masm code
www.intel.com/Assets/PDF/appnote/241618.pdf

the prescott - 2 cores right across the middle
now, you can say you have seen them, too - lol

(http://images.sudhian.com/review/cpu/intel/Prescott/prescott_die_8in.jpg)
Title: Re: Which is faster?
Post by: hutch-- on May 06, 2009, 11:24:32 AM
I am suprised that your BIOS does not tell you the Intel designation for your processor. I have 2 EM64T processors that identify as Prescott, an earlier one that identifies as a Northwood and all of them can handle hyperthreading which you turn off in win2000 as it is not optimised for this technology. Without turning it off in the BIOS,  Win2000 reports 2 processors and runs badly with uneven timings.
Title: Re: Which is faster?
Post by: dedndave on May 06, 2009, 01:24:08 PM
Quote from: hutch-- on May 06, 2009, 11:24:32 AM
I have 2 EM64T processors that identify as Prescott, an earlier one that identifies as a Northwood

2 questions for you Hutch.....
1) "identify as" - you mean all CPUID programs tell you the EM64T's are Prescotts ?
2) "an earlier one that identifies as a Northwood" - an earlier EM64T ?

They are all Intel processors - they should ID properly - i could understand if they were manufactured by someone else
Perhaps it is that the CpuID programs used are not doing their thing ?
Of course, if you ask the salesman at radio shack, "is it quad-core", he will answer yes - lol
Do you run a 64-bit OS on any of them?
I bet the Vista-64 installer would tell you right away that they are not Prescotts
Title: Re: Which is faster?
Post by: hutch-- on May 06, 2009, 02:02:59 PM
Dave,

I know their Intel designation from the Intel box they came in. The 6 year old Northwood is the last true 32 bit processor, the EM64T (3.2 and a 3.0 gig versions in 2 seperate boxes) were both designated as Prescott. Here is one of many agreeing dumps from some of the toys I have.


Number of CPU(s) One Physical Processor / One Core / One Logical Processor / 64 bits
Vendor GenuineIntel
CPU Full Name Intel Pentium 4 HT
CPU Name Intel(R) Pentium(R) 4 CPU 3.20GHz
CPU Code Name Prescott
Technology 0.09µ
Platform Name LGA775
Type Original OEM processor
FSB Mode QDR
Platform ID 4
Microcode ID 03
Type ID 0
CPU Clock 3193.53
System Bus Clock 798.38
System Clock 199.60
Multiplier 16.00
Original Clock 3200.00
Original Bus Clock 800.00
Original System Clock 200.00
Original Multiplier 16.00
L2 Cache Speed 3193.53 MHz
L2 Cache Speed Full
CPU Family / Model / Stepping F / 4 / 9
Family Extended 00
HyperThreading 2
L1 T-Cache 12 KµOps
L1 D-Cache 16 KB
L2 Cache 1024 KB
RDMSR 00000000 00000000 10120210 00000000
MMX Yes
MMX+ No
SSE Yes
SSE2 Yes
SSE3 Yes
3DNow! No
3DNow!+ No
DualCore No
HyperThreading Yes
IA-64 No
AMD64 No
EM64T Yes
NX/XD Yes
SpeedStep No
PowerNow! No
LongHaul No
LongRun No
Architecture x86

Title: Re: Which is faster?
Post by: dedndave on May 06, 2009, 02:03:29 PM
Just got back from a re-boot - I wanted to see what my BIOS said.
I never looked at the CPU part much, other than to notice all the features were enabled.
It says I have a Pentium P4 - on the next line it says "EM64T Capable"
Dang it Hutch - I was all warm and fuzzy with a Prescottt - Now I hafta go find out what the hell "EM64T Capable" means   :eek

Title: Re: Which is faster?
Post by: Neil on May 06, 2009, 02:13:22 PM
Check this out :-


http://www.mbreview.com/em64t.php
Title: Re: Which is faster?
Post by: dedndave on May 06, 2009, 02:28:43 PM
Thanks Neil
Now I am thoroughly confused  :dazzled:
Title: Re: Which is faster?
Post by: dedndave on May 06, 2009, 02:32:58 PM
From what I can gather, EM64T is a specification, not a processor
Let me look further.......

Yikes! - another "whack-a-mole" definition
notice the last sentance....

The final sub-mode of EM64T is 64-bit mode. As one would likely assume, 64-bit mode is utilized by 64-bit applications when they're run under a 64-bit operating system. Intel has made several key changes to the IA-32 architecture to allow for these 64-bit applications, such as adding support for 64-bit linear addressing. Linear addressing is a scheme that allows access to the entirety of memory with use of a single address, usually loaded in a register or instruction. Variations of the IA-32 architecture may not offer full 64-bit linear addressing, an example being the current 600 series Pentium 4 processors which only allow for 48-bit linear addressing.

I am now googling for "Vista-48" - lol

Unless I can run Vista-64 with limited 48-bit linear addressing, it makes no sense to me.
I have no intention of going to Windows x64. I think that OS will, in the future, be viewed much as we now view OS2.
Truthfully, I have my hands full with 32-bits.
Intel, Micorsoft, and the computer manufacturers are in a damn fast hurry to make us believe we need 64 bits.
Anyone who does not believe that they are all working together, is not seeing the big picture.
They want us to scrap out our "old" 32-bit computers, OS's, and software to buy all new stuff.
Hey! Billy! Don't you have enough #$%@#$% money, as it is?
I am happy with 32-bits and, at my age, I don't foresee a need for 64-bits for myself.
I suppose if I was running some high-powered CAD or emulation software, or making CGI graphics for movies, I might want more machine.
They have to realize that "Joe and Jan Sixpack" that want to get their e-mail, edit a few pictures of the kids, and surf the net a bit
do not want, need, nor can they afford more than they already have. Especially after the big corporate CEO's, Bush/Cheney, and all the
other politicians on the planet have sucked the life out of the world economy.

Oops! Sorry for sliding off topic. It is, however, slightly germaine to the discussion at hand.
We would not be in here trying to figure out how to optimize/benchamrk code on all these platforms,
or just identify those platforms, if the issues I mentioned did not affect us.
We have to work harder in order to accomodate "the conspiracy".

P.S. I am an old guy who still thinks the 8088 was a powerhouse. Compared to what we had beforehand, it was.
      I do, however, think they may have reached a point of diminishing return with respect to the "average" consumers' needs.
Title: Re: Which is faster?
Post by: Mark Jones on May 06, 2009, 05:14:52 PM
Dave, the more you say, the more I like you. :bg

Quote from: dedndave on May 06, 2009, 02:32:58 PM
...Variations of the IA-32 architecture may not offer full 64-bit linear addressing, an example being the current 600 series Pentium 4 processors which only allow for 48-bit linear addressing.

I am now googling for "Vista-48" - lol

I think where some of this abiguity comes from, is that not all bits must be used in an instruction or addressing width. I.e., a 64-bit memory "pointer" may only have 48 bits actually used or implemented, which will give a lot more addressable space than a 32-bit pointer without using the full 64 bits. It is ambiguous, indeed.

Quote
Intel, Microsoft, and the computer manufacturers are in a damn fast hurry to make us believe we need 64 bits.

Well, they are just "propogating the market." If the OS and program size keeps growing exponentially, then the user will be eternally forced to upgrade hardware to keep up... ingenious, if not borderline shady business practice...

Quote
P.S. I am an old guy who still thinks the 8088 was a powerhouse. Compared to what we had beforehand, it was.
      I do, however, think they may have reached a point of diminishing return with respect to the "average" consumers' needs.

Indeed. I think a lot of us view the older hardware in such a positive light because it was so simple and managable that it was elegant. Granted, segmented memory and interrupt tables look nasty, but the instructions and processor itself were nice and concise. Shaving off a few bytes and clocks made amazing improvements, and there was an immediate "reward" for the time invested coding meaningful, good code. However nowadays, a CPU has 4MB of third-level cache and four cores with out-of-order preemptive pipelining or some other crap... making not only programming it effectively a nightmare, but the "golden tweaks" of yesteryear far less valuable or even noticable.

In short, the hardware is adapting to software bloat, instead of vice-versa. ::)
Title: Re: Which is faster?
Post by: dedndave on May 06, 2009, 05:51:01 PM
I think that is totally true.
I also think that XP, for example, is way more of an OS than is required for many day-to-day tasks.
Back in the days, we could do a DIR on a larger directory and get a feel for how fast the machine was.
If the entire list of files could be read, one by one as they went by, it was a 4.77 mhz 8088 - lol.
If the list scrolled off the screen in a flash of green light, you had a fast machine.
The operating systems in use today have so much overhead.
The machine I am typing on now is fairly decent, and I am pleased with its' performance.
But, not without some tweaking and tuning of the OS.
If I were to (or even could - don't think it would work - no drivers) put DOS 3.3 on this machine, it would fly! - lol
To bring the point a little closer to home, Windows 95 was a very fast OS.
Windows 98, a little less so, but it had many desirable feature upgrades.
Those OS's would blaze on this hardware, but I doubt I could get a complete set of device drivers.
Each OS that comes out is more and more demanding on the hardware.
Where you can really see "the conspiracy", though, is in things like XP support for newer hardware.
It is slowly trickling away. The hardware manufacturers are well aware of their mortality if they continue to support XP.
They actually have to change the hardware in some way so that XP will not run correctly or completely.
This insures that Vista (much slower OS) will be used.
Which, in turn, assures the obsolesance of XP.
Microsoft happily bats the birdie back over the net and assures that each OS requires better, newer, faster hardware to run on.
It all comes out of our pocket, and is part of the reason that the wealthiest 2% of the population have 98% of the money.

One final note - then back to the technical stuff on benchmarking......
Notice how Vista-64 will not run a real mode, making 16-bit code obsolete. MS/Intel/AMD did not have to do that, if they didn't want to.
At the same time, Windows XP-64 is just crippled enough to make it unwanted.
This is one example that really makes me believe that Microsoft, Intel, AMD, the computer manufacturers, and perhaps some of
the larger software companies (other than MS) get together for an annual or bi-annual "meeting" on someones yacht to discuss how
they are going to slowly walk the consumers away from XP into something they don't really need.
I did not hear the consumers hollering for a 64-bit CPU and/or OS to begin with.
Title: Re: Which is faster?
Post by: FORTRANS on May 06, 2009, 06:54:31 PM
Quote from: dedndave on May 06, 2009, 05:51:01 PM
If I were to (or even could - don't think it would work - no drivers) put DOS 3.3 on this machine, it would fly! - lol

   You probably could boot DOS, I was running a benchmark, and
did it on my "newest" machine.  Made a nice RAM disk 'cause the
hard disks were too big.  And no network.  And no sound.  But I
got some real nice numbers from my nice little graphics program.

Cheers,

Steve N.

P.S.  I would go with DOS 6.x, the last couple of times dealing with
DOS 3.1 were not pretty.  <g>

Title: Re: Which is faster?
Post by: jj2007 on May 06, 2009, 07:15:10 PM
Quote from: Mark Jones on May 06, 2009, 05:14:52 PM

Granted, segmented memory and interrupt tables look nasty, but the instructions and processor itself were nice and concise.


LOAD_ARRP:     MOVE.W  D4,P_FE0(A6)
               MOVE.L  (A0),A0        *A0=Arrptr
               ADDQ.L  #4,A0          *A0=first descriptor
               MULU    #6,D4
               ADD.L   D4,A0          *A0=descriptor P_fe%
               MOVE.L  A0,P_FE_A(A6)  *save for backtrailer
               MOVE.L  D0,D4
               RTS


I used to write my own screen and printer drivers in 68000 assembler. At that time, late 80's, people who were fumbling with segments were considered anachronists :bg
Just ran my old word processor on an emulator. A factor 10 faster than on the real thing, at least
Title: Re: Which is faster?
Post by: dedndave on May 06, 2009, 07:37:45 PM
That code looks like a foriegn language to most in here JJ - lol
May as well be Greek (or Italiano)
I haven't seen 68000 code since i worked at Edge Computer - mid 80's

I do not miss segmentation, much
I do miss interrupts for hardware detection, though
I haven't figured out how all that works on 32-bit yet
One step at a time...
Title: Re: Which is faster?
Post by: dedndave on May 06, 2009, 11:43:39 PM
ok, Hutch
  What you and I have (I think the same) are Prescott dual-core CPU's. Mine is a model 630.
They are capable of running Vista 64. I am not sure if we have the 48-bit address or 64-bit address capablilities.
Personally, I can't imagine my needing to breach the first one, as it allows for more memory than my motherboard will allow.
One thing to note; The motherboard, as well as the CPU, must be able to support Vista 64. This is probably why determining
whether or not a particular machine is capable is a bit fuzzy. As for my own m/b, it is an Intel, which made that determination
a little easier.
  Now - here is what will excite some of us. By poking around on the Intel site, I managed to find an Intel CPU ID program.
They also have a utility to measure frequencies. Following is a link to a page with both downloads.

Intel CPU ID Program
http://support.intel.com/support/processors/sb/CS-015477.htm
Title: Re: Which is faster?
Post by: hutch-- on May 06, 2009, 11:57:05 PM
Dave,

This has been the most useful toy in the processor identification area that I have found. There are others that work just as well.

http://www.cpuid.com/cpuz.php
Title: Re: Which is faster?
Post by: dedndave on May 07, 2009, 12:23:21 AM
Here is the official AMD CPU ID program.
Personally, I would think if you want to ID an Intel CPU, who better to trust than Intel.
If you want to ID an AMD CPU, who better to trust than AMD.
The manufacturer may more completely identify a certain CPU's features.

AMD CPU ID Program
http://support.amd.com/us/Pages/dynamicDetails.aspx?ListID=c5cd2c08-1432-4756-aafa-4d9dc646342f&ItemID=132

Intel CPU ID Program
http://support.intel.com/support/processors/sb/CS-015477.htm

btw Hutch - I like that one too - it tells me a few things about my m/b I was not aware of.
Title: Re: Which is faster?
Post by: jj2007 on May 08, 2009, 08:19:03 PM
Quote from: Jimg on May 02, 2009, 04:40:15 PM
Also, in the real world, (lodsb) is slightly slower then (movzx eax,[esi]/add esi,1), however, also in the real world, and how the code is usually used, the difference is insignificant because the instructions end up in a non-optimal alignment, or are affected by the preceding and following instructions executing simultaneously, or any of several other variables.


Jim,

Out of boredom, I just replaced a movzx eax,[esi] with a lodsb - and it runs over 20 cycles faster. It's not even a speed-critical loop, but it consistently makes a difference. See UseLodsb switch in new attachment here (http://www.masm32.com/board/index.php?topic=9370.msg84533#msg84533).

   push esi
   lea ebx, [esi+ecx-16]            ; Mainstring (only ebx is free)
   mov esi, [esp+6*4+3*4+4]      ; lpPattern
   add ebx, ebp
   if UseLodsb
      dec esi
   endif
@@:
   if UseLodsb
      lodsb                           ; ca. 25 cycles faster!
   else
      inc esi
      movzx eax, byte ptr [esi]         ; esi=Substr (mov al is a few cycles slower)
   endif
   test al, al                  ; this could be shifted lower, but there is the rare case of equality after the zero delimiter:
   je @F                        ; db "Mainstr", 0, "abc" ... db "str", 0, "abx"  would crash
   inc ebx
   cmp al, [ebx]                  ; ebx=Mainstr
   je @B

@@:   pop esi
Title: Re: Which is faster?
Post by: Jimg on May 09, 2009, 02:04:07 PM
Just from looking at the above post, your two possibilities don't seem equivalent.

Quoteif UseLodsb

   add ebx, ebp
   dec esi
@@:
   lodsb                           ; ca. 25 cycles faster!
   test al, al                  ; this could be shifted lower, but there is the rare case of equality after the zero delimiter:
   je @F                        ; db "Mainstr", 0, "abc" ... db "str", 0, "abx"  would crash
   inc ebx
   cmp al, [ebx]                  ; ebx=Mainstr
   je @B

else

   add ebx, ebp
@@:
   inc esi
   movzx eax, byte ptr [esi]         ; esi=Substr (mov al is a few cycles slower)
   test al, al                  ; this could be shifted lower, but there is the rare case of equality after the zero delimiter:
   je @F                        ; db "Mainstr", 0, "abc" ... db "str", 0, "abx"  would crash
   inc ebx
   cmp al, [ebx]                  ; ebx=Mainstr
   je @B

endif

I could be missing something, but it looks like you start two bytes earlier when using lodsb?
Title: Re: Which is faster?
Post by: jj2007 on May 09, 2009, 04:42:44 PM
Quote from: Jimg on May 09, 2009, 02:04:07 PM
Just from looking at the above post, your two possibilities don't seem equivalent.
...
I could be missing something, but it looks like you start two bytes earlier when using lodsb?

You are perfectly right, Jim :red

In addition, I found a bug that shows up for certain unusual patterns, so I better pull back that code until it's correct.
Title: Re: Which is faster?
Post by: Jimg on May 10, 2009, 01:43:12 AM
Looking at your latest-
if UseLodsb
add esi, 6
else
add esi, 5
endif
@@:
if UseLodsb
lodsb ; some cycles faster!
else
inc esi
movzx eax, byte ptr [esi] ; esi=Substr (mov al is a few cycles slower)
endif
test al, al ; this cannot be shifted below because of the rare case of equality after the zero delimiter:
je @F ; db "Mainstr", 0, "abc" ... db "str", 0, "abx"  would crash
inc ebx
cmp al, [ebx] ; ebx=Mainstr
je @B

lodsb loads and then increments esi, so esi is pointing at the next character.

for your non lodsb code, you increment first then load, so esi is pointing at the current character.

therefore, the lodsb code will not end at the same esi as the non-lodsb code.

Does this matter?

If not, then use the same startup and move the "inc esi" below the "movzx eax, byte ptr [esi]" so the movzx doesn't have to wait for the inc esi to be done incrementing.  Should pick up a cycle.

If it does, you'll have to decrement esi later for the lodsb code.

Yes?
Title: Re: Which is faster?
Post by: jj2007 on May 10, 2009, 06:44:16 AM
Quote from: Jimg on May 10, 2009, 01:43:12 AM
Does this matter?

If not, then use the same startup and move the "inc esi" below the "movzx eax, byte ptr [esi]"


Jim,

Good point, thanks. It doesn't matter because soon after I pop esi, but it makes the code more readable.

add ebx, ebp
add esi, 6
@@:
if UseLodsb
lodsb ; some cycles faster!
test al, al ; this cannot be shifted below because of the rare case of equality after the zero delimiter:
je @F ; db "Mainstr", 0, "abc" ... db "str", 0, "abx"  would crash
else
movzx eax, byte ptr [esi] ; esi=Substr (mov al is a few cycles slower)
je @F
inc esi
endif
inc ebx
cmp al, [ebx] ; ebx=Mainstr
je @B

@@: pop ebx
pop esi
jne BadLuck ; test al, al not needed, flags still valid
Title: Re: Which is faster?
Post by: Jimg on May 10, 2009, 02:09:15 PM
or simply
add esi, 6
@@:
if UseLodsb
    lodsb ; some cycles faster!
else
    movzx eax, byte ptr [esi] ; esi=Substr (mov al is a few cycles slower)
    inc esi
endif   
    test al, al ; this cannot be shifted below because of the rare case of equality after the zero delimiter:
    je @F
    inc ebx
    cmp al, [ebx] ; ebx=Mainstr
    je @B

Title: Re: Which is faster?
Post by: jj2007 on May 10, 2009, 02:53:04 PM
Oops, that should have been

      movzx eax, byte ptr [esi]      ; esi=Substr (mov al is a few cycles slower)
      test al, al
      je @F
      inc esi

My version is about half a cycle faster because it tests for zero before the inc esi :wink
Title: Re: Which is faster?
Post by: hutch-- on May 10, 2009, 03:08:19 PM
JJ,

See if "test eax, eax" is faster than the byte test.
Title: Re: Which is faster?
Post by: jj2007 on May 10, 2009, 04:15:03 PM
Quote from: hutch-- on May 10, 2009, 03:08:19 PM
JJ,

See if "test eax, eax" is faster than the byte test.

Not on a Celeron M (code below):
664     cycles for 1000* test 1000*reg32, reg32
664     cycles for 1000* test reg8, reg8

0.7 cycles for an algo that needs about 2500, so "full algo" testing is simply impossible.
But I'll replace it, thanks - test eax after a movzx eax simply looks better.

The limits are elsewhere anyway. I have tested three variants now for the SSE2 Instring:
1. word scanner in SSE2, 3 more bytes through cmp edi, [mem+2]: 2470 cycles on average
2. byte scanner in SSE2, if match: word scanner, then 3 more bytes through cmp edi, [mem+2]: 2470 cycles on average
3. byte scanner in SSE2, 3 more bytes through cmp edi, [mem+1]: 2470 cycles on average

See the problem? :bg

No. 2+3 get really complex, so I'll stick with the word scanner posted in the laboratory (http://www.masm32.com/board/index.php?topic=9370.new#new).

Of course, No. 2+3 are blazing fast if you are searching for a pattern starting with a rare letter like X, but that is the exception...

Quote.nolist
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm         ; get them from the Masm32 Laboratory (http://www.masm32.com/board/index.php?topic=770.0)
   LOOP_COUNT   = 1000000

.code
start:
   counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT 250
         test eax, eax
         test ebx, ebx
         test ecx, ecx
         test edx, edx
      ENDM
   counter_end
   print str$(eax), 9, "cycles for 1000* test reg32, reg32", 13, 10

   counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT 250
         test al, al
         test bl, bl
         test cl, cl
         test dl, dl
      ENDM
   counter_end
   print str$(eax), 9, "cycles for 1000* test reg8, reg8", 13, 10

   inkey chr$(13, 10, "--- ok ---", 13)
   exit
end start