News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Which is faster?

Started by Neil, May 01, 2009, 10:56:52 AM

Previous topic - Next topic

jj2007

Quote from: Jimg on May 02, 2009, 02:34:25 PM
In normal code, you are calling this proc or that proc and none of them are in the cache.  To test instruction timing, you have to test code that is not in the cache, just the way is is executed most of the time.  Looping a million times is just silly.

You might want to see the Timings and Cache thread. By the way, "Looping a million times is just silly" means there are lots of silly people in this forum. Me included, of course :bg

dedndave

A method that might make us both happy would be to set up a dynamic threshold.
Make measurements (some minimum number) and keep track of minima and maxima along the way.
Calculate a threshold at some arbitrary level, say 0.15 x (max-min) + min.
Once you have acquired 20 data points below that theshold, stop the measurement and calculate the average of those points.
This method would be very repeatable.

Jimg

Sound good.  Have to try it out to see.  The key point, is that you are timing each loop, not a million loops and dividing the result by a million.

QuoteYou might want to see the Timings and Cache thread. By the way, "Looping a million times is just silly" means there are lots of silly people in this forum. Me included, of course BigGrin
Yes!  :P and me too!
I've been doing these games for ten years now, which is why I've come to the conclusions I have. :toothy

jj2007

Quote from: Jimg on May 02, 2009, 07:31:00 PM
Sound good.  Have to try it out to see.  The key point, is that you are timing each loop, not a million loops and dividing the result by a million.

QuoteYou might want to see the Timings and Cache thread. By the way, "Looping a million times is just silly" means there are lots of silly people in this forum. Me included, of course BigGrin
Yes!  :P and me too!
I've been doing these games for ten years now, which is why I've come to the conclusions I have. :toothy

That sounds promising, but maybe I have not fully understood what you want - my apologies. How often would you time, for example, a BitBlt loop for a 1920*1600 screen? I mean: Not just choosing the fastest available Microsoft API, but rather rewrite the code that seems to be the bottleneck. You must have gathered some specific experience in this thread.

Jimg

QuoteHow often would you time, for example, a BitBlt loop for a 1920*1600 screen?
Strangely enough, just yesterday.  Trying to find how to decrease the time it takes to generate a particular screen, I found that 1/3 of the time was taken up by clearing the background.  So I tested-
    tryclr = 1
    starttimetest 7
    if tryclr eq 1
        inv PatBlt,pi.td,0,0,mwidth,mheight,WHITENESS
    else
        mov edi,dibh1.bits
        mov ecx,mwidth
        imul ecx,mheight
        mov eax,0ffffffh
        rep stosd
    endif
    endtimetest 7

it turns out the first takes around 15300 clicks, and the second around 15150 clicks.  So even though it's faster, it's not worth the effort.
That would be 150/45000 = .003%


This is the code I use for testing various sections-
    .data?
    align 8
        strtime dq ?,?,?,?,?,?,?
        endtime dq ?,?,?,?,?,?,?
        elapsed0 dd ?
        elapsed1 dd ?
        elapsed2 dd ?
        elapsed3 dd ?
        elapsed4 dd ?
        elapsed5 dd ?
        elapsed6 dd ?
        elapsed7 dd ?
    .code
   
    starttimetest macro testnum
        if DoDebug
            inv QueryPerformanceCounter,addr [strtime + testnum*8]
        endif
    endm
   
    endoftest proc testnum
        push esi
        mov esi,testnum
        inv QueryPerformanceCounter,addr [endtime+esi*8]
        finit
        fild qword ptr [endtime+esi*8]
        fild qword ptr [strtime+esi*8]
        fsub
        fist dword ptr [elapsed0+esi*4]
        pop esi
    ret
    endoftest endp   
   
    endtimetest macro testnum
        if DoDebug
            inv endoftest,testnum
        endif
    endm

.
.
.
    if DoDebug
        printxa "  qdt=",dd elapsed1,32,dd elapsed0,32,dd elapsed6,32,dd elapsed7
    endif
    inv SetWindowText,hWin,mbuff


It may not be as precise as doing a cpuid and rdtsc, but in the real world, it more than sufficient.

jj2007

Quote from: Jimg on May 02, 2009, 08:51:38 PM
it turns out the first takes around 15300 clicks, and the second around 15150 clicks.  So even though it's faster, it's not worth the effort.
That would be 150/45000 = .003%

Sure, rep stosd is one of the exceptions where there is nothing to optimise, as you know :bg

Mark Jones

I recall from the many many pages of the code timing thread, a general convergence towards the fastest loop timing also being ideal. Does not Petezold's PROCTIMERS use the lowest cycle count of 10,000 iterations? (Much less iterations, and the timing results are rock-steady.)

In my experience, Petezold's timer is very good.
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

jj2007

Quote from: Mark Jones on May 02, 2009, 11:09:51 PM
I recall from the many many pages of the code timing thread, a general convergence towards the fastest loop timing also being ideal.

In my spare time, I am still trying to improve the Instr algo, and I stumble all the time over outliers as the one marked below:
QuoteIntel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

Timings:
(_imp__strstr=crt_strstr, InstrCi=my non-SSE version, InString=Masm32 library; TestSub?=Easy, Difficult, Xtreme)
7345    _imp__strstr, addr Mainstr, addr TestSubE
8551    _imp__strstr, addr Mainstr, addr TestSubD
12789   _imp__strstr, addr Mainstr, addr TestSubX
8151    InstrCi, 1, addr Mainstr, addr TestSubE, 0
8314    InstrCi, 1, addr Mainstr, addr TestSubD, 0
10836   InstrCi, 1, addr Mainstr, addr TestSubX, 0
7458    InString, 1, addr Mainstr, addr TestSubE
9767    InString, 1, addr Mainstr, addr TestSubD
13001   InString, 1, addr Mainstr, addr TestSubX

1866    InstrJJ, 1, addr Mainstr, addr TestSubE, 0
1870    InstrJJ, 1, addr Mainstr, addr TestSubE, 0
1868    InstrJJ, 1, addr Mainstr, addr TestSubE, 0

1881    InstrJJ, 1, addr Mainstr, addr TestSubD, 0
1884    InstrJJ, 1, addr Mainstr, addr TestSubD, 0
1882    InstrJJ, 1, addr Mainstr, addr TestSubD, 0

3814    InstrJJ, 1, addr Mainstr, addr TestSubX, 0
3727    InstrJJ, 1, addr Mainstr, addr TestSubX, 0
3830    InstrJJ, 1, addr Mainstr, addr TestSubX, 0

Average cycle count:
2513     InstrJJ
10075    MasmLib InstringL
InstrJJ : InstringL = 24 %
Code size InstrJJ=366

I have no explanation why code can suddenly, out of the blue, run 3% faster, but it happens all the time. To improve reliability, one might consider to eliminate both fast and slow outliers, but it would require some overhead in \masm32\macros\timers.asm - such as a 100 loops to calculate the expected average before starting the main exercise...?

hutch--

 :bg

Now do you understand why I only ever test in real time ?
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on May 03, 2009, 06:20:32 AM
:bg

Now do you understand why I only ever test in real time ?

Hutch, you want to provoke me to write "Now I understand why your code is so slow". But nope, I will not let you provoke me, and I will not write such nasty things about you!!!

:bg

MichaelW

JJ,

The code is not suddenly running 3%, or whatever, faster. The problem is that the test is being interrupted, and the more it's interrupted the higher the cycle counts. The second set of timing macros was an attempt to correct this problem. These macros capture the lowest cycle count that occurs in a single loop through the block of code, on the assumption that the lowest count is the correct count. The higher counts that occur are the result of one or more context switches within the loop. Context switches can occur at the end of a time slice, so to minimize the possibility of the loop overlapping the time slice the ctr_begin macro starts a new time slice at the beginning of the loop. If the execution time for a single loop is greater than the duration of a time slice (approximately 20ms under Windows), then the loop will overlap the time slice, and if another thread of equal priority is ready to run, then a context switch will occur. Here are the typical results of the code from the attachment running on my P3:

441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271
441     271


Unfortunately, these macros do not work well with a P4, typically returning cycle counts that are a multiple of 4 and frequently higher than they should be.


[attachment deleted by admin]
eschew obfuscation

hutch--

 :bg

You should know by now that these things leave me sitting up at night, wringing my hands inbetween wiping the tearstains from my face while losing sleep about it. Further I have dismally failed to write the world's fastest MessageBoxA() after 20 years of trying and to cap it off I still can't get SSE4.5 to run on a 486. Such may be the case with matter of such great importance but when it comes to timing an algo I have done it the right way for many years, design the test/timing method to fit the task then make as big as you can fit in memory and bash it long enough to reduce the variations to below 1%. Intel spec 3% but true fanaticism requires better.  :bdg
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

lingo

The mad thievish gipsy use part of my strlen code from here  to produce lame slow code and  is shameless to post it everywhere...  :bdg
No offend, but who uses .if  .elseif .else or preserve ecx and edx in  speed critical algos? IMO idiots in assembly and it is the result:   :lol
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)

Search Test 1 - value expected 37; lenSrchPattern ->22
InString - JJ:                         38 ; clocks: 99
InString - Lingo:                      37 ; clocks: 39

Search Test 2 - value expected 1007; lenSrchPattern ->17
InString - JJ:                         1008 ; clocks: 22567
InString - Lingo:                      1007 ; clocks: 6294

Search Test 3 - value expected 1008 ;lenSrchPattern ->16
InString - JJ:                         1009 ; clocks: 712
InString - Lingo:                      1008 ; clocks: 502

Search Test 4 - value expected 1008 ;lenSrchPattern ->16
InString - JJ:                         1009 ; clocks: 6600
InString - Lingo:                      1008 ; clocks: 1418

Search Test 5 - value expected 1008 ;lenSrchPattern ->16
InString - JJ:                         1009 ; clocks: 5426
InString - Lingo:                      1008 ; clocks: 1308

Search Test 6 - value expected 1008 ;lenSrchPattern ->16
InString - JJ:                         1009 ; clocks: 629
InString - Lingo:                      1008 ; clocks: 498

Search Test 7 - value expected 1009 ;lenSrchPattern ->14
InString - JJ:                         1010 ; clocks: 625
InString - Lingo:                      1009 ; clocks: 502

Search Test 8 - value expected 1001 ;lenSrchPattern ->1
InString - JJ:                         0 ; clocks: 781
InString - Lingo:                      1001 ; clocks: 102

Search Test 9 - value expected 1001 ;lenSrchPattern ->2
InString - JJ:                         1002 ; clocks: 611
InString - Lingo:                      1001 ; clocks: 512

Search Test 10 - value expected 1001 ;lenSrchPattern ->3
InString - JJ:                         1002 ; clocks: 625
InString - Lingo:                      1001 ; clocks: 435

Search Test 11 - value expected 1001 ;lenSrchPattern ->4
InString - JJ:                         1002 ; clocks: 635
InString - Lingo:                      1001 ; clocks: 496

Search Test 12 - value expected 1001 ;lenSrchPattern ->5
InString - JJ:                         1002 ; clocks: 795
InString - Lingo:                      1001 ; clocks: 638

Search Test 13 --Find 'Duplicate inc' in 'windows.inc' ;lenSrchPattern ->13
InString - JJ:                         1127625 ; clocks: 679836
InString - Lingo:                      1127624 ; clocks: 543385

Press ENTER to exit...



jj2007


lingo

Hutch, is it possible to add an icon that says "middle finger"
something like this ?  :lol