The MASM Forum Archive 2004 to 2012

General Forums => The Campus => Topic started by: loki_dre on April 26, 2008, 08:14:14 AM

Title: preformance
Post by: loki_dre on April 26, 2008, 08:14:14 AM
It appears that my program seems to have different performance times each time I run it with the same input file.
Sometimes it is 15ms, and sometimes it is 30ms or more.
I am not starting or stopping any other programs in between each run(ie.the system is relatively idle).

Is it normal to get different performance times like this.
Is there anything I can do to minimize this time and keep it consistent on a relatively idle PC?
      I'm new to MASM.......... is there something in my code that should be changed (some sort of common mistake made by beginners)?
      Or are there any settings in Windows that I can use to maximize performance (I used "start /high program.exe", & killed every process in the Windows Task Manager that I Possibly could)?

PS:
I Would like to avoid the use of Windows Embedded....Anyone got any tips or tricks?
Title: Re: preformance
Post by: donkey on April 26, 2008, 08:32:21 AM
Windows is a multitasking OS, it switches processes based on several factors but mainly the priority assigned to the process. When switching processes it performs a context switch which saves the machine state for your process and loads the machine state for another then gives that process it's time slice. When it has run through all the processes (not really but it's easier to explain this way) it gets back to yours and continues execution. So, depending on when the context switch takes place you can get different run times for exactly the same program with the same data. Also since you are using a file, the hard disk may be in use or seek times may be different from run to run, for example the indexing service may be reading the drive when the app starts for one run and it may be idle for another. Virtual memory can also play a part, one run of the process may have enough free memory to be completely resident while another is partially in the swap file. Could be a lot of other reasons that fall lower on the probability ladder as well...

You can set the process priority in program if you like but it is not something to be done lightly...

invoke SetPriorityClass,[hProcess], REALTIME_PRIORITY_CLASS
invoke SetThreadPriority,[hThread], THREAD_PRIORITY_TIME_CRITICAL


Donkey
Title: Re: preformance
Post by: loki_dre on April 26, 2008, 08:59:08 AM
do I have to do that for each function?????
ie.
EXAMPLE proc
     invoke SetPriorityClass,[hProcess], REALTIME_PRIORITY_CLASS
     invoke SetThreadPriority,[hThread], THREAD_PRIORITY_TIME_CRITICAL
ret
EXAMPLE endp

Or is just once after "start:" sufficient?

I run my program at the command prompt with "start /high masmIP.exe".........but it didn't really seem to have any effect on a relatively idle PC? any idea why?

Title: Re: preformance
Post by: MichaelW on April 26, 2008, 09:58:14 AM
The effective resolution of the value returned by GetTickCount is no better than 10ms. Timing a period requires two calls to GetTickCount, and since you don't know where in the timer cycle the calls occur, each timed period has an uncertainty of at least plus or minus 20ms. You can cut the uncertainty in half by synchronizing with GetTickCount before you start timing. Either way, to get meaningful times the timed period must be at least several seconds. The  High-Resolution Timer (http://msdn2.microsoft.com/en-us/library/ms644900(VS.85).aspx#high_resolution) has an effective resolution of several microseconds, so it can be used to get meaningful times for periods down to perhaps 100ms. Below that you need to loop your code to get the period up to something reasonable, and then divide the total time by the number of loops, or measure your execution times in processor clock cycles. Boosting the process/thread priority will help reduce the number of context switches that occur during the timing period, and may improve the accuracy/consistency of the results, but using REALTIME_PRIORITY_CLASS with buggy code can cause Windows to crash.

If you examine the timing methods used in the Laboratory you will basically see two schools of thought, one that favors GetTickCount with no synchronization, no priority boost, and many loops, and one that favors counting clock cycles, with a smaller number of loops and a priority boost. Within limits, either method will work.

Title: Re: preformance
Post by: hutch-- on April 26, 2008, 10:14:21 AM
loki,

Michael is correct here, different timing methods test different things and its worth understand what each method is useful for. If you download Michael's timer code you will find it very useful for timing short sequences of instructions as it is designed to perform that function among others.

If you use getTickCount you need to be aware of its limitations and one of them is that its resolution is poor at low time intervals. To start to get reliable timings you need to set up the test with enough data to run for a quarter to half a second before the timings fall below a couple of percentage points.

Where the timer code gets used to time small instruction sequences which is very useful when designing the inner guts of an algorithm, the GetTickCount method when run a half a second or so is testing real time which is also very useful.

To further tailor this technique to what you are testing, if its an algo that handles a very large amount of data like some search or sorting algorithms, you feed it large data to get the speed of the main algorithm without taking much notice of how fast it starts or finishes. At the other end if its a very short algo where its start and finish speed is important, you tend to feed it much smaller data but with a much higher loop count.
Title: Re: preformance
Post by: loki_dre on April 26, 2008, 10:58:50 AM
thanks guys


hhmmmm....QueryPerformanceCounter requires the use of a 64-bit variable.......and as a result a 64-bit register would be useful
how can a create a 64-bit variable?
        CPU_Time1   QWORD   ?         ;<<<<<<=========is that correct?
how can I access the 64-bit registers?
Title: Re: preformance
Post by: MichaelW on April 26, 2008, 12:15:25 PM
Timers.asm, available  here (http://www.masm32.com/board/index.php?topic=770.0), includes a pair of macros that use the High-Resolution Timer. The code is a little more complex that absolutely necessary, because it attempts to eliminate the effects of the loop overhead by timing an empty reference loop and then subtracting that time from the total time, so the result reasonably represents the execution time for the code being tested.
Title: Re: preformance
Post by: donkey on April 26, 2008, 12:35:40 PM
Quote from: loki_dre on April 26, 2008, 10:58:50 AM
thanks guys


hhmmmm....QueryPerformanceCounter requires the use of a 64-bit variable.......and as a result a 64-bit register would be useful
how can a create a 64-bit variable?
        CPU_Time1   QWORD   ?         ;<<<<<<=========is that correct?
how can I access the 64-bit registers?


There are no 64 bit GP registers in a 32 bit machine, the value is split over two registers with EAX containing the low DWORD and EDX containing the high DWORD.
Title: Re: preformance
Post by: loki_dre on April 26, 2008, 02:29:18 PM
hmmmm.....I found an alternate answer on another post & am now getting more consistent results:
    mov eax, DWORD PTR [QW1+0]   ; Low DWORD of QW1
    mov edx, DWORD PTR [QW1+4]   ; High DWORD of QW1
    sub eax, DWORD PTR [QW2+0]   ; Low DWORD of QW2
    sbb edx, DWORD PTR [QW2+4]   ; High DWORD of QW2

but doesn't MMX mean the processor has a couple 64-bit registers?


I also noticed that my program seems to run faster on the first loop immediately after I compile it...........anyone know why this is?
Title: Re: preformance
Post by: donkey on April 26, 2008, 07:35:36 PM
GP stands for Gerneral Purpose. The MMX registers are not used to return values from the API.
Title: Re: preformance
Post by: loki_dre on April 27, 2008, 02:50:04 PM
any thoughts on why I get better performance if I compile first and run immediately after?
I created a bat file called compile with the following code to compile & run:
cls
del masmIP.exe
del masmIP.obj
\masm32\bin\ml /c /Zd /coff masmIP.asm
\masm32\bin\Link /SUBSYSTEM:WINDOWS masmIP.obj
start /high masmIP.exe

Title: Re: preformance
Post by: MichaelW on April 27, 2008, 03:41:40 PM
I can't see any reason why code would execute faster immediately after it was compiled. The program would load faster if it were in the disk cache, and immediately after the exe was compiled and linked it would be in the cache. I think the main reason is likely to be the command line:

start /high masmIP.exe

Which starts the program in high priority class. Depending on the program, this could cause a significant increase in performance.
Title: Re: preformance
Post by: loki_dre on April 27, 2008, 05:32:47 PM
when you run a small program is it loaded into memory (RAM) and then run.....or would it read instructions off the hard drive?

is there anyway I could load my program into memory(RAM) and then run it?.....assuming it is not done automatically
Title: Re: preformance
Post by: donkey on April 27, 2008, 06:02:10 PM
Quote from: loki_dre on April 27, 2008, 05:32:47 PM
when you run a small program is it loaded into memory (RAM) and then run.....or would it read instructions off the hard drive?

is there anyway I could load my program into memory(RAM) and then run it?.....assuming it is not done automatically


All programs are run from memory. I think you should read some Randall Hyde (http://webster.cs.ucr.edu/) about now, you seem to have a complete lack of knowledge about computer architecture and the very basics of how computers work, though we are more than happy to help you this is not a classroom and you should maybe try to research a few things yourself. You should definitely not be playing with priority classes without at least an idea of how pre-emptive multitasking operates and the consequences of modifying a processes priority.
Title: Re: preformance
Post by: loki_dre on April 27, 2008, 08:22:02 PM

it seems that the following command:
fn MessageBox,0,str$(eax),str$(ebx),MB_OK

was the problem..........since it delays (waits for user input)........windows must have removed it from memory/cache.

Title: Re: preformance
Post by: loki_dre on April 27, 2008, 08:25:20 PM
fn MessageBox,0,str$(eax),str$(ebx),MB_OK    <<<<<<<<<========was reporting delay time
Title: Re: preformance
Post by: donkey on April 27, 2008, 09:28:40 PM
Quote from: loki_dre on April 27, 2008, 08:25:20 PM
fn MessageBox,0,str$(eax),str$(ebx),MB_OK    <<<<<<<<<========was reporting delay time


The function pauses for user input, a time delay there is meaningless.
Title: Re: preformance
Post by: loki_dre on April 27, 2008, 09:39:29 PM
apparently not,
since the program is very small 4Kb
looping with that message box to report the delay will cause windows to re-allocate memory if the user waits to long to press ok.
I have to press (enter/space) very fast to get the message box to report lower performance times.
in my case the difference was a program cycle of 40 times per a second to 20 times per a second.

Title: Re: preformance
Post by: loki_dre on April 27, 2008, 09:46:31 PM
btw .....if windows copies & runs all programs from HD to memory why don't you have the option to delete a program from the harddrive while it is running....you can do it with editing text files etc.....
Title: Re: preformance
Post by: donkey on April 27, 2008, 10:11:51 PM
I have no idea what you're talking about here, are you saying that Windows will copy your program to the page file if you wait too long to respond to a modal dialog (a message box is a modal dialog). If that's the case, its not true, copying to and from the page file is something Windows does when it runs low on physical memory. So if your program is being moved to the page file then you are running low on memory, it has nothing to do with any delay in responding to a message box. Here's an explanation of the page file process...

Quote from: PC911 © Copyright 1998-2008. All rights reservedTo execute a program in Windows, it first needs to be loaded into memory (RAM). Windows lets you run multiple programs simultaneously and chances are that they won't all fit into memory at the same time. For that purpose, Windows uses what is called Virtual Memory to simulate RAM, pretending it has more memory than what is actually build into the PC. It does this by moving data from real memory to a special file on the hard drive, called the swap file in Windows 95/98 or page file in Windows NT. This, in effect, allows Windows to address more memory than the amount of physical RAM installed. Without it, we would not be able to run windows on machines with limited RAM. For example, think back to when Windows 95 first came out, the average computer had 8 to 16 Mb of Ram. It would not have been possible to run Win95 and applications without using virtual memory. Program code and data are moved in pages (memory allocated in 4K or 16K segments within a 64K page frame) from physical memory to the swap file. As the information is needed by a process, it is paged back into physical memory on demand and, if necessary, windows may page other code or data to the swap file in its place.
Title: Re: preformance
Post by: loki_dre on April 27, 2008, 10:26:45 PM
i could not tell you what it does exactly.....I don't work for Microsoft and their source code is not released to the public

but I can tell you looping with
        invoke write_disk_file, addr performanceFileName, str$(eax), 3
Or lopping with
        fn MessageBox,0,str$(eax),str$(eax),MB_OK
        & pressing OK rapidly
gives me a smaller number than looping with
        fn MessageBox,0,str$(eax),str$(eax),MB_OK
        and pressing OK every second or so...

Title: Re: preformance
Post by: donkey on April 27, 2008, 10:32:36 PM
Hi,

I give up......
Title: Re: preformance
Post by: loki_dre on April 27, 2008, 11:58:14 PM

include \masm32\include\masm32rt.inc

LOOP_VAL             EQU     9999*3;
;CONSTANTS
.data
        performanceFileName   db      "performance.txt", 0       
.data? 
        ;PREFORMANCE TIME VARIABLES
        CPU_Time    DWORD   ?
        CPU_Time1   QWORD   ?
        CPU_Time2   QWORD   ?
        CPU_Time3   QWORD   ?
.const
.code
start:
               
_MainLoop:
        ;invoke GetTickCount ; Milliseconds since system start (max of 49.7days)
        ;mov CPU_Time,eax
        ;invoke QueryPerformanceFrequency,XXXX      ;<<<<<<=======3,579,545cycle/sec
        invoke QueryPerformanceCounter, addr CPU_Time1
        mov eax, DWORD PTR [CPU_Time1+0]   ; Low DWORD of QW1
        mov edx, DWORD PTR [CPU_Time1+4]   ; High DWORD of QW1
        mov DWORD PTR [CPU_Time3+0], eax   ; Low DWORD of QW1
        mov DWORD PTR [CPU_Time3+4], edx   ; High DWORD of QW1


            ;ONLY DIALATE IN ONE DIRECTION (BACKWARDS)
            Repeat LOOP_VAL
                mov eax,100
                mov edx,1
                mov ebx,10000
                div ebx
            endM
           
        ;DONE PROCESSING
        invoke QueryPerformanceCounter, addr CPU_Time2
        invoke QueryPerformanceFrequency,addr CPU_Time1
    ; Subtract QWORDS (QW1 - QW2 = QW3)
    mov eax, DWORD PTR [CPU_Time2+0]   ; Low DWORD of QW1
    mov edx, DWORD PTR [CPU_Time2+4]   ; High DWORD of QW1
    sub eax, DWORD PTR [CPU_Time3+0]   ; Low DWORD of QW2
    sbb edx, DWORD PTR [CPU_Time3+4]   ; High DWORD of QW2
        mov ebx, eax
        mov ecx, edx
        mov eax, DWORD PTR [CPU_Time1+0]
        mov edx, DWORD PTR [CPU_Time1+4]
        div ebx
        mov ebx,ecx
        fn MessageBox,0,str$(eax),str$(ebx),MB_OK
        ; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««


jmp _MainLoop

end start
Title: Re: preformance
Post by: loki_dre on April 27, 2008, 11:59:59 PM
With the above code I get approx 600 program cycles per second if I press enter slow & 1200 if I press it rapidly
Title: Re: preformance
Post by: MichaelW on April 28, 2008, 10:06:28 AM
I'm having problems understanding the purpose of your code. The point of timing code is normally to compare the execution times between algorithms and/or implementations, to determine which executes faster. For this purpose, execution times that include a highly variable user response time are of little use. For your code the measured time is almost entirely the user response time. On my relatively slow system the REPEAT loop actually executes in around 2.4ms, and the shortest possible user response time is many times this. This example compares the execution time for two substantially different implementations of the Sieve of Eratosthenes algorithm.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      pcFreq  dq 0
      pcCount dq 0
      msCount dd 0
      total1  dd 0
      total2  dd 0
      pMem    dd 0
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

; -------------------------------------------------
; This code is a somewhat optimized implementation
; of the Sieve of Eratosthenes algorithm.
; -------------------------------------------------

Sieve proc uses ebx esi pFlags:DWORD, nFlags:DWORD
    mov esi, pFlags
    fild nFlags
    fsqrt
    push ebx
    fistp DWORD PTR [esp]
    pop ebx
    mov ecx, 1
  outer:
    add ecx, 1
    cmp ecx, ebx
    ja  finished
    cmp BYTE PTR [esi+ecx], 0
    jne outer
    mov edx, ecx
    shl edx, 1
  inner:
    mov BYTE PTR [esi+edx], 1
    add edx, ecx
    cmp edx, nFlags
    jna inner
    jmp outer
  finished:
    ret
Sieve endp

; ------------------------------------------------------
; This code is an adaption of a Microsoft MASM example.
; ------------------------------------------------------

Sieve_ms proc uses ebx p:DWORD, sz:DWORD
    mov edx, p
    push 2
    pop eax
  iloop:
    mov ecx, eax
    shl ecx, 1
  jloop:
    mov ebx, sz
    cmp ecx, ebx
    ja @F
    mov BYTE PTR [edx+ecx], 1
    add ecx, eax
    jmp jloop
  @@:
    inc eax
    shr ebx, 1
    cmp eax, ebx
    jb iloop
    ret
Sieve_ms ENDP

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    N EQU 15485863

    invoke Sleep, 3000

    REPEAT 4

      ; ------------------------------------------------
      ; Allocate an array of byte flags large enough to
      ; represent the first million primes. The array
      ; should be zeroed before each test, and an easy
      ; way to do this is just to free the array and
      ; reallocate it.
      ; ------------------------------------------------

      mov pMem, alloc( N )

      ; ------------------------------------------
      ; Flag the first million primes, timing the
      ; process with GetTickCount.
      ; ------------------------------------------

      invoke GetTickCount
      push eax
      invoke Sieve, pMem, N
      invoke GetTickCount
      pop ebx
      sub eax, ebx
      add total1, eax
      print ustr$(eax), "ms", 9

      ; ------------------------------------------
      ; Flag the first million primes, timing the
      ; process with the High-Resolution Timer.
      ; ------------------------------------------

      invoke QueryPerformanceFrequency, ADDR pcFreq
      invoke QueryPerformanceCounter, ADDR pcCount
      push DWORD PTR pcCount+4
      push DWORD PTR pcCount
      invoke Sieve, pMem, N
      invoke QueryPerformanceCounter, ADDR pcCount
      pop ecx
      sub DWORD PTR pcCount, ecx
      pop ecx
      sbb DWORD PTR pcCount+4, ecx

      fild pcCount
      fild pcFreq
      fdiv                    ; pcCount / pcFreq = seconds
      mov  msCount, 1000
      fild msCount
      fmul                    ; seconds * 1000 = milliseconds
      fistp msCount
      mov eax, msCount
      add total2, eax
      print ustr$(eax), "ms", 13, 10

      free( pMem )

    ENDM

    print "GetTickCount average "
    shr total1, 2
    print ustr$(total1), "ms", 13, 10

    print "High-Resolution Timer average "
    shr total2, 2
    print ustr$(total1), "ms", 13, 10, 13, 10

    free( pMem )
    mov pMem, alloc( N )

    ; ---------------------------------------
    ; The Microsoft code is very slow, so to
    ; save time do the timing only once.
    ; ---------------------------------------

    ; ------------------------------------------
    ; Flag the first million primes, timing the
    ; process with GetTickCount.
    ; ------------------------------------------

    invoke GetTickCount
    push eax
    invoke Sieve_ms, pMem, N
    invoke GetTickCount
    pop ebx
    sub eax, ebx
    print ustr$(eax), "ms", 9

    ; ------------------------------------------
    ; Flag the first million primes, timing the
    ; process with the High-Resolution Timer.
    ; ------------------------------------------

    invoke QueryPerformanceFrequency, ADDR pcFreq
    invoke QueryPerformanceCounter, ADDR pcCount
    push DWORD PTR pcCount+4
    push DWORD PTR pcCount
    invoke Sieve_ms, pMem, N
    invoke QueryPerformanceCounter, ADDR pcCount
    pop ecx
    sub DWORD PTR pcCount, ecx
    pop ecx
    sbb DWORD PTR pcCount+4, ecx

    fild pcCount
    fild pcFreq
    fdiv                    ; pcCount / pcFreq = seconds
    mov  msCount, 1000
    fild msCount
    fmul                    ; seconds * 1000 = milliseconds
    fistp msCount

    print ustr$(msCount), "ms", 13, 10

    free( pMem )

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


Typical results on my P3:

3075ms  3083ms
3065ms  3064ms
3095ms  3068ms
3084ms  3060ms
GetTickCount average 3079ms
High-Resolution Timer average 3079ms

28651ms 28564ms

Title: Re: preformance
Post by: Tedd on April 28, 2008, 11:29:57 AM
Loki's code appears to be doing the following:

@@:
  t1 = perfCount()
  t3 = t1
  test_code()
  t2 = perfCount()
  t1 = perfFreq()
  diff = t2 - t3
  Msgbox( str(t1/LOW_DWORD(diff)), str(HIGH_DWORD(diff)) )
  jmp @B

The measured time isn't including the delay caused by response to the messagebox, directly.

So, why should creating a messagebox cause a slow down? Simple - creating any kind of dialog means loading various dlls into the process memory space, and initialising structures, etc.. This is particularly noticeable for the first call to any of the common dialogs or controls. But wait, that comes after the measure, so it shouldn't be included. True, but once the dialog is destroyed, it's cleaned up - that goes on in the background. So the next loop measures a longer time because the OS is trying to do other things at the same time. By responding fast enough, it may be that things are still in cache/buffers, so their loading is faster next time around, reducing the delays, etc.
The messagebox itself doesn't change the run time of the code. However, the method you're using to measure the time assumes your code is the only thing running - at all. This isn't too bad for a short time period, but the OS tries to multi-task, and that means it will do other things and they will interrupt your timing. By creating and destroying dialogs, you're forcing it to do other things, and thus messing up your own timing. If you want to measure code accurately, do as little else as possible.


Quotebtw .....if windows copies & runs all programs from HD to memory why don't you have the option to delete a program from the harddrive while it is running....you can do it with editing text files etc.....
Only the required sections are copied directly into memory. Small programs will usually fit their whole working set in memory in one go, but larger programs won't. The exe is kept locked (meaning you can't delete it) in case other parts/sections need to be loaded from it.
When editing text files, it depends on the editor - notepad loads the whole file into memory and no longer needs the file, but it was never really meant for editing large files. Not all editors do that.
Title: Re: preformance
Post by: loki_dre on April 29, 2008, 09:08:13 PM
Thanx guys that helps clear things up

QuoteI'm having problems understanding the purpose of your code. The point of timing code is normally to compare the execution times between algorithms and/or implementations

I'm basically writing code to do image processing on bmp files (detect objects in a picture)....the purpose of measuring my performance right now is to determine the average/max amount of frames per a second(FPS) that I can process.  My target it 60FPS (typical screen refresh rate).....code still needs some more work done on it right now,  & I probably got some HW & SW changes to make.....
Most high speed/high resolution cameras you can buy & interface to easily have better support for windows than other OS's so I'm developing it on Windows XP...

I've been looking at buying a Quad-Core PC right now but i'm not quite sure if it will have a significant increase in performance..........I mostly have for loops with some math...setx, & jxx.  And I am typically looking at surrounding pixels etc.
If anyone has any HW or SW tips it would be greatly appreciated....