News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

GetTickCount from Win32 API

Started by cork, July 10, 2010, 12:39:05 PM

Previous topic - Next topic

cork

I'm just getting my feet wet with x86 assembly. Previously I've done some Win32 programming using C and the Win32 API, so I have a little understanding.

Using MASM32 I do this:
     invoke  GetTickCount

The GetTickCount call returns a DWORD. My question is where does this DWORD get stored? In a register? On the stack? If its passed back in a 32-bit register, which one... and also whether that register is always used for return value from a Win32 API call?

Oh, and I'm using 64-bit Windows Vista, and assembling using MASM32.

jj2007

It gets returned in eax - always. Here are some tips for register usage etc.

Welcome to the Forum :thumbu

dedndave

QuoteIt gets returned in eax - always.
unless it is RAX   :P

welcome to th forum   :U

jj2007

Quote from: dedndave on July 10, 2010, 03:54:12 PM
Unless it is RAX   :P
hmmm... sure that works with invoke?
:wink

dedndave

pretty sure it won't   :bg
but, he said he's running 64-bit vista
i figured he was talking 64-bit
i see, now, the question was about win32 api

cork

Okay, thanks for the info about EAX. Now, I have a follow on question, if you don't mind - again related to the Win32 API.

I am using the following two lines of code in MASM32:
   qwPerformanceCount  dq   0  ; 64-bits (SIGNED) (LARGE_INTEGER union)
   invoke QueryPerformanceCounter, ADDR qwPerformanceCount   

The qwPerformanceCount parameter is defined in Windows.h as:

typedef union _LARGE_INTEGER {
  struct {
    DWORD LowPart;
    LONG  HighPart;
  } ;
  struct {
    DWORD LowPart;
    LONG  HighPart;
  } u;
  LONGLONG QuadPart;
} LARGE_INTEGER, *PLARGE_INTEGER;

Is the value, in qwPerformanceCount, 8 consecutive bytes representing a 64-bit unsigned integer - stored in Little Endian order? Or do I have to access the most-significant bits separately and then the least-significant 32-bits separately?

Because the structure is defined in the following order - LowPart then HighPart, it looks like it is in the proper little endian order and that I can just see it as a little-endian 64-bit unsigned integer...

Sorry if i'm a little long-winded in my description. Better safe than sorry. I'm compiling using MASM32 to make a 32-bit EXE (I think!) and the platform I have MASM32 installed on is 64-bit Windows Vista.

dedndave

#6
in 32-bit code, there is no way to directly access the 64-bit integer in one instruction
if you look in the masm32\include\windows.inc file...
LARGE_INTEGER UNION
    STRUCT
      LowPart  DWORD ?
      HighPart DWORD ?
    ENDS
  QuadPart QWORD ?
LARGE_INTEGER ENDS

if you want to pass the address of the value, you can refer to LowPart or QuadPart, or the structure name
if you want to grab the values...

.data?
PerfCnt LARGE_INTEGER <>
.
.
.
.code
        INVOKE  QueryPerformanceCounter,offset PerfCnt
        mov     eax,PerfCnt.LowPart
        mov     edx,PerfCnt.HighPart

another approach is to use the stack...
        sub     esp,8
        INVOKE  QueryPerformanceCounter,esp
        pop     eax
        pop     edx

or
        push    edx
        push    eax
        INVOKE  QueryPerformanceCounter,esp
        pop     eax
        pop     edx


EDIT - oops - replaced GetTickCount with QueryPerformanceCounter - thanks Jochen   :P

Antariy

Quote
I am using the following two lines of code in MASM32:
   qwPerformanceCount  dq   0  ; 64-bits (SIGNED) (LARGE_INTEGER union)
   invoke QueryPerformanceCounter, ADDR qwPerformanceCount   

..................skipped....................

Is the value, in qwPerformanceCount, 8 consecutive bytes representing a 64-bit unsigned integer - stored in Little Endian order? Or do I have to access the most-significant bits separately and then the least-significant 32-bits separately?


Yes, you can access to all bits.


push eax ;
push eax ; two times eax, because this better to prediction
push esp ; or, you can "lea edx,[esp-8]; push eax; push eax; push edx
call QueryPerformanceCounter
pop eax
pop edx


So, now in edx:eax the 64-bit integer. In edx - Biggest DWORD of QWORD, in eax - little DWORD of QWORD;
31 bit of edx - 63 bit of QWORD, 30 bit of edx - 62 bit of QWORD, 0 bit of edx - 32bit of QWORD; 31bit of EAX - 31bit of QWORD, 30 bit of eax - 30 bit of QWORD...

That is normal Little Endian format, nothing more.

And, also, usage of "rdtsc" instruction better than using "QueryPerformanceCounter", because... it better :)


rdtsc
mov dword ptr [qwPerformanceCount],eax
mov dword ptr [qwPerformanceCount+4],edx


now in "qwPerformanceCount" the counter of the CPU's ticks.
Don't forgot set ".586" directive in the source file...

Rockoon

Dont forget that the x87 FPU works with 80-bit reals that just happens to have 64-bit mantissa's, so you can work with 64-bit integers without dropping to MMX, SSEx, or 64-bit mode
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

cork

Antariy - I've been looking into rdtsc, but that is requiring a little more advanced thought than using QueryPerformanceCounter(). Just to use QueryPerformanceCounter() has been a little adventure....

I have to lock the thread onto just 1 core, then give the thread a high priority. This is what I have so far (error-handling removed for ease of reading). Using MASM32:

    hProcess               dd  0
    hThread                dd  0
    dwProcessAffinityMask  dd  0
    dwSystemAffinityMask   dd  0
    dwThreadAffinityMask   dd  00000001h  ; set value to 1, 2, 4, 8 (1 bit for each processor)
    qwPerformanceFrequency dq  0  ; 64-bit signed integer value.
    qwPerformanceCount1    dq  0   ; 64-bit signed integer value.
    qwPerformanceCount2    dq  0   ; 64-bit signed integer value.
   
    invoke GetCurrentThread
    mov hThread, EAX
    invoke GetCurrentProcess
    mov hProcess, EAX
    invoke GetProcessAffinityMask, hProcess, ADDR dwProcessAffinityMask, ADDR dwSystemAffinityMask
    invoke SetThreadAffinityMask, hThread, dwThreadAffinityMask
    invoke SetPriorityClass, hProcess, 128  ; HIGH_PRIORITY_CLASS = 0x00000080 (128D)
    invoke SetThreadPriority, hThread, 2  ; THREAD_PRIORITY_HIGHEST=2D
    invoke QueryPerformanceFrequency, ADDR qwPerformanceFrequency
    invoke QueryPerformanceCounter, ADDR qwPerformanceCount1
      ;
      ;  Put code to be timed here. Do many loops so I can average.
      ;
    invoke QueryPerformanceCounter, ADDR qwPerformanceCount2

In order to use rdtsc, it seems like you have to lock the thread to 1 core also, but there are other issues I'm learning about.

I've been reading an old document on rdtsc: http://www.ccsl.carleton.ca/~jamuir/rdtscpm1.pdf

The first issue is to call a serializing instruction immediately before rdtsc, so that rdtsc doesn't get called out of order. For instance:
    cpuid  ; forces all previous instructions to complete, according to Intel
    rdtsc   ; read time stamp counter
      ;
      ; code to be timed goes here
    cpuid  ; force completion of all instructions before calling rdtsc
    rdtsc

Since there is an extra cpuid instruction in between rdtsc you have to take into account how many cycles that instruction takes and subtract that from the total. I suppose you could time just the rdtsc over and over and then average it:
   cpuid
   rdtsc
   cpuid  ; only this instruction is being timed
   rdtsc
I won't be using it for anything too fine-grained, and will mostly be using it to time a routine many times in a loop, so I won't need to do this. But if a person were being extra precise, it's a consideration.

Next, it looks like you have to warm up the data and instruction cache, so you are getting consistent readings.
I don't understand cache enough to understand the issues and how cache hits and misses affect the precise timings, and how to go about setting up the code to set up such a scenario.

Also, the OS or BIOS can kick in power-saving measures and ratchet down the speed of the processor. I'd have to understand that and how to ensure that that doesn't happen.

And lastly, I don't know how to force the processor to run the code straight through without a context switch, which further adds variability to the results.

All in all, I suppose most of this won't matter in the instances where I'm timing code since I'll loop the code and get an average time and that should be good enough.

dedndave

hiya Cork
you may be trying to re-invent the wheel   :bg

http://www.masm32.com/board/index.php?topic=770.msg5281#msg5281

many of us use Michael's macros to time code
i usually select a single core...
        INVOKE  GetCurrentProcess
        INVOKE  SetProcessAffinityMask,eax,1

then, use Michael's macros

cork

Thanks  :U

I'll start using Michael's macros, too.