News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Timing macros rebirth

Started by Petroizki, January 22, 2006, 07:40:46 AM

Previous topic - Next topic

MichaelW

P3:

Petroizki's inline macros clock counts:
lstrcmp:  1123, 1123, 1147, 1123, 1148, 1123, 1123, 1149, 1123, 1122,
szCmp:    43, 43, 43, 43, 43, 43, 43, 43, 43, 43,
multiply: 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
division: 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,

Petroizki's procmacros clock counts:
lstrcmp:  1123, 1128, 1123, 1128, 1128, 1123, 1128, 1123, 1128, 1123,
szCmp:    37, 37, 37, 37, 37, 37, 37, 37, 37, 37,

MichaelW's macros clock counts:
lstrcmp:  1374, 1395, 1354, 1341, 1353, 1384, 1353, 1465, 1376, 1378,
szCmp:    56, 58, 60, 61, 56, 54, 61, 57, 54, 53,
multiply: 3, 0, 1, 4294967295, 2, 4294967295, 0, 4294967295, 4294967295, 1,
division: 44, 42, 43, 40, 43, 39, 42, 41, 43, 43,


Looking at my last results, the cycle count for div seems far too low. I tested again and I still get 7. Agner Fog's Pentium optimization manual shows a latency of 39 and a reciprocal throughput of 37 for the PPro, P2 and P3.
eschew obfuscation

Petroizki

The problem kinda is; should we get the average, or the fastest possible time, even if it would appear only once in 32 loops? So far i have been aiming at stable average. Should i just dump all the average calculations?

_func0 segment
FUNC0_START label dword
func0:
mov ebx, 1
xor edx, edx
div ebx

retn
db 4096*2-($-FUNC0_START) dup (0)
_func0 ends

start:
print chr$("Procedure macros TEST.",13,10,)

mov ebp, 3
@loop:
PTIMER_PROC_LOOP_COUNT func0

mov ebx, __PROCTIMER_LOOPS - 1
or edi, -1
print chr$(13,10,"Loop times: ")
@@: print ustr$(dword ptr [__PROCTIMER_TABLE + ebx*8])
print chr$(", ")

mov ecx, edi
mov eax, dword ptr [__PROCTIMER_TABLE + ebx*8]
sub ecx, eax
sbb edi, edi
and edi, ecx
add edi, eax

sub ebx, 1
jnc @B

print chr$(13,10,"Average as calculated: ")
print ustr$(dword ptr [qwPROCTIMER_RESULT])
print chr$(13,10,"Minimum value: ")
print ustr$(edi)
print chr$(13,10)
sub ebp, 1
jnz @loop

inkey chr$(13,10,13,10,"Press any key to exit...")
ret
end start


Quote from: MichaelW on January 26, 2006, 09:55:09 AMLooking at my last results, the cycle count for div seems far too low. I tested again and I still get 7.
mul clock count is 7, div clock count is 41?

MichaelW

QuoteThe problem kinda is; should we get the average, or the fastest possible time, even if it would appear only once in 32 loops?

Assuming the results from my limited testing are typical, if you discarded the counts for the first few trials, and were running say 20 or more trials, the average would equal the lowest value, so I would lean towards doing whatever is easiest.

I changed my code to add:

xor  eax, eax
cpuid

Immediately before the test instructions (and to the overhead loop), and this changed the result for div to 37 cycles. So far I cannot see any way to isolate the test instructions from the xor  eax, eax that precedes the next cpuid, because the only other non-privileged serializing instructions are IRET and RSM. I at first thought an IRET might be possible, but setting the stack up correctly is going to be difficult at best.

eschew obfuscation

zooba

Quote from: MichaelW on January 26, 2006, 09:55:09 AM
P3:

multiply: 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
division: 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,


Looking at my last results, the cycle count for div seems far too low. I tested again and I still get 7. Agner Fog's Pentium optimization manual shows a latency of 39 and a reciprocal throughput of 37 for the PPro, P2 and P3.


According to the results you've posted, you're actually getting 41 for division and 7 for multiplication. I think 41 is much more appropriate :U

MichaelW

QuoteAccording to the results you've posted, you're actually getting 41 for division and 7 for multiplication. I think 41 is much more appropriate.

Sorry, my statement was not clear. I was not referring to the results that I posted, but to the results of the code that I posted.

I am now fairly certain that an IRET cannot be made to work from ring 3. With no way to completely isolate the test instructions from the timing instructions there will always be the potential for one or more of the test instructions to execute in parallel with the trailing xor  eax,eax. For long sequences of instructions the effect of this would probably be negligible, but not for short sequences.
eschew obfuscation

Mark Jones

Keep up the great work on the timing macros. :bg

Processor: AMD Athlon(tm) XP 2500+

Petroizki's inline macros clock counts:
lstrcmp:  881, 881, 881, 881, 881, 881, 881, 881, 881, 906,
szCmp:    32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
multiply: 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
division: 38, 38, 38, 38, 38, 38, 38, 38, 38, 38,

Petroizki's procmacros clock counts:
lstrcmp:  877, 877, 877, 877, 877, 877, 877, 877, 877, 877,
szCmp:    32, 32, 32, 32, 32, 32, 32, 32, 32, 32,

MichaelW's macros clock counts:
lstrcmp:  1391, 1264, 1264, 1118, 1177, 1189, 1353, 1124, 1286, 1141,
szCmp:    35, 34, 25, 44, 40, 49, 37, 49, 49, 49,
multiply: 4294967287, 0, 4294967295, 4294967288, 4294967293, 0, 4294967283, 4294
967295, 4294967293, 55,
division: 49, 48, 58, 43, 49, 42, 49, 47, 32, 49,

Press any key to exit...


proctimers example:
multiple-pass lstrlen: 80 clock counts
single-pass lstrlen: 267 clock counts


ptimers example:
multiple-pass lstrlen: 81 clock counts
single-pass lstrlen: 328 clock counts
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

Mark Jones

Interesting, if the project is built as a release version, we get:


Petroizki's inline macros clock counts:
lstrcmp:  899, 899, 899, 899, 899, 899, 899, 899, 877, 877,


and if we build as a debug version, we get:


Petroizki's inline macros clock counts:
lstrcmp:  877, 877, 877, 877, 877, 877, 877, 877, 877, 877,


  :wink  Everything else matches. The command lines are:

Quote
ml /c /Cp /coff /nologo /I"\masm32\Include" comparison.asm
polink /SUBSYSTEM:CONSOLE /VERSION:4.0 /LIBPATH:"\masm32\lib" /OUT:"TT.exe" *.obj

Quote
ml /c /Cp /coff /Fm /Zi /Zd /nologo /I"\masm32\Include" comparison.asm
polink /SUBSYSTEM:CONSOLE /DEBUG /DEBUGTYPE:BOTH /VERSION:4.0 /LIBPATH:"\masm32\lib" /OUT:"TT.exe" *.obj
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08