I wanted to kill a few birds with one stone in this thread. Jochen was working on a times-ten table for his float library.
This is my version of the generator. His is 22 bytes smaller, but uses some pre-existing data. This one is self-contained, as
it generates the entire table from scratch. Mine is probably a little slower than his, also - lol.
I also wrote an "EnumerateCPUs" routine and would like to see some results from that. This one displays a line that
shows how many cores are in the machine, as well as a line for each CPU package.
For me, the real meat in this thread has to do with algorithm timing. This program uses MichaelW's timing macros.
For some reason, my machine shows numbers that are way out of whack with the rest of the world (i.e. masm32 members).
I think it may be related to the fact that I am running Windows XP Media Center Edition. Although, it may be the BIOS, as
the machine was designed specifically to run that OS. It is a Sony VAIO VG-RB42GS machine. It came with a Sony
MPEG-Encoder/TV Tuner card that uses a Conexant MPEG chip. As TV tuners go, this one is a bit of an oddball. It may
be that the BIOS or tuner drivers alter the Time-Stamp Counter tick rate in order to provide higher resolution counts.
Here are the numbers I get - I think they are a factor of 5 higher than they should be....
Total System Processor Cores: 2
CPU 0: Intel(R) Pentium(R) 4 CPU 3.00GHz MMX SSE3 Cores: 2
13356 clock cycles
13257 clock cycles
13273 clock cycles
Generated table values are identical to Masm-assembled table values
Code: 172 bytes
Data: 38 bytes
Total: 210 bytes
I have attached the program and source for those who are interested...
EDIT Jul 22 2009
udated Table10 - replaced core per package count code and eliminated loop instructions - Thank You Michael
also, used lea eax,[eax+4*eax] to multiply by 5 - Thank You Hutch
and - added a space after 3DNow!/3DNow!+
EDIT Jul 22 2009
added static display time updates
EDIT Jul 22 2009
added more time measurement lines
added push/pop around timer routines
release allocated heap at the end of the program
[attachment deleted by admin]
Running on a P3 it get:
9182 clock cycles
9180 clock cycles
9180 clock cycles
Considering that you are running on a Pentium 4 with a lower IPC, I think your cycle counts are reasonably reasonable. You should be able to speed up the code somewhat by replacing the LOOP instructions with dec ecx/jnz, or similar.
Also, I had to comment out the call to EnumerateCPUs because on my system it hangs.
dang ! - lol
well - the clock counts are not what i expected
size was the issue with the generator routine - so i used loop in a couple places
300 passes in loop isn't that bad
i was trying to get down to Jochen's 188 bytes - lol
as for the enumeration hanging - very disappointing :(
i will have to have a closer look at it
i may have used a cpuid function that isn't supported or something
Many Thanks, Michael
Total System Processor Cores: 4
CPU 0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz MMX SSSE3 Cores: 3
CPU 1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz MMX SSSE3 Cores: 3
8936 clock cycles
8927 clock cycles
8918 clock cycles
i was just going to say - the little loop that counts cores per package is whacko - lol
it uses a cpuid function that i forgot to test support for, then can get stuck in an endless loop (Michael's machine)
i see it doesn't even count right on yours - lol
i went by the intel cpuid manual - my first mistake
Thanks Sinsi
i take it you actually have one cpu package with 4 cores
EDIT
i am going to try the AMD method and update the d/l in the first post - maybe later tonight
On my system the EcpusQ loop is hanging.
This is a demonstration of why LOOP should be avoided in code were speed matters.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
invoke Sleep, 4000
counter_begin 1000, HIGH_PRIORITY_CLASS
REPEAT 10
mov ecx, 1
@@:
loop @B
ENDM
counter_end
print ustr$(eax)," cycles, (loop*1)*10",13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
REPEAT 10
mov ecx, 1
@@:
dec ecx
jnz @B
ENDM
counter_end
print ustr$(eax)," cycles, (dec ecx/jnz*1)*10",13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
REPEAT 10
mov ecx, 10
@@:
loop @B
ENDM
counter_end
print ustr$(eax)," cycles, (loop*10)*10",13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
REPEAT 10
mov ecx, 10
@@:
dec ecx
jnz @B
ENDM
counter_end
print ustr$(eax)," cycles, (dec ecx/jnz*10)*10",13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
REPEAT 10
mov ecx, 100
@@:
loop @B
ENDM
counter_end
print ustr$(eax)," cycles, (loop*100)*10",13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
REPEAT 10
mov ecx, 100
@@:
dec ecx
jnz @B
ENDM
counter_end
print ustr$(eax)," cycles, (dec ecx/jnz*100)*10",13,10
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
Running on a P3:
56 cycles, (loop*1)*10
10 cycles, (dec ecx/jnz*1)*10
712 cycles, (loop*10)*10
276 cycles, (dec ecx/jnz*10)*10
5813 cycles, (loop*100)*10
2076 cycles, (dec ecx/jnz*100)*10
I have no way to test this, but on a Pentium 4 sub ecx, 1 might be faster.
I may have missed the point here but if the task is a fixed multiply by 10 try this.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
comment * -----------------------------------------------------
Build this template with
"CONSOLE ASSEMBLE AND LINK"
----------------------------------------------------- *
.data?
value dd ?
.data
item dd 0
mul10 MACRO number
mov eax, number
lea eax, [eax+eax*4] ; mul by 5
add eax, eax ; double it
ENDM
.code
start:
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main
inkey
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
mul10 333
print str$(eax),13,10
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
Dave, one processor, 4 cores.
55 cycles, (loop*1)*10
5 cycles, (dec ecx/jnz*1)*10
563 cycles, (loop*10)*10
179 cycles, (dec ecx/jnz*10)*10
5172 cycles, (loop*100)*10
1225 cycles, (dec ecx/jnz*100)*10
Looks great, Dave. I will steal that green colour idea from you, if you don't mind. And no, I won't send money, but if you come over, I'll offer you a beer :bg
Total System Processor Cores: 2
CPU 0: Intel(R) Pentium(R) 4 CPU 3.40GHz MMX SSE3 Cores: 2
13304 clock cycles
13286 clock cycles
13288 clock cycles
lol
Thanks to all
and Jochen, the beer sounds great - can't wait to sample some Euro-grog (btw - we like it cold - put it on ice for me)
you are more than welcome to use that - lol
i used it in another program where i had a static display (continually updated numeric values - no scroll)
i just grabbed the display position of the "any key" message and calculated the other screen positions from that
i fixed the EnumerateCPUs function, hopefully
previously, i had used the method prescribed in Intel's CPUID reference manual (which, quite frankly, didn't make sense to me)
on this update, i test the HTT (hyper-thread technology) bit from CPUID[0_1]EDX:28
if it is 0 - indicates a single core
if it is 1 - i can use the logical processor count from CPUID[0_1]EBX:23-16
i have updated the first post in the thread with the new program and source
Thank You Everyone for testing it
as for the loop instruction, i had no idea it was that slow - lol
i don't get it - but - i eliminated it from my code in the updated d/l above
on the table generator, i was trying to get down to 188 bytes
i think replacing LOOP with DEC ECX|JNZ will add a byte - oh well - i wasn't close to 188, anyways - lol
EDIT - without LOOP, my gen code is 212 bytes, but ~500 cycles faster
by using Hutch's lea eax,[eax+4*eax], i got it back down to 210 bytes
i also removed LOOP from the EnumerateCPUs function
FYI,
DednDave Times Ten Table Generator
Total System Processor Cores: 1
CPU 0: AMD Athlon(tm) 64 Processor 3200+ MMX+ SSE2 3DNow!+Cores: 1
7303 clock cycles
7303 clock cycles
7305 clock cycles
Generated table values are identical to Masm-assembled table values
Code: 172 bytes
Data: 38 bytes
Total: 210 bytes
Steve
Thanks Steve - oops forgot to put a space after 3DNow! - lol - fixed it
that is a nice processor - very fast
mine is now....
Total System Processor Cores: 2
CPU 0: Intel(R) Pentium(R) 4 CPU 3.00GHz MMX SSE3 Cores: 2
12904 clock cycles
12746 clock cycles
12876 clock cycles
i should use a static display - lol
but - no way to copy/paste more than one reading if i did that
i guess i could keep 3 readings and keep updating them
Quote from: dedndave on July 22, 2009, 12:54:43 PM
as for the loop instruction, i had no idea it was that slow - lol
i don't get it
It was discussed in one (at least) of the assembly newsgroups. LOOP
in one of AMD's processors was too fast for the M$ Windows 95 install
program. A timing loop croaked. So they deliberately slowed the
instruction down. As to why it was never fixed later?
Bleah,
Steve N.
that is SO stupid !!!! - lol
fix windows - not the processor that wasn't broken to begin with - lol
anyways, i updated the Table10 program again
this time, i made the times update on a static screen until a key is pressed
if you liked the green, Jochen, you'll love this
(better put another one on ice)
Hi Dave,
DednDave Times Ten Table Generator
Total System Processor Cores: 1
CPU 0: Fam 6 Mod 5 xFam 0 xMod 0 Type 0 Step 2 MMX Cores: 1
8636 clock cycles
8638 clock cycles
8699 clock cycles
Can't believe my P2 IBM box is still fas'n'furious after all those years! Used to call it Lentium (lento=español ->slow), no more I guess. :U
Cheers,
KhipuCoder
Quote from: dedndave on July 22, 2009, 04:04:52 PM
that is SO stupid !!!! - lol
fix windows - not the processor that wasn't broken to begin with - lol
anyways, i updated the Table10 program again
this time, i made the times update on a static screen until a key is pressed
if you liked the green, Jochen, you'll love this
(better put another one on ice)
Yeah, looks nice, and seems to like my Celeron :U
Total System Processor Cores: 1
CPU 0: Intel(R) Celeron(R) M CPU 420 @ 1.60GHz MMX SSE3 Cores: 1
7981 clock cycles
7948 clock cycles
7938 clock cycles
DednDave Times Ten Table Generator
Total System Processor Cores: 1
CPU 0: AMD Sempron(tm) 3000+ MMX+ SSE 3DNow!+ Cores: 1
7241 clock cycles
7172 clock cycles
7161 clock cycles
Generated table values are identical to Masm-assembled table values
Code: 172 bytes
Data: 38 bytes
Total: 210 bytes
Press any key to continue ...
Just timed my own routine - it's a lot slower than Dave's algo, arouind 8,600 cycles. At 1.6 Giga, that means roughly 0.005 milliseconds. Fortunately, it has to run only once :bg
QuoteFortunately, it has to run only once
i can't believe you said that - lol - that's just not like you, Jochen
still, you beat me by 22 bytes - fair and square
i had to work hard to get it down to that
did you see that i had to pack the correction table ? - lol
Everything works on my P3 now, and it's ~8% faster.
Total System Processor Cores: 1
CPU 0: Fam 6 Mod 7 xFam 0 xMod 0 Type 0 Step 3 MMX SSE Cores: 1
8328 clock cycles
8328 clock cycles
8328 clock cycles
One problem with the continual update is that it's difficult to copy the output without affecting one or more of the cycle counts.
yes - i noticed that Michael
console mode copy/paste is very strange
the best way is to exit the program, then get the last set
maybe i should put several timing text lines in it, instead of 3
i added more lines
also, i found a couple small flaws
i released the heap before i was done with it - oops - i am surprised i didn't see my c0000005 friend (that should be my nic)
also, using Michael's timing routine requires push/pop ebx if in a function - unless they use the updated counter2 file
so, i added push/pop
playing around with the console mode copy/paste is always a disappointment
i swear - i have tried everything to simply disable it altogether, with no luck
i have found three bugs - at least in the XP MCE 2005 console window
they may have been fixed in later OS's
again, the best thing to do is to stop the program before using copy/paste
i am going to leave that part as it is and work on the EnumerateCPUs function if i have time
Thanks to Everyone for your help :U
Dave
This link from AMD may be of interest.
http://developer.amd.com/documentation/articles/pages/ProcessorCoreEnumeration.aspx
(http://developer.amd.com/documentation/articles/pages/ProcessorCoreEnumeration.aspx%3Cbr%20/%3E)
thank you, Bruce
i did wind up using the AMD method
they outline the same info in their CPUID reference manual