The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: dedndave on July 22, 2009, 01:18:41 AM

Title: Timing for times-ten table generator
Post by: dedndave on July 22, 2009, 01:18:41 AM
  I wanted to kill a few birds with one stone in this thread. Jochen was working on a times-ten table for his float library.
This is my version of the generator. His is 22 bytes smaller, but uses some pre-existing data. This one is self-contained, as
it generates the entire table from scratch. Mine is probably a little slower than his, also - lol.

  I also wrote an "EnumerateCPUs" routine and would like to see some results from that. This one displays a line that
shows how many cores are in the machine, as well as a line for each CPU package.

  For me, the real meat in this thread has to do with algorithm timing. This program uses MichaelW's timing macros.
For some reason, my machine shows numbers that are way out of whack with the rest of the world (i.e. masm32 members).
I think it may be related to the fact that I am running Windows XP Media Center Edition. Although, it may be the BIOS, as
the machine was designed specifically to run that OS. It is a Sony VAIO VG-RB42GS machine. It came with a Sony
MPEG-Encoder/TV Tuner card that uses a Conexant MPEG chip. As TV tuners go, this one is a bit of an oddball. It may
be that the BIOS or tuner drivers alter the Time-Stamp Counter tick rate in order to provide higher resolution counts.

Here are the numbers I get - I think they are a factor of 5 higher than they should be....

Total System Processor Cores: 2
CPU 0: Intel(R) Pentium(R) 4 CPU 3.00GHz MMX SSE3 Cores: 2
13356   clock cycles
13257   clock cycles
13273   clock cycles

Generated table values are identical to Masm-assembled table values
Code: 172 bytes
Data:  38 bytes
Total: 210 bytes

I have attached the program and source for those who are interested...

EDIT Jul 22 2009
udated Table10 - replaced core per package count code and eliminated loop instructions - Thank You Michael
also, used lea eax,[eax+4*eax] to multiply by 5 - Thank You Hutch
and - added a space after 3DNow!/3DNow!+

EDIT Jul 22 2009
added static display time updates

EDIT Jul 22 2009
added more time measurement lines
added push/pop around timer routines
release allocated heap at the end of the program

[attachment deleted by admin]
Title: Re: Timing for times-ten table generator
Post by: MichaelW on July 22, 2009, 02:08:17 AM
Running on a P3 it get:

9182    clock cycles
9180    clock cycles
9180    clock cycles


Considering that you are running on a Pentium 4 with a lower IPC, I think your cycle counts are reasonably reasonable. You should be able to speed up the code somewhat by replacing the LOOP instructions with dec ecx/jnz, or similar.

Also, I had to comment out the call to EnumerateCPUs because on my system it hangs.
Title: Re: Timing for times-ten table generator
Post by: dedndave on July 22, 2009, 02:16:17 AM
dang ! - lol
well - the clock counts are not what i expected
size was the issue with the generator routine - so i used loop in a couple places
300 passes in loop isn't that bad
i was trying to get down to Jochen's 188 bytes - lol

as for the enumeration hanging - very disappointing   :(
i will have to have a closer look at it
i may have used a cpuid function that isn't supported or something

Many Thanks, Michael
Title: Re: Timing for times-ten table generator
Post by: sinsi on July 22, 2009, 02:43:46 AM

Total System Processor Cores: 4
CPU 0: Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz MMX SSSE3 Cores: 3
CPU 1: Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz MMX SSSE3 Cores: 3
8936    clock cycles
8927    clock cycles
8918    clock cycles

Title: Re: Timing for times-ten table generator
Post by: dedndave on July 22, 2009, 02:47:59 AM
i was just going to say - the little loop that counts cores per package is whacko - lol
it uses a cpuid function that i forgot to test support for, then can get stuck in an endless loop (Michael's machine)
i see it doesn't even count right on yours - lol
i went by the intel cpuid manual - my first mistake
Thanks Sinsi
i take it you actually have one cpu package with 4 cores

EDIT
i am going to try the AMD method and update the d/l in the first post - maybe later tonight
Title: Re: Timing for times-ten table generator
Post by: MichaelW on July 22, 2009, 06:11:49 AM
On my system the EcpusQ loop is hanging.

This is a demonstration of why LOOP should be avoided in code were speed matters.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 4000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 10
        mov ecx, 1
      @@:
        loop @B
      ENDM
    counter_end
    print ustr$(eax)," cycles, (loop*1)*10",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 10
        mov ecx, 1
      @@:
        dec ecx
        jnz @B
      ENDM
    counter_end
    print ustr$(eax)," cycles, (dec ecx/jnz*1)*10",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 10
        mov ecx, 10
      @@:
        loop @B
      ENDM
    counter_end
    print ustr$(eax)," cycles, (loop*10)*10",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 10
        mov ecx, 10
      @@:
        dec ecx
        jnz @B
      ENDM
    counter_end
    print ustr$(eax)," cycles, (dec ecx/jnz*10)*10",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 10
        mov ecx, 100
      @@:
        loop @B
      ENDM
    counter_end
    print ustr$(eax)," cycles, (loop*100)*10",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 10
        mov ecx, 100
      @@:
        dec ecx
        jnz @B
      ENDM
    counter_end
    print ustr$(eax)," cycles, (dec ecx/jnz*100)*10",13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


Running on a P3:

56 cycles, (loop*1)*10
10 cycles, (dec ecx/jnz*1)*10
712 cycles, (loop*10)*10
276 cycles, (dec ecx/jnz*10)*10
5813 cycles, (loop*100)*10
2076 cycles, (dec ecx/jnz*100)*10


I have no way to test this, but on a Pentium 4 sub ecx, 1 might be faster.
Title: Re: Timing for times-ten table generator
Post by: hutch-- on July 22, 2009, 06:40:51 AM
I may have missed the point here but if the task is a fixed multiply by 10 try this.


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
    include \masm32\include\masm32rt.inc
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    .data?
      value dd ?

    .data
      item dd 0

    mul10 MACRO number
      mov eax, number
      lea eax, [eax+eax*4]        ; mul by 5
      add eax, eax                ; double it
    ENDM


    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    mul10 333

    print str$(eax),13,10

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start
Title: Re: Timing for times-ten table generator
Post by: sinsi on July 22, 2009, 06:44:53 AM
Dave, one processor, 4 cores.


55 cycles, (loop*1)*10
5 cycles, (dec ecx/jnz*1)*10
563 cycles, (loop*10)*10
179 cycles, (dec ecx/jnz*10)*10
5172 cycles, (loop*100)*10
1225 cycles, (dec ecx/jnz*100)*10

Title: Re: Timing for times-ten table generator
Post by: jj2007 on July 22, 2009, 10:08:05 AM
Looks great, Dave. I will steal that green colour idea from you, if you don't mind. And no, I won't send money, but if you come over, I'll offer you a beer :bg

Total System Processor Cores: 2
CPU 0: Intel(R) Pentium(R) 4 CPU 3.40GHz MMX SSE3 Cores: 2
13304   clock cycles
13286   clock cycles
13288   clock cycles
Title: Re: Timing for times-ten table generator
Post by: dedndave on July 22, 2009, 12:54:43 PM
lol
Thanks to all
and Jochen, the beer sounds great - can't wait to sample some Euro-grog (btw - we like it cold - put it on ice for me)
you are more than welcome to use that - lol
i used it in another program where i had a static display (continually updated numeric values - no scroll)
i just grabbed the display position of the "any key" message and calculated the other screen positions from that

i fixed the EnumerateCPUs function, hopefully
previously, i had used the method prescribed in Intel's CPUID reference manual (which, quite frankly, didn't make sense to me)
on this update, i test the HTT (hyper-thread technology) bit from CPUID[0_1]EDX:28
if it is 0 - indicates a single core
if it is 1 - i can use the logical processor count from CPUID[0_1]EBX:23-16
i have updated the first post in the thread with the new program and source
Thank You Everyone for testing it

as for the loop instruction, i had no idea it was that slow - lol
i don't get it - but - i eliminated it from my code in the updated d/l above
on the table generator, i was trying to get down to 188 bytes
i think replacing LOOP with DEC ECX|JNZ will add a byte - oh well - i wasn't close to 188, anyways - lol

EDIT - without LOOP, my gen code is 212 bytes, but ~500 cycles faster
by using Hutch's lea eax,[eax+4*eax], i got it back down to 210 bytes
i also removed LOOP from the EnumerateCPUs function
Title: Re: Timing for times-ten table generator
Post by: FORTRANS on July 22, 2009, 01:35:48 PM
FYI,

DednDave Times Ten Table Generator

Total System Processor Cores: 1
CPU 0: AMD Athlon(tm) 64 Processor 3200+ MMX+ SSE2 3DNow!+Cores: 1
7303    clock cycles
7303    clock cycles
7305    clock cycles

Generated table values are identical to Masm-assembled table values
Code: 172 bytes
Data:  38 bytes
Total: 210 bytes


Steve
Title: Re: Timing for times-ten table generator
Post by: dedndave on July 22, 2009, 01:38:10 PM
Thanks Steve - oops forgot to put a space after 3DNow! - lol - fixed it
that is a nice processor - very fast

mine is now....

Total System Processor Cores: 2
CPU 0: Intel(R) Pentium(R) 4 CPU 3.00GHz MMX SSE3 Cores: 2
12904   clock cycles
12746   clock cycles
12876   clock cycles

i should use a static display - lol
but - no way to copy/paste more than one reading if i did that
i guess i could keep 3 readings and keep updating them
Title: Re: Timing for times-ten table generator
Post by: FORTRANS on July 22, 2009, 03:39:48 PM
Quote from: dedndave on July 22, 2009, 12:54:43 PM
as for the loop instruction, i had no idea it was that slow - lol
i don't get it

   It was discussed in one (at least) of the assembly newsgroups. LOOP
in one of AMD's processors was too fast for the M$ Windows 95 install
program.  A timing loop croaked.  So they deliberately slowed the
instruction down.  As to why it was never fixed later?

Bleah,

Steve N.
Title: Re: Timing for times-ten table generator
Post by: dedndave on July 22, 2009, 04:04:52 PM
that is SO stupid !!!! - lol
fix windows - not the processor that wasn't broken to begin with - lol

anyways, i updated the Table10 program again
this time, i made the times update on a static screen until a key is pressed

if you liked the green, Jochen, you'll love this
(better put another one on ice)
Title: Re: Timing for times-ten table generator
Post by: KhipuCoder on July 22, 2009, 04:29:18 PM
Hi Dave,


DednDave Times Ten Table Generator

Total System Processor Cores: 1
CPU 0: Fam 6 Mod 5 xFam 0 xMod 0 Type 0 Step 2 MMX Cores: 1
8636    clock cycles
8638    clock cycles
8699    clock cycles


Can't believe my P2 IBM box is still fas'n'furious after all those years! Used to call it Lentium (lento=español ->slow), no more I guess. :U

Cheers,
KhipuCoder
Title: Re: Timing for times-ten table generator
Post by: jj2007 on July 22, 2009, 06:14:27 PM
Quote from: dedndave on July 22, 2009, 04:04:52 PM
that is SO stupid !!!! - lol
fix windows - not the processor that wasn't broken to begin with - lol

anyways, i updated the Table10 program again
this time, i made the times update on a static screen until a key is pressed

if you liked the green, Jochen, you'll love this
(better put another one on ice)

Yeah, looks nice, and seems to like my Celeron :U

Total System Processor Cores: 1
CPU 0: Intel(R) Celeron(R) M CPU        420  @ 1.60GHz MMX SSE3 Cores: 1
7981    clock cycles
7948    clock cycles
7938    clock cycles
Title: Re: Timing for times-ten table generator
Post by: bruce1948 on July 22, 2009, 07:14:05 PM

DednDave Times Ten Table Generator

Total System Processor Cores: 1
CPU 0: AMD Sempron(tm)   3000+ MMX+ SSE 3DNow!+ Cores: 1
7241    clock cycles
7172    clock cycles
7161    clock cycles

Generated table values are identical to Masm-assembled table values
Code: 172 bytes
Data:  38 bytes
Total: 210 bytes
Press any key to continue ...
Title: Re: Timing for times-ten table generator
Post by: jj2007 on July 22, 2009, 07:33:37 PM
Just timed my own routine - it's a lot slower than Dave's algo, arouind 8,600 cycles. At 1.6 Giga, that means roughly 0.005 milliseconds. Fortunately, it has to run only once :bg
Title: Re: Timing for times-ten table generator
Post by: dedndave on July 22, 2009, 07:41:23 PM
QuoteFortunately, it has to run only once
i can't believe you said that - lol - that's just not like you, Jochen
still, you beat me by 22 bytes - fair and square
i had to work hard to get it down to that
did you see that i had to pack the correction table ? - lol
Title: Re: Timing for times-ten table generator
Post by: MichaelW on July 22, 2009, 08:58:57 PM
Everything works on my P3 now, and it's ~8% faster.

Total System Processor Cores: 1
CPU 0: Fam 6 Mod 7 xFam 0 xMod 0 Type 0 Step 3 MMX SSE Cores: 1
8328    clock cycles
8328    clock cycles
8328    clock cycles


One problem with the continual update is that it's difficult to copy the output without affecting one or more of the cycle counts.

Title: Re: Timing for times-ten table generator
Post by: dedndave on July 22, 2009, 09:44:47 PM
yes - i noticed that Michael
console mode copy/paste is very strange
the best way is to exit the program, then get the last set
maybe i should put several timing text lines in it, instead of 3
Title: Re: Timing for times-ten table generator
Post by: dedndave on July 22, 2009, 11:10:19 PM
i added more lines
also, i found a couple small flaws
i released the heap before i was done with it - oops - i am surprised i didn't see my c0000005 friend (that should be my nic)
also, using Michael's timing routine requires push/pop ebx if in a function - unless they use the updated counter2 file
so, i added push/pop

playing around with the console mode copy/paste is always a disappointment
i swear - i have tried everything to simply disable it altogether, with no luck
i have found three bugs - at least in the XP MCE 2005 console window
they may have been fixed in later OS's
again, the best thing to do is to stop the program before using copy/paste
i am going to leave that part as it is and work on the EnumerateCPUs function if i have time
Thanks to Everyone for your help  :U
Title: Re: Timing for times-ten table generator
Post by: bruce1948 on July 29, 2009, 10:56:22 PM
Dave

This link from AMD may be of interest.


http://developer.amd.com/documentation/articles/pages/ProcessorCoreEnumeration.aspx
(http://developer.amd.com/documentation/articles/pages/ProcessorCoreEnumeration.aspx%3Cbr%20/%3E)
Title: Re: Timing for times-ten table generator
Post by: dedndave on July 29, 2009, 11:55:05 PM
thank you, Bruce
i did wind up using the AMD method
they outline the same info in their CPUID reference manual