News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Timing for times-ten table generator

Started by dedndave, July 22, 2009, 01:18:41 AM

Previous topic - Next topic

dedndave

  I wanted to kill a few birds with one stone in this thread. Jochen was working on a times-ten table for his float library.
This is my version of the generator. His is 22 bytes smaller, but uses some pre-existing data. This one is self-contained, as
it generates the entire table from scratch. Mine is probably a little slower than his, also - lol.

  I also wrote an "EnumerateCPUs" routine and would like to see some results from that. This one displays a line that
shows how many cores are in the machine, as well as a line for each CPU package.

  For me, the real meat in this thread has to do with algorithm timing. This program uses MichaelW's timing macros.
For some reason, my machine shows numbers that are way out of whack with the rest of the world (i.e. masm32 members).
I think it may be related to the fact that I am running Windows XP Media Center Edition. Although, it may be the BIOS, as
the machine was designed specifically to run that OS. It is a Sony VAIO VG-RB42GS machine. It came with a Sony
MPEG-Encoder/TV Tuner card that uses a Conexant MPEG chip. As TV tuners go, this one is a bit of an oddball. It may
be that the BIOS or tuner drivers alter the Time-Stamp Counter tick rate in order to provide higher resolution counts.

Here are the numbers I get - I think they are a factor of 5 higher than they should be....

Total System Processor Cores: 2
CPU 0: Intel(R) Pentium(R) 4 CPU 3.00GHz MMX SSE3 Cores: 2
13356   clock cycles
13257   clock cycles
13273   clock cycles

Generated table values are identical to Masm-assembled table values
Code: 172 bytes
Data:  38 bytes
Total: 210 bytes

I have attached the program and source for those who are interested...

EDIT Jul 22 2009
udated Table10 - replaced core per package count code and eliminated loop instructions - Thank You Michael
also, used lea eax,[eax+4*eax] to multiply by 5 - Thank You Hutch
and - added a space after 3DNow!/3DNow!+

EDIT Jul 22 2009
added static display time updates

EDIT Jul 22 2009
added more time measurement lines
added push/pop around timer routines
release allocated heap at the end of the program

[attachment deleted by admin]

MichaelW

Running on a P3 it get:

9182    clock cycles
9180    clock cycles
9180    clock cycles


Considering that you are running on a Pentium 4 with a lower IPC, I think your cycle counts are reasonably reasonable. You should be able to speed up the code somewhat by replacing the LOOP instructions with dec ecx/jnz, or similar.

Also, I had to comment out the call to EnumerateCPUs because on my system it hangs.
eschew obfuscation

dedndave

dang ! - lol
well - the clock counts are not what i expected
size was the issue with the generator routine - so i used loop in a couple places
300 passes in loop isn't that bad
i was trying to get down to Jochen's 188 bytes - lol

as for the enumeration hanging - very disappointing   :(
i will have to have a closer look at it
i may have used a cpuid function that isn't supported or something

Many Thanks, Michael

sinsi


Total System Processor Cores: 4
CPU 0: Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz MMX SSSE3 Cores: 3
CPU 1: Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz MMX SSSE3 Cores: 3
8936    clock cycles
8927    clock cycles
8918    clock cycles

Light travels faster than sound, that's why some people seem bright until you hear them.

dedndave

i was just going to say - the little loop that counts cores per package is whacko - lol
it uses a cpuid function that i forgot to test support for, then can get stuck in an endless loop (Michael's machine)
i see it doesn't even count right on yours - lol
i went by the intel cpuid manual - my first mistake
Thanks Sinsi
i take it you actually have one cpu package with 4 cores

EDIT
i am going to try the AMD method and update the d/l in the first post - maybe later tonight

MichaelW

On my system the EcpusQ loop is hanging.

This is a demonstration of why LOOP should be avoided in code were speed matters.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 4000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 10
        mov ecx, 1
      @@:
        loop @B
      ENDM
    counter_end
    print ustr$(eax)," cycles, (loop*1)*10",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 10
        mov ecx, 1
      @@:
        dec ecx
        jnz @B
      ENDM
    counter_end
    print ustr$(eax)," cycles, (dec ecx/jnz*1)*10",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 10
        mov ecx, 10
      @@:
        loop @B
      ENDM
    counter_end
    print ustr$(eax)," cycles, (loop*10)*10",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 10
        mov ecx, 10
      @@:
        dec ecx
        jnz @B
      ENDM
    counter_end
    print ustr$(eax)," cycles, (dec ecx/jnz*10)*10",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 10
        mov ecx, 100
      @@:
        loop @B
      ENDM
    counter_end
    print ustr$(eax)," cycles, (loop*100)*10",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 10
        mov ecx, 100
      @@:
        dec ecx
        jnz @B
      ENDM
    counter_end
    print ustr$(eax)," cycles, (dec ecx/jnz*100)*10",13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


Running on a P3:

56 cycles, (loop*1)*10
10 cycles, (dec ecx/jnz*1)*10
712 cycles, (loop*10)*10
276 cycles, (dec ecx/jnz*10)*10
5813 cycles, (loop*100)*10
2076 cycles, (dec ecx/jnz*100)*10


I have no way to test this, but on a Pentium 4 sub ecx, 1 might be faster.
eschew obfuscation

hutch--

I may have missed the point here but if the task is a fixed multiply by 10 try this.


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
    include \masm32\include\masm32rt.inc
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    .data?
      value dd ?

    .data
      item dd 0

    mul10 MACRO number
      mov eax, number
      lea eax, [eax+eax*4]        ; mul by 5
      add eax, eax                ; double it
    ENDM


    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    mul10 333

    print str$(eax),13,10

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

sinsi

Dave, one processor, 4 cores.


55 cycles, (loop*1)*10
5 cycles, (dec ecx/jnz*1)*10
563 cycles, (loop*10)*10
179 cycles, (dec ecx/jnz*10)*10
5172 cycles, (loop*100)*10
1225 cycles, (dec ecx/jnz*100)*10

Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Looks great, Dave. I will steal that green colour idea from you, if you don't mind. And no, I won't send money, but if you come over, I'll offer you a beer :bg

Total System Processor Cores: 2
CPU 0: Intel(R) Pentium(R) 4 CPU 3.40GHz MMX SSE3 Cores: 2
13304   clock cycles
13286   clock cycles
13288   clock cycles

dedndave

lol
Thanks to all
and Jochen, the beer sounds great - can't wait to sample some Euro-grog (btw - we like it cold - put it on ice for me)
you are more than welcome to use that - lol
i used it in another program where i had a static display (continually updated numeric values - no scroll)
i just grabbed the display position of the "any key" message and calculated the other screen positions from that

i fixed the EnumerateCPUs function, hopefully
previously, i had used the method prescribed in Intel's CPUID reference manual (which, quite frankly, didn't make sense to me)
on this update, i test the HTT (hyper-thread technology) bit from CPUID[0_1]EDX:28
if it is 0 - indicates a single core
if it is 1 - i can use the logical processor count from CPUID[0_1]EBX:23-16
i have updated the first post in the thread with the new program and source
Thank You Everyone for testing it

as for the loop instruction, i had no idea it was that slow - lol
i don't get it - but - i eliminated it from my code in the updated d/l above
on the table generator, i was trying to get down to 188 bytes
i think replacing LOOP with DEC ECX|JNZ will add a byte - oh well - i wasn't close to 188, anyways - lol

EDIT - without LOOP, my gen code is 212 bytes, but ~500 cycles faster
by using Hutch's lea eax,[eax+4*eax], i got it back down to 210 bytes
i also removed LOOP from the EnumerateCPUs function

FORTRANS

#10
FYI,

DednDave Times Ten Table Generator

Total System Processor Cores: 1
CPU 0: AMD Athlon(tm) 64 Processor 3200+ MMX+ SSE2 3DNow!+Cores: 1
7303    clock cycles
7303    clock cycles
7305    clock cycles

Generated table values are identical to Masm-assembled table values
Code: 172 bytes
Data:  38 bytes
Total: 210 bytes


Steve

dedndave

Thanks Steve - oops forgot to put a space after 3DNow! - lol - fixed it
that is a nice processor - very fast

mine is now....

Total System Processor Cores: 2
CPU 0: Intel(R) Pentium(R) 4 CPU 3.00GHz MMX SSE3 Cores: 2
12904   clock cycles
12746   clock cycles
12876   clock cycles

i should use a static display - lol
but - no way to copy/paste more than one reading if i did that
i guess i could keep 3 readings and keep updating them

FORTRANS

Quote from: dedndave on July 22, 2009, 12:54:43 PM
as for the loop instruction, i had no idea it was that slow - lol
i don't get it

   It was discussed in one (at least) of the assembly newsgroups. LOOP
in one of AMD's processors was too fast for the M$ Windows 95 install
program.  A timing loop croaked.  So they deliberately slowed the
instruction down.  As to why it was never fixed later?

Bleah,

Steve N.

dedndave

that is SO stupid !!!! - lol
fix windows - not the processor that wasn't broken to begin with - lol

anyways, i updated the Table10 program again
this time, i made the times update on a static screen until a key is pressed

if you liked the green, Jochen, you'll love this
(better put another one on ice)

KhipuCoder

Hi Dave,


DednDave Times Ten Table Generator

Total System Processor Cores: 1
CPU 0: Fam 6 Mod 5 xFam 0 xMod 0 Type 0 Step 2 MMX Cores: 1
8636    clock cycles
8638    clock cycles
8699    clock cycles


Can't believe my P2 IBM box is still fas'n'furious after all those years! Used to call it Lentium (lento=español ->slow), no more I guess. :U

Cheers,
KhipuCoder