News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

peculiar result

Started by denise_amiga, June 05, 2005, 09:35:55 PM

Previous topic - Next topic

denise_amiga

Quote

  xor ecx,ecx    sub ecx,ecx     mov ecx,0      and ecx,0
      109 ms         67 ms          66 ms          110 ms


on PIV 2800@3300



; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include     masm32rt.inc
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
align 4
    hHnd        dd 0
    cPos        COORD<6,1>
.code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke GetStdHandle,STD_OUTPUT_HANDLE
    mov hHnd, eax
   
    cls
    print "  xor ecx,ecx    sub ecx,ecx     mov ecx,0      and ecx,0",13,10
   
        timer_begin 10000000, REALTIME_PRIORITY_CLASS
        repeat 100              ; really very simple test
            xor ecx,ecx
        endm
        timer_end
        push    eax
        invoke  SetConsoleCursorPosition, hHnd, dword ptr cPos
        add     [cPos.x], 15    ; makes nice spacing
        pop     eax
        print   ustr$(eax)," ms"
        invoke  Sleep,250       ; short break to let Windows stabilize

        timer_begin 10000000, REALTIME_PRIORITY_CLASS
        repeat 100              ; really very simple test
            sub ecx,ecx
        endm
        timer_end
        push    eax
        invoke  SetConsoleCursorPosition, hHnd, dword ptr cPos
        add     [cPos.x], 15    ; makes nice spacing
        pop     eax
        print   ustr$(eax)," ms"
        invoke  Sleep,250       ; short break to let Windows stabilize

        timer_begin 10000000, REALTIME_PRIORITY_CLASS
        repeat 100              ; really very simple test
            mov ecx,0
        endm
        timer_end
        push    eax
        invoke  SetConsoleCursorPosition, hHnd, dword ptr cPos
        add     [cPos.x], 15    ; makes nice spacing
        pop     eax
        print   ustr$(eax)," ms"
        invoke  Sleep,250       ; short break to let Windows stabilize

        timer_begin 10000000, REALTIME_PRIORITY_CLASS
        repeat 100              ; really very simple test
            and ecx,0
        endm
        timer_end
        push    eax
        invoke  SetConsoleCursorPosition, hHnd, dword ptr cPos
        add     [cPos.x], 15    ; makes nice spacing
        pop     eax
        print   ustr$(eax)," ms"
        invoke  Sleep,250       ; short break to let Windows stabilize


    mov   eax,input(13,10,13,10,"Press enter to exit...")
    exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

end start


hutch--

denise,

You have not posted the "timer_begin" and "timer_end" code.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Phil

Hi Denise,
  I wasn't able to get your program working on my system because I don't currently have the masm32rt.inc file that is available in the upgrade. Instead, I modified something that MichaelW had done earlier to compare the various alignment instruction timings. I also remember MichaelW saying that he doesn't use REALTIME often because of potential problems with instability if a loop runs away.

  I've added a few other instructions that I was curious about and created the following:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc

    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib

    include \masm32\macros\macros.asm

    include timers.asm

    time MACRO definition
      sz SIZESTR <definition>
      instruction SUBSTR <definition>,2,sz-2
      counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
        REPEAT REPEAT_COUNT
          instruction
        ENDM
      counter_end
      mov   ebx,eax
      print chr$(definition," : ")
      print ustr$(ebx)
      print chr$(" cycles",13,10)
    ENDM
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT    EQU 10000000
    REPEAT_COUNT  EQU 100

    ; "+00h" indicates a one-byte displacement of zero.
    ; "+00000000h" indicates a four-byte displacement of zero.
   
    time "nop                    "
    time "mov eax,0              "
    time "mov eax,ecx            "
    time "xor eax,eax            "
    time "sub ecx,ecx            "
    time "and ecx,0              "
    time "add ecx,1              "
    time "inc ecx                "
    time "mov edi,edi            "
    time "shl ecx,2              "
    time "lea ecx,[4*ecx]        "
    time "shl ecx,3              "
    time "lea ecx,[8*ecx]        "
    time "add ecx,ebx            "
    time "adc ecx,ebx            "

    mov   eax, input(13,10,"Press enter to exit...")
    exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


I got the following timings on a P3 at 996MHz
C:\ASM>xtime
nop                     : 45 cycles
mov eax,0               : 45 cycles
mov eax,ecx             : 45 cycles
xor eax,eax             : 94 cycles
sub ecx,ecx             : 94 cycles
and ecx,0               : 94 cycles
add ecx,1               : 94 cycles
inc ecx                 : 94 cycles
mov edi,edi             : 94 cycles
shl ecx,2               : 94 cycles
lea ecx,[4*ecx]         : 94 cycles
shl ecx,3               : 94 cycles
lea ecx,[8*ecx]         : 94 cycles
add ecx,ebx             : 94 cycles
adc ecx,ebx             : 195 cycles

Press enter to exit...

denise_amiga

sorry hutch--

it is certain I have made modifications and I have not explained the changes, it´s logic that don´t works. I am sorryyyyyyyy very much.

the code (it´s macros) for "timer_begin" and "timer_end" are in timers.asm (from MichaelW):



include timers.asm



I have included timers.asm inside the macros.asm. and masm32rt.inc is from masm8.2 sp2



comment * «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

The MASM32 Runtime Library include file.

Differing from most compilers, MASM does not contain any built in
run time library so it is difficult for a programmer starting with
MASM to to get any code up and running without having to learn a lot
of extra information just to do basic things.

This file simplifies entry into assembler programming by making the
full capacity of the MASM32 library, macro system and include files
available to programmers undertaking this quest.

It specifies the normal conditions for building a 32 bit Windows
program with the minimum processor type, memory model and the need
for case sensitive capacity.

The include files are declared in the correct order so that the
windows.inc file is always first followed by static libraries and
import libraries for Windows API functions.

Where there is a corresponding library for either static or import
libraries, it is included after the include files.

NOTE : It is to the advantage of the programmer once they have their
basic code up and running to properly understand the architecture
of a MASM executable file so that they can construct their own
projects to more accurately reflect their own application design.

  ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««« *

.586 ; create 32 bit code
.model flat, stdcall ; 32 bit memory model
option casemap :none ; case sensitive

;~~~~~~~~~~~~~
;include files
;~~~~~~~~~~~~~
include windows.inc ; main windows include file
include dialogs.inc ; macro file for dialogs
include masm32.inc ; masm32 library include
include macros.asm ; masm32 macro file         <--- timers inside
;include timers.asm

; -------------------------
; Windows API include files
; -------------------------
include gdi32.inc
include user32.inc
include kernel32.inc
include comctl32.inc
include comdlg32.inc
include shell32.inc
include oleaut32.inc
include ole32.inc
include debug.inc      ;not included in original masm32rt.inc

;~~~~~~~~~
;libraries
;~~~~~~~~~
includelib masm32.lib ; masm32 static library

; ------------------------------------------
; import libraries for Windows API functions
; ------------------------------------------
includelib gdi32.lib
includelib user32.lib
includelib kernel32.lib
includelib comctl32.lib
includelib comdlg32.lib
includelib shell32.lib
includelib oleaut32.lib
includelib ole32.lib
includelib debug.lib      ;not included in original masm32rt.inc

DBGWIN_DEBUG_ON = 0 ; turn it on/off if you want debug info into the program
DBGWIN_EXT_INFO = 0 ; turn it on/off if you want extra debug info into the program

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««




denise_amiga

hi Phil

I like the style to make the test.  :U

hi is my results:
Quote

nop                     : 22 cycles
mov eax,0               : 21 cycles
mov eax,ecx             : 22 cycles
xor eax,eax             : 36 cycles
sub ecx,ecx             : 23 cycles
and ecx,0               : 47 cycles
add ecx,1               : 35 cycles
inc ecx                 : 54 cycles
mov edi,edi             : 35 cycles
shl ecx,2               : 393 cycles
lea ecx,[4*ecx]         : 487 cycles
shl ecx,3               : 397 cycles
lea ecx,[8*ecx]         : 487 cycles
add ecx,ebx             : 35 cycles
adc ecx,ebx             : 686 cycles


Phil

Wow! Those timings are amazing ... especially the add vs. adc. Before you fully trust the numbers you'll need to make sure that you get the same timings +/- 1 or 2 every time you run the program. I'm wondering if some of the numbers might have been inflated because your CPU microcode was busy doing other things like handling I/O or something else. Also, since you have a faster machine it might be necessary to modify the inner loop count to repeat each instruction 200 times instead of the 100 as coded. MichaelW had also mentioned something like that. I certainly believe you got those results ... I'm just not sure I'd fully trust them without some more investigations. The certainly do seem peculiar! Maybe someone else can explain what's going on ... I just don't know!


denise_amiga

I have executed the program several times, because I not believed the times of several operations, as you comment, CPU microcode busy etc, but the times are those WOW!!!
I already have readed the deficiencies of p4 in certain instructions as Shifts/Rotations (http://www.website.masmforum.com/mark/index.htm), the surprise has been the sum with carry

MichaelW

Neither of the modifications below make any change on a P3, except the last one takes twice as many cycles because there are twice as many instructions.

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc

    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib

    include \masm32\macros\macros.asm

    include timers.asm

    time MACRO definition
      sz SIZESTR <definition>
      instruction SUBSTR <definition>,2,sz-2
      counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
        REPEAT REPEAT_COUNT
          instruction
        ENDM
      counter_end
      mov   ebx,eax
      print chr$(definition," : ")
      print ustr$(ebx)
      print chr$(" cycles",13,10)
    ENDM
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT    EQU 10000000
    REPEAT_COUNT  EQU 100

    adc ecx,ebx
    adc edx,eax

    time "nop                    "
    time "mov eax,0              "
    time "mov eax,ecx            "
    time "xor eax,eax            "
    time "sub ecx,ecx            "
    time "and ecx,0              "
    time "add ecx,1              "
    time "inc ecx                "
    time "mov edi,edi            "
    time "shl ecx,2              "
    time "lea ecx,[4*ecx]        "
    time "lea edx,[4*ecx]        " ; eliminate dependency?
    time "shl ecx,3              "
    time "lea ecx,[8*ecx]        "
    time "lea edx,[8*ecx]        " ; eliminate dependency?
    time "add ecx,ebx            "
    time "adc ecx,ebx            "
    time "db 13h,0cbh,13h,0d0h   " ; adc ecx,ebx adc edx,eax

    mov   eax, input(13,10,"Press enter to exit...")
    exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

eschew obfuscation

Mark Jones

Here's an AMD XP 1800+


nop                     : 29 cycles
mov eax,0               : 39 cycles
mov eax,ecx             : 34 cycles
xor eax,eax             : 34 cycles
sub ecx,ecx             : 34 cycles
and ecx,0               : 96 cycles
add ecx,1               : 96 cycles
inc ecx                 : 96 cycles
mov edi,edi             : 96 cycles
shl ecx,2               : 96 cycles
lea ecx,[4*ecx]         : 207 cycles
lea edx,[4*ecx]         : 54 cycles
shl ecx,3               : 96 cycles
lea ecx,[8*ecx]         : 198 cycles
lea edx,[8*ecx]         : 54 cycles
add ecx,ebx             : 96 cycles
adc ecx,ebx             : 96 cycles
db 13h,0cbh,13h,0d0h    : 197 cycles


Using REPEAT_COUNT = 100, it seems to place 99 of each instruction in the code. The nop's are inserted from 00401070 to 004010D3. Am I looking at this incorrectly? I modified it slightly to add 1 instruction so now the nop's count is exactly 100. I also added a 250ms sleep, which did very little.. Here are the new results:


nop                     : 29 cycles
mov eax,0               : 39 cycles
mov eax,ecx             : 34 cycles
xor eax,eax             : 34 cycles
sub ecx,ecx             : 34 cycles
and ecx,0               : 97 cycles
add ecx,1               : 97 cycles
inc ecx                 : 97 cycles
mov edi,edi             : 96 cycles
shl ecx,2               : 97 cycles
lea ecx,[4*ecx]         : 199 cycles
lea edx,[4*ecx]         : 54 cycles
shl ecx,3               : 97 cycles
lea ecx,[8*ecx]         : 200 cycles
lea edx,[8*ecx]         : 54 cycles
add ecx,ebx             : 97 cycles
adc ecx,ebx             : 97 cycles
db 13h,0cbh,13h,0d0h    : 200 cycles


So then would it be safe to assume that one of each instruction is:

nop                     : 0.3 cycle
mov eax,0               : 0.4 cycle
mov eax,ecx             : 0.33 cycle
xor eax,eax             : 0.33 cycle
sub ecx,ecx             : 0.33 cycle
and ecx,0               : 1 cycle
add ecx,1               : 1 cycle
inc ecx                 : 1 cycle
mov edi,edi             : 1 cycle
shl ecx,2               : 1 cycle
lea ecx,[4*ecx]         : 2 cycles
lea edx,[4*ecx]         : 0.5 cycle
shl ecx,3               : 1 cycle
lea ecx,[8*ecx]         : 2 cycles
lea edx,[8*ecx]         : 0.5 cycle
add ecx,ebx             : 1 cycle
adc ecx,ebx             : 1 cycle
db 13h,0cbh,13h,0d0h    : 2 cycles


[attachment deleted by admin]
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

Jimg

If I may ask a question here, isn't register stall playing a big part in thest tests?  I've found that, using the original 4 test instructions, the instruction " and ecx,0" took three times longer than the others, but if I follow each test instruction with 3 nops, the " and ecx,0" instruction was the fastest. ?

pbrennick

AMD XP 1000

nop  : 30 cycles
mov eax,0  : 39 cycles
mov eax,ecx  : 34 cycles
xor eax,eax  : 34 cycles
sub ecx,ecx  : 34 cycles
and ecx,0  : 96 cycles
add ecx,1  : 96 cycles
inc ecx  : 96 cycles
mov edi,edi  : 96 cycles
shl ecx,2  : 96 cycles
lea ecx,[4*ecx]  : 197 cycles
lea edx,[4*ecx]  : 54 cycles
shl ecx,3  : 96 cycles
lea ecx,[8*ecx]  : 197 cycles
lea edx,[8*ecx]  : 54 cycles
add ecx,ebx  : 96 cycles
adc ecx,ebx  : 96 cycles
db 13h,0cbh,13h,0d0h  : 196 cycles

Adding 3 nops to the time macro...

nop  : 131 cycles
mov eax,0  : 147 cycles
mov eax,ecx  : 134 cycles
xor eax,eax  : 134 cycles
sub ecx,ecx  : 134 cycles
and ecx,0  : 131 cycles
add ecx,1  : 131 cycles
inc ecx  : 131 cycles
mov edi,edi  : 135 cycles
shl ecx,2  : 131 cycles
lea ecx,[4*ecx]  : 197 cycles
lea edx,[4*ecx]  : 135 cycles
shl ecx,3  : 131 cycles
lea ecx,[8*ecx]  : 198 cycles
lea edx,[8*ecx]  : 135 cycles
add ecx,ebx  : 135 cycles
adc ecx,ebx  : 134 cycles
db 13h,0cbh,13h,0d0h  : 196 cycles

nops seem to give more consistent results and as Jimq notes, it shows a speed-up for "and ecx,0"  meaning, I think, that there is definitely a register stall problem with the code.

Paul

MichaelW

I'm not sure "register stall" is the correct term. What I get from Agner Fog's Pentium Optimization document is that the repeated identical instructions form a "dependence" chain, where each instruction must wait for the preceding instruction to complete. For a P3 an AND reg,immed  instruction generates 1 uop that can go through port 0 or 1, so I think this means that the instruction can execute in 0.5 cycles under the best conditions. If I break up the dependence chain by interleaving "independent" instructions:

time "db 83h,0E1h,0,83h,0E2h,0" ;and ecx,0 and edx,0

On my P3 I get:

and ecx,0               : 95 cycles
db 83h,0E1h,0,83h,0E2h,0 : 96 cycles

With two instructions running in a single cycle.

For a P4, the instruction generates 1 uop, but it can only go through port 0, so with the double-clocked ALU the throughput should be 2 instructions per cycle, with or without the interleaving. Perhaps someone with a P4 could test this.

And for the ADC reg,reg:

On a P3 it generates 2 uops that can go through port 0 or 1, so with a dependency chain the throughput should be 0.5 instructions per cycle, which is what I am getting. And without dependency it should be 1 instruction per cycle, which is only half of what I'm getting, so the dependency still exists. There does not appear to be any other instruction that will exactly pair with an ADC reg,reg, but I can break the dependency chain with an XCHG EAX,EDX which generates 3 uops that can go through port 0 or 1, and both instructions execute in <3 cycles.


time "db 92h,13h,0cbh        " ;xchg eax,edx, adc ecx,ebx

adc ecx,ebx             : 196 cycles
db 92h,13h,0cbh         : 273 cycles


On a P4 it generates 4 uops that can go through port 1, so with or without a dependency chain the throughput should be 0.5 instructions per cycle (I think). And because the XCHG EAX,EDX is only 3 uops and can go through port 0 or 1, it should be able to execute in parallel with the ADC ECX,EBX with the same throughput.
eschew obfuscation

AeroASM

Quote from: MichaelW on June 06, 2005, 06:32:28 PM
And for the ADC reg,reg:

On a P3 it generates 2 uops that can go through port 0 or 1, so with a dependency chain the throughput should be 0.5 instructions per cycle, which is what I am getting.

Surely 2 instructions per cycle, or 0.5 cycles per instruction?

Where can I find out about uops? Also what are port 0 and port 1? Are they the U and V pipelines?

pbrennick

Hi Michael,
You know more about these things than I do.  All I know is that there was a definite change when I added the 3 nops and the change was pretty uniform across all the results, not just the one I mentioned.  All except the final one, that is...

For whatever reason, the last test is the only one that did not change whether there are nops or not.  Why is that?  Is this the dependency thing you are talking about?

Paul

QvasiModo

A dependency chain occurs when you have instructions that depend on the results of previous ones to be run. For example:

mov eax, 1
add eax, 2
sub eax, 3

This is a dependency chain, because instruction 2 can't be executed until 1 is finished, and the same for instruction 3. This means that they will have to be executed in that order, sequentially.

That means the instructions being tested here are not identical... some of them depend on the previous value of ECX and some don't.

BTW, the U and V pipes don't exist anymore, that's Pentium and Pentium MMX stuff. The Pentium II and above work in a very different fashion, check out the Intel manuals, they're really worth it. :)