peculiar result

denise_amiga · June 05, 2005, 09:35:55 PM

Quote

xor ecx,ecx sub ecx,ecx mov ecx,0 and ecx,0
109 ms 67 ms 66 ms 110 ms

on PIV 2800@3300



; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include     masm32rt.inc
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
align 4
    hHnd        dd 0
    cPos        COORD<6,1>
.code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke GetStdHandle,STD_OUTPUT_HANDLE
    mov hHnd, eax
    
    cls
    print "  xor ecx,ecx    sub ecx,ecx     mov ecx,0      and ecx,0",13,10
    
        timer_begin 10000000, REALTIME_PRIORITY_CLASS
        repeat 100              ; really very simple test
            xor ecx,ecx
        endm
        timer_end
        push    eax
        invoke  SetConsoleCursorPosition, hHnd, dword ptr cPos
        add     [cPos.x], 15    ; makes nice spacing
        pop     eax
        print   ustr$(eax)," ms"
        invoke  Sleep,250       ; short break to let Windows stabilize

        timer_begin 10000000, REALTIME_PRIORITY_CLASS
        repeat 100              ; really very simple test
            sub ecx,ecx
        endm
        timer_end
        push    eax
        invoke  SetConsoleCursorPosition, hHnd, dword ptr cPos
        add     [cPos.x], 15    ; makes nice spacing
        pop     eax
        print   ustr$(eax)," ms"
        invoke  Sleep,250       ; short break to let Windows stabilize

        timer_begin 10000000, REALTIME_PRIORITY_CLASS
        repeat 100              ; really very simple test
            mov ecx,0
        endm
        timer_end
        push    eax
        invoke  SetConsoleCursorPosition, hHnd, dword ptr cPos
        add     [cPos.x], 15    ; makes nice spacing
        pop     eax
        print   ustr$(eax)," ms"
        invoke  Sleep,250       ; short break to let Windows stabilize

        timer_begin 10000000, REALTIME_PRIORITY_CLASS
        repeat 100              ; really very simple test
            and ecx,0
        endm
        timer_end
        push    eax
        invoke  SetConsoleCursorPosition, hHnd, dword ptr cPos
        add     [cPos.x], 15    ; makes nice spacing
        pop     eax
        print   ustr$(eax)," ms"
        invoke  Sleep,250       ; short break to let Windows stabilize


    mov   eax,input(13,10,13,10,"Press enter to exit...")
    exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

end start

hutch-- · June 06, 2005, 02:02:05 AM

denise,

You have not posted the "timer_begin" and "timer_end" code.

Phil · June 06, 2005, 05:11:21 AM

Hi Denise,
I wasn't able to get your program working on my system because I don't currently have the masm32rt.inc file that is available in the upgrade. Instead, I modified something that MichaelW had done earlier to compare the various alignment instruction timings. I also remember MichaelW saying that he doesn't use REALTIME often because of potential problems with instability if a loop runs away.

I've added a few other instructions that I was curious about and created the following:

Code Select

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc

    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib

    include \masm32\macros\macros.asm

    include timers.asm

    time MACRO definition
      sz SIZESTR <definition>
      instruction SUBSTR <definition>,2,sz-2
      counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
        REPEAT REPEAT_COUNT
          instruction
        ENDM
      counter_end
      mov   ebx,eax
      print chr$(definition," : ")
      print ustr$(ebx)
      print chr$(" cycles",13,10)
    ENDM
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start: 
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT    EQU 10000000
    REPEAT_COUNT  EQU 100

    ; "+00h" indicates a one-byte displacement of zero.
    ; "+00000000h" indicates a four-byte displacement of zero.
    
    time "nop                    "
    time "mov eax,0              "
    time "mov eax,ecx            "
    time "xor eax,eax            "
    time "sub ecx,ecx            "
    time "and ecx,0              "
    time "add ecx,1              "
    time "inc ecx                "
    time "mov edi,edi            "
    time "shl ecx,2              "
    time "lea ecx,[4*ecx]        "
    time "shl ecx,3              "
    time "lea ecx,[8*ecx]        "
    time "add ecx,ebx            "
    time "adc ecx,ebx            "

    mov   eax, input(13,10,"Press enter to exit...")
    exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

I got the following timings on a P3 at 996MHz

Code Select

C:\ASM>xtime
nop                     : 45 cycles
mov eax,0               : 45 cycles
mov eax,ecx             : 45 cycles
xor eax,eax             : 94 cycles
sub ecx,ecx             : 94 cycles
and ecx,0               : 94 cycles
add ecx,1               : 94 cycles
inc ecx                 : 94 cycles
mov edi,edi             : 94 cycles
shl ecx,2               : 94 cycles
lea ecx,[4*ecx]         : 94 cycles
shl ecx,3               : 94 cycles
lea ecx,[8*ecx]         : 94 cycles
add ecx,ebx             : 94 cycles
adc ecx,ebx             : 195 cycles

Press enter to exit...

denise_amiga · June 06, 2005, 07:13:13 AM

sorry hutch--

it is certain I have made modifications and I have not explained the changes, it´s logic that don´t works. I am sorryyyyyyyy very much.

the code (it´s macros) for "timer_begin" and "timer_end" are in timers.asm (from MichaelW):

Code Select



include timers.asm

I have included timers.asm inside the macros.asm. and masm32rt.inc is from masm8.2 sp2

Code Select



comment * «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

	The MASM32 Runtime Library include file.

	Differing from most compilers, MASM does not contain any built in
	run time library so it is difficult for a programmer starting with
	MASM to to get any code up and running without having to learn a lot
	of extra information just to do basic things.

	This file simplifies entry into assembler programming by making the
	full capacity of the MASM32 library, macro system and include files
	available to programmers undertaking this quest.

	It specifies the normal conditions for building a 32 bit Windows
	program with the minimum processor type, memory model and the need
	for case sensitive capacity.

	The include files are declared in the correct order so that the
	windows.inc file is always first followed by static libraries and
	import libraries for Windows API functions.

	Where there is a corresponding library for either static or import
	libraries, it is included after the include files.

	NOTE : It is to the advantage of the programmer once they have their
	basic code up and running to properly understand the architecture
	of a MASM executable file so that they can construct their own
	projects to more accurately reflect their own application design.

  ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««« *

.586					; create 32 bit code
.model flat, stdcall			; 32 bit memory model
option casemap :none			; case sensitive

;~~~~~~~~~~~~~
;include files
;~~~~~~~~~~~~~
include windows.inc			; main windows include file
include dialogs.inc			; macro file for dialogs
include masm32.inc			; masm32 library include
include macros.asm			; masm32 macro file         <--- timers inside
;include timers.asm

; -------------------------
; Windows API include files
; -------------------------
include gdi32.inc
include user32.inc
include kernel32.inc
include comctl32.inc
include comdlg32.inc
include shell32.inc
include oleaut32.inc
include ole32.inc
include debug.inc      ;not included in original masm32rt.inc

;~~~~~~~~~
;libraries
;~~~~~~~~~
includelib masm32.lib			; masm32 static library

; ------------------------------------------
; import libraries for Windows API functions
; ------------------------------------------
includelib gdi32.lib
includelib user32.lib
includelib kernel32.lib
includelib comctl32.lib
includelib comdlg32.lib
includelib shell32.lib
includelib oleaut32.lib
includelib ole32.lib
includelib debug.lib      ;not included in original masm32rt.inc

DBGWIN_DEBUG_ON = 0 ; turn it on/off if you want debug info into the program
DBGWIN_EXT_INFO = 0 ; turn it on/off if you want extra debug info into the program

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

denise_amiga · June 06, 2005, 08:03:45 AM

hi Phil

I like the style to make the test. :U

hi is my results:

Quote

nop : 22 cycles
mov eax,0 : 21 cycles
mov eax,ecx : 22 cycles
xor eax,eax : 36 cycles
sub ecx,ecx : 23 cycles
and ecx,0 : 47 cycles
add ecx,1 : 35 cycles
inc ecx : 54 cycles
mov edi,edi : 35 cycles
shl ecx,2 : 393 cycles
lea ecx,[4*ecx] : 487 cycles
shl ecx,3 : 397 cycles
lea ecx,[8*ecx] : 487 cycles
add ecx,ebx : 35 cycles
adc ecx,ebx : 686 cycles

Phil · June 06, 2005, 09:29:45 AM

Wow! Those timings are amazing ... especially the add vs. adc. Before you fully trust the numbers you'll need to make sure that you get the same timings +/- 1 or 2 every time you run the program. I'm wondering if some of the numbers might have been inflated because your CPU microcode was busy doing other things like handling I/O or something else. Also, since you have a faster machine it might be necessary to modify the inner loop count to repeat each instruction 200 times instead of the 100 as coded. MichaelW had also mentioned something like that. I certainly believe you got those results ... I'm just not sure I'd fully trust them without some more investigations. The certainly do seem peculiar! Maybe someone else can explain what's going on ... I just don't know!

denise_amiga · June 06, 2005, 10:24:14 AM

I have executed the program several times, because I not believed the times of several operations, as you comment, CPU microcode busy etc, but the times are those WOW!!!
I already have readed the deficiencies of p4 in certain instructions as Shifts/Rotations (http://www.website.masmforum.com/mark/index.htm), the surprise has been the sum with carry

MichaelW · June 06, 2005, 10:55:57 AM

Neither of the modifications below make any change on a P3, except the last one takes twice as many cycles because there are twice as many instructions.

Code Select


; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc

    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib

    include \masm32\macros\macros.asm

    include timers.asm

    time MACRO definition
      sz SIZESTR <definition>
      instruction SUBSTR <definition>,2,sz-2
      counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
        REPEAT REPEAT_COUNT
          instruction
        ENDM
      counter_end
      mov   ebx,eax
      print chr$(definition," : ")
      print ustr$(ebx)
      print chr$(" cycles",13,10)
    ENDM
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start: 
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT    EQU 10000000
    REPEAT_COUNT  EQU 100

    adc ecx,ebx
    adc edx,eax

    time "nop                    "
    time "mov eax,0              "
    time "mov eax,ecx            "
    time "xor eax,eax            "
    time "sub ecx,ecx            "
    time "and ecx,0              "
    time "add ecx,1              "
    time "inc ecx                "
    time "mov edi,edi            "
    time "shl ecx,2              "
    time "lea ecx,[4*ecx]        "
    time "lea edx,[4*ecx]        " ; eliminate dependency?
    time "shl ecx,3              "
    time "lea ecx,[8*ecx]        "
    time "lea edx,[8*ecx]        " ; eliminate dependency?
    time "add ecx,ebx            "
    time "adc ecx,ebx            "
    time "db 13h,0cbh,13h,0d0h   " ; adc ecx,ebx adc edx,eax

    mov   eax, input(13,10,"Press enter to exit...")
    exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Mark Jones · June 06, 2005, 01:43:11 PM

Here's an AMD XP 1800+

Code Select


nop                     : 29 cycles
mov eax,0               : 39 cycles
mov eax,ecx             : 34 cycles
xor eax,eax             : 34 cycles
sub ecx,ecx             : 34 cycles
and ecx,0               : 96 cycles
add ecx,1               : 96 cycles
inc ecx                 : 96 cycles
mov edi,edi             : 96 cycles
shl ecx,2               : 96 cycles
lea ecx,[4*ecx]         : 207 cycles
lea edx,[4*ecx]         : 54 cycles
shl ecx,3               : 96 cycles
lea ecx,[8*ecx]         : 198 cycles
lea edx,[8*ecx]         : 54 cycles
add ecx,ebx             : 96 cycles
adc ecx,ebx             : 96 cycles
db 13h,0cbh,13h,0d0h    : 197 cycles

Using REPEAT_COUNT = 100, it seems to place 99 of each instruction in the code. The nop's are inserted from 00401070 to 004010D3. Am I looking at this incorrectly? I modified it slightly to add 1 instruction so now the nop's count is exactly 100. I also added a 250ms sleep, which did very little.. Here are the new results:

Code Select


nop                     : 29 cycles
mov eax,0               : 39 cycles
mov eax,ecx             : 34 cycles
xor eax,eax             : 34 cycles
sub ecx,ecx             : 34 cycles
and ecx,0               : 97 cycles
add ecx,1               : 97 cycles
inc ecx                 : 97 cycles
mov edi,edi             : 96 cycles
shl ecx,2               : 97 cycles
lea ecx,[4*ecx]         : 199 cycles
lea edx,[4*ecx]         : 54 cycles
shl ecx,3               : 97 cycles
lea ecx,[8*ecx]         : 200 cycles
lea edx,[8*ecx]         : 54 cycles
add ecx,ebx             : 97 cycles
adc ecx,ebx             : 97 cycles
db 13h,0cbh,13h,0d0h    : 200 cycles

So then would it be safe to assume that one of each instruction is:

nop : 0.3 cycle
mov eax,0 : 0.4 cycle
mov eax,ecx : 0.33 cycle
xor eax,eax : 0.33 cycle
sub ecx,ecx : 0.33 cycle
and ecx,0 : 1 cycle
add ecx,1 : 1 cycle
inc ecx : 1 cycle
mov edi,edi : 1 cycle
shl ecx,2 : 1 cycle
lea ecx,[4*ecx] : 2 cycles
lea edx,[4*ecx] : 0.5 cycle
shl ecx,3 : 1 cycle
lea ecx,[8*ecx] : 2 cycles
lea edx,[8*ecx] : 0.5 cycle
add ecx,ebx : 1 cycle
adc ecx,ebx : 1 cycle
db 13h,0cbh,13h,0d0h : 2 cycles

[attachment deleted by admin]

Jimg · June 06, 2005, 03:28:42 PM

If I may ask a question here, isn't register stall playing a big part in thest tests? I've found that, using the original 4 test instructions, the instruction " and ecx,0" took three times longer than the others, but if I follow each test instruction with 3 nops, the " and ecx,0" instruction was the fastest. ?

pbrennick · June 06, 2005, 04:45:23 PM

AMD XP 1000

nop : 30 cycles
mov eax,0 : 39 cycles
mov eax,ecx : 34 cycles
xor eax,eax : 34 cycles
sub ecx,ecx : 34 cycles
and ecx,0 : 96 cycles
add ecx,1 : 96 cycles
inc ecx : 96 cycles
mov edi,edi : 96 cycles
shl ecx,2 : 96 cycles
lea ecx,[4*ecx] : 197 cycles
lea edx,[4*ecx] : 54 cycles
shl ecx,3 : 96 cycles
lea ecx,[8*ecx] : 197 cycles
lea edx,[8*ecx] : 54 cycles
add ecx,ebx : 96 cycles
adc ecx,ebx : 96 cycles
db 13h,0cbh,13h,0d0h : 196 cycles

Adding 3 nops to the time macro...

nop : 131 cycles
mov eax,0 : 147 cycles
mov eax,ecx : 134 cycles
xor eax,eax : 134 cycles
sub ecx,ecx : 134 cycles
and ecx,0 : 131 cycles
add ecx,1 : 131 cycles
inc ecx : 131 cycles
mov edi,edi : 135 cycles
shl ecx,2 : 131 cycles
lea ecx,[4*ecx] : 197 cycles
lea edx,[4*ecx] : 135 cycles
shl ecx,3 : 131 cycles
lea ecx,[8*ecx] : 198 cycles
lea edx,[8*ecx] : 135 cycles
add ecx,ebx : 135 cycles
adc ecx,ebx : 134 cycles
db 13h,0cbh,13h,0d0h : 196 cycles

nops seem to give more consistent results and as Jimq notes, it shows a speed-up for "and ecx,0" meaning, I think, that there is definitely a register stall problem with the code.

Paul

MichaelW · June 06, 2005, 06:32:28 PM

I'm not sure "register stall" is the correct term. What I get from Agner Fog's Pentium Optimization document is that the repeated identical instructions form a "dependence" chain, where each instruction must wait for the preceding instruction to complete. For a P3 an AND reg,immed instruction generates 1 uop that can go through port 0 or 1, so I think this means that the instruction can execute in 0.5 cycles under the best conditions. If I break up the dependence chain by interleaving "independent" instructions:

time "db 83h,0E1h,0,83h,0E2h,0" ;and ecx,0 and edx,0

On my P3 I get:

and ecx,0 : 95 cycles
db 83h,0E1h,0,83h,0E2h,0 : 96 cycles

With two instructions running in a single cycle.

For a P4, the instruction generates 1 uop, but it can only go through port 0, so with the double-clocked ALU the throughput should be 2 instructions per cycle, with or without the interleaving. Perhaps someone with a P4 could test this.

And for the ADC reg,reg:

On a P3 it generates 2 uops that can go through port 0 or 1, so with a dependency chain the throughput should be 0.5 instructions per cycle, which is what I am getting. And without dependency it should be 1 instruction per cycle, which is only half of what I'm getting, so the dependency still exists. There does not appear to be any other instruction that will exactly pair with an ADC reg,reg, but I can break the dependency chain with an XCHG EAX,EDX which generates 3 uops that can go through port 0 or 1, and both instructions execute in <3 cycles.

time "db 92h,13h,0cbh " ;xchg eax,edx, adc ecx,ebx

adc ecx,ebx : 196 cycles
db 92h,13h,0cbh : 273 cycles

On a P4 it generates 4 uops that can go through port 1, so with or without a dependency chain the throughput should be 0.5 instructions per cycle (I think). And because the XCHG EAX,EDX is only 3 uops and can go through port 0 or 1, it should be able to execute in parallel with the ADC ECX,EBX with the same throughput.

AeroASM · June 06, 2005, 07:32:34 PM

Quote from: MichaelW on June 06, 2005, 06:32:28 PM
And for the ADC reg,reg:

On a P3 it generates 2 uops that can go through port 0 or 1, so with a dependency chain the throughput should be 0.5 instructions per cycle, which is what I am getting.

Surely 2 instructions per cycle, or 0.5 cycles per instruction?

Where can I find out about uops? Also what are port 0 and port 1? Are they the U and V pipelines?

pbrennick · June 06, 2005, 08:52:26 PM

Hi Michael,
You know more about these things than I do. All I know is that there was a definite change when I added the 3 nops and the change was pretty uniform across all the results, not just the one I mentioned. All except the final one, that is...

For whatever reason, the last test is the only one that did not change whether there are nops or not. Why is that? Is this the dependency thing you are talking about?

Paul

QvasiModo · June 06, 2005, 09:01:34 PM

A dependency chain occurs when you have instructions that depend on the results of previous ones to be run. For example:

Code Select


mov eax, 1
add eax, 2
sub eax, 3

This is a dependency chain, because instruction 2 can't be executed until 1 is finished, and the same for instruction 3. This means that they will have to be executed in that order, sequentially.

That means the instructions being tested here are not identical... some of them depend on the previous value of ECX and some don't.

BTW, the U and V pipes don't exist anymore, that's Pentium and Pentium MMX stuff. The Pentium II and above work in a very different fashion, check out the Intel manuals, they're really worth it. :)

News:

peculiar result

Phil

Phil

pbrennick

AeroASM

pbrennick

QvasiModo