Possible problems with SSE usage.

MichaelW · July 16, 2010, 10:21:57 PM

Quote from: clive on July 16, 2010, 07:44:28 PM
Trying to quantify the memory speed.

PC133 SDRAM and IIRC I set it up to use the fastest supported timings.

KeepingRealBusy · July 16, 2010, 11:08:52 PM

dedndave,

Quote
hang on - is that a quote from the intel manual ?
and - if so - the OS could possibly alter that, no ?

No ,this is not a quote, but an observation from the documentation. If you are at a user task level, anything you do to get back to the OS must cause a task switch, and this automatically saves your stack pointer (and selector) in the TSS, then loads the stack pointer and selector with appropriate values depending on the reason for the switch (fault, interrupt, call), then starts saving your IP on the NEW (system) stack. Anything that happens to YOUR stack must happen at user task level, i.e., push, pop, mov and call (to your local procedure). With multiple threads, I believe, each thread has its own stack.

Could the OS possibly alter that? The OS is capable of putting anything anywhere in memory once it gets control. With single core, a task switch must happen for the OS to get control, but multi-core means the other core may be the OS. Yes it could change something while you are running. Are we talking virus conditions here? If so, anything could happen, otherwise, I doubt it will. To quote "Pogo" "We has met the enemy, and he is us."

Watch where you step, it gets pretty deep in some places.

Dave.

KeepingRealBusy · July 16, 2010, 11:14:40 PM

Quote from: clive on July 16, 2010, 07:44:28 PM
Quote from: MichaelW
QuoteHow fast is the P3 running?

If you mean the clock speed, it's 500MHz. If you mean subjectively, it's plenty fast for what I do.

Trying to quantify the memory speed. The number of cycles relates to one SDRAM READ, followed by a WRITE, that occur back-to-back at the same address across the entire bit line width of the memory subsystem. In your case here about 19ns for the READ, and 19ns for the WRITE. Say 52 MHz

Quote from: Queue
Why would xchg mem,reg be so extra costly on a P4?

As indicated above it exposes the speed of the memory subsystem. It is an atomic event (ie RMW), and a serializing event. Therefore the processor must complete/retire all pending operations (ooo, pipeline), and entirely flush the write buffers (at whatever depth it has) in the CPU, flush out everything pending/deferred to memory in the chipset, and then complete an indivisible READ (setting up addresses, with CAS/RAS latencies) followed by a WRITE. This is pretty much the worst case for synchronous memory's (SDRAM, DDRAM, RAMBUS, etc), exposing nasty CL (CAS Latency) numbers printed on the DIMMs.

In order to allow the processor to speed along, most everything sent to memory is buffered/deferred/delayed to write back in a lazy manner, and prioritize prefetching/cache line reads so as not to stall forward motion of the processor.

It's not so much a cycles issue, than a time issue.

Thank you for reminding me about the atomic nature of xchg. What other normal user instructions fall in this class?

Dave.

dedndave · July 17, 2010, 12:50:41 AM

as i mentioned, i am from the old-school side of the fence on this issue
so far, i have seen no harm come from using the stack that way
but, it seems to me that leaving the barn door open doesn't mean the horses are going to leave :P
i feel more comfortable by adjusting the stack pointer
and, let's face it - it doesn't cost that much in terms of code size or clock cylces

KeepingRealBusy · July 17, 2010, 01:00:13 AM

I totally agree.

Dave.

FORTRANS · July 17, 2010, 01:57:34 PM

Hi,

QuoteThank you for reminding me about the atomic nature of xchg. What other normal user instructions fall in this class?

Those that can be used with the LOCK prefix come to mind.
And those are RMW instructions. An old reference mentions
the following.

Code Select


BT, BTC, BTR, BTS   mem, reg/imm
XCHG   reg, mem
ADD, ADC, AND, OR, SBB, SUB, XOR   mem, reg/imm
DEC, INC, NEG, NOT   mem

Regards,

Steve N.

KeepingRealBusy · July 17, 2010, 04:08:59 PM

Quote from: FORTRANS on July 17, 2010, 01:57:34 PM
Hi,

QuoteThank you for reminding me about the atomic nature of xchg. What other normal user instructions fall in this class?

Those that can be used with the LOCK prefix come to mind.
And those are RMW instructions. An old reference mentions
the following.

Code Select Expand
BT, BTC, BTR, BTS mem, reg/imm XCHG reg, mem ADD, ADC, AND, OR, SBB, SUB, XOR mem, reg/imm DEC, INC, NEG, NOT mem

Regards,

Steve N.

Thank you, thank you, thank you. I'm going to print this list out and post it right in front of my workstation with the caption "Don't even think about it!"

Dave.

jj2007 · July 17, 2010, 04:27:27 PM

Quote from: KeepingRealBusy on July 17, 2010, 04:08:59 PMI'm going to print this list out and post it right in front of my workstation with the caption "Don't even think about it!"

You might find it, ehm, challenging to code without AND, OR, SUB, XOR, DEC, INC... :wink

KeepingRealBusy · July 17, 2010, 04:37:57 PM

Quote from: jj2007 on July 17, 2010, 04:27:27 PM
Quote from: KeepingRealBusy on July 17, 2010, 04:08:59 PMI'm going to print this list out and post it right in front of my workstation with the caption "Don't even think about it!"

You might find it, ehm, challenging to code without AND, OR, SUB, XOR, DEC, INC... :wink

JJ,

I think the restriction is mem, reg/imm, reg,reg should be ok. Another thing I have to read up in the specs.

Dave.

jj2007 · July 17, 2010, 04:49:50 PM

Here is a snippet:

Code Select

     counter_begin 1000, HIGH_PRIORITY_CLASS
      lea ebx, mem
      REPEAT 100
        inc dword ptr [ebx]
      ENDM
    counter_end
    print ustr$(eax)," cycles, (inc mem)*100",13,10

... and various results:

Code Select

170 cycles, (xchg reg,reg)*100
1909 cycles, (xchg reg,mem)*100
1909 cycles, (xchg mem,reg)*100
165 cycles, (exchange reg,reg)*100 using mov
307 cycles, (exchange reg,mem)*100 using mov
494 cycles, (exchange reg,mem)*100 using pop [ebx]
499 cycles, (exchange reg,mem)*100 using push [ebx]
594 cycles, (and mem)*100
594 cycles, (or mem)*100
594 cycles, (inc mem)*100
594 cycles, (inc dec mem)*100
594 cycles, (inc mem)*100

xchg seems to be the worst case.

MichaelW · July 17, 2010, 04:50:02 PM

I think the ability of an instruction to have a lock prefix is not the problem, it's the presence of the prefix, or for

XCHG mem, reg

Per the Intel manual:

"If a memory operand is referenced, the processor's locking protocol is automatically implemented for the duration of the exchange operation, regardless of the presence or absence of the LOCK prefix or of the value of the IOPL."

KeepingRealBusy · July 17, 2010, 05:23:09 PM

Quote from: jj2007 on July 17, 2010, 04:49:50 PM
Here is a snippet:
Code Select Expand
counter_begin 1000, HIGH_PRIORITY_CLASS lea ebx, mem REPEAT 100 inc dword ptr [ebx] ENDM counter_end print ustr$(eax)," cycles, (inc mem)*100",13,10

... and various results:
Code Select Expand
170 cycles, (xchg reg,reg)*100 1909 cycles, (xchg reg,mem)*100 1909 cycles, (xchg mem,reg)*100 165 cycles, (exchange reg,reg)*100 using mov 307 cycles, (exchange reg,mem)*100 using mov 494 cycles, (exchange reg,mem)*100 using pop [ebx] 499 cycles, (exchange reg,mem)*100 using push [ebx] 594 cycles, (and mem)*100 594 cycles, (or mem)*100 594 cycles, (inc mem)*100 594 cycles, (inc dec mem)*100 594 cycles, (inc mem)*100

xchg seems to be the worst case.

JJ, could you post the .zip, I'll try on my AMD. Dave

MichaelW · July 17, 2010, 05:42:57 PM

Code Select


;==============================================================================
    include \masm32\include\masm32rt.inc
    .586
    include \masm32\macros\timers.asm
;==============================================================================
    .data
        mem dd 0
    .code
;==============================================================================
start:
;==============================================================================
    invoke Sleep, 3000

    REPEAT 3

    counter_begin 1000, HIGH_PRIORITY_CLASS
      lea ebx, mem
      REPEAT 100
        inc dword ptr [ebx]
      ENDM
    counter_end
    print ustr$(eax)," cycles, (inc mem)*100",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      lea ebx, mem
      REPEAT 100
        lock inc dword ptr [ebx]
      ENDM
    counter_end
    print ustr$(eax)," cycles, (lock inc mem)*100",13,10

    ENDM

    inkey "Press any key to exit..."
    exit
;==============================================================================
end start

Code Select


627 cycles, (inc mem)*100
2239 cycles, (lock inc mem)*100
638 cycles, (inc mem)*100
2246 cycles, (lock inc mem)*100
627 cycles, (inc mem)*100
2239 cycles, (lock inc mem)*100

jj2007 · July 17, 2010, 06:10:16 PM

Quote from: KeepingRealBusy on July 17, 2010, 05:23:09 PM
JJ, could you post the .zip, I'll try on my AMD. Dave

Here it is.

KeepingRealBusy · July 17, 2010, 08:54:12 PM

JJ,

Here are my timings. I added my cpuid for identification - (why else would I add it?):

AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (SSE3)
144 cycles, (xchg reg,reg)*100
1853 cycles, (xchg reg,mem)*100
1819 cycles, (xchg mem,reg)*100
149 cycles, (exchange reg,reg)*100 using mov
506 cycles, (exchange reg,mem)*100 using mov
553 cycles, (exchange reg,mem)*100 using pop [ebx]
552 cycles, (exchange reg,mem)*100 using push [ebx]
793 cycles, (and mem)*100
777 cycles, (or mem)*100
819 cycles, (inc mem)*100
792 cycles, (inc mem)*100 using eax
808 cycles, (inc dec mem)*100
687 cycles, (inc mem)*100

News:

Possible problems with SSE usage.