When I saw [mov edx, [eax]; add eax, 4] in my code, come up to me to use stack autoupdating addressing. As pop'ing is a common operation, it is very optimized and shorter.
At home, the difference was not so visible, but at job was something like +20%.
What if we use this as a common idiom? It would not affect the cache in any way, so performance will be at least the same. The code must be carefully written to do not generate a non-local goto, say exceptions, and forget function calls. On Windows worked okay, but if a context switch occur before restoring the stack pointer, can this trick crash an environment(there's an OS which assumes a valid esp?)?
What can you say about that?
Generally XCHG is slow on any modern hardware. Its easy enough to benchmark it but from memory either the stack method or using registers is faster.
Quote from: hutch-- on February 02, 2005, 03:07:01 AM
Generally XCHG is slow on any modern hardware. Its easy enough to benchmark it but from memory either the stack method or using registers is faster.
Yeah, it maybe so. But I'm not considering its use for trivial loads, but in a loop like this:
xchg eax, esp
.loop:
pop edx
...
jmp .loop
xchg eax, esp
So, there's one XCHG per hundreds or thousands of cycles.
Thank you.
Quote from: hutch-- on February 02, 2005, 03:07:01 AM
Generally XCHG is slow on any modern hardware. Its easy enough to benchmark it but from memory either the stack method or using registers is faster.
XCHG reg,mem is slow, because of the implicit LOCK that gets done. AFAIK, XCHG reg, reg is a perfectly reasonable instruction to use.
As to the original question, interrupts and task switches use a different stack, so you don't have to worry about those events corrupting your stack.
Cheers,
Randy Hyde
I'm not sure I implemented parts of this correctly, and the clock cycle counts aren't very repeatable for a single loop of a process that takes millions of cycles.
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.586 ; create 32 bit code
.model flat, stdcall ; 32 bit memory model
option casemap :none ; case sensitive
include \masm32\include\windows.inc
include \masm32\include\masm32.inc
include \masm32\include\kernel32.inc
include \masm32\include\oleaut32.inc
includelib \masm32\lib\masm32.lib
includelib \masm32\lib\kernel32.lib
includelib \masm32\lib\oleaut32.lib
include \masm32\macros\macros.asm
include macros2.asm
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
lpString1 dd 0
lpString2 dd 0
.code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
STRLENGTH EQU 1048576 * 4
LOOPCOUNT EQU 1
mov lpString1, alloc$(STRLENGTH)
mov lpString2, alloc$(STRLENGTH)
print chr$("rep movsd : ")
clockctr_begin LOOPCOUNT, REALTIME_PRIORITY_CLASS
mov esi, lpString1
mov edi, lpString2
mov ecx, STRLENGTH SHR 2
cld
rep movsd
clockctr_end
print ustr$(eax)
print chr$(" clock cycles",13,10)
print chr$("move through reg32 in loop : ")
clockctr_begin LOOPCOUNT, REALTIME_PRIORITY_CLASS
mov esi, lpString1
mov edi, lpString2
mov ecx, STRLENGTH SHR 2
@@:
mov eax, [esi]
mov [edi], eax
add esi, 4
add edi, 4
sub ecx, 1
jnz @B
clockctr_end
print ustr$(eax)
print chr$(" clock cycles",13,10)
print chr$("stack auto-increment : ")
clockctr_begin LOOPCOUNT, REALTIME_PRIORITY_CLASS
mov esi, lpString1
mov edi, lpString2
mov ecx, STRLENGTH SHR 2
xchg esi, esp
@@:
pop [edi]
add edi, 4
sub ecx, 1
jnz @B
xchg esi, esp
clockctr_end
print ustr$(eax)
print chr$(" clock cycles",13,10)
free$ lpString1
free$ lpString2
mov eax, input(13,10,"Press enter to exit...")
exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
Typical output on a P3:
rep movsd : 16461048 clock cycles
move through reg32 in loop : 19885052 clock cycles
stack auto-increment : 20035101 clock cycles
[attachment deleted by admin]
Quote from: MichaelW on February 02, 2005, 06:23:23 AM
I'm not sure I implemented parts of this correctly, and the clock cycle counts aren't very repeatable for a single loop of a process that takes millions of cycles.
Interrupts, multitasking, and other events pretty much guarantee that you will not get consistent results if your sample period is long enough.
Cheers,
Randy Hyde
With only 8 registers it is hard to free up one to save ESP to. Try this instead.
movd mm0, esp
.loop:
pop edx
...
jmp .loop
movd esp, mm0
I have used this technique in the past.
funcname proc args etc ...
.data?
reg_esp dd ?
.code
mov reg_esp, esp
; write code here using ESP
mov esp, reg_esp
ret
funcname endp
Use the [xchg xxx, esp] must to be careful !
because the esp (Stack Point) be changed .
When NMI come in, maybe ..???
all these interrupt handlers are still implemented in ring0 in all operating systems I know, and according to the Intel System Programming Manual, it's even forbidden to do them in ring3. Each switch to ring0 means that you swap stack first (with your non-stack, but who cares except for a debugger) and then use the kernel stack assigned to you (or a generic kernel one) for storing those registers. Only for a few exceptions (see also this link on a different forum (http://www.mega-tokyo.com/forum/index.php?board=1;action=display;threadid=7259) where people are discussing implementing exceptions in ring3 and their limitations) you can expect them to be handled in user space at some time, but this is not in any current OS as far as I know.
Summary: Nothing yet, might be a few in the future, just don't do them.
@Randall Hyde: I agree with you, as a CPU cannot access another's registers. And thank you for your information.
@MichaelW: in this case, the routine using the trick in question could avoid xchg. I'll use the following(obviously, if it worked):
mov esi, esp
mov esp, lpString1
mov edi, lpString2
mov ecx, STRLENGTH SHR 2
@@:
dec ecx
pop [edi+ecx*4]
jnz @B
mov esp, esi
But I see this trick is not intended for move operations.
@Mark Larson: indeed, a MMX register could be used, but some cases(if a register already contains the new pointer for esp) xchg performs better. But there's a need to emms after?
Quote from: UncannyDude on February 03, 2005, 11:14:17 AM
@Mark Larson: indeed, a MMX register could be used, but some cases(if a register already contains the new pointer for esp) xchg performs better. But there's a need to emms after?
All times are for a P4. The actual timing will vary depending on dependencies and other stalls. Xchg runs in 1.5 cycles. "movd register,mmx register" runs in 5 cycles. "movd mmx register,register" runs in 2 cycles. So xchg is faster, however the way you are doing it wastes a CPU register to save it. So what is the point in even using it when you have to waste another register to save the value in the register? Does that make sense?
Also if you aren't doing floating point code you don't have to use EMMS. IF you do use floating point, then put the EMMS once right before any floating point code. As an alternative you can use this, and it won't require an EMMS, but it's slower.
movd xmm0,esp
movd esp,xmm0
By the time I considered using "xchg esp, eax", I did feel no register pressure, so the point is to "waste" a spare register.