Is there a way to set all registers to 0? (without explicitly assigning them val

Started by starsiege, March 30, 2009, 05:36:22 PM

Previous topic - Next topic

NightWare

 :bg noone use cdq to clean edx when eax has been cleaned ?

Quote from: dedndave on March 30, 2009, 07:24:50 PM
XOR AX,AX
MOV CX,AX   ; <- not a good idea, there is a stall here...
MOV DX,AX
XOR AX,AX
XOR CX,CX
MOV DX,AX   ; <- here it's ok, coz dependency broken

dedndave

huh ?
i don't get it

first let's put it in 32 bit

XOR eax,eax
MOV ecx,eax
MOV edx,eax
MOV ebx,eax
.
.
.
etc

why is there a stall ?

NightWare

Quote from: dedndave on April 01, 2009, 12:47:30 AM
why is there a stall ?
XOR eax,eax
MOV ecx,eax ; because the cpu need to wait the result of the previous line to use eax in this line

for more precision, the cpu must internally declare eax usable, before using it again, so there is a "wait" until it's the case.  :bg and xor is a "destructive" instruction, it alter eax (+ fix flags).

MichaelW

Over time I have tried multiple variations of this, and I can't recall ever seeing evidence of a significant stall.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 4000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 100
        xor eax, eax
        xor ecx, ecx
        mov ebx, eax
        mov edx, ecx
      ENDM
    counter_end
    print ustr$(eax),13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 100
        xor eax, eax
        xor ebx, eax
        mov ecx, eax
        mov edx, eax
      ENDM
    counter_end
    print ustr$(eax),13,10,13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 100
        add eax, 1
        add ecx, 1
        mov ebx, eax
        mov edx, ecx
      ENDM
    counter_end
    print ustr$(eax),13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 100
        add eax, 1
        xor ebx, eax
        mov ecx, eax
        mov edx, eax
      ENDM
    counter_end
    print ustr$(eax),13,10,13,10

    inkey "Press any key to exit..."
    exit

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Running on my P3, I would expect to see a much larger effect from 100 or more stalls:

206
208

207
205

eschew obfuscation

dedndave

xor eax,eax is 3 clock cycles, if i remember correctly
i have never heard of this "condition dependancy" before
this is not the same as a queue hit at all
is this something new ?

NightWare

Quote from: MichaelW on April 01, 2009, 02:27:12 AM
Running on my P3, I would expect to see a much larger effect from 100 or more stalls:

206
208

207
205
hmm...yep, here the stall is partially balanced by the non-change of the source argument (a work/branch avoided by the cpu), so in this specific case, the difference is minimised.

        xor eax, eax
        xor ecx, ecx
        mov ebx, eax
        mov edx, ecx

technically the "result" stall is slower (but never enough to put a useless instruction ! even a nop !), there is also possible stall with the source argument (when r+i/r), but not here. so you don't need to change the register in the last line, you will obtain the benefit of a non-changed source (there is no dependency/alteration here) :

        xor eax, eax
        xor ecx, ecx
        mov ebx, eax
        mov edx, eax

Quote from: dedndave on April 01, 2009, 02:36:19 AM
xor eax,eax is 3 clock cycles, if i remember correctly
i have never heard of this "condition dependancy" before
this is not the same as a queue hit at all
is this something new ?
xor is just 1 clock cycles (same as mov), they have the same speed coz they do aproximatively the same thing :
xor = xor reg,reg + flag fix
mov = xor reg1,reg1 (without flag fix) + or reg1,reg2 (without flags fix)

no, it's not something new (the first time i read an asm doc, it was already mentionned, by agner i think...). you can't produce a code without stalls, all you can do is reducing the effects by avoiding the slowest cases (when possible)...

jj2007

Quote from: Mark Jones on March 31, 2009, 11:42:23 PM
"Please place all holier-than-thou drivel in the collesseum where it may putrify properly, thank you."

No fun allowed here, eh ::)

Ok, so let's get damn serious :8)
1. While it may seem difficult to find a serious application for a macro or an algo that sets all registers to zero (esp??), those here who are not just seeking fun are obviously taking that possibility seriously.
2. So as good assembler programmers, we must optimise it!
3. Maybe due to my overall lack of experience, I had severe dificulties to imagine a case where zeroing all registers would be needed in a speed-critical innermost loop. Therefore I tried to optimise it for size. My apologies if that was not serious.
:8)

jj2007

(deleted because it was identical to the post below)

jj2007

Quote from: jj2007 on April 01, 2009, 06:07:03 AM
Quote from: MichaelW on April 01, 2009, 02:27:12 AM
Running on my P3, I would expect to see a much larger effect from 100 or more stalls:

I added an inc eax variant. 10 cycles (203->213) is 5%, can that be explained by stalls..?

Quote
Celeron M
203 xm interleaved
213

208 add/mov
205 inc/mov
212 add/xor/mm

EDIT: P4

309 xm interleaved
157

162 add/mov
239 inc/mov
162 add/xor/mm


    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 100
        add eax, 1
        add ecx, 1
        mov ebx, eax
        mov edx, ecx
      ENDM
    counter_end
    print ustr$(eax), " add/moc", 13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 100
        inc eax
        inc ecx
        mov ebx, eax
        mov edx, ecx
      ENDM
    counter_end
    print ustr$(eax)," inc/mov", 13,10


rags

[NoMorningCoffee]
I think Donkey said it best in another thread when he essentially said what's the use in spending 6 hours writing a piece of code to save 10 seconds over the lifespan of a program?
[\NoMorningCoffee]     :bg
God made Man, but the monkey applied the glue -DEVO

lingo

"... less than 12 bytes in less than 5 cycles...." :lol

.data
zer  dd 0,0,0,0,0,0,0,0, zer

.code
xchg esp, [zer+8*4]
popad
pop esp

00401004 87 25 E4 E4 42 00           xchg        esp, dword ptr ds:[42E4E4h]
0040100A 61                          popad           
0040100B 5C                          pop         esp     ; just 8 bytes



jj2007

Quote from: lingo on April 02, 2009, 02:01:09 AM
"... less than 12 bytes in less than 5 cycles...." :lol

.data
zer  dd 0,0,0,0,0,0,0,0, zer

.code
xchg esp, [zer+8*4]
popad
pop esp

00401004 87 25 E4 E4 42 00           xchg        esp, dword ptr ds:[42E4E4h]
0040100A 61                          popad           
0040100B 5C                          pop         esp     ; just 8 bytes



You got me - almost :U

First, you are cheating: That is 8+9*4=44 bytes, because you need initialised memory.

Second, even when considering that this memory could be used once, and the macro/algo could be called a thousand times: your code will work only once...

:green2


include \masm32\include\masm32rt.inc

.data
zer dd 0,0,0,0,0,0,0,0, zer

.data?
Null8 dd ?, ?, ?, ?, ?, ?, ?, ?, ?

.code

start:
int 3 ; let Olly speak
nop

mov eax, offset Null8+8*4  ; jj, first use
xchg esp, eax
mov [esp+32], eax
popad
pop esp

nop

mov eax, offset Null8+8*4  ; jj, 2nd use
xchg esp, eax
mov [esp+32], eax
popad
pop esp

nop

xchg esp, [zer+8*4]  ; Lingo, first use
popad
pop esp

nop

xchg esp, [zer+8*4]  ; Lingo, 2nd use
popad
pop esp

nop
exit
end start


lingo

"and the macro/algo could be called a thousand times:.."

Where...bla,bla,blah... Lingo is an idea generator...jj just steals other's ideas... :lol


jj2007

Quote from: lingo on April 02, 2009, 03:07:07 AM
"and the macro/algo could be called a thousand times:.."

Where...bla,bla,blah... Lingo is an idea generator...jj just steals other's ideas... :lol


:boohoo:

Here is an 11-byte version that works repeatedly:

mov eax, offset Null8+8*4
mov [eax+32], esp
xchg esp, eax
popad
pop esp


Ever noticed that difference:
   mov [esp+32], eax   ; 4 bytes
   mov [eax+32], esp   ; 3 bytes
:toothy

herge

 Hi jj2007:


(238c.2390): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=ffffffff ebx=8054b6ed ecx=84f75930 edx=0012ffc8 esi=00a0f6f2 edi=7c817067
eip=0040102d esp=7c839ac0 ebp=00a0f75a iopl=0         nv up ei pl zr na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010246
stall!start+0x2d:
0040102d 6a00            push    0



I dropped the int 3

Break into Debug Trace Mode or Send Message to Big M$rcoSoft.

Is this C5 suppose to happen.

Regards herge
// Herge born  Brussels, Belgium May 22, 1907
// Died March 3, 1983
// Cartoonist of Tintin and Snowy