News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

a shift (shl shr) on a 64 bits constants

Started by ToutEnMasm, June 11, 2009, 06:13:05 AM

Previous topic - Next topic

jj2007

#45
Quote from: ToutEnMasm on June 15, 2009, 04:39:39 PM
Sorry , it is your code who is responsible .He leave the fpu in an undeterminate state.Add FINIT before the call of my routine,and all wil be ok,except for yours.The end of my routine is a finit,could'nt trashes anyting,surely a bad copy.


When you launch a program, you get a fresh FPU. You can use finit to initialise it, but it is not necessary.

Your code leaves two registers (ST0, ST1) with valid entries on the stack. When you call it a seond time, 4 registers are left valid. After four calls, it fails because ST0 is no longer in a free state.
EDIT
: More precisely, in call #4, fild qword ptr MyQw1 fails because ST7 is not empty. You can watch that in OllyDbg.

Of course, you can use finit for each call, but first, it is horribly slow, and second, other code parts might use the FPU, too, so it is not good programming practice to trash it with a low level instruction such as a shift.

EDIT: Just to give you an idea of the difference in speed:

20      cycles for 10*shr64
584     cycles for 10*FPU, fstp*2
1089    cycles for 10*FPU, finit


I love the FPU, but it's just not a good idea to use it for a 64-byte shift...

ToutEnMasm


Seems you need to review it.
Quote
MyQw   qword  0000FFFFFFFF0000h
NumberOfShit equ 48   
   mov eax,NumberOfShit
   push eax
   fild dword ptr [esp]   
   pop eax
   fld1   
   fscale
   fild qword ptr MyQw
   fdiv st(0),st(1)
   fistp qword ptr MyQw
   finit     

The finit is here to leave the fpu in is original state.The routine can be repeat without problem

Quote
Your code leaves two registers (ST0, ST1) with valid entries on the stack. When you call it a seond time, 4 registers are left valid. After four calls, it fails because ST0 is no longer in a free state.

NOT TRUE








GregL

Hey jj, quit worrying so much about the absolute fastest speed.  You are obsessed, and you expect everyone else to be the same way.  I don't worry about code speed unless I have to.

Using finit is a good idea (most of the time).


jj2007

Quote from: Greg on June 15, 2009, 06:41:25 PM
Hey jj, quit worrying so much about the absolute fastest speed.  You are obsessed, and you expect everyone else to be the same way.  I don't worry about code speed unless I have to.

Using finit is a good idea (most of the time).


Greg, I may seem obsessed but where is the point in choosing an algo that is 20%50% longer and a factor 55 slower? Sure you can use Visual Basic, too, but after all, why are we here, in an assembler forum??

> The finit is here to leave the fpu in is original state.
Noobs might actually believe this, therefore I correct it: finit initialises the FPU, i.e. everything that happened to be in the eight ST registers is gone. That has nothing to do with "original state".

dedndave

i think it is always good to know the fastest way to do things
that is one of the things that makes writing in assembler unique from all other languages
we all understand that not every routine has to be written for speed
but, when it comes time to speed up a repetitive task, the forum is a good place to search
each algo they research adds one more chapter to the reference book

raymond

QuoteWhen you launch a program, you get a fresh FPU.

TRUE

QuoteYou can use finit to initialise it, but it is not necessary.

TRUE

HOWEVER, (at least under Windows) if you don't use finit, the Precision Control is set to double-precision, i.e. 64 bits with 53 bits for the mantissa. Using finit sets the PC bits of the Control Word to extended double-precision, i.e. the full 80-bit precision.

Although finit is slow, it can be done as the first instruction at the start of the program and will be performed in parallel with the other initializing code which does not need the FPU. Then you simply keep the FPU registers clean.
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

ToutEnMasm

Hello,
Quote
When you launch a program, you get a fresh FPU.
FALSE,the status word isn't clear and this can made false result.If you want i can post a visual sample off this.
The FINIT at the start of a program is a very good practice.




jj2007

Quote from: ToutEnMasm on June 16, 2009, 08:41:30 AM
Hello,
Quote
When you launch a program, you get a fresh FPU.
FALSE,the status word isn't clear and this can made false result.If you want i can post a visual sample off this.

Go check in Olly. The status word is clear at startup, which means 53 bits of mantissa, i.e. a slightly reduced precision. With finit, you set it to the full 64 bits precision.

Quote
The FINIT at the start of a program is a very good practice.

Yes indeed. I am glad that you gave up the idea of trashing the FPU after each and every call. I am afraid, however, that it does not solve the inherent problem of using the FPU for a 64-bit shift. It works for certain values but fails for others - irrespectively of the FPU status flag precision. Higher precision just means that you can use somewhat higher source figures.

Results below are for full 64-bit precision, and a correctly balanced FPU stack.

Quote1234567812345678  original value 1
0123456781234567  shr64 MyQw1, 4, MyQwRes
0123456781234568  shr64TeM MyQw1, 4, MyQwRes         <<<<<<<<<<<< almost OK
8FFFFFFFFFFFFFFF  original value 2
08FFFFFFFFFFFFFF  shr64 MyQw2, 4, MyQwRes
F900000000000000  shr64TeM MyQw2, 4, MyQwRes         <<<<<<<<<<<< ????

1234567812345678  original value 1
0012345678123456  shr64 MyQw1, 8, MyQwRes
0012345678123456  shr64TeM MyQw1, 8, MyQwRes         <<<<<<<<<<<< OK
8FFFFFFFFFFFFFFF  original value 2
008FFFFFFFFFFFFF  shr64 MyQw2, 8, MyQwRes
FF90000000000000  shr64TeM MyQw2, 8, MyQwRes         <<<<<<<<<<<< ????

Code sizes:
fpu     32
alu     25

Timings (Prescott P4):
86      cycles for 10 * shl64, non-FPU
952     cycles for 10 * shl64 FPU, 2*fstp ST
13865   cycles for 10 * shl64 FPU, finit


Full code attached.

[attachment deleted by admin]

ToutEnMasm

If don't want to review your test,just put a clear code hat can be clearly understanding

jj2007

Quote from: ToutEnMasm on June 16, 2009, 10:08:25 AM
If don't want to review your test,just put a clear code hat can be clearly understanding

Yves, before complaining about the lack of clarity of my code, you might at least look at it: Download counter is at 1, and that was my own test download. Anyway, here is the macro of your code, in case you want to verify yourself:

shr64TeM MACRO qwarg, NumberOfShifts, destarg
LOCAL dest
   ifb <destarg>
dest equ qwarg
   else
dest equ destarg
   endif
   mov eax, NumberOfShifts
   push eax
;   int 3 ; set a breakpoint for Olly
   fild dword ptr [esp]
   pop eax
   fld1   
   fscale
   fild qword ptr qwarg
   fdiv st(0),st(1)
   fistp qword ptr dest
;   mov eax, dword ptr dest ; check in Olly what
;   mov edx, dword ptr dest[4] ; happens to your FPU
   fstp st ; balance the
   fstp st ; FPU stack correctly
;   finit ; not a good idea
ENDM


Usage:    shr64TeM MyQw1, 40 [, MyDestinationQword]

ToutEnMasm

Not very difficult to see where is the problem,the finit is put in comment at the end.
You don't show also how it's use.
A clear code is with .code start: and exitprocess
Other thing is lose of time

jj2007

Quote from: ToutEnMasm on June 16, 2009, 10:39:57 AM
Not very difficult to see where is the problem,the finit is put in comment at the end.
You don't show also how it's use.
A clear code is with .code start: and exitprocess
Other thing is lose of time

You are inconsistent: Just a few posts above, you had agreed that once is enough. Which is what I did.

Suggestion: Uncomment the finit, assemble and run. Post the results here. I am really curious if your CPU produces a different result.

ToutEnMasm


Quote
You are inconsistent: Just a few posts above, you had agreed that once is enough. Which is what I did.
Test my routine as it is,not how you want that it be.Seems also that you affirm manything false.
Quote
you had agreed that once is enough         
Where is it ??? .The rule is , at enter in the routine,the fpu is in a normal state,at the end he must be in a normal state.This two conditions must be satisfied.
The routine i have posted have been tested seriously.I repeat use it as it is and in normal conditions.She works perfectly.







jj2007

Quote from: ToutEnMasm on June 16, 2009, 12:25:07 PM
The routine i have posted have been tested seriously.I repeat use it as it is and in normal conditions.She works perfectly.

Great. So what do you get when you shift 8FFFFFFFFFFFFFFFh one nibble (=4 bits) to the right? 08FFFFFFFFFFFFFFh like for my routine, or something else?

ToutEnMasm


i say that is the first thing useful you say since a while.
The qword is interpreted as a negative number and show the limits of the function.
That is something clear to say,Not ?