Anyone have any code for writing EDX:EAX as k's/megs/gigs? I've played around with it but using shifts loses the 0.5's and all.
I know there's a shell32? api but would like to know if there's an asm version.
I asked because of the thread about printing a decimal number - I think that a number like "4.0 GiB" is easier to understand than "4294967296" (or "18446744073709551615", which is 0FFFFFFFFFFFFFFFFh - what's that? 18 exabytes?))
The attachment is a quick attempt at a macro to do the scaling. With the floating-point approximations and the decimal conversions I could not think of any simple way to verify the correctness of the code, but the results at least look OK.
[attachment deleted by admin]
I was thinking more along the lines of automatic scaling e.g. the largest number would be 1023, so we would have "1023 bytes", then "1.00 Kib", "1023 KiB", "1023 MiB" etc.
I'm trying to use shifts, to avoid the overhead of div - they're fine for whole numbers, but lose any fractional part (1.9 Mib is a bit different to 1 Mib).
The main thing is avoiding any sort of API, since I was wanting to use this in my roll-your-own OS (or even actual DOS).
sinsi wrote:
> I was thinking more along the lines of automatic scaling
> e.g. the largest number would be 1023, so we would
> have "1023 bytes", then "1.00 Kib", "1023 KiB", "1023 MiB" etc.
> I'm trying to use shifts, to avoid the overhead of div - they're
> fine for whole numbers, but lose any fractional part (1.9 Mib
> is a bit different to 1 Mib).
Using TEST you could do a binary search for the max value, and
then shift the value to a wanted size. Ten bits for the integer
and maybe four bits to use a lookup table for the fraction?
Steve N.
- Test if your value is bigger than 1023 etc.
- for 1024, divide (using FPU) by 102.4
- result is 10
- move to a buffer, using dwtoa
- insert a dot at endofbuf-2, and at " kBytes"
- you got 1.0 kBytes
hi,
a while ago I've written a function uses binary prefixes. don't know if work all correct ... but i think so :toothy
it uses some functions from masmlib.
regard qword
[attachment deleted by admin]
OK, here's my first attempt. I've never used fpu instructions before, and the code is a mess (since it's work-in-progress), so be gentle please. Am I on the right track here?
[attachment deleted by admin]
You are certainly on the right track, Sinsi. Wrap it as print xByte$(123456789)... :wink
By the way, the FPU understands
push 100
fmul
Quote from: FORTRANS on July 30, 2008, 01:51:13 PM
sinsi wrote:
> I was thinking more along the lines of automatic scaling
> e.g. the largest number would be 1023, so we would
> have "1023 bytes", then "1.00 Kib", "1023 KiB", "1023 MiB" etc.
> I'm trying to use shifts, to avoid the overhead of div - they're
> fine for whole numbers, but lose any fractional part (1.9 Mib
> is a bit different to 1 Mib).
Using TEST you could do a binary search for the max value, and
then shift the value to a wanted size. Ten bits for the integer
and maybe four bits to use a lookup table for the fraction?
Steve N.
There is actually a simpler way to do this. There is an X86 instruction you can use to automatically scan for the highest bit that is set in the integer. It is called BSR ( bit scan reverse). It clears the zero flag if there is a high bit set. And the high bit is returned in the destination. Note it returns the BIT position. And from there you could do a lookup table for 0-31 different bit positions, that has the largest number you can represent if that bit is set.
You can do the something similar with BSF from the front of the integer and find the first bit set.
EDIT: Welcome to the board FORTRANS! :) Forgot to say Hi on the other board.
Mark_Larson wrote:
QuoteThere is actually a simpler way to do this. There is an X86 instruction you can use to automatically scan for the highest bit that is set in the integer. It is called BSR ( bit scan reverse). It clears the zero flag if there is a high bit set. And the high bit is returned in the destination. Note it returns the BIT position. And from there you could do a lookup table for 0-31 different bit positions, that has the largest number you can represent if that bit is set.
Oops. Too much 16-bit coding... Now if I had read your
message _before_ I coded something up. With the magic of
Cut-n-Paste here is a 16-bit DOS subroutine pretending to
be a generic/no OS 32-bit subroutine. Tested, though barely.
And to code it up quickly, no binary search, just brute forced.
QuoteEDIT: Welcome to the board FORTRANS! :) Forgot to say Hi on the other board.
Thanks. Still figuring out the interface. (Or not. I will work
on an avatar if I figure that one out.)
Regards,
Steve N.
I found an error.
[attachment deleted by admin]
Quote from: Mark_Larson on July 31, 2008, 03:35:14 PM
There is an X86 instruction you can use to automatically scan for the highest bit that is set in the integer. It is called BSR ( bit scan reverse).
Seems to be incredibly slow - 103 cycles for a single instruction??
BSR- Bit Scan Reverse (386+)
Usage: BSR dest,src
Modifies flags: ZF
Scans source operand for first bit set. Sets ZF if a bit is found
set and loads the destination with an index to first set bit. Clears
ZF is no bits are found set. BSF scans forward across bit pattern
(0-n) while BSR scans in reverse (n-0).
Clocks Size
Operands 808x 286 386 486 Bytes
reg,reg - - 10+3n 6-
103 3
reg,mem - - 10+3n 7-104 3-7
reg32,reg32 - - 10+3n 6-103 3-7
reg32,mem32 - - 10+3n 7-104 3-7
Quote from: jj2007 on July 31, 2008, 07:36:59 AM
By the way, the FPU understands
push 100
fmul
Huh? Do you mean
push 100.0
fmul dword ptr [esp]
Quote from: jj2007By the way, the FPU understands
push 100
fmul
It does not. This works:
push 100
fimul DWORD PTR [esp]
Quote from: jj2007Seems to be incredibly slow - 103 cycles for a single instruction??
Speed isn't everything.
I am struggling with the proper representation of qwords. In the attachment [EDIT: obsolete, removed, see later post], I use a series of example values like this:
qw1a dq 1023 ; 1023.00 bytes
qw1b dq 1099 ; 1.07 kBytes
qw2a dq 1023*1024
qw2b dq 1099*1024
qw3a dq 1023*1024*1024
qw3b dq 1099*1024*1024
I pass the values to the proc like this:
lea ecx, [qw1a+8*ebx]
invoke ShowQw, ecx, addr buffer
...
mov esi, pNum ; pointer to our 8-byte number
xor edx, edx ; counter
mov IsLow, edx ; flag
mov eax, [esi+4] ; eax is high dword of qword to convert
.if eax==0
mov ecx, 1024*1024*1024 ; 2^30
mov eax, [esi] ; low dword only
.endif
This works fine until No. 6:
qw4a dq 1023*1024*1024*1024
qw4b dq 1099*1024*1024*1024
Maybe related: Is there a correct way to convince the FPU that we are working with unsigned CPU registers?
push eax ; our number, either high or low dword
fild dword ptr [esp] ; move into ST (0)
.if signed eax<0 ; if eax==C000000,
fabs ; the FPU thinks it's negative...
One reason for my convoluted code is that fpu stuff is signed, so using "fild 0ffffffffh' (you get the idea) is treated as signed...help, Raymond! :bdg
jj, I'm thinking that if you want to use 'qw1a dq 1023' as a real8, it needs to be 'qw1a dq 1023.0'.
That way masm knows it's an fpu number (my god, the fpu has been around for yonks and I don't know jack shit about it ::))
The bsf thing I didn't worry about, since I saw the same 100+ cycles that jj did, but qword's code and michael's helped me figure out what the hell is happening.
(my fpu code is basically michael's - I don't really get it, but my fpu programming is growing exponentially, thanks mate).
One thing, I used Windbg to go through the code, the 'view registers' really helped here - st(0) and such really f*cked me up until I could see what the hell was happening.
Anyway, tomorrow (maybe) is the big rewrite of the proc, then it's fair game.
PS sorry jj, your love of the 'xxx$()' macro is for you to do... :bg
Quote from: sinsi on August 01, 2008, 01:43:42 PM
jj, I'm thinking that if you want to use 'qw1a dq 1023' as a real8, it needs to be 'qw1a dq 1023.0'
Actually, I thought of a 64-bit integer, not a Real8, but that is apparently not possible.
I get the impression, though, that this could be done very elegantly with the MMX instructions, using this logic:
; move the number into mm0
movq mm0, QWORD ptr [pNum]
; move a copy to mm1
movq mm1, mm0
; put the "divisor" into mm2, and let it start with 1
xor ecx, ecx
inc ecx
movd mm2, ecx
; loop
.While
mm0>=mm2 ; original value bigger or equal divisor?
psllq mm2, 10 ; multiply divisor by 1024
pslrq mm1, 10 ; divide number by 1024
inc Counter ; choose the next higher unit
.Endw
movd eax, mm1 ; the remainder
invoke dwtoa, ....
This way, we would avoid these problems. However, I am stuck with the part in red: There is no simple way to compare mmx registers, apparently. Any help from our mmx freaks?
jj2007 wrote:
QuoteMaybe related: Is there a correct way to convince the FPU that we are working with unsigned CPU registers?
Code:
push eax ; our number, either high or low dword
fild dword ptr [esp] ; move into ST (0)
.if signed eax<0 ; if eax==C000000,
fabs ; the FPU thinks it's negative...
The FPU only understands signed numbers (Otherwise
entering a negative number would be a pain).
If you load 0FFFFFFFFH and do an FABS you
will get one. Probably not what you want if
you are trying to use unsigned numbers.
Two ways come to mind to get around that.
One, divide by 2, load, .multiply by 2, and add
one if the original was odd.
Two, mask off the high bit, load, load 080000000H
FABS, and FADD. I think that works. You can
generate the constant in other ways of course.
Regards,
Steve N.
Thanks, Steve - much appreciated. The problem seems to occur earlier, however, when I try to load low and high dword separately. Switching to MMX seems really a lot more elegant; but the compare is somewhat complex.
Quote from: jj2007 on August 01, 2008, 02:53:38 PM
However, I am stuck with the part in red: There is no simple way to compare mmx registers, apparently.
by using sse2 ::) it could be easy
;works only with unsigned values
;mm0>=mm2
psubq mm0,mm2
pextrw eax,mm0,3
test eax,08000h
.if !ZERO? ;negativ result
;mm2 > mm0
.else ;positiv result
;mm2 <= mm0
.endif
Do you want to replace bsr? it isn't as slow as you think => take a look in intels optimization reference manual (Appendix C).
jj2007 wrote:
QuoteThe problem seems to occur earlier, however, when I try to load low and high dword separately.
Load low DWORD, load high DWORD, load 000001000H,
load ST(0), FMUL, FMUL, and FADD? Or am I completely
off track?
HTH,
Steve N.
Arrh, 000010000H. Bad eyes.
HTH,
Steve N.
Quote from: qWord on August 01, 2008, 07:16:55 PM
by using sse2 ::) it could be easy
Hutch, could you please post the sse2 macros? The old fourm is unavailable. I need pslrq and psubq
Thanks, jj
Quote from: jj2007 on August 01, 2008, 09:32:21 PM
Hutch, could you please post the sse2 macros? The old fourm is unavailable. I need pslrq and psubq
http://www.masm32.com/board/index.php?topic=973.0
Thanks for the link, but no luck. Here is the complete sse2.inc - pslrq and psubq are missing, unfortunately. Lingo once used
db 0fh, 0fbh, 0d9h ; psubq mm3, mm1
... where from?
;SSE2 macros for MASM 6.14 by daydreamer aka Magnus Svensson
ADDPD MACRO M1,M2
db 066h
ADDPS M1,M2
ENDM
ADDSD MACRO M1,M2
DB 0F2H
ADDPS M1,M2
ENDM
ANDPD MACRO M1,M2
DB 066H
ANDPS M1,M2
ENDM
ANDNPD MACRO M1,M2
DB 066H
ANDNPS M1,M2
ENDM
ORPD MACRO M1,M2
DB 066H
ORPS MACRO M1,M2
ENDM
XORPD MACRO M1,M2
DB 066H
XORPS M1,M2
ENDM
SUBPD MACRO M1,M2
DB 066H
SUBPS M1,M2
ENDM
SUBSD MACRO M1,M2
DB 0F2H
SUBPS M1,M2
ENDM
MULPD MACRO M1,M2
DB 066H
MULPS M1,M2
ENDM
MULSD MACRO M1,M2
DB 0F2H
MULPS M1,M2
ENDM
DIVPD MACRO M1,M2
DB 066H
DIVPS M1,M2
ENDM
DIVSD MACRO M1,M2
DB 0F2H
DIVPS M1,M2
ENDM
RCPPD MACRO M1,M2
DB 066H
RCPPS M1,M2
ENDM
RCPSD MACRO M1,M2
DB 0F2H
RCPPS M1,M2
ENDM
RSQRTPD MACRO M1,M2
DB 066H
RSQRTPS M1,M2
ENDM
RSQRTSD MACRO M1,M2
DB 0F2H
RSQRTPS M1,M2
ENDM
SQRTPD MACRO M1,M2
DB 066H
SQRTPS M1,M2
ENDM
SQRTSD MACRO M1,M2
DB 0F2H
SQRTPS M1,M2
ENDM
MAXPD MACRO M1,M2
DB 066H
MAXPS M1,M2
ENDM
MAXSD MACRO M1,M2
DB 0F2H
MAXPS M1,M2
ENDM
MINPD MACRO M1,M2
DB 066H
MINPS M1,M2
ENDM
MINSD MACRO M1,M2
DB 0F2H
MINPS M1,M2
ENDM
MOVAPD MACRO M1,M2
DB 066H
MOVAPS M1,M2
ENDM
MOVHLSPD MACRO M1,M2
DB 066H
MOVHLSPS M1,M2
ENDM
MOVHPD MACRO M1,M2
DB 066H
MOVHPS M1,M2
ENDM
MOVLPD MACRO M1,M2
DB 066H
MOVLPS M1,M2
ENDM
MOVNTPD MACRO M1,M2
DB 066H
MOVNTPS M1,M2
ENDM
MOVUPD MACRO M1,M2
DB 066H
MOVUPS M1,M2
ENDM
CMPPD MACRO M1,M2,M3
DB 066H
CMPPS M1,M2,M3
ENDM
CMPSD MACRO M1,M2,M3
DB 0F2H
CMPPS M1,M2,M3
ENDM
CMPEQPD MACRO M1,M2
DB 066H
CMPEQPS M1,M2
ENDM
CMPEQSD MACRO M1,M2
DB 0F2H
CMPEQPS M1,M2
ENDM
CMPLTPD MACRO M1,M2
DB 066H
CMPLTPS M1,M2
ENDM
CMPLTSD MACRO M1,M2
DB 0F2H
CMPLTPS M1,M2
END
CMPLEPD MACRO M1,M2
DB 066H
CMPLEPS M1,M2
ENDM
CMPLESD MACRO M1,M2
DB 0F2H
CMPLEPS M1,M2
ENDM
CMPUNORDPD MACRO M1,M2
DB 066H
CMPUNORDPS M1,M2
ENDM
CMPUNORDSD MACRO M1,M2
DB 0F2H
CMPUNORDPS M1,M2
ENDM
CMPNEQPD MACRO M1,M2
DB 066H
CMPNEQPS M1,M2
ENDM
CMPNEQSD MACRO M1,M2
DB 0F2H
CMPNEQPS M1,M2
ENDM
CMPNLTPD MACRO M1,M2
DB 066H
CMPNLTPS M1,M2
ENDM
CMPNLTSD MACRO M1,M2
DB 0F2H
CMPNLTPS M1,M2
ENDM
CMPNLEPD MACRO M1,M2
DB 066H
CMPNLEPS M1,M2
ENDM
CMPNLESD MACRO M1,M2
DB 0F2H
CMPNLEPS M1,M2
ENDM
CMPORDPD MACRO M1,M2
DB 066H
CMPORDPS M1,M2
ENDM
CMPORDSD MACRO M1,M2
DB 0F2H
CMPORDPS M1,M2
ENDM
First problem solved:
This one I found by misusing OllyDbg:
db 0fh, 0fbh, 0d8h ; psubq mm3, mm0
... and this one was simply a typo, sorry :red
psrlq mm1, 10 ; divide number by 1024
Now it assembles fine and starts throwing runtime errors :green
OK, it's time for some testing... code attached in next post.
0 bytes AUTOEXEC.BAT
211 bytes boot.ini
4.83 kB Bootfont.bin
0 bytes CDS
0 bytes CONFIG.SYS
0 bytes db_circs
0 bytes Documents and Settings
1015.42 MB hiberfil.sys
0 bytes IO.SYS
0 bytes MSDOS.SYS
0 bytes MSOCache
46.44 kB NTDETECT.COM
245.18 kB ntldr
756.00 MB pagefile.sys
0 bytes Programmi
0 bytes RECYCLER
0 bytes System Volume Information
0 bytes WINDOWS
Quote from: sinsi on August 01, 2008, 01:43:42 PM
PS sorry jj, your love of the 'xxx$()' macro is for you to do... :bg
Your wish is my command:
invoke GetCurrentDirectory, 260, offset buffer
invoke lstrcat, offset buffer, chr$("\*.*")
invoke FindFirstFile, offset buffer, addr wfd
mov fHandle, eax
mov ecx, offset fSize ; we need to store the file size elsewhere ...
mov eax, wfd.nFileSizeHigh ; ... because stupidly enough, the wfd order...
mov [ecx+4], eax
mov eax, wfd.nFileSizeLow ; ... is wrong, so we have to invert it
mov [ecx], eax
print
flexbyte$(ecx) ; format [ecx] nicely: ecx is a pointer to a QWORD
Full code no longer attached here - see later post. The proc has become awfully long - 136 bytes. I hope Vista left some space on your harddisk :green
Hi,
I found an error in the code I posted earlier. Posted
a correction. Sorry. Output now looks like:
490.58 GiB
2.58 GiB
0 Bytes
1,023 Bytes
1.00 KiB
15.99 EiB
1,023.99 PiB
Regards,
Steve N.
Please forgive my lateness on this - I saw the thread last week but didn't have a chance to post earlier...
You can achive the result you want with a single divide (it gives the modulo too - which is important).
You need to drive the whole thing off the exponent of a real4.
Pseudo code:
Take the expoent (bits 30 down to 23)
Subtract 127
Divide by 10
Add 127 to the remainder, and replace the exponent in the original number
The result of the division is 0 = bytes, 1 = kilobytes, 2 = megabytes etc.
Print the resulting floating point number (it will be in the range of 0 to 1023.999ish), and append some string pointed to by the result of the division.
Mirno
Quote from: Mirno on August 04, 2008, 08:36:34 AM
Pseudo code:
Take the expoent (bits 30 down to 23)
Subtract 127
Divide by 10
Add 127 to the remainder, and replace the exponent in the original number
The result of the division is 0 = bytes, 1 = kilobytes, 2 = megabytes etc.
Looks interesting. I tried my luck but the result is not convincing. Any idea?
.data
MyQword dq 12345678
MyReal4 REAL4 0.0
.code
int 3 ; let Olly say Hi
mov esi, offset MyQword
fild MyQword ; convert quad
fst MyReal4 ; to real4
mov ecx, offset MyReal4
xor eax, eax ; not really needed
mov al, [ecx+3] ; take the exponent
and al, 127 ; mask out bit 31, i.e. take bits 30 down to 23
sub al, 127 ; sub 127
aam ; divide al by 10, result is in ah
add ah, 127 ; add 127
mov [ecx+3], ah ; replace the exponent
fld MyReal4 ; show it in Olly
"mov al, [ecx+3] ; take the exponent" - that's not taking bits 30 to 23 is it? 31/30/29/28/27/26/25/24, you're missing a little bit... :bg
Quote from: sinsi on August 04, 2008, 09:59:42 AM
"mov al, [ecx+3] ; take the exponent" - that's not taking bits 30 to 23 is it? 31/30/29/28/27/26/25/24, you're missing a little bit... :bg
You make a lot of fuzz about one little bit! :toothy
.data
MyQword dq 12345678
MyReal4 REAL4 0.0
.code
int 3 ; let Olly say Hi
mov esi, offset MyQword
fild MyQword ; convert quad
fst MyReal4 ; to real4
mov ecx, offset MyReal4
mov eax, [ecx] ; take the full number
sar eax, 7+16 ; take bits 30 down to 23
sub al, 127 ; sub 127
aam ; divide al by 10
add al, 127 ; add 127 to the remainder
sal eax, 7+16 ; shift left
mov edx, [ecx] ; take original, and free the exponent slot
and edx, 10000000011111111111111111111111b
add edx, eax ; move bits 30 .. 23 to their old positions
mov [ecx], edx ; replace the exponent
fld MyReal4 ; show it in Olly
Note that you aren't masking out bit 31 either, not that it should matter as you shouldn't be able to load negative values (bit 31 is the sign bit on a REAL4)...
I wondered about using aam myself, but I'm not convinced the hit from the partial register stalls will be worth the saving.
Also if you divide by (10 SHL 23), and subtract/add (127 SHL 23) you can get rid of the shifts.
You also want to save the result of the divide (using aam - ah, using div edx) as it's the index into your array of strings.
Mirno
Will look into it. I wonder if there is a faster way to convert qw to real4 and vice versa... CVTDQ2PS and CVTTPS2DQ are SSE2, and not particularly handy.
jj2007,
This is equivalent to the last code you posted.
.DATA
MyQword QWORD 12345678
MB DWORD 1024*1024
MyReal4 REAL4 0.0
.CODE
fild MyQword
fidiv MB
fstp MyReal4
Faster.
.DATA
MyQword QWORD 12345678
MB REAL8 0.00000095367431640625
MyReal4 REAL4 0.0
.CODE
fild MyQword
fmul MB
fstp MyReal4
Later: I just looked at MichaelW's macro, it's pretty much the same code at the core.
Quote from: Mirno on August 04, 2008, 04:25:56 PM
Note that you aren't masking out bit 31 either, not that it should matter as you shouldn't be able to load negative values (bit 31 is the sign bit on a REAL4)...
I wondered about using aam myself, but I'm not convinced the hit from the partial register stalls will be worth the saving.
Also if you divide by (10 SHL 23), and subtract/add (127 SHL 23) you can get rid of the shifts.
You also want to save the result of the divide (using aam - ah, using div edx) as it's the index into your array of strings.
Mirno
Thanxalot for the good ideas. New version is attached; there is still the UseMMX option to reverse to old code. Size is 136 bytes for the MMX and 132 bytes for the Mirno version.
[attachment deleted by admin]
Quote from: Greg on August 04, 2008, 10:09:39 PM
jj2007,
This is equivalent to the last code you posted.
.DATA
MyQword QWORD 12345678
MB DWORD 1024*1024
MyReal4 REAL4 0.0
.CODE
fild MyQword
fidiv MB
fstp MyReal4
Faster.
.DATA
MyQword QWORD 12345678
MB REAL8 0.00000095367431640625
MyReal4 REAL4 0.0
.CODE
fild MyQword
fmul MB
fstp MyReal4
Later: I just looked at MichaelW's macro, it's pretty much the same code at the core.
What you write is technically correct but no longer relevant... I don't divide reals:
ShowQw proc uses edi esi ebx pNum:DWORD
LOCAL TmpR4:REAL4
ffree st(7) ; instead of expensive finit
mov esi, pNum ; pointer to our 8-byte number
lea edi, TmpR4
fild dword ptr [esi] ; convert quad
fstp dword ptr [edi] ; to real4
xor ebx, ebx
mov eax, [edi] ; our number as real4 - if it's zero, better do nothing
.if eax
mov ebx, 1065353216 ; 127 shl 23
sub eax, ebx
mov ecx, 83886080 ; 10 shl 23
xor edx, edx ; quite useful before a divide ;-)
div ecx ; divide eax by ecx
add edx, ebx ; 127 shl 23
mov ebx, eax ; ebx holds unit
mov [edi], edx ; replace with new number (exponent in bits 23...30)
fld dword ptr [edi] ; push Real4 on FPU
.if ebx
push 100 ; move 100 into the stack
fimul dword ptr [esp] ; mul ST (0), 100
.endif
fistp dword ptr [esp] ; pop result from FPU stack to CPU stack
pop eax ; correct the CPU stack, and get the result
.endif
mov esi, offset FlexByteBuffer
invoke dwtoa, eax, esi
mov edx, offset t0 ; default is bytes
.if ebx ; add the dot if it's more than bytes
mov edx, [esi+len(esi)-2] ; len returns eax, so we use ecx as accu
mov cl, "."
mov [esi+eax-2], cl
mov [esi+eax-1], edx
lea edx, [t1+4*ebx-4] ; our unit
.endif
invoke lstrcat, esi, edx ; add the xByte unit
ret
ShowQw endp
But thanks anyway, Greg.
Quote from: jj2007 on July 31, 2008, 09:10:21 PM
Quote from: Mark_Larson on July 31, 2008, 03:35:14 PM
There is an X86 instruction you can use to automatically scan for the highest bit that is set in the integer. It is called BSR ( bit scan reverse).
Seems to be incredibly slow - 103 cycles for a single instruction??
BSR- Bit Scan Reverse (386+)
Usage: BSR dest,src
Modifies flags: ZF
Scans source operand for first bit set. Sets ZF if a bit is found
set and loads the destination with an index to first set bit. Clears
ZF is no bits are found set. BSF scans forward across bit pattern
(0-n) while BSR scans in reverse (n-0).
Clocks Size
Operands 808x 286 386 486 Bytes
reg,reg - - 10+3n 6-103 3
reg,mem - - 10+3n 7-104 3-7
reg32,reg32 - - 10+3n 6-103 3-7
reg32,mem32 - - 10+3n 7-104 3-7
it is significantly faster on P4s. I think it is in the 1-2 cycle range. The timing you got was for a 486.
Quote from: Mark_Larson on August 05, 2008, 03:32:11 PM
it is significantly faster on P4s. I think it is in the 1-2 cycle range. The timing you got was for a 486.
In the meantime, we found another solution, but good to know anyway :bg
Is there a more up-to-date free alternative to opcodes.hlp?
Agner Fog has some docs on his site - they're pretty useful but more indepth (given the out of order execution, different engines being able to execute different micro-ops, micro-op fusion, latency, and such like) than the good old days when instruction X took Y clock cycles no matter what.
http://www.agner.org/optimize/#manuals
Quote from: Mirno on August 05, 2008, 04:21:14 PM
Agner Fog has some docs on his site
Fascinating lecture indeed, but often I simply need to find an appropriate opcode, plus some basic info about size and latency. Opcodes.hlp is really handy but apparently very outdated...
Quote from: jj2007What you write is technically correct but no longer relevant...
I fail to see how it's no longer relevant.
Quote from: Greg on August 05, 2008, 07:20:35 PM
Quote from: jj2007What you write is technically correct but no longer relevant...
I fail to see how it's no longer relevant.
Greg, replacing a slow
div with a faster
mul complement will always be relevant, but in the meantime the code took another road, the one proposed by Mirno, and therefore does not need divs any more. If somebody volunteers to add the bsr variant, we might need it again, of course. Thanks anyway :thumbu
OK, here's my crack at using BSR and the FPU for this.
edit: fixed a little problem, new file.
edit: spiffed up a litlle more, new file.
edit: a little faster, new file
edit: reversed the '.IF hibit' code to test for smallest first
[attachment deleted by admin]
Quote from: Greg on August 05, 2008, 11:09:49 PM
OK, here's my crack at using BSR and the FPU for this.
You are faster than the police allows! Here is mine...
[attachment deleted by admin]
Quote from: Mirno on August 04, 2008, 04:25:56 PM
I wondered about using aam myself
Looking to the future, instructions like AAM aren't legal in 64-bit programming :(
Quote from: sinsi on August 06, 2008, 05:21:27 AM
Quote from: Mirno on August 04, 2008, 04:25:56 PM
I wondered about using aam myself
Looking to the future, instructions like AAM aren't legal in 64-bit programming :(
No problem, I live in Europe, and our laws are pretty liberal, hehe :bg
Seeing as we're all having a go, here's mine (stole the proc format mostly from Greg)...
ShowBytes PROC lo:DWORD, hi:DWORD
LOCAL tmp:REAL8
.DATA
fmt BYTE "%.2f %s", 0
; Do this with a macro so it looks nicer!
ALIGN 8
szB BYTE "bytes",0
ALIGN 8
BYTE "kB",0
ALIGN 8
BYTE "MB",0
ALIGN 8
BYTE "GB",0
ALIGN 8
BYTE "TB",0
ALIGN 8
BYTE "PB",0
ALIGN 8
BYTE "EB",0
.CODE
fild QWORD PTR [lo]
fstp tmp
mov eax, DWORD PTR [tmp + 4]
mov ecx, (10 SHL 20)
xor edx,edx
and eax, (07FFh SHL 20)
jz @F
sub eax, (1023 SHL 20)
div ecx
mov ecx, DWORD PTR [tmp + 4]
add edx, (1023 SHL 20)
and ecx, NOT (07FFh SHL 20)
or edx, ecx
@@:
lea eax, [offset szB + eax*8]
INVOKE crt_printf, ADDR fmt, DWORD PTR [tmp], edx, eax
ret
ShowBytes ENDP
It seems the crt_printf doesn't accept REAL4 (float), they must be REAL8 (double).
So I made the necessary changes (11 bits of exponent, bias by 1023).
It should also deal with zero - which is a corner case when dealing with the floating point numbers.
Mirno
One more, this time with cycle counts. The BSR version is slightly shorter than the MMX version but seems to be slower, too - I am not quite sure because as you can easily see, the MMX version fails miserably for the TB and PB examples ::)
Furthermore, the MMX version has a considerable rounding error. Anybody around with a Core2 for a speed comparison? I have a P4 here with a slow FPU.
BSR, 113 bytes:
Test0 492 cycles 123 bytes
Test1 474 cycles 123.45 MB
Test2 510 cycles 123.45 GB
Test3 460 cycles 123.45 TB
Test4 459 cycles 123.45 PB
MMX, 138 bytes:
Test0 357 cycles 123 bytes
Test1 386 cycles 123.44 MB
Test2 384 cycles 123.44 GB
Test3 324 cycles .0
Test4 320 cycles .0
EDIT 2: Attachment removed, see later post.
EDIT 1: Celeron M, 1.6 GHz:
BSR, 113 bytes:
Test0 225 cycles 123 bytes
Test1 253 cycles 123.45 MB
Test2 249 cycles 123.45 GB
Test3 247 cycles 123.45 TB
Test4 249 cycles 123.45 PB
MMX, 138 bytes:
Test0 184 cycles 123 bytes
Test1 236 cycles 123.44 MB
Test2 241 cycles 123.44 GB
Test3 177 cycles .0
Test4 176 cycles .0
I am working on my own BSR version but there are some bugs I am working out.
Quote from: jj2007 on August 06, 2008, 10:25:12 AM... MMX version fails miserably for the TB and PB examples ...
there was some little failure in your code:
...
@@: movq mm3, mm2
psubq mm3, mm0 ;db 0fh, 0fbh, 0d8h
pextrw eax, mm3, 3
test eax, 08000h
jz @F
psllq mm2, 10
psrlq mm1, 10
dec edi
jnz @B
@@:
...
Quote from: jj2007 on August 06, 2008, 10:25:12 AM
Anybody around with a Core2 for a speed comparison? I have a P4 here with a slow FPU.
her are my results on c2d (corrected version):
BSR, 113 bytes:
Test0 207 cycles 123 bytes
Test1 243 cycles 123.45 MB
Test2 225 cycles 123.45 GB
Test3 227 cycles 123.45 TB
Test4 223 cycles 123.45 PB
MMX, 141 bytes:
Test0 170 cycles 123 bytes
Test1 206 cycles 123.44 MB
Test2 212 cycles 123.44 GB
Test3 216 cycles 123.44 TB
Test4 213 cycles 123.44 PB
The floating point code I posted is pretty compact, but printing out a floating point value is killing it performance wise (crt_sprintf adds 4k clocks to it - ouch).
I can probably do some messing around with the mantissa to get the decimal places out of the calculation separately from the integer part (using two fistp instructions, and a bit of jiggery-hackery - we're dealing with powers of two so it's just shifting the mantissa about a bit...).
Quote from: Mirno on August 06, 2008, 05:23:02 PM
The floating point code I posted is pretty compact, but printing out a floating point value is killing it performance wise (crt_sprintf adds 4k clocks to it - ouch).
That's why I chose the little dwtoa hack :bg
Quote
I can probably do some messing around with the mantissa to get the decimal places out of the calculation separately from the integer part (using two fistp instructions, and a bit of jiggery-hackery - we're dealing with powers of two so it's just shifting the mantissa about a bit...).
Quote
I incorporated the two decimals into the complements table... I doubt there is a faster solution.
Quote from: qWord on August 06, 2008, 05:20:09 PM
there was some little failure in your code:
...
test eax, 08000h
@@:
...
You're a darling, qWord, thanxalot :thumbu
My timings on Celeron M:
BSR, 113 bytes:
Test0 225 cycles 123 bytes
Test1 251 cycles 123.45 MB
Test2 248 cycles 123.45 GB
Test3 248 cycles 123.45 TB
Test4 247 cycles 123.45 PB
MMX, 141 bytes:
Test0 189 cycles 123 bytes
Test1 234 cycles 123.44 MB
Test2 243 cycles 123.44 GB
Test3 243 cycles 123.44 TB
Test4 266 cycles 123.44 PB
Corrected version attached.
[attachment deleted by admin]
Quote from: jj2007 on August 06, 2008, 06:02:10 PM
You're a darling, qWord, thanxalot :thumbu
My timings on Celeron M:
BSR, 113 bytes:
Test0 225 cycles 123 bytes
Test1 251 cycles 123.45 MB
Test2 248 cycles 123.45 GB
Test3 248 cycles 123.45 TB
Test4 247 cycles 123.45 PB
MMX, 141 bytes:
Test0 189 cycles 123 bytes
Test1 234 cycles 123.44 MB
Test2 243 cycles 123.44 GB
Test3 243 cycles 123.44 TB
Test4 266 cycles 123.44 PB
Corrected version attached.
I just looked at your BSR code, and you do a lot more work than me to get the result. I don't even have one loop. Just two BSRs and one lookup table. I think I'll go ahead and post my buggy code, so that you can see a different approach.
The bug is in the floating point code.
my approach is to use the BSR on the upper 32-bits. If it is 0, then we do it on the lower 32-bits.
I use the BSR bit value as a value into a lookup table to get ONE value to divide by. You can switch this to a multiply of course, you just need to set up the lookup table that way. I was going to do that later.
Again there is no looping. There is only ONE conditional jump in the code. If you can find the bug in the floating point code, it should work. I didn't do extensive testing. But I did test 10 random 64-bit random #'s
here is my data lookup table. I use the string of text to print as a lookup table as well. You can divide these values into 1.0 to flip them, since it's a lookup table and then just do a multiply.
align 8
divide_values dq 1
dq 1024
dq 1024*1024
dq 1024*1024*1024
dq 1024*1024*1024*1024
dq 1024*1024*1024*1024*1024
dq 1024*1024*1024*1024*1024*1024
dq 1024*1024*1024*1024*1024*1024*1024
string lookup table. The strings have to be exactly 4 bytes long for the code to work correctly, 3 bytes of chars and an ascii0. I use a trick later on to make looking up these values quicker if you do that.
string_size_table db "BBs",0
db "KBs",0
db "MBs",0
db "GBs",0
db "TBs",0
db "PBs",0
db "EBs",0
db "?Bs",0
I didn't know what the value was after ExaBytes. So I used ?bytes as the last entry in the table.
invoke nseed,34521345
invoke nrandom,0ffffffffh
mov ebx,eax ;save in EBX
invoke nrandom,0ffffffffh
;got 64-bit number
mov edx,ebx
pushad
fn crt_printf,"Hex Value: %.8X%::%.8X%c%c", edx,eax, 13, 10
popad
;edx:eax already has 64-bit
bsr ebx,edx
jz lower_32_bits
add ebx,4*8 ;say we are in the upper part of the table., *8 because we shift right 3 later
jmp @F
lower_32_bits:
xor ebx,ebx ;if the register is 0, BSR won't correclty update the register
bsr ebx,eax
@@:
shr ebx,3 ;divide by 1024 bits ( not bytes)
; fn crt_printf,"BSR: %d%c%c", ebx, 13, 10
;for debugging print the divide table entry we are using
lea ecx,[divide_values + ebx*8]
fn crt_printf,"%I64X%c%c", ecx, 13, 10
lea ecx,[string_size_table + ebx*4]
;for debugging print the string we are using
; fn crt_printf,"String: %s%c%c", ecx, 13, 10
mov dword ptr [fp],eax
mov dword ptr [fp+4],edx
ffree st(7) ;finit
fild [fp] ; st1
fild [divide_values + ebx*8] ; st0
.data?
align 8
fp dq ?
.code
fdivp st(1),st(0)
fstp [fp]
;this is the only value that needs to be printed
fn crt_printf,"%f %s%c%c", [fp], ecx, 13, 10
Invoke ExitProcess, 0
Quote from: Mark_Larson on August 06, 2008, 09:25:23 PM
I didn't know what the value was after ExaBytes. So I used ?bytes as the last entry in the table.
=> Zebibyte (ZiB) == 2^70 => doesn't matter because we are using 64 bit numbers :bg
regards, qWord
Quote from: qWord on August 06, 2008, 10:02:23 PM
Quote from: Mark_Larson on August 06, 2008, 09:25:23 PM
I didn't know what the value was after ExaBytes. So I used ?bytes as the last entry in the table.
=> Zebibyte (ZiB) == 2^70 => doesn't matter because we are using 64 bit numbers :bg
regards, qWord
is there one between Zebi and Exa?
EDIT: nevermind I found it on Wikipedia
http://en.wikipedia.org/wiki/Exabyte
Mark
Found my bug. It wasn't floating point related. I needed a 3rd lookup table to convert from a bit count from the BSR to which offset to use in the divide table. I am still testing it, so I will post it later.
I found one additional bug. I commented out all the DEBUG crt_printfs.
I changed my divide_values lookup table to REAL8 and I did 1/1024.0 so I could multiply. Again there is only one conditional branch and no looping.
I didn't do extensive testing, so there still might be a bug or two lurking. I tested 5 different random numbers.
If you find any bugs, please let me know, thanks :)
The code is very small. 16 lines not including labels but including the crt_printf. Let me know if you find it useful.
.data
align 8
fp REAL8 0.0
; 7 entries
divide_values REAL8 1.0 ;bytes
REAL8 0.0009765625 ;kilobytes
REAL8 0.00000095367431640625 ;megabytes
REAL8 0.000000000931322574615478515625 ;gigabytes
REAL8 9.094947017729282379150390625e-13 ;terabytes
REAL8 8.8817841970012523233890533447266e-16 ;petabytes
REAL8 8.6736173798840354720596224069595e-19 ;exabytes
;70 entries, 10 * 7
bit_count_table dd 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ; the 1st 10 bits belong to the 0th offset in the divide table
dd 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ; the 2nd 10 bits belong to the 1st offset in the divide table
dd 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
dd 3, 3, 3, 3, 3, 3, 3, 3, 3, 3
dd 4, 4, 4, 4, 4, 4, 4, 4, 4, 4
dd 5, 5, 5, 5, 5, 5, 5, 5, 5, 5
dd 6, 6, 6, 6, 6, 6, 6, 6, 6, 6
align 4
string_size_table db "BBs",0
db "KBs",0
db "MBs",0
db "GBs",0
db "TBs",0
db "PBs",0
db "EBs",0
.code
Start:
;edx:eax already has 64-bit
bsr ebx,edx
jz lower_32_bits
add bl,32 ;say we are in the UPPER 32-bits of the 64-bit value
jmp @F
lower_32_bits:
xor ebx,ebx ;if the register is 0, BSR won't correclty update the register
bsr ebx,eax
@@:
mov ebx, dword ptr [bit_count_table + ebx*4]
;ecx used later, needs to have this value for the crt_printf
lea ecx,[string_size_table + ebx*4]
mov dword ptr [fp],eax
mov dword ptr [fp+4],edx
ffree st(7) ;finit
fild [fp] ; st1
fld [divide_values + ebx*8] ; st0
fmulp st(1),st(0)
fstp [fp]
;this is the only value that needs to be printed
fn crt_printf,"%f %s%c%c", [fp], ecx, 13, 10
I used Michael's timing macros. For greater accuracy I commented out the crt_printf, since we are primarily concerned with how we calculate the data for it. I did 100,000,000 loops and set the priority level to REALTIME. The code ran in 20 cycles on my Core 2 Duo.
I worked on getting Mark's code working earlier, I did get it working. Then I took what I learned from that and modified my previous code. Not sure if it's any faster, just another take on it.
Edit: This code is slower than my previous code.
[attachment deleted by admin]
Looks ok, Greg. Re speed: crt_printf seems very slow, and can be replaced by my "dwtoa hack". Same applies to finit - a very slow instruction that can in this case be replaced by an ffree.
Quote from: Greg on August 07, 2008, 01:05:02 AM
lodword:
xor eax, eax ;if the SRC register [edx] is 0, BSR won't correctly update the DEST register
bsr eax, [edx]
jz done ;it's zero
Interesting indeed, and good to know. The zero flag is set correctly though.
Quote from: jj2007 on August 07, 2008, 11:11:05 AM
Quote from: Greg on August 07, 2008, 01:05:02 AM
lodword:
xor eax, eax ;if the SRC register [edx] is 0, BSR won't correctly update the DEST register
bsr eax, [edx]
jz done ;it's zero
Interesting indeed, and good to know. The zero flag is set correctly though.
That's actually from my code :)
Quote from: Greg on August 07, 2008, 01:05:02 AM
I worked on getting Mark's code working earlier, I did get it working. Then I took what I learned from that and modified my previous code. Not sure if it's any faster, just another take on it.
did you time it at all? I re-posted my new code at the end of page 4. It came out to 20 cycles on my core 2 duo, except for the printf
Quote from: jj2007 on August 07, 2008, 07:01:33 AM
Looks ok, Greg. Re speed: crt_printf seems very slow, and can be replaced by my "dwtoa hack". Same applies to finit - a very slow instruction that can in this case be replaced by an ffree.
yea I picked up your ffree trick in my code. Thanks.
The majority of the time when I print to the screen it doesn't need to be fast. I print out debugging statistics and other stuff. In general I don't print at all if the code needs to be fast. Or if you have to have one, you can try and move the PRINTFs to outside the part that needs to be fast. So for instance I collected data statistics on the data I was looking at. I saved that in memory, and then printed it out later. So in my opinion you should never do printfs in time critical code. Same with malloc() and free(). I've seen otherwise fast code bog down at constantly having to allocate a small piece of data. Allocate one big buffer in the beginning and break off chunks as you need it.
EDIT: Pbrennick was asking about optimiation tricks in a seperate thread. I posted my own website with 60 optimization tricks. THe funny thing is I specifically say use BSR to get the highest power of 2 ( tip #17 on the webpage)
http://www.mark.masmcode.com/
Quote from: jj2007Looks ok, Greg. Re speed: crt_printf seems very slow, and can be replaced by my "dwtoa hack". Same applies to finit - a very slow instruction that can in this case be replaced by an ffree.
If the code was too slow for the application, then I would worry about it. Regarding
finit, I think the benefits of using it far outweigh any speed considerations.
xor eax, eax ;if the SRC register [edx] is 0, BSR won't correctly update the DEST register
Quote from: jj2007Interesting indeed, and good to know. The zero flag is set correctly though.
Yes, that was Mark's idea. I left the comment in my code as a reminder of what it was.
Quote from: Mark Larsondid you time it at all?
No, I took your word for it. I imagine the
div in the last code I posted is not helping the speed any.
Mark,
Your Assembly Optimization Tips (http://www.mark.masmcode.com/) are very good, I have a link to them in my favorites. :U
Quote from: Greg on August 07, 2008, 05:59:44 PM
Regarding finit, I think the benefits of using it far outweigh any speed considerations.
finit has some advantages, such as resetting everything to standard values, e.g. precision to 80 bits. But it is sufficient to use it once at code start; and I have a suspicion that Windows does it for you when you launch the program... inside a loop, ffree st(7) is absolutely sufficient afaik.
jj2007,
Quote from: jj2007But it is sufficient to use it once at code start;
I agree, that's the way to do it.
Don't count on Windows doing it for you when you launch your program, it doesn't. In later versions of Windows, some API calls leave the FPU in 53-bit precision. Also, some API calls use MMX and don't clean up with emms.
Quote from: Greg on August 07, 2008, 07:36:20 PM
In later versions of Windows, some API calls leave the FPU in 53-bit precision. Also, some API calls use MMX and don't clean up with emms.
Cute. Do you have any links or other references saying which ones misbehave?
Quote from: Greg on August 07, 2008, 06:02:36 PM
Quote from: Mark Larsondid you time it at all?
No, I took your word for it. I imagine the div in the last code I posted is not helping the speed any.
correct. My latest code will be the fastest. I was thinking we could combine this with Michael's Macro, and add a routine that prints out the correct unit of time based on the reutrned value in clocks.
clocks in edx:eax
use 1000.0 instead of 1024.0
assuming I have a 1GHz processor (makes it easier). 1GHz = 1000MHz (need to convert to MHz)
to get the speed in
nanoseconds you take the clocks, and divide it by the PCU speed in MHz
if the clocks is 10000 divide by 1000 for the cpu speed to get nanoseconds
10000
----------- = 100 nanoseconds.
1000
so we do the same thing. If the value is within a certain range we use different units of time.
nanoseconds
microseconds
milliseconds
seconds
minutes
hours
what do you think?
jj2007,
I don't have a list of specific APIs. You won't find anything about it on MSDN either.
The MMX issue is mentioned by Raymond in SimplyFPU and I have seen it discussed elsewhere. The issue about the precision has come up in the PowerBASIC forums. PowerBASIC has an 80-bit extended precision data type (EXT) so if you are using EXT variables you would want the FPU to always use 64-bit precision. PowerBASIC has found that some Windows APIs leave the precision set to 53 bits. It's Microsoft's problem, but PowerBASIC is working on work-arounds.
Microsoft has pretty much decided 80-bit extended precision variables don't exist. They removed them from their compilers, they say for compatibility with other CPUs (PowerPC etc.). Big mistake if you ask me. They could have kept them for x86, but they took the easy way out. Other C++ compilers support them, like Borland and Intel.
Mark,
MichealW's macros will display milliseconds. For different units of time it sounds like a good idea.
QuoteMy latest code will be the fastest.
I did some timing on my code. The last code I posted here (http://www.masm32.com/board/index.php?topic=9585.msg70299#msg70299) is pretty slow (but it is short and concise). The previous code I posted here (http://www.masm32.com/board/index.php?topic=9585.msg70240#msg70240) does pretty well, I get about 19 cycles.
Quote from: Greg on August 07, 2008, 08:17:36 PM
Microsoft has pretty much decided 80-bit extended precision variables don't exist. They removed them from their compilers, they say for compatibility with other CPUs (PowerPC etc.). Big mistake if you ask me. They could have kept them for x86, but they took the easy way out. Other C++ compilers support them, like Borland and Intel.
I agree. The gain in speed and space is not that significant. There is a speed penalty for 80-bit mem to FPU transfers (as compared to real4 and real8), therefore for my roll-your-own FPU lib, I will try to use 64 bit memory variables, keep internal full 80 bit precision, and do as many things as possible inside the FPU without shoving values to memory.
jj2007,
I think you missed my point, I think it was a mistake Microsoft got rid of them (80-bit extended precision variables).
It all depends on what you are doing. If you want accuracy and precision REAL10 is the way to go. If you want your library to be able to be called from Microsoft C/C++, then you need to use REAL8 (or REAL4) variables. If you are writing fast graphics you would probably use REAL4 variables.
I like having the choice to use REAL10 variables if I want to.
Quote from: Greg on August 07, 2008, 08:23:20 PM
Mark,
MichealW's macros will display milliseconds. For different units of time it sounds like a good idea.
QuoteMy latest code will be the fastest.
I did some timing on my code. The last code I posted here (http://www.masm32.com/board/index.php?topic=9585.msg70299#msg70299) is pretty slow (but it is short and concise). The previous code I posted here (http://www.masm32.com/board/index.php?topic=9585.msg70240#msg70240) does pretty well, I get about 19 cycles.
time my latest so you can get a comparison.
I'm switching to SSE2. And trying that instead :) Will be faster on Intel P4 and up CPUs
What type and speed of CPU to you have Greg?
Mark
Quote from: Greg on August 07, 2008, 08:23:20 PM
Mark,
MichealW's macros will display milliseconds. For different units of time it sounds like a good idea.
QuoteMy latest code will be the fastest.
I did some timing on my code. The last code I posted here (http://www.masm32.com/board/index.php?topic=9585.msg70299#msg70299) is pretty slow (but it is short and concise). The previous code I posted here (http://www.masm32.com/board/index.php?topic=9585.msg70240#msg70240) does pretty well, I get about 19 cycles.
I tried to incorporate your code - no full success because I used dwtoa for a fixed buffer, while you use crt_printf with a chr$; but when disabling the print part, I get these timings:
BSR, 122 bytes:
Test0 28 cycles
Test1 43 cycles
Test2 40 cycles
Test3 40 cycles
Test4 39 cycles
MMX, 150 bytes:
Test0 31 cycles
Test1 38 cycles
Test2 46 cycles
Test3 53 cycles
Test4 57 cycles
Greg, 221 bytes:
Test0 36 cycles
Test1 35 cycles
Test2 32 cycles
Test3 29 cycles
Test4 26 cycles
[attachment deleted by admin]
Quote from: Greg on August 07, 2008, 11:02:16 PM
I think you missed my point, I think it was a mistake Microsoft got rid of them (80-bit extended precision variables).
It all depends on what you are doing. If you want accuracy and precision REAL10 is the way to go. If you want your library to be able to be called from Microsoft C/C++, then you need to use REAL8 (or REAL4) variables. If you are writing fast graphics you would probably use REAL4 variables.
I like having the choice to use REAL10 variables if I want to.
No disagreement, Greg - you should indeed have the choice. But for my own use, I will default to REAL8 because it's the best compromise. Inaccuracies creep in if you repeatedly convert FPU values to lower precision memory vars - but than can be avoided by keeping the stuff inside the FPU while doing complex calculations, with 80 bits precision. After all, there are eight handy registers...
EDIT: Just checked with fphelp.hlp my statement about speed penalty for using REAL10, and while it's correct for FLD, t=6:
FLD memreal | fld longreal | 486 s=3,l=3,t=6
.. it seems incorrect for FSTP, where l=8 is slowest:
FSTP memreal
fstp longreal | 486 s=7,l=8,t=6
fstp tempreals[bx] | 486 s=7,l=8,t=6
What does the [bx] variant mean?
Mark,
I timed your code, I'm getting around 13 cycles for your latest code (with the bit_count_table), it's definitely faster. I have a Pentium D 940.
Quote from: Mark_Larson on August 06, 2008, 09:25:23 PM
I just looked at your BSR code, and you do a lot more work than me to get the result. I don't even have one loop. Just two BSRs and one lookup table.
Nor do I. Check the file preceding your post for ShowQwBsr. It looks pretty similar. Main difference is that I use a REAL4 table that has already the mul 100 incorporated. For showing two decimals, REAL4 is already an overkill...
.data
qwEBi REAL4 8.67361737988404E-17
qwPBi REAL4 8.88178419700125E-14
qwTBi REAL4 9.09494701772928E-11
qwGBi REAL4 9.31322574615479E-8
qwMBi REAL4 9.5367431640625E-05
qwKBi REAL4 0.09765625
t6 db 9, "EB", 0
t5 db 9, "PB", 0
t4 db 9, "TB", 0
t3 db 9, "GB", 0
t2 db 9, "MB", 0
t1 db 9, "kB", 0
t0 db 9, "bytes", 0
ShowQwBsr proc uses edi esi ebx pNum:DWORD
mov esi, pNum ; pointer to our 8-byte number
mov edi, offset t0 ; default is bytes
mov ebx, [esi+4] ; test HiDword
bsr eax, ebx ; get first bit from the left
jnz H1 ; HiDword is set
mov ebx, [esi] ; test LoDword
bsr eax, ebx ; get first bit from left
jz IsZero ; number is zero, special treatment ;-)
jmp H0
H1:
add eax, 32 ; HiDword was set, so add 32 to the bit count
H0:
aam ; divide by 10, result in ah, remainder in al
movzx ecx, ah ; counter
mov ebx, ecx ; copy of counter
sal ecx, 2 ; *4
sub edi, ecx ; pointer to the unit and the R4 complements
push [esi] ; for bytes: push the loword only
.if ebx
ffree st(7) ; instead of an "expensive" finit
fild qword ptr [esi] ; push quad on FPU, and multiply with complements*100
fmul dword ptr [edi-24] ; to simulate div 1024 and 2 decimals
fistp dword ptr [esp] ; pop result from FPU stack to CPU stack
.endif
pop eax ; our number as an integer
IsZero:
mov esi, offset FlexByteBuffer ; a dedicated buffer to display number and unit
invoke dwtoa, eax, esi
.if ebx ; add the dot if it's more than bytes
mov edx, [esi+len(esi)-2] ; len returns eax, therefore
mov cl, "." ; we use ecx/cl as accu
mov [esi+eax-2], cl ; insert the dot
mov [esi+eax-1], edx ; and add the two decimals copied above
.endif
invoke lstrcat, esi, edi ; add the xByte unit
ret
ShowQwBsr endp
JJ you have a bug in your code.
In the timing part you don't actually call the procedure to time. That is why all the clocks are close
You need to set the # of loops to 10,000,000
and the priority to REALTIME
it'll give you more accurate results.
Quote from: Mark_Larson on August 08, 2008, 12:55:34 AM
JJ you have a bug in your code.
In the timing part you don't actually call the procedure to time. That is why all the clocks are close
You need to set the # of loops to 10,000,000
and the priority to REALTIME
it'll give you more accurate results.
If you refer to the post above, flexbyte.zip below "Greg, 221 bytes:", no there is no bug, the procedure is being called. But you are right about the loop count: I had set it to 100 for testing, and forgot to reset. With 10,000,000, results are as follows on a P4, 3.4 GHz:
BSR, 124 bytes:
Test0 105 cycles
Test1 116 cycles
Test2 104 cycles
Test3 113 cycles
Test4 105 cycles
MMX, 150 bytes:
Test0 36 cycles
Test1 45 cycles
Test2 59 cycles
Test3 75 cycles
Test4 96 cycles
Greg, 226 bytes:
Test0 53 cycles
Test1 49 cycles
Test2 38 cycles
Test3 36 cycles
Test4 32 cycles
Greg's code is the fastest, but for practical purposes it would be useful to invert the order of the multiple
.IF hibit >= tests.
I attach the version with the corrected loop count.
[attachment deleted by admin]
BSR, 122 bytes:
Test0 14 cycles
Test1 19 cycles
Test2 18 cycles
Test3 18 cycles
Test4 19 cycles
MMX, 150 bytes:
Test0 15 cycles
Test1 20 cycles
Test2 25 cycles
Test3 28 cycles
Test4 32 cycles
Greg, 221 bytes:
Test0 20 cycles
Test1 20 cycles
Test2 22 cycles
Test3 25 cycles
Test4 17 cycles
Quote from: Mark_Larson on August 08, 2008, 09:40:18 AM
Strange that your figures are so much lower than mine. Different CPU?
core 2 duo
that is why I was surprised by your #'s
you don't need to use AAM
crt_printf supports only printing some of the decimal places.
aam runs very slow,
Quote from: Mark_Larson on August 08, 2008, 09:52:40 AM
core 2 duo
that is why I was surprised by your #'s
you don't need to use AAM
crt_printf supports only printing some of the decimal places.
aam runs very slow,
Attached version with aam switched off - still pretty slow. Would you mind adding your code? I prepared a ShowQwMark slot, just copy & paste.
[attachment deleted by admin]
for some reason, when I add my code into yours, I get a big slow down. But when I cut all 5 of your test data into my program, it runs as expected.
All 5 tests from your code run in 19 cycles on mine.
I am still going to do an SSE2 version I just haven't gotten to it yet.
there's a trick you can also do with shifts instead of doing a MUL
I have an interview today, so I don't know how long I will be on
inspired by Mirno's concept, I've created a function using SSE2.
It is not optimized at all, but it works well.
regards, qWord
[attachment deleted by admin]
Quote from: jj2007... for practical purposes it would be useful to invert the order of the multiple .IF hibit >= tests.
Yeah, you're right. I reversed it to test for the smallest first and uploaded a new file. Maybe slightly faster.
AMD dual-core x64 4000+ (WinXP x32)
BSR, 128 bytes:
Test0 72 cycles
Test1 74 cycles
Test2 65 cycles
Test3 65 cycles
Test4 65 cycles
MMX, 150 bytes:
Test0 15 cycles
Test1 20 cycles
Test2 26 cycles
Test3 31 cycles
Test4 34 cycles
Greg, 222 bytes:
Test0 42 cycles
Test1 41 cycles
Test2 28 cycles
Test3 25 cycles
Test4 24 cycles
Okay,
Since Mark mentioned BSR, and sinsi used SHRD, I revisited
my code. Cleaned it up a bit first, the algorithm is a bit clearer.
(I hope.) Then used those instructions to do the same thing
using a different algorithm to select the bits to display. There
are two EQUates to select the different options. One selects
either the BSR and a jump table, or the test and branch code.
The other switches between using AAM or a divide to format
the fraction.
Written as mostly 32 bit code, though a few things remain
as 16 bit. The BSR code probably looks like it had a encounter
with a large, flat, programming rock. But it is reasonably
commented, and works. So some beginner may benefit.
Regards,
Steve N.
[attachment deleted by admin]
Quote from: Mirno on August 06, 2008, 05:23:02 PM
The floating point code I posted is pretty compact, but printing out a floating point value is killing it performance wise (crt_sprintf adds 4k clocks to it - ouch).
Have you tried crt_sprintf? No invoke available, but it does the same as crt_printf except it writes to a buffer:
.data?
f2sBuffer dd 10 dup(?)
.data
Float8 REAL8 1234.5678901234567890
StrFormat db "Float is %.17g", 0
.code
mov eax, offset Float8
push [eax+4]
push [eax]
push OFFSET StrFormat
push OFFSET f2sBuffer
call crt_sprintf
pop edx ; f2sBuffer
add esp, 12
MsgBox 0, edx, "ST$ test:", MB_OK
jj2007,
You can use invoke:
mov eax, OFFSET Float8
INVOKE crt_sprintf, ADDR f2sBuffer, ADDR StrFormat, REAL8 PTR [eax]
Quote from: Greg on August 19, 2008, 04:59:04 AM
jj2007,
You can use invoke:
mov eax, OFFSET Float8
INVOKE crt_sprintf, ADDR f2sBuffer, ADDR StrFormat, REAL8 PTR [eax]
I had not seen that one, thanks a lot Greg :thumbu
::) divisions ?
:dazzled: fpu ?
:eek mmx/sse2 ?
what's wrong with you guys ? maybe i should post something before you attack the avx instruction set... :cheekygreen:
.DATA
ALIGN 16
XBytes_Table BYTE " Ko",0
BYTE " Mo",0
BYTE " Go",0
BYTE " To",0
BYTE " Po",0
BYTE " Eo",0
.CODE
ALIGN 16
;
; syntax :
; mov eax,{low part (31-0 bits) of the value}
; mov edx,{high part (63-32 bits) of the value (or 0 if no high part)}
; mov esi,{OFFSET of the string to create (12 bytes needed)}
; call GetFileSizeString
;
; return :
; eax = string length
;
GetFileSizeString PROC
push ebx
push ecx
push edx
push esi
push edi
CaseEdx:
mov edi,OFFSET XBytes_Table
test edx,edx
jz CaseEax
add edi,DWORD
shrd eax,edx,10
shr edx,10
jz CaseEax
add edi,DWORD
shrd eax,edx,10
shr edx,10
jz CaseEax
add edi,DWORD
shrd eax,edx,10
shr edx,10
jz CaseEax
add edi,DWORD
shrd eax,edx,10
CaseEax:
test eax,11111111111111111111110000000000b
jz CaseBytes
test eax,11111111111100000000000000000000b
jz CaseXBytes
shr eax,10
add edi,DWORD
test eax,11111111111100000000000000000000b
jz CaseXBytes
shr eax,10
add edi,DWORD
CaseXBytes:
mov ebx,eax
shr eax,10
shl ebx,22
mov edx,4294968
mov ecx,10
mul edx
jc Label01
dec esi
mul ecx
jc Label02
dec esi
mul ecx
jc Label03
dec esi
jmp Label04
Label01: add dl,"0"
mov BYTE PTR [esi],dl
mul ecx
Label02: add dl,"0"
mov BYTE PTR [esi+1],dl
mul ecx
Label03: add dl,"0"
mov BYTE PTR [esi+2],dl
Label04: mul ecx
add dx,".0"
mov WORD PTR [esi+3],dx
mov eax,ebx
test ebx,ebx
jnz Couple
dec esi
xor edx,edx
jmp UniqueZero
Couple:
mul ecx
add dl,"0"
mov BYTE PTR [esi+5],dl
mul ecx
UniqueZero:
mov eax,DWORD PTR [edi]
add dl,"0"
mov BYTE PTR [esi+6],dl
mov DWORD PTR [esi+7],eax
lea eax,[esi+10]
pop edi
pop esi
pop edx
pop ecx
pop ebx
sub eax,esi
ret
CaseBytes:
mov ebx,4294968
mov ecx,10
mul ebx
jc Label21
dec esi
mul ecx
jc Label22
dec esi
mul ecx
jc Label23
dec esi
jmp Label24
Label21: add dl,"0"
mov BYTE PTR [esi],dl
mul ecx
Label22: add dl,"0"
mov BYTE PTR [esi+1],dl
mul ecx
Label23: add dl,"0"
mov BYTE PTR [esi+2],dl
Label24: mul ecx
add edx,"yB 0"
mov DWORD PTR [esi+3],edx
mov DWORD PTR [esi+7],"set"
lea eax,[esi+11]
pop edi
pop esi
pop edx
pop ecx
pop ebx
sub eax,esi
ret
GetFileSizeString ENDP