News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

decimal as KiB,Mib,GiB etc.

Started by sinsi, July 29, 2008, 12:38:02 PM

Previous topic - Next topic

sinsi

Quote from: Mirno on August 04, 2008, 04:25:56 PM
I wondered about using aam myself
Looking to the future, instructions like AAM aren't legal in 64-bit programming  :(
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Quote from: sinsi on August 06, 2008, 05:21:27 AM
Quote from: Mirno on August 04, 2008, 04:25:56 PM
I wondered about using aam myself
Looking to the future, instructions like AAM aren't legal in 64-bit programming  :(

No problem, I live in Europe, and our laws are pretty liberal, hehe :bg

Mirno

Seeing as we're all having a go, here's mine (stole the proc format mostly from Greg)...

ShowBytes PROC lo:DWORD, hi:DWORD
    LOCAL tmp:REAL8
    .DATA
        fmt     BYTE "%.2f %s", 0

        ; Do this with a macro so it looks nicer!
        ALIGN 8
        szB     BYTE "bytes",0
        ALIGN 8
                BYTE "kB",0
        ALIGN 8
                BYTE "MB",0
        ALIGN 8
                BYTE "GB",0
        ALIGN 8
                BYTE "TB",0
        ALIGN 8
                BYTE "PB",0
        ALIGN 8
                BYTE "EB",0

    .CODE
        fild QWORD PTR [lo]
        fstp tmp
        mov eax, DWORD PTR [tmp + 4]
        mov ecx, (10 SHL 20)
        xor edx,edx
        and eax, (07FFh SHL 20)
        jz @F
        sub eax, (1023 SHL 20)
        div ecx

        mov ecx, DWORD PTR [tmp + 4]
        add edx, (1023 SHL 20)
        and ecx, NOT (07FFh SHL 20)

        or  edx, ecx
@@:
        lea eax, [offset szB + eax*8]
        INVOKE crt_printf, ADDR fmt,  DWORD PTR [tmp], edx, eax
        ret
ShowBytes ENDP


It seems the crt_printf doesn't accept REAL4 (float), they must be REAL8 (double).
So I made the necessary changes (11 bits of exponent, bias by 1023).

It should also deal with zero - which is a corner case when dealing with the floating point numbers.

Mirno

jj2007

#48
One more, this time with cycle counts. The BSR version is slightly shorter than the MMX version but seems to be slower, too - I am not quite sure because as you can easily see, the MMX version fails miserably for the TB and PB examples ::)

Furthermore, the MMX version has a considerable rounding error. Anybody around with a Core2 for a speed comparison? I have a P4 here with a slow FPU.

BSR, 113 bytes:
Test0   492 cycles      123     bytes
Test1   474 cycles      123.45  MB
Test2   510 cycles      123.45  GB
Test3   460 cycles      123.45  TB
Test4   459 cycles      123.45  PB

MMX, 138 bytes:
Test0   357 cycles      123     bytes
Test1   386 cycles      123.44  MB
Test2   384 cycles      123.44  GB
Test3   324 cycles      .0
Test4   320 cycles      .0

EDIT 2: Attachment removed, see later post.
EDIT 1: Celeron M, 1.6 GHz:

BSR, 113 bytes:
Test0   225 cycles      123     bytes
Test1   253 cycles      123.45  MB
Test2   249 cycles      123.45  GB
Test3   247 cycles      123.45  TB
Test4   249 cycles      123.45  PB

MMX, 138 bytes:
Test0   184 cycles      123     bytes
Test1   236 cycles      123.44  MB
Test2   241 cycles      123.44  GB
Test3   177 cycles      .0
Test4   176 cycles      .0

Mark_Larson

  I am working on my own BSR version but there are some bugs I am working out.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

qWord

Quote from: jj2007 on August 06, 2008, 10:25:12 AM...  MMX version fails miserably for the TB and PB examples  ...

there was some little failure in your code:


...
@@: movq mm3, mm2
    psubq mm3, mm0 ;db 0fh, 0fbh, 0d8h
    pextrw eax, mm3, 3
    test eax, 08000h
    jz @F
    psllq mm2, 10
    psrlq mm1, 10
    dec edi
    jnz @B
@@:
...


Quote from: jj2007 on August 06, 2008, 10:25:12 AM
Anybody around with a Core2 for a speed comparison? I have a P4 here with a slow FPU.
her are my results on c2d (corrected version):


BSR, 113 bytes:
Test0   207 cycles      123     bytes
Test1   243 cycles      123.45  MB
Test2   225 cycles      123.45  GB
Test3   227 cycles      123.45  TB
Test4   223 cycles      123.45  PB

MMX, 141 bytes:
Test0   170 cycles      123     bytes
Test1   206 cycles      123.44  MB
Test2   212 cycles      123.44  GB
Test3   216 cycles      123.44  TB
Test4   213 cycles      123.44  PB
FPU in a trice: SmplMath
It's that simple!

Mirno

The floating point code I posted is pretty compact, but printing out a floating point value is killing it performance wise (crt_sprintf adds 4k clocks to it - ouch).
I can probably do some messing around with the mantissa to get the decimal places out of the calculation separately from the integer part (using two fistp instructions, and a bit of jiggery-hackery - we're dealing with powers of two so it's just shifting the mantissa about a bit...).

jj2007

Quote from: Mirno on August 06, 2008, 05:23:02 PM
The floating point code I posted is pretty compact, but printing out a floating point value is killing it performance wise (crt_sprintf adds 4k clocks to it - ouch).
That's why I chose the little dwtoa hack :bg

Quote
I can probably do some messing around with the mantissa to get the decimal places out of the calculation separately from the integer part (using two fistp instructions, and a bit of jiggery-hackery - we're dealing with powers of two so it's just shifting the mantissa about a bit...).
Quote

I incorporated the two decimals into the complements table... I doubt there is a faster solution.

jj2007

Quote from: qWord on August 06, 2008, 05:20:09 PM
there was some little failure in your code:


...
    test eax, 08000h
@@:
...



You're a darling, qWord, thanxalot :thumbu

My timings on Celeron M:

BSR, 113 bytes:
Test0   225 cycles      123     bytes
Test1   251 cycles      123.45  MB
Test2   248 cycles      123.45  GB
Test3   248 cycles      123.45  TB
Test4   247 cycles      123.45  PB

MMX, 141 bytes:
Test0   189 cycles      123     bytes
Test1   234 cycles      123.44  MB
Test2   243 cycles      123.44  GB
Test3   243 cycles      123.44  TB
Test4   266 cycles      123.44  PB

Corrected version attached.


[attachment deleted by admin]

Mark_Larson

Quote from: jj2007 on August 06, 2008, 06:02:10 PM


You're a darling, qWord, thanxalot :thumbu

My timings on Celeron M:

BSR, 113 bytes:
Test0   225 cycles      123     bytes
Test1   251 cycles      123.45  MB
Test2   248 cycles      123.45  GB
Test3   248 cycles      123.45  TB
Test4   247 cycles      123.45  PB

MMX, 141 bytes:
Test0   189 cycles      123     bytes
Test1   234 cycles      123.44  MB
Test2   243 cycles      123.44  GB
Test3   243 cycles      123.44  TB
Test4   266 cycles      123.44  PB

Corrected version attached.


  I just looked at your BSR code, and you do a lot more work than me to get the result.  I don't even have one loop.  Just two BSRs and one lookup table.  I think I'll go ahead and post my buggy code, so that you can see a different approach.

The bug is in the floating point code.

my approach is to use the BSR on the upper 32-bits.  If it is 0, then we do it on the lower 32-bits.

I use the BSR bit value as a value into a lookup table to get ONE value to divide by.  You can switch this to a multiply of course, you just need to set up the lookup table that way.  I was going to do that later.

Again there is no looping.  There is only ONE conditional jump in the code.  If you can find the bug in the floating point code, it should work.  I didn't do extensive testing.  But I did test 10 random 64-bit random #'s

here is my data lookup table.   I use the string of text to print as a lookup table as well.  You can divide these values into 1.0 to flip them, since it's a lookup table and then just do a multiply.


align 8

divide_values dq 1
dq 1024
dq 1024*1024
dq 1024*1024*1024
dq 1024*1024*1024*1024
dq 1024*1024*1024*1024*1024
dq 1024*1024*1024*1024*1024*1024
dq 1024*1024*1024*1024*1024*1024*1024


string lookup table.  The strings have to be exactly 4 bytes long for the code to work correctly,  3 bytes of chars and an ascii0.  I use a trick later on to make looking up these values quicker if you do that.

string_size_table db "BBs",0
db "KBs",0
db "MBs",0
db "GBs",0
db "TBs",0
db "PBs",0
db "EBs",0
db "?Bs",0


  I didn't know what the value was after ExaBytes.  So I used ?bytes as the last entry in the table.


invoke nseed,34521345
invoke nrandom,0ffffffffh
mov ebx,eax ;save in EBX
invoke nrandom,0ffffffffh
;got 64-bit number
mov edx,ebx

pushad
fn crt_printf,"Hex Value: %.8X%::%.8X%c%c", edx,eax, 13, 10
popad

;edx:eax already has 64-bit
bsr ebx,edx
jz lower_32_bits

add ebx,4*8 ;say we are in the upper part of the table., *8 because we shift right 3 later

jmp @F

lower_32_bits:
xor ebx,ebx ;if the register is 0, BSR won't correclty update the register
bsr ebx,eax

@@:
shr ebx,3 ;divide by 1024 bits ( not bytes)
; fn crt_printf,"BSR: %d%c%c", ebx, 13, 10

;for debugging print the divide table entry we are using
lea ecx,[divide_values + ebx*8]
fn crt_printf,"%I64X%c%c", ecx, 13, 10

lea ecx,[string_size_table + ebx*4]
;for debugging print the string we are using
; fn crt_printf,"String: %s%c%c", ecx, 13, 10



mov dword ptr [fp],eax
mov dword ptr [fp+4],edx
ffree st(7) ;finit
fild [fp] ; st1
fild [divide_values + ebx*8] ; st0
.data?
align 8
fp dq ?
.code
fdivp st(1),st(0)
fstp [fp]
;this is the only value that needs to be printed
fn crt_printf,"%f %s%c%c", [fp], ecx, 13, 10

Invoke ExitProcess, 0

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

qWord

Quote from: Mark_Larson on August 06, 2008, 09:25:23 PM
  I didn't know what the value was after ExaBytes.  So I used ?bytes as the last entry in the table.

=> Zebibyte (ZiB) ==  2^70 => doesn't matter because we are using 64 bit numbers    :bg

regards, qWord

FPU in a trice: SmplMath
It's that simple!

Mark_Larson

Quote from: qWord on August 06, 2008, 10:02:23 PM
Quote from: Mark_Larson on August 06, 2008, 09:25:23 PM
  I didn't know what the value was after ExaBytes.  So I used ?bytes as the last entry in the table.

=> Zebibyte (ZiB) ==  2^70 => doesn't matter because we are using 64 bit numbers    :bg

regards, qWord



is there one between Zebi and Exa?

EDIT:  nevermind I found it on Wikipedia

http://en.wikipedia.org/wiki/Exabyte



Mark
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark_Larson

Found my bug.  It wasn't floating point related.  I needed a 3rd lookup table to convert from a bit count from the BSR to which offset to use in the divide table.  I am still testing it, so I will post it later.

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark_Larson

I found one additional bug.  I commented out all the DEBUG crt_printfs.

I changed my divide_values lookup table to REAL8 and I did 1/1024.0 so I could multiply.  Again there is only one conditional branch and no looping.

I didn't do extensive testing, so there still might be a bug or two lurking.  I tested 5 different random numbers.

If you find any bugs, please let me know, thanks :)

The code is very small.  16 lines not including labels but including the crt_printf.  Let me know if you find it useful.


.data

align 8

fp REAL8      0.0


; 7 entries
divide_values REAL8 1.0 ;bytes
REAL8 0.0009765625 ;kilobytes
REAL8 0.00000095367431640625 ;megabytes
REAL8 0.000000000931322574615478515625 ;gigabytes
REAL8 9.094947017729282379150390625e-13 ;terabytes
REAL8 8.8817841970012523233890533447266e-16 ;petabytes
REAL8 8.6736173798840354720596224069595e-19 ;exabytes


;70 entries, 10 * 7
bit_count_table dd 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ; the 1st 10 bits belong to the 0th offset in the divide table
dd 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ; the 2nd 10 bits belong to the 1st offset in the divide table
dd 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
dd 3, 3, 3, 3, 3, 3, 3, 3, 3, 3
dd 4, 4, 4, 4, 4, 4, 4, 4, 4, 4
dd 5, 5, 5, 5, 5, 5, 5, 5, 5, 5
dd 6, 6, 6, 6, 6, 6, 6, 6, 6, 6
align 4

string_size_table db "BBs",0
db "KBs",0
db "MBs",0
db "GBs",0
db "TBs",0
db "PBs",0
db "EBs",0


.code

Start:
;edx:eax already has 64-bit
bsr ebx,edx
jz lower_32_bits
add bl,32 ;say we are in the UPPER 32-bits of the 64-bit value
jmp @F
lower_32_bits:
xor ebx,ebx ;if the register is 0, BSR won't correclty update the register
bsr ebx,eax
@@:
mov ebx, dword ptr [bit_count_table + ebx*4]
;ecx used later, needs to have this value for the crt_printf
lea ecx,[string_size_table + ebx*4]
mov dword ptr [fp],eax
mov dword ptr [fp+4],edx
ffree st(7) ;finit
fild [fp] ; st1
fld [divide_values + ebx*8] ; st0
fmulp st(1),st(0)
fstp [fp]
;this is the only value that needs to be printed
fn crt_printf,"%f %s%c%c", [fp], ecx, 13, 10

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark_Larson

  I used Michael's timing macros.  For greater accuracy I commented out the crt_printf, since we are primarily concerned with how we calculate the data for it.  I did 100,000,000 loops and set the priority level to REALTIME.  The code ran in 20 cycles on my Core 2 Duo.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm