decimal as KiB,Mib,GiB etc.

sinsi · August 06, 2008, 05:21:27 AM

Quote from: Mirno on August 04, 2008, 04:25:56 PM
I wondered about using aam myself

Looking to the future, instructions like AAM aren't legal in 64-bit programming :(

jj2007 · August 06, 2008, 06:36:56 AM

Quote from: sinsi on August 06, 2008, 05:21:27 AM
Quote from: Mirno on August 04, 2008, 04:25:56 PM
I wondered about using aam myself
Looking to the future, instructions like AAM aren't legal in 64-bit programming :(

No problem, I live in Europe, and our laws are pretty liberal, hehe :bg

Mirno · August 06, 2008, 10:06:17 AM

Seeing as we're all having a go, here's mine (stole the proc format mostly from Greg)...

Code Select


ShowBytes PROC lo:DWORD, hi:DWORD
    LOCAL tmp:REAL8
    .DATA
        fmt     BYTE "%.2f %s", 0

        ; Do this with a macro so it looks nicer!
        ALIGN 8
        szB     BYTE "bytes",0
        ALIGN 8
                BYTE "kB",0
        ALIGN 8
                BYTE "MB",0
        ALIGN 8
                BYTE "GB",0
        ALIGN 8
                BYTE "TB",0
        ALIGN 8
                BYTE "PB",0
        ALIGN 8
                BYTE "EB",0

    .CODE
        fild QWORD PTR [lo]
        fstp tmp
        mov eax, DWORD PTR [tmp + 4]
        mov ecx, (10 SHL 20)
        xor edx,edx
        and eax, (07FFh SHL 20)
        jz @F
        sub eax, (1023 SHL 20)
        div ecx

        mov ecx, DWORD PTR [tmp + 4]
        add edx, (1023 SHL 20)
        and ecx, NOT (07FFh SHL 20)

        or  edx, ecx
@@:
        lea eax, [offset szB + eax*8]
        INVOKE crt_printf, ADDR fmt,  DWORD PTR [tmp], edx, eax
        ret
ShowBytes ENDP

It seems the crt_printf doesn't accept REAL4 (float), they must be REAL8 (double).
So I made the necessary changes (11 bits of exponent, bias by 1023).

It should also deal with zero - which is a corner case when dealing with the floating point numbers.

Mirno

jj2007 · August 06, 2008, 10:25:12 AM

One more, this time with cycle counts. The BSR version is slightly shorter than the MMX version but seems to be slower, too - I am not quite sure because as you can easily see, the MMX version fails miserably for the TB and PB examples ::)

Furthermore, the MMX version has a considerable rounding error. Anybody around with a Core2 for a speed comparison? I have a P4 here with a slow FPU.

BSR, 113 bytes:
Test0 492 cycles 123 bytes
Test1 474 cycles 123.45 MB
Test2 510 cycles 123.45 GB
Test3 460 cycles 123.45 TB
Test4 459 cycles 123.45 PB

MMX, 138 bytes:
Test0 357 cycles 123 bytes
Test1 386 cycles 123.44 MB
Test2 384 cycles 123.44 GB
Test3 324 cycles .0
Test4 320 cycles .0

EDIT 2: Attachment removed, see later post.
EDIT 1: Celeron M, 1.6 GHz:

BSR, 113 bytes:
Test0 225 cycles 123 bytes
Test1 253 cycles 123.45 MB
Test2 249 cycles 123.45 GB
Test3 247 cycles 123.45 TB
Test4 249 cycles 123.45 PB

MMX, 138 bytes:
Test0 184 cycles 123 bytes
Test1 236 cycles 123.44 MB
Test2 241 cycles 123.44 GB
Test3 177 cycles .0
Test4 176 cycles .0

Mark_Larson · August 06, 2008, 04:16:30 PM

I am working on my own BSR version but there are some bugs I am working out.

qWord · August 06, 2008, 05:20:09 PM

Quote from: jj2007 on August 06, 2008, 10:25:12 AM... MMX version fails miserably for the TB and PB examples ...

there was some little failure in your code:

Code Select


...
@@: movq mm3, mm2
    psubq mm3, mm0 ;db 0fh, 0fbh, 0d8h
    pextrw eax, mm3, 3
    test eax, 08000h
    jz @F
    psllq mm2, 10
    psrlq mm1, 10
    dec edi
    jnz @B
@@:
...

Quote from: jj2007 on August 06, 2008, 10:25:12 AM
Anybody around with a Core2 for a speed comparison? I have a P4 here with a slow FPU.

her are my results on c2d (corrected version):

Code Select


BSR, 113 bytes:
Test0   207 cycles      123     bytes
Test1   243 cycles      123.45  MB
Test2   225 cycles      123.45  GB
Test3   227 cycles      123.45  TB
Test4   223 cycles      123.45  PB

MMX, 141 bytes:
Test0   170 cycles      123     bytes
Test1   206 cycles      123.44  MB
Test2   212 cycles      123.44  GB
Test3   216 cycles      123.44  TB
Test4   213 cycles      123.44  PB

Mirno · August 06, 2008, 05:23:02 PM

The floating point code I posted is pretty compact, but printing out a floating point value is killing it performance wise (crt_sprintf adds 4k clocks to it - ouch).
I can probably do some messing around with the mantissa to get the decimal places out of the calculation separately from the integer part (using two fistp instructions, and a bit of jiggery-hackery - we're dealing with powers of two so it's just shifting the mantissa about a bit...).

jj2007 · August 06, 2008, 05:55:23 PM

Quote from: Mirno on August 06, 2008, 05:23:02 PM
The floating point code I posted is pretty compact, but printing out a floating point value is killing it performance wise (crt_sprintf adds 4k clocks to it - ouch).

That's why I chose the little dwtoa hack :bg

Quote

I can probably do some messing around with the mantissa to get the decimal places out of the calculation separately from the integer part (using two fistp instructions, and a bit of jiggery-hackery - we're dealing with powers of two so it's just shifting the mantissa about a bit...).

Quote

I incorporated the two decimals into the complements table... I doubt there is a faster solution.

jj2007 · August 06, 2008, 06:02:10 PM

Quote from: qWord on August 06, 2008, 05:20:09 PM
there was some little failure in your code:

Code Select Expand
... test eax, 08000h @@: ...

You're a darling, qWord, thanxalot :thumbu

My timings on Celeron M:

BSR, 113 bytes:
Test0 225 cycles 123 bytes
Test1 251 cycles 123.45 MB
Test2 248 cycles 123.45 GB
Test3 248 cycles 123.45 TB
Test4 247 cycles 123.45 PB

MMX, 141 bytes:
Test0 189 cycles 123 bytes
Test1 234 cycles 123.44 MB
Test2 243 cycles 123.44 GB
Test3 243 cycles 123.44 TB
Test4 266 cycles 123.44 PB

Corrected version attached.

[attachment deleted by admin]

Mark_Larson · August 06, 2008, 09:25:23 PM

Quote from: jj2007 on August 06, 2008, 06:02:10 PM

You're a darling, qWord, thanxalot :thumbu

My timings on Celeron M:

BSR, 113 bytes:
Test0 225 cycles 123 bytes
Test1 251 cycles 123.45 MB
Test2 248 cycles 123.45 GB
Test3 248 cycles 123.45 TB
Test4 247 cycles 123.45 PB

MMX, 141 bytes:
Test0 189 cycles 123 bytes
Test1 234 cycles 123.44 MB
Test2 243 cycles 123.44 GB
Test3 243 cycles 123.44 TB
Test4 266 cycles 123.44 PB

Corrected version attached.

I just looked at your BSR code, and you do a lot more work than me to get the result. I don't even have one loop. Just two BSRs and one lookup table. I think I'll go ahead and post my buggy code, so that you can see a different approach.

The bug is in the floating point code.

my approach is to use the BSR on the upper 32-bits. If it is 0, then we do it on the lower 32-bits.

I use the BSR bit value as a value into a lookup table to get ONE value to divide by. You can switch this to a multiply of course, you just need to set up the lookup table that way. I was going to do that later.

Again there is no looping. There is only ONE conditional jump in the code. If you can find the bug in the floating point code, it should work. I didn't do extensive testing. But I did test 10 random 64-bit random #'s

here is my data lookup table. I use the string of text to print as a lookup table as well. You can divide these values into 1.0 to flip them, since it's a lookup table and then just do a multiply.

Code Select


align 8

divide_values		dq	1
					dq	1024								
					dq	1024*1024
					dq	1024*1024*1024
					dq	1024*1024*1024*1024
					dq	1024*1024*1024*1024*1024
					dq	1024*1024*1024*1024*1024*1024
					dq	1024*1024*1024*1024*1024*1024*1024

string lookup table. The strings have to be exactly 4 bytes long for the code to work correctly, 3 bytes of chars and an ascii0. I use a trick later on to make looking up these values quicker if you do that.

Code Select


string_size_table	db	"BBs",0
				db	"KBs",0
				db	"MBs",0
				db	"GBs",0
				db	"TBs",0
				db	"PBs",0
				db	"EBs",0
				db	"?Bs",0

I didn't know what the value was after ExaBytes. So I used ?bytes as the last entry in the table.

Code Select


	invoke	nseed,34521345
	invoke	nrandom,0ffffffffh
	mov		ebx,eax					;save in EBX
	invoke	nrandom,0ffffffffh
;got 64-bit number
	mov		edx,ebx

	pushad
	fn		crt_printf,"Hex Value: %.8X%::%.8X%c%c", edx,eax, 13, 10
	popad
	
	;edx:eax already has 64-bit
	bsr		ebx,edx
	jz		lower_32_bits
	
	add		ebx,4*8				;say we are in the upper part of the table., *8 because we shift right 3 later
	
	jmp		@F
	
lower_32_bits:
	xor		ebx,ebx				;if the register is 0, BSR won't correclty update the register
	bsr		ebx,eax
	
@@:	
	shr		ebx,3			;divide by 1024 bits ( not bytes)
;	fn		crt_printf,"BSR: %d%c%c", ebx, 13, 10

;for debugging print the divide table entry we are using
	lea		ecx,[divide_values + ebx*8]	
	fn		crt_printf,"%I64X%c%c", ecx, 13, 10

	lea		ecx,[string_size_table + ebx*4]	
;for debugging print the string we are using
;	fn		crt_printf,"String: %s%c%c", ecx, 13, 10
	
	

	mov		dword ptr [fp],eax
	mov		dword ptr [fp+4],edx
	ffree	st(7)		;finit
	fild	[fp]					; st1
	fild	[divide_values + ebx*8] ; st0
.data?
align 8
fp		dq		?
.code
	fdivp	st(1),st(0)
	fstp	[fp]
;this is the only value that needs to be printed
	fn		crt_printf,"%f %s%c%c", [fp], ecx, 13, 10	
	
	Invoke	ExitProcess, 0

qWord · August 06, 2008, 10:02:23 PM

Quote from: Mark_Larson on August 06, 2008, 09:25:23 PM
I didn't know what the value was after ExaBytes. So I used ?bytes as the last entry in the table.

=> Zebibyte (ZiB) == 2^70 => doesn't matter because we are using 64 bit numbers :bg

regards, qWord

Mark_Larson · August 06, 2008, 10:15:53 PM

Quote from: qWord on August 06, 2008, 10:02:23 PM
Quote from: Mark_Larson on August 06, 2008, 09:25:23 PM
I didn't know what the value was after ExaBytes. So I used ?bytes as the last entry in the table.

=> Zebibyte (ZiB) == 2^70 => doesn't matter because we are using 64 bit numbers :bg

regards, qWord

is there one between Zebi and Exa?

EDIT: nevermind I found it on Wikipedia

http://en.wikipedia.org/wiki/Exabyte

Mark

Mark_Larson · August 06, 2008, 10:51:42 PM

Found my bug. It wasn't floating point related. I needed a 3rd lookup table to convert from a bit count from the BSR to which offset to use in the divide table. I am still testing it, so I will post it later.

Mark_Larson · August 06, 2008, 11:34:43 PM

I found one additional bug. I commented out all the DEBUG crt_printfs.

I changed my divide_values lookup table to REAL8 and I did 1/1024.0 so I could multiply. Again there is only one conditional branch and no looping.

I didn't do extensive testing, so there still might be a bug or two lurking. I tested 5 different random numbers.

If you find any bugs, please let me know, thanks :)

The code is very small. 16 lines not including labels but including the crt_printf. Let me know if you find it useful.

Code Select


.data

align 8

fp	REAL8      0.0


; 7 entries
divide_values		REAL8	1.0											;bytes
					REAL8	0.0009765625								;kilobytes
					REAL8	0.00000095367431640625						;megabytes
					REAL8	0.000000000931322574615478515625			;gigabytes
					REAL8	9.094947017729282379150390625e-13			;terabytes
					REAL8	8.8817841970012523233890533447266e-16		;petabytes
					REAL8	8.6736173798840354720596224069595e-19		;exabytes
					

;70 entries, 10 * 7
bit_count_table		dd	0, 0, 0, 0, 0, 0, 0, 0, 0, 0		; the 1st 10 bits belong to the 0th offset in the divide table
					dd	1, 1, 1, 1, 1, 1, 1, 1, 1, 1		; the 2nd 10 bits belong to the 1st offset in the divide table
					dd	2, 2, 2, 2, 2, 2, 2, 2, 2, 2
					dd	3, 3, 3, 3, 3, 3, 3, 3, 3, 3					
					dd	4, 4, 4, 4, 4, 4, 4, 4, 4, 4			
					dd	5, 5, 5, 5, 5, 5, 5, 5, 5, 5
					dd	6, 6, 6, 6, 6, 6, 6, 6, 6, 6
align 4

string_size_table	db	"BBs",0
					db	"KBs",0
					db	"MBs",0
					db	"GBs",0
					db	"TBs",0
					db	"PBs",0
					db	"EBs",0


.code

Start:
	;edx:eax already has 64-bit
	bsr		ebx,edx
	jz		lower_32_bits
	add		bl,32				;say we are in the UPPER 32-bits of the 64-bit value
	jmp		@F
lower_32_bits:
	xor		ebx,ebx				;if the register is 0, BSR won't correclty update the register
	bsr		ebx,eax
@@:	
	mov		ebx, dword ptr [bit_count_table + ebx*4]
;ecx used later, needs to have this value for the crt_printf
	lea		ecx,[string_size_table + ebx*4]	
	mov		dword ptr [fp],eax
	mov		dword ptr [fp+4],edx
	ffree	st(7)		;finit
	fild	[fp]					; st1
	fld		[divide_values + ebx*8] ; st0
	fmulp	st(1),st(0)
	fstp	[fp]
;this is the only value that needs to be printed
	fn		crt_printf,"%f %s%c%c", [fp], ecx, 13, 10

Mark_Larson · August 06, 2008, 11:43:38 PM

I used Michael's timing macros. For greater accuracy I commented out the crt_printf, since we are primarily concerned with how we calculate the data for it. I did 100,000,000 loops and set the priority level to REALTIME. The code ran in 20 cycles on my Core 2 Duo.

News:

decimal as KiB,Mib,GiB etc.

Mirno

Mirno