decimal as KiB,Mib,GiB etc.

sinsi · August 04, 2008, 09:59:42 AM

"mov al, [ecx+3] ; take the exponent" - that's not taking bits 30 to 23 is it? 31/30/29/28/27/26/25/24, you're missing a little bit... :bg

jj2007 · August 04, 2008, 11:14:14 AM

Quote from: sinsi on August 04, 2008, 09:59:42 AM
"mov al, [ecx+3] ; take the exponent" - that's not taking bits 30 to 23 is it? 31/30/29/28/27/26/25/24, you're missing a little bit... :bg

You make a lot of fuzz about one little bit! :toothy

Code Select

  .data
MyQword	dq 12345678
MyReal4	REAL4 0.0
  .code
  int 3			; let Olly say Hi
  mov esi, offset MyQword
  fild MyQword		; convert quad
  fst MyReal4		; to real4

  mov ecx, offset MyReal4
  mov eax, [ecx]	; take the full number
  sar eax, 7+16		; take bits 30 down to 23
  sub al, 127		; sub 127
  aam			; divide al by 10
  add al, 127		; add 127 to the remainder
  sal eax, 7+16		; shift left
  mov edx, [ecx]	; take original, and free the exponent slot
  and edx, 10000000011111111111111111111111b
  add edx, eax		; move bits 30 .. 23 to their old positions
  mov [ecx], edx	; replace the exponent
  fld MyReal4		; show it in Olly

Mirno · August 04, 2008, 04:25:56 PM

Note that you aren't masking out bit 31 either, not that it should matter as you shouldn't be able to load negative values (bit 31 is the sign bit on a REAL4)...

I wondered about using aam myself, but I'm not convinced the hit from the partial register stalls will be worth the saving.
Also if you divide by (10 SHL 23), and subtract/add (127 SHL 23) you can get rid of the shifts.

You also want to save the result of the divide (using aam - ah, using div edx) as it's the index into your array of strings.

Mirno

jj2007 · August 04, 2008, 08:08:35 PM

Will look into it. I wonder if there is a faster way to convert qw to real4 and vice versa... CVTDQ2PS and CVTTPS2DQ are SSE2, and not particularly handy.

GregL · August 04, 2008, 10:09:39 PM

jj2007,

This is equivalent to the last code you posted.

Code Select


.DATA

    MyQword QWORD 12345678
    MB      DWORD 1024*1024
    MyReal4 REAL4 0.0

.CODE

    fild MyQword
    fidiv MB
    fstp MyReal4

Faster.

Code Select


.DATA

    MyQword QWORD 12345678
    MB      REAL8 0.00000095367431640625
    MyReal4 REAL4 0.0

.CODE

    fild MyQword
    fmul MB
    fstp MyReal4

Later: I just looked at MichaelW's macro, it's pretty much the same code at the core.

jj2007 · August 04, 2008, 10:51:36 PM

Quote from: Mirno on August 04, 2008, 04:25:56 PM
Note that you aren't masking out bit 31 either, not that it should matter as you shouldn't be able to load negative values (bit 31 is the sign bit on a REAL4)...

I wondered about using aam myself, but I'm not convinced the hit from the partial register stalls will be worth the saving.
Also if you divide by (10 SHL 23), and subtract/add (127 SHL 23) you can get rid of the shifts.

You also want to save the result of the divide (using aam - ah, using div edx) as it's the index into your array of strings.

Mirno

Thanxalot for the good ideas. New version is attached; there is still the UseMMX option to reverse to old code. Size is 136 bytes for the MMX and 132 bytes for the Mirno version.

[attachment deleted by admin]

jj2007 · August 04, 2008, 11:11:58 PM

Quote from: Greg on August 04, 2008, 10:09:39 PM
jj2007,

This is equivalent to the last code you posted.

Code Select Expand
.DATA MyQword QWORD 12345678 MB DWORD 1024*1024 MyReal4 REAL4 0.0 .CODE fild MyQword fidiv MB fstp MyReal4

Faster.

Code Select Expand
.DATA MyQword QWORD 12345678 MB REAL8 0.00000095367431640625 MyReal4 REAL4 0.0 .CODE fild MyQword fmul MB fstp MyReal4

Later: I just looked at MichaelW's macro, it's pretty much the same code at the core.

What you write is technically correct but no longer relevant... I don't divide reals:

Code Select

ShowQw proc uses edi esi ebx pNum:DWORD
LOCAL TmpR4:REAL4
  ffree st(7)		; instead of expensive finit
  mov esi, pNum		; pointer to our 8-byte number
  lea edi, TmpR4
  fild dword ptr [esi]	; convert quad
  fstp dword ptr [edi]	; to real4
  xor ebx, ebx
  mov eax, [edi] 		; our number as real4 - if it's zero, better do nothing
  .if eax
	mov ebx, 1065353216	; 127 shl 23
	sub eax, ebx

	mov ecx, 83886080	; 10 shl 23
	xor edx, edx		; quite useful before a divide ;-)
	div ecx			; divide eax by ecx

	add edx, ebx		; 127 shl 23
	mov ebx, eax		; ebx holds unit
	mov [edi], edx		; replace with new number (exponent in bits 23...30)
	fld dword ptr [edi]	; push Real4 on FPU
	.if ebx
		push 100				; move 100 into the stack
		fimul dword ptr [esp]		; mul ST (0), 100
	.endif
	fistp dword ptr [esp]	; pop result from FPU stack to CPU stack
	pop eax		; correct the CPU stack, and get the result
  .endif

  mov esi, offset FlexByteBuffer
  invoke dwtoa, eax, esi
  mov edx, offset t0				; default is bytes
  .if ebx							; add the dot if it's more than bytes
	mov edx, [esi+len(esi)-2]	; len returns eax, so we use ecx as accu
	mov cl, "."
	mov [esi+eax-2], cl
	mov [esi+eax-1], edx
	lea edx, [t1+4*ebx-4]				; our unit
  .endif
  invoke lstrcat, esi, edx			; add the xByte unit
  ret
ShowQw endp

But thanks anyway, Greg.

Mark_Larson · August 05, 2008, 03:32:11 PM

Quote from: jj2007 on July 31, 2008, 09:10:21 PM
Quote from: Mark_Larson on July 31, 2008, 03:35:14 PM
There is an X86 instruction you can use to automatically scan for the highest bit that is set in the integer. It is called BSR ( bit scan reverse).

Seems to be incredibly slow - 103 cycles for a single instruction??

BSR- Bit Scan Reverse (386+)
Usage: BSR dest,src
Modifies flags: ZF
Scans source operand for first bit set. Sets ZF if a bit is found
set and loads the destination with an index to first set bit. Clears
ZF is no bits are found set. BSF scans forward across bit pattern
(0-n) while BSR scans in reverse (n-0).
Clocks Size
Operands 808x 286 386 486 Bytes

reg,reg - - 10+3n 6-103 3
reg,mem - - 10+3n 7-104 3-7
reg32,reg32 - - 10+3n 6-103 3-7
reg32,mem32 - - 10+3n 7-104 3-7

it is significantly faster on P4s. I think it is in the 1-2 cycle range. The timing you got was for a 486.

jj2007 · August 05, 2008, 03:51:03 PM

Quote from: Mark_Larson on August 05, 2008, 03:32:11 PM
it is significantly faster on P4s. I think it is in the 1-2 cycle range. The timing you got was for a 486.

In the meantime, we found another solution, but good to know anyway :bg

Is there a more up-to-date free alternative to opcodes.hlp?

Mirno · August 05, 2008, 04:21:14 PM

Agner Fog has some docs on his site - they're pretty useful but more indepth (given the out of order execution, different engines being able to execute different micro-ops, micro-op fusion, latency, and such like) than the good old days when instruction X took Y clock cycles no matter what.

http://www.agner.org/optimize/#manuals

jj2007 · August 05, 2008, 07:05:22 PM

Quote from: Mirno on August 05, 2008, 04:21:14 PM
Agner Fog has some docs on his site

Fascinating lecture indeed, but often I simply need to find an appropriate opcode, plus some basic info about size and latency. Opcodes.hlp is really handy but apparently very outdated...

GregL · August 05, 2008, 07:20:35 PM

Quote from: jj2007What you write is technically correct but no longer relevant...

I fail to see how it's no longer relevant.

jj2007 · August 05, 2008, 08:14:28 PM

Quote from: Greg on August 05, 2008, 07:20:35 PM
Quote from: jj2007What you write is technically correct but no longer relevant...

I fail to see how it's no longer relevant.

Greg, replacing a slow div with a faster mul complement will always be relevant, but in the meantime the code took another road, the one proposed by Mirno, and therefore does not need divs any more. If somebody volunteers to add the bsr variant, we might need it again, of course. Thanks anyway :thumbu

GregL · August 05, 2008, 11:09:49 PM

OK, here's my crack at using BSR and the FPU for this.

edit: fixed a little problem, new file.
edit: spiffed up a litlle more, new file.
edit: a little faster, new file
edit: reversed the '.IF hibit' code to test for smallest first

[attachment deleted by admin]

jj2007 · August 05, 2008, 11:11:47 PM

Quote from: Greg on August 05, 2008, 11:09:49 PM
OK, here's my crack at using BSR and the FPU for this.

You are faster than the police allows! Here is mine...

[attachment deleted by admin]

News:

decimal as KiB,Mib,GiB etc.

Mirno

Mirno