News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

decimal as KiB,Mib,GiB etc.

Started by sinsi, July 29, 2008, 12:38:02 PM

Previous topic - Next topic

sinsi

"mov al, [ecx+3]   ; take the exponent" - that's not taking bits 30 to 23 is it? 31/30/29/28/27/26/25/24, you're missing a little bit... :bg
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Quote from: sinsi on August 04, 2008, 09:59:42 AM
"mov al, [ecx+3]   ; take the exponent" - that's not taking bits 30 to 23 is it? 31/30/29/28/27/26/25/24, you're missing a little bit... :bg

You make a lot of fuzz about one little bit! :toothy

  .data
MyQword dq 12345678
MyReal4 REAL4 0.0
  .code
  int 3 ; let Olly say Hi
  mov esi, offset MyQword
  fild MyQword ; convert quad
  fst MyReal4 ; to real4

  mov ecx, offset MyReal4
  mov eax, [ecx] ; take the full number
  sar eax, 7+16 ; take bits 30 down to 23
  sub al, 127 ; sub 127
  aam ; divide al by 10
  add al, 127 ; add 127 to the remainder
  sal eax, 7+16 ; shift left
  mov edx, [ecx] ; take original, and free the exponent slot
  and edx, 10000000011111111111111111111111b
  add edx, eax ; move bits 30 .. 23 to their old positions
  mov [ecx], edx ; replace the exponent
  fld MyReal4 ; show it in Olly

Mirno

Note that you aren't masking out bit 31 either, not that it should matter as you shouldn't be able to load negative values (bit 31 is the sign bit on a REAL4)...

I wondered about using aam myself, but I'm not convinced the hit from the partial register stalls will be worth the saving.
Also if you divide by (10 SHL 23), and subtract/add (127 SHL 23) you can get rid of the shifts.

You also want to save the result of the divide (using aam - ah, using div edx) as it's the index into your array of strings.

Mirno

jj2007

Will look into it. I wonder if there is a faster way to convert qw to real4 and vice versa... CVTDQ2PS and CVTTPS2DQ are SSE2, and not particularly handy.

GregL

jj2007,

This is equivalent to the last code you posted.


.DATA

    MyQword QWORD 12345678
    MB      DWORD 1024*1024
    MyReal4 REAL4 0.0

.CODE

    fild MyQword
    fidiv MB
    fstp MyReal4


Faster.


.DATA

    MyQword QWORD 12345678
    MB      REAL8 0.00000095367431640625
    MyReal4 REAL4 0.0

.CODE

    fild MyQword
    fmul MB
    fstp MyReal4



Later: I just looked at MichaelW's macro, it's pretty much the same code at the core.




jj2007

Quote from: Mirno on August 04, 2008, 04:25:56 PM
Note that you aren't masking out bit 31 either, not that it should matter as you shouldn't be able to load negative values (bit 31 is the sign bit on a REAL4)...

I wondered about using aam myself, but I'm not convinced the hit from the partial register stalls will be worth the saving.
Also if you divide by (10 SHL 23), and subtract/add (127 SHL 23) you can get rid of the shifts.

You also want to save the result of the divide (using aam - ah, using div edx) as it's the index into your array of strings.

Mirno

Thanxalot for the good ideas. New version is attached; there is still the UseMMX option to reverse to old code. Size is 136 bytes for the MMX and 132 bytes for the Mirno version.

[attachment deleted by admin]

jj2007

Quote from: Greg on August 04, 2008, 10:09:39 PM
jj2007,

This is equivalent to the last code you posted.


.DATA

    MyQword QWORD 12345678
    MB      DWORD 1024*1024
    MyReal4 REAL4 0.0

.CODE

    fild MyQword
    fidiv MB
    fstp MyReal4


Faster.


.DATA

    MyQword QWORD 12345678
    MB      REAL8 0.00000095367431640625
    MyReal4 REAL4 0.0

.CODE

    fild MyQword
    fmul MB
    fstp MyReal4


Later: I just looked at MichaelW's macro, it's pretty much the same code at the core.


What you write is technically correct but no longer relevant... I don't divide reals:

ShowQw proc uses edi esi ebx pNum:DWORD
LOCAL TmpR4:REAL4
  ffree st(7) ; instead of expensive finit
  mov esi, pNum ; pointer to our 8-byte number
  lea edi, TmpR4
  fild dword ptr [esi] ; convert quad
  fstp dword ptr [edi] ; to real4
  xor ebx, ebx
  mov eax, [edi] ; our number as real4 - if it's zero, better do nothing
  .if eax
mov ebx, 1065353216 ; 127 shl 23
sub eax, ebx

mov ecx, 83886080 ; 10 shl 23
xor edx, edx ; quite useful before a divide ;-)
div ecx ; divide eax by ecx

add edx, ebx ; 127 shl 23
mov ebx, eax ; ebx holds unit
mov [edi], edx ; replace with new number (exponent in bits 23...30)
fld dword ptr [edi] ; push Real4 on FPU
.if ebx
push 100 ; move 100 into the stack
fimul dword ptr [esp] ; mul ST (0), 100
.endif
fistp dword ptr [esp] ; pop result from FPU stack to CPU stack
pop eax ; correct the CPU stack, and get the result
  .endif

  mov esi, offset FlexByteBuffer
  invoke dwtoa, eax, esi
  mov edx, offset t0 ; default is bytes
  .if ebx ; add the dot if it's more than bytes
mov edx, [esi+len(esi)-2] ; len returns eax, so we use ecx as accu
mov cl, "."
mov [esi+eax-2], cl
mov [esi+eax-1], edx
lea edx, [t1+4*ebx-4] ; our unit
  .endif
  invoke lstrcat, esi, edx ; add the xByte unit
  ret
ShowQw endp


But thanks anyway, Greg.

Mark_Larson

Quote from: jj2007 on July 31, 2008, 09:10:21 PM
Quote from: Mark_Larson on July 31, 2008, 03:35:14 PM
There is an X86 instruction you can use to automatically scan for the highest bit that is set in the integer.  It is called BSR ( bit scan reverse).

Seems to be incredibly slow - 103 cycles for a single instruction??

BSR- Bit Scan Reverse  (386+)
        Usage:  BSR     dest,src
        Modifies flags: ZF
        Scans source operand for first bit set.  Sets ZF if a bit is found
        set and loads the destination with an index to first set bit.  Clears
        ZF is no bits are found set.  BSF scans forward across bit pattern
        (0-n) while BSR scans in reverse (n-0).
                                 Clocks                 Size
        Operands         808x  286   386   486          Bytes

        reg,reg           -     -   10+3n  6-103          3
        reg,mem           -     -   10+3n  7-104         3-7
        reg32,reg32       -     -   10+3n  6-103         3-7
        reg32,mem32       -     -   10+3n  7-104         3-7

  it is significantly faster on P4s.  I think it is in the 1-2 cycle range.  The timing you got was for a 486.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

jj2007

Quote from: Mark_Larson on August 05, 2008, 03:32:11 PM
  it is significantly faster on P4s.  I think it is in the 1-2 cycle range.  The timing you got was for a 486.

In the meantime, we found another solution, but good to know anyway :bg

Is there a more up-to-date free alternative to opcodes.hlp?

Mirno

Agner Fog has some docs on his site - they're pretty useful but more indepth (given the out of order execution, different engines being able to execute different micro-ops, micro-op fusion, latency, and such like) than the good old days when instruction X took Y clock cycles no matter what.

http://www.agner.org/optimize/#manuals

jj2007

Quote from: Mirno on August 05, 2008, 04:21:14 PM
Agner Fog has some docs on his site

Fascinating lecture indeed, but often I simply need to find an appropriate opcode, plus some basic info about size and latency. Opcodes.hlp is really handy but apparently very outdated...

GregL

Quote from: jj2007What you write is technically correct but no longer relevant...

I fail to see how it's no longer relevant.


jj2007

Quote from: Greg on August 05, 2008, 07:20:35 PM
Quote from: jj2007What you write is technically correct but no longer relevant...

I fail to see how it's no longer relevant.


Greg, replacing a slow div with a faster mul complement will always be relevant, but in the meantime the code took another road, the one proposed by Mirno, and therefore does not need divs any more. If somebody volunteers to add the bsr variant, we might need it again, of course. Thanks anyway  :thumbu

GregL

#43
OK, here's my crack at using BSR and the FPU for this.

edit: fixed a little problem, new file.
edit: spiffed up a litlle more, new file.
edit: a little faster, new file
edit: reversed the '.IF hibit' code to test for smallest first

[attachment deleted by admin]

jj2007

Quote from: Greg on August 05, 2008, 11:09:49 PM
OK, here's my crack at using BSR and the FPU for this.

You are faster than the police allows! Here is mine...


[attachment deleted by admin]