News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

decimal as KiB,Mib,GiB etc.

Started by sinsi, July 29, 2008, 12:38:02 PM

Previous topic - Next topic

jj2007

Quote from: Greg on August 07, 2008, 08:17:36 PM
Microsoft has pretty much decided 80-bit extended precision variables don't exist. They removed them from their compilers, they say for compatibility with other CPUs (PowerPC etc.).  Big mistake if you ask me. They could have kept them for x86, but they took the easy way out. Other C++ compilers support them, like Borland and Intel.
I agree. The gain in speed and space is not that significant. There is a speed penalty for 80-bit mem to FPU transfers (as compared to real4 and real8), therefore for my roll-your-own FPU lib, I will try to use 64 bit memory variables, keep internal full 80 bit precision, and do as many things as possible inside the FPU without shoving values to memory.

GregL

jj2007,

I think you missed my point, I think it was a mistake Microsoft got rid of them (80-bit extended precision variables).

It all depends on what you are doing. If you want accuracy and precision REAL10 is the way to go. If you want your library to be able to be called from Microsoft C/C++, then you need to use REAL8 (or REAL4) variables. If you are writing fast graphics you would probably use REAL4 variables.

I like having the choice to use REAL10 variables if I want to.



Mark_Larson

Quote from: Greg on August 07, 2008, 08:23:20 PM
Mark,

MichealW's macros will display milliseconds. For different units of time it sounds like a good idea.

QuoteMy latest code will be the fastest.

I did some timing on my code. The last code I posted here is pretty slow (but it is short and concise). The previous code I posted here does pretty well, I get about 19 cycles.



time my latest so you can get a comparison.

I'm switching to SSE2.  And trying that instead :)  Will be faster on Intel P4 and up CPUs

What type and speed of CPU to you have Greg?
Mark
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

jj2007

Quote from: Greg on August 07, 2008, 08:23:20 PM
Mark,

MichealW's macros will display milliseconds. For different units of time it sounds like a good idea.

QuoteMy latest code will be the fastest.

I did some timing on my code. The last code I posted here is pretty slow (but it is short and concise). The previous code I posted here does pretty well, I get about 19 cycles.


I tried to incorporate your code - no full success because I used dwtoa for a fixed buffer, while you use crt_printf with a chr$; but when disabling the print part, I get these timings:

BSR, 122 bytes:
Test0   28 cycles
Test1   43 cycles
Test2   40 cycles
Test3   40 cycles
Test4   39 cycles

MMX, 150 bytes:
Test0   31 cycles
Test1   38 cycles
Test2   46 cycles
Test3   53 cycles
Test4   57 cycles

Greg, 221 bytes:
Test0   36 cycles
Test1   35 cycles
Test2   32 cycles
Test3   29 cycles
Test4   26 cycles


[attachment deleted by admin]

jj2007

Quote from: Greg on August 07, 2008, 11:02:16 PM
I think you missed my point, I think it was a mistake Microsoft got rid of them (80-bit extended precision variables).

It all depends on what you are doing. If you want accuracy and precision REAL10 is the way to go. If you want your library to be able to be called from Microsoft C/C++, then you need to use REAL8 (or REAL4) variables. If you are writing fast graphics you would probably use REAL4 variables.

I like having the choice to use REAL10 variables if I want to.

No disagreement, Greg - you should indeed have the choice. But for my own use, I will default to REAL8 because it's the best compromise. Inaccuracies creep in if you repeatedly convert FPU values to lower precision memory vars - but than can be avoided by keeping the stuff inside the FPU while doing complex calculations, with 80 bits precision. After all, there are eight handy registers...

EDIT: Just checked with fphelp.hlp my statement about speed penalty for using REAL10, and while it's correct for FLD, t=6:

FLD memreal  | fld   longreal         | 486 s=3,l=3,t=6

.. it seems incorrect for FSTP, where l=8 is slowest:
FSTP memreal
fstp   longreal       | 486 s=7,l=8,t=6
fstp   tempreals[bx]  | 486 s=7,l=8,t=6

What does the [bx] variant mean?

GregL

Mark,

I timed your code, I'm getting around 13 cycles for your latest code (with the bit_count_table), it's definitely faster. I have a Pentium D 940.


jj2007

Quote from: Mark_Larson on August 06, 2008, 09:25:23 PM
I just looked at your BSR code, and you do a lot more work than me to get the result.  I don't even have one loop.  Just two BSRs and one lookup table.

Nor do I. Check the file preceding your post for ShowQwBsr. It looks pretty similar. Main difference is that I use a REAL4 table that has already the mul 100 incorporated. For showing two decimals, REAL4 is already an overkill...

.data
qwEBi REAL4 8.67361737988404E-17
qwPBi REAL4 8.88178419700125E-14
qwTBi REAL4 9.09494701772928E-11
qwGBi REAL4 9.31322574615479E-8
qwMBi REAL4 9.5367431640625E-05
qwKBi REAL4 0.09765625
t6 db 9, "EB", 0
t5 db 9, "PB", 0
t4 db 9, "TB", 0
t3 db 9, "GB", 0
t2 db 9, "MB", 0
t1 db 9, "kB", 0
t0 db 9, "bytes", 0


ShowQwBsr proc uses edi esi ebx pNum:DWORD
  mov esi, pNum ; pointer to our 8-byte number
  mov edi, offset t0 ; default is bytes
  mov ebx, [esi+4] ; test HiDword
  bsr eax, ebx ; get first bit from the left
  jnz H1 ; HiDword is set
  mov ebx, [esi] ; test LoDword
  bsr eax, ebx ; get first bit from left
  jz IsZero ; number is zero, special treatment ;-)
  jmp H0

H1:
  add eax, 32 ; HiDword was set, so add 32 to the bit count

H0:
  aam ; divide by 10, result in ah, remainder in al
  movzx ecx, ah ; counter
  mov ebx, ecx ; copy of counter
  sal ecx, 2 ; *4
  sub edi, ecx ; pointer to the unit and the R4 complements
  push [esi] ; for bytes: push the loword only
  .if ebx
ffree st(7) ; instead of an "expensive" finit
fild qword ptr [esi] ; push quad on FPU, and multiply with complements*100
fmul dword ptr [edi-24] ; to simulate div 1024 and 2 decimals
fistp dword ptr [esp] ; pop result from FPU stack to CPU stack
  .endif
  pop eax ; our number as an integer
IsZero:
  mov esi, offset FlexByteBuffer ; a dedicated buffer to display number and unit
  invoke dwtoa, eax, esi
  .if ebx ; add the dot if it's more than bytes
mov edx, [esi+len(esi)-2] ; len returns eax, therefore
mov cl, "." ; we use ecx/cl as accu
mov [esi+eax-2], cl ; insert the dot
mov [esi+eax-1], edx ; and add the two decimals copied above
  .endif
  invoke lstrcat, esi, edi ; add the xByte unit
  ret
ShowQwBsr endp


Mark_Larson


JJ you have a bug in your code.

  In the timing part you don't actually call the procedure to time.  That is why all the clocks are close

  You need to set the # of loops to 10,000,000
  and the priority to REALTIME

it'll give you more accurate results.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

jj2007

Quote from: Mark_Larson on August 08, 2008, 12:55:34 AM

JJ you have a bug in your code.

  In the timing part you don't actually call the procedure to time.  That is why all the clocks are close

  You need to set the # of loops to 10,000,000
  and the priority to REALTIME

it'll give you more accurate results.

If you refer to the post above, flexbyte.zip below "Greg, 221 bytes:", no there is no bug, the procedure is being called. But you are right about the loop count: I had set it to 100 for testing, and forgot to reset. With 10,000,000, results are as follows on a P4, 3.4 GHz:

BSR, 124 bytes:
Test0   105 cycles
Test1   116 cycles
Test2   104 cycles
Test3   113 cycles
Test4   105 cycles

MMX, 150 bytes:
Test0   36 cycles
Test1   45 cycles
Test2   59 cycles
Test3   75 cycles
Test4   96 cycles

Greg, 226 bytes:
Test0   53 cycles
Test1   49 cycles
Test2   38 cycles
Test3   36 cycles
Test4   32 cycles


Greg's code is the fastest, but for practical purposes it would be useful to invert the order of the multiple .IF hibit >= tests.

I attach the version with the corrected loop count.

[attachment deleted by admin]

Mark_Larson

BSR, 122 bytes:
Test0   14 cycles
Test1   19 cycles
Test2   18 cycles
Test3   18 cycles
Test4   19 cycles

MMX, 150 bytes:
Test0   15 cycles
Test1   20 cycles
Test2   25 cycles
Test3   28 cycles
Test4   32 cycles

Greg, 221 bytes:
Test0   20 cycles
Test1   20 cycles
Test2   22 cycles
Test3   25 cycles
Test4   17 cycles
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

jj2007

Quote from: Mark_Larson on August 08, 2008, 09:40:18 AM

Strange that your figures are so much lower than mine. Different CPU?

Mark_Larson

core 2 duo

that is why I was surprised by your #'s

you don't need to use AAM

crt_printf supports only printing some of the decimal places.

aam runs very slow,

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

jj2007

Quote from: Mark_Larson on August 08, 2008, 09:52:40 AM
core 2 duo

that is why I was surprised by your #'s

you don't need to use AAM

crt_printf supports only printing some of the decimal places.

aam runs very slow,

Attached version with aam switched off - still pretty slow. Would you mind adding your code? I prepared a ShowQwMark slot, just copy & paste.

[attachment deleted by admin]

Mark_Larson

for some reason, when I add my code into yours, I get a big slow down.  But when I cut all 5 of your test data into my program, it runs as expected.

All 5 tests from your code run in 19 cycles on mine.

I am still going to do an SSE2 version I just haven't gotten to it yet.

there's a trick you can also do with shifts instead of doing a MUL

I have an interview today, so I don't know how long I will be on
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

qWord

inspired by Mirno's concept, I've created a function using SSE2.
It is not optimized at all, but it works well.

regards, qWord

[attachment deleted by admin]
FPU in a trice: SmplMath
It's that simple!