MASM for FUN - #0.0 [Nested Loops - formatted numbers - msgbox]

Started by frktons, May 05, 2010, 05:56:39 PM

Previous topic - Next topic

jj2007

Quote from: frktons on July 05, 2010, 04:34:51 PM
The strange and impressive thing is that somebody, in Cprogramming
forum where I posted the function, compiled it under Linux with GCC
and got [with the same code] a performace of 50:1 compared with
mine on Win7 and Pelles'C. That was very impressive. Just using
Linux and a different compiler gives the same code an incredible boost.  :eek

Incredible indeed. The snippet below takes 8.9 seconds as compared to 12 seconds for yours. If I take Masm32 str$ instead of MasmBasic Str$, it can be done in just over 4 seconds - Str$() is an allrounder, str$ swallows integers only. Now maybe, if we put Lingo on it, that could be reduced to 2 seconds, but 12/50=0.24 seconds. No way! My advice: Test the output of that miraculous Linux proggie... or rather: run it through Olly. I would not be surprised if the compiler optimised a little bit by putting a fixed string or similar ;-)

include \masm32\MasmBasic\MasmBasic.inc
.data
num dd -1234567890
Init
push Timer()
For_ n=1 To 50000000
void Str$(num)
Next
void Timer()
pop edx
sub eax, edx
Inkey Str$("\nYour puter needed %f seconds for showing TheNum: ", eax/1000), Str$(num)
Exit
end start

frktons

Quote from: clive on July 06, 2010, 04:10:35 PM
Perhaps you should look at the assembler code emitted by GCC to understand what it is doing. (-S option as I recall, generates a .S file at least for my cygwin hosted GCC MIPS compiler)

Unless you are measuring TIME elapsed on the same processor (model, speed), it is not going to be a very effective measure. Your code takes 17 seconds on my system and I didn't even recompile it.

-Clive

In order to reveal some tricks used by GCC, I'm now going
to test the function with random generated numbers, to see if
the Linux/GCC is still capable of similar optimization with -O3 flag
as they said.

Quote from: jj2007 on July 06, 2010, 05:37:13 PM
Incredible indeed. The snippet below takes 8.9 seconds as compared to 12 seconds for yours. If I take Masm32 str$ instead of MasmBasic Str$, it can be done in just over 4 seconds - Str$() is an allrounder, str$ swallows integers only. Now maybe, if we put Lingo on it, that could be reduced to 2 seconds, but 12/50=0.24 seconds. No way! My advice: Test the output of that miraculous Linux proggie... or rather: run it through Olly. I would not be surprised if the compiler optimised a little bit by putting a fixed string or similar ;-)

include \masm32\MasmBasic\MasmBasic.inc
.data
num dd -1234567890
Init
push Timer()
For_ n=1 To 50000000
void Str$(num)
Next
void Timer()
pop edx
sub eax, edx
Inkey Str$("\nYour puter needed %f seconds for showing TheNum: ", eax/1000), Str$(num)
Exit
end start


Well, I was really impressed, so let's find out if there is a trick of the tail
somewhere for using always the same number, changing it with random ones.  :green
Mind is like a parachute. You know what to do in order to use it :-)

oex

Quote from: frktons on July 06, 2010, 06:29:08 PM
Well, I was really impressed, so let's find out if there is a trick of the tail
somewhere for using always the same number, changing it with random ones.  :green

To find the same number you just run a cmp LastNum -> je and skip any other calculations

or use a jump table
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

frktons

Quote from: oex on July 06, 2010, 06:32:38 PM
Quote from: frktons on July 06, 2010, 06:29:08 PM
Well, I was really impressed, so let's find out if there is a trick of the tail
somewhere for using always the same number, changing it with random ones.  :green

To find the same number you just run a cmp LastNum -> je and skip any other calculations

or use a jump table

With random numbers, changing them all the time I call the function, this
trick will not work?  Jump table should be very big for covering all the
numbers, maybe dividing by 1000, and using the remainder for a smaller
jump table could be doable in order to improve the performance.

Mind is like a parachute. You know what to do in order to use it :-)

oex

Sorry i misunderstood your post.... For the same number it would work for random numbers it wouldnt.... My suggestion was to explain how linux could be 50 times faster for the same number....

I didnt read up to the top only based it on your code you (er jj) just posted

For random numbers it would be difficult to make it any faster however for incrementing numbers there is plenty of ability to increase the speed if it is known the number is simply incrementing

str$ runs dwtoa function however 40000-40009 needs only one byte incremented etc not the full DW to A function run

ie cmp byte 2 (from right), 9 -> jne inc byte
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

jj2007

As Clive already wrote, try to get hold of the assembler file generated. Or just use the loop counter to eliminate the "optimisation":
   For_ n=1 To 50000000
      void Str$(n)
   Next

clive

If the compiler can roll the MOD,DIV operations into a single operation it would double the speed.

I don't think your code handles the ZERO/SIGN cases particularly well, or the NUL termination of the string. You could eliminate the 'zero' flag if you are not using it anywhere else, and probably address the initialization of the string better. The 'neg' flag could be held locally.

Here was a quick hack of the code

void int_format(int num)
{
  int x = 0;
  int y = 0;
  int remain = 0;     // integer variable to store the remainder
  int count = 0;                    // integer counter for positioning the separator
  const int len_str = sizeof(buffer) - 1;                 // string length less the terminator NULL
  const int ten = 10;

  x = len_str;

  if (num != 0)
  {
    zero = FALSE;

    if (num < 0)
    {
      neg = TRUE;
      num = -num; // transform number to positive if negative
    }
    else
      neg = FALSE;

    while(num)
    {
      if (count == 3)
    {
      count = 0;
      buffer[x--] = sep;
      }

      buffer[x--] = '0' + (char)(num % ten);

      num = num / ten;

      count++;
    }

    if (sign == TRUE)
    {
      if (neg == TRUE)
        buffer[x] = '-';
      else
        buffer[x] = '+';
    }
    else
       x++;
  }
  else
  {
    buffer[x] = '0';

    zero = TRUE;
  }

  if (alignment  == 'L')
  {
    if (x == 0)
     return;

    for (y = 0; y < sizeof(buffer); y++, x++)
    {
      buffer[y] = buffer[x];

      if (buffer[y] == '\0')
        return;
     } // end for
  }

  return;
}
It could be a random act of randomness. Those happen a lot as well.

frktons

Quote from: clive on July 06, 2010, 09:05:03 PM
If the compiler can roll the MOD,DIV operations into a single operation it would double the speed.

I don't think your code handles the ZERO/SIGN cases particularly well, or the NUL termination of the string. You could eliminate the 'zero' flag if you are not using it anywhere else, and probably address the initialization of the string better. The 'neg' flag could be held locally.

Here was a quick hack of the code

void int_format(int num)
{
  int x = 0;
  int y = 0;
  int remain = 0;     // integer variable to store the remainder
  int count = 0;                    // integer counter for positioning the separator
  const int len_str = sizeof(buffer) - 1;                 // string length less the terminator NULL
  const int ten = 10;

  x = len_str;

  if (num != 0)
  {
    zero = FALSE;

    if (num < 0)
    {
      neg = TRUE;
      num = -num; // transform number to positive if negative
    }
    else
      neg = FALSE;

    while(num)
    {
      if (count == 3)
    {
      count = 0;
      buffer[x--] = sep;
      }

      buffer[x--] = '0' + (char)(num % ten);

      num = num / ten;

      count++;
    }

    if (sign == TRUE)
    {
      if (neg == TRUE)
        buffer[x] = '-';
      else
        buffer[x] = '+';
    }
    else
       x++;
  }
  else
  {
    buffer[x] = '0';

    zero = TRUE;
  }

  if (alignment  == 'L')
  {
    if (x == 0)
     return;

    for (y = 0; y < sizeof(buffer); y++, x++)
    {
      buffer[y] = buffer[x];

      if (buffer[y] == '\0')
        return;
     } // end for
  }

  return;
}

Thanks clive, it looks a little bit faster with your hack.  :U

Yes something is to refine a little to make it work properly.  :P

I have a strange display, with L after the formatted numbers,
do you know what they are? Errors or normal signs for Long numbers?


                         Testing version : 0.60

                         ----------------------

The value of num is: -1234567890

The formatted value of num is: -1.234.567.890LL
Elapsed ClickTime: 1.404L
to perform 10.000.000L cycles of the formatting function


Pelles' C likes TRUE/FALSE in lower letters: true/false, and
I don't know why the stdbool.h was written this way.

This is faster than the lookup table?
      buffer[x--] = '0' + (char)(num % ten);

In a later version I initialize the buffer this way:
    char buffer[15] = {'\0'};        // string array for the formatted number
                                              // 10 digits max + 3 separators max + sign + NULL


Is it better this way?
Mind is like a parachute. You know what to do in order to use it :-)

clive

Microsoft C uses BOOL/TRUE/FALSE, I don't use Pelle

The 'L' are probably because you don't terminate the string. Most of the code is predicated on the fact that buffer[14] is NUL, but is never set as such. The "char buffer[15] = {'\0'};" does not achieve this, the first character is a NUL, but you are filling the buffer from the back.


  x = len_str;

  buffer[x--] = 0; // ADD THIS

  if (num != 0)


A simple register/immediate addition should be more efficient than a memory read. A table would help above base 10 due to the non-linearity of the ASCII translation. The overall cost of the table might be hidden. The most time consuming operation is the divide and modulus function. While DIV will provide both, most C compilers aren't going to fold the '/' and '%' pair together.

Microsoft C/C++ 12.00 optimized, 1 IDIV, 1 reciprocal IMUL (12 seconds, 50M iterations)

; 143  :       buffer[x--] = '0' + (char)(num % ten);

  00232 8b c1 mov eax, ecx
  00234 bd 0a 00 00 00 mov ebp, 10 ; 0000000aH
  00239 99 cdq
  0023a f7 fd idiv ebp

NOTE EAX here is num / ten

; 144  :
; 145  :       num = num / ten;

  0023c b8 67 66 66 66 mov eax, 1717986919 ; 66666667H
  00241 80 c2 30 add dl, 48 ; 00000030H
  00244 88 96 00 00 00
00 mov BYTE PTR _buffer[esi], dl
  0024a f7 e9 imul ecx
  0024c c1 fa 02 sar edx, 2
  0024f 8b c2 mov eax, edx
  00251 4e dec esi
  00252 c1 e8 1f shr eax, 31 ; 0000001fH
  00255 03 d0 add edx, eax

; 146  :
; 147  :       count++;

  00257 47 inc edi
  00258 8b ca mov ecx, edx
  0025a 85 c9 test ecx, ecx
  0025c 75 c6 jne SHORT $L52929


Microsoft C/C++ 12.00 unoptimized, 2 IDIV (25 seconds, 50M iterations)

; 143  :       buffer[x--] = '0' + (char)(num % ten);

  003b1 8b 45 08 mov eax, DWORD PTR _num$[ebp]
  003b4 99 cdq
  003b5 f7 7d fc idiv DWORD PTR _ten$[ebp]
  003b8 0f be ca movsx ecx, dl
  003bb 83 c1 30 add ecx, 48 ; 00000030H
  003be 8b 55 f0 mov edx, DWORD PTR _x$[ebp]
  003c1 88 8a 00 00 00
00 mov BYTE PTR _buffer[edx], cl
  003c7 8b 45 f0 mov eax, DWORD PTR _x$[ebp]
  003ca 83 e8 01 sub eax, 1
  003cd 89 45 f0 mov DWORD PTR _x$[ebp], eax

; 144  :
; 145  :       num = num / ten;

  003d0 8b 45 08 mov eax, DWORD PTR _num$[ebp]
  003d3 99 cdq
  003d4 f7 7d fc idiv DWORD PTR _ten$[ebp]
  003d7 89 45 08 mov DWORD PTR _num$[ebp], eax

; 146  :
; 147  :       count++;

  003da 8b 4d f8 mov ecx, DWORD PTR _count$[ebp]
  003dd 83 c1 01 add ecx, 1
  003e0 89 4d f8 mov DWORD PTR _count$[ebp], ecx

; 148  :     }

  003e3 eb a1 jmp SHORT $L52885
It could be a random act of randomness. Those happen a lot as well.

frktons

Quote from: clive on July 07, 2010, 03:06:15 AM
Microsoft C uses BOOL/TRUE/FALSE, I don't use Pelle

The 'L' are probably because you don't terminate the string. Most of the code is predicated on the fact that buffer[14] is NUL, but is never set as such. The "char buffer[15] = {'\0'};" does not achieve this, the first character is a NUL, but you are filling the buffer from the back.


  x = len_str;

  buffer[x--] = 0; // ADD THIS

  if (num != 0)


Well, according to what I've read so far, I had the naive assumption
that char buffer[15] = {'\0'} was referring to all the bytes
of the string array, not only the first.  ::)

I'll add the code you suggest and see what happens as soon as I get
home and put my hands on my pc.  :P

Edit: OK clive, it worked, thanks  :U
Mind is like a parachute. You know what to do in order to use it :-)

frktons

And the new version, smarter one, posted by malcom on
cprogramming forum:

int iMalcsUltimateFormat(long snum, char s[])
{
    char *ps = s;
    unsigned long num1 = snum, num2, num3, div;
    if (snum < 0) {
        *ps++ = '-';
        num1 = -snum;
    }
    if (num1 < 10000) {
        if (num1 < 10) goto L1;
        if (num1 < 100) goto L2;
        if (num1 < 1000) goto L3;
    } else {
        num2 = num1 / 10000;
        num1 -= num2 * 10000;
        if (num2 < 10000) {
            if (num2 < 10) goto L5;
            if (num2 < 100) goto L6;
            if (num2 < 1000) goto L7;
        } else {
            num3 = num2 / 10000;
            num2 -= num3 * 10000;
            if (num3 >= 10) {
                *ps++ = '0' + (char)(div = (num3*6554UL)>>16); num3 -= div*10;
                *ps++ = ',';
            }
            *ps++ = '0' + (char)(num3);
        }
        *ps++ = '0' + (char)(div = (num2*8389UL)>>23); num2 -= div*1000;
L7:
        *ps++ = '0' + (char)(div = (num2*5243UL)>>19); num2 -= div*100;
        *ps++ = ',';
L6:
        *ps++ = '0' + (char)(div = (num2*6554UL)>>16); num2 -= div*10;
L5:
        *ps++ = '0' + (char)(num2);
    }
    *ps++ = '0' + (char)(div = (num1*8389UL)>>23); num1 -= div*1000;
    *ps++ = ',';
L3:
    *ps++ = '0' + (char)(div = (num1*5243UL)>>19); num1 -= div*100;
L2:
    *ps++ = '0' + (char)(div = (num1*6554UL)>>16); num1 -= div*10;
L1:
    *ps++ = '0' + (char)(num1);
    *ps = '\0';
    return ps - s;
}


is about 3 times faster than the unoptimized, crumpled solution I
posted and clive corrected a little.

The entire pgm for testing pourpose source and exe are attached.

This is the output it gives:

            Testing version : 0.60

            --------------------------

The value of num is: -1234567890

The formatted value of num is: -1,234,567,890

Elapsed ClickTime: 463
to perform 10,000,000 cycles of the formatting function


It looks like a better algorithm, nearly as fast as assembly code
somebody tested here.  :U
Mind is like a parachute. You know what to do in order to use it :-)

drizz

Quote from: frktons on July 07, 2010, 04:13:45 PMIt looks like a better algorithm, nearly as fast as assembly code
Here is my fastest int2str routine in assembly (Int3264ToStr.asm in attachment)
http://www.masm32.com/board/index.php?topic=9906.msg93198#msg93198

The funny thing is that my C translation produces almost identical code when compiled. :)
Though, I did carefully cater the compiler. 

uint32_t uint2str(uint32_t num, uint8_t *str)
{
uint8_t *p = str;
uint32_t val = num, first5, tmp;
uint64_t w = num;

w *= 0xA7C5AC47UL;
w += 0xA7C5AC47ULL;
first5 = (w >> 48) & 0xFFFFFFFFUL;
tmp = (int32_t)first5 * 100000L;
w = first5;
val -= tmp;
w *= 0x68DB9UL;

if (first5 == 0) goto NEXT5;
if (first5 > 9999) goto DIGIT0;
if (first5 > 999) goto DIGIT1;
w = (uint32_t)(w & 0xFFFFFFFFUL) * 10UL;
if (first5 > 99) goto DIGIT2;
w = (uint32_t)(w & 0xFFFFFFFFUL) * 10UL;
if (first5 > 9) goto DIGIT3;
w = (uint32_t)(w & 0xFFFFFFFFUL) * 10UL;
goto DIGIT4;
NEXT5: w=val;
w*=0x68DB9UL;
if (val > 9999) goto DIGIT5;
if (val > 999) goto DIGIT6;
w = (uint32_t)(w & 0xFFFFFFFFUL) * 10UL;
if (val > 99) goto DIGIT7;
w = (uint32_t)(w & 0xFFFFFFFFUL) * 10UL;
if (val > 9) goto DIGIT8;
*((uint16_t *)p)=(uint16_t)val + '\x30';
return 1;
DIGIT0: *p++ = (uint8_t) (w >> 32) + 0x30;
DIGIT1: w = (w & 0xFFFFFFFFUL) * 10;
*p++ = (uint8_t) (w >> 32) + 0x30;
DIGIT2: w = (w & 0xFFFFFFFFUL) * 10;
*p++ = (uint8_t) (w >> 32) + 0x30;
DIGIT3: w = (w & 0xFFFFFFFFUL) * 10;
*p++ = (uint8_t) (w >> 32) + 0x30;
DIGIT4: w = (w & 0xFFFFFFFFUL) * 10;
*p++ = (uint8_t) (w >> 32) + 0x30;
w=val;
w*=0x68DB9UL;
DIGIT5: *p++ = (uint8_t) (w >> 32) + 0x30;
DIGIT6: w = (w & 0xFFFFFFFFUL) * 10;
*p++ = (uint8_t) (w >> 32) + 0x30;
DIGIT7: w = (w & 0xFFFFFFFFUL) * 10;
*p++ = (uint8_t) (w >> 32) + 0x30;
DIGIT8: w = (w & 0xFFFFFFFFUL) * 10;
*p++ = (uint8_t) (w >> 32) + 0x30;
DIGIT9: w = (w & 0xFFFFFFFFUL) * 10;
*p++ = (uint8_t) (w >> 32) + 0x30;
*p = '\0';
return p-str;
}

PS No formatting!
The truth cannot be learned ... it can only be recognized.

frktons

Quote from: drizz on July 08, 2010, 06:21:10 AM
Quote from: frktons on July 07, 2010, 04:13:45 PMIt looks like a better algorithm, nearly as fast as assembly code
Here is my fastest int2str routine in assembly (Int3264ToStr.asm in attachment)
http://www.masm32.com/board/index.php?topic=9906.msg93198#msg93198

The funny thing is that my C translation produces almost identical code when compiled. :)
Though, I did carefully cater the compiler. 

uint32_t uint2str(uint32_t num, uint8_t *str)
{
uint8_t *p = str;
uint32_t val = num, first5, tmp;
uint64_t w = num;

w *= 0xA7C5AC47UL;
w += 0xA7C5AC47ULL;
first5 = (w >> 48) & 0xFFFFFFFFUL;
tmp = (int32_t)first5 * 100000L;
w = first5;
val -= tmp;
w *= 0x68DB9UL;

if (first5 == 0) goto NEXT5;
if (first5 > 9999) goto DIGIT0;
if (first5 > 999) goto DIGIT1;
w = (uint32_t)(w & 0xFFFFFFFFUL) * 10UL;
if (first5 > 99) goto DIGIT2;
w = (uint32_t)(w & 0xFFFFFFFFUL) * 10UL;
if (first5 > 9) goto DIGIT3;
w = (uint32_t)(w & 0xFFFFFFFFUL) * 10UL;
goto DIGIT4;
NEXT5: w=val;
w*=0x68DB9UL;
if (val > 9999) goto DIGIT5;
if (val > 999) goto DIGIT6;
w = (uint32_t)(w & 0xFFFFFFFFUL) * 10UL;
if (val > 99) goto DIGIT7;
w = (uint32_t)(w & 0xFFFFFFFFUL) * 10UL;
if (val > 9) goto DIGIT8;
*((uint16_t *)p)=(uint16_t)val + '\x30';
return 1;
DIGIT0: *p++ = (uint8_t) (w >> 32) + 0x30;
DIGIT1: w = (w & 0xFFFFFFFFUL) * 10;
*p++ = (uint8_t) (w >> 32) + 0x30;
DIGIT2: w = (w & 0xFFFFFFFFUL) * 10;
*p++ = (uint8_t) (w >> 32) + 0x30;
DIGIT3: w = (w & 0xFFFFFFFFUL) * 10;
*p++ = (uint8_t) (w >> 32) + 0x30;
DIGIT4: w = (w & 0xFFFFFFFFUL) * 10;
*p++ = (uint8_t) (w >> 32) + 0x30;
w=val;
w*=0x68DB9UL;
DIGIT5: *p++ = (uint8_t) (w >> 32) + 0x30;
DIGIT6: w = (w & 0xFFFFFFFFUL) * 10;
*p++ = (uint8_t) (w >> 32) + 0x30;
DIGIT7: w = (w & 0xFFFFFFFFUL) * 10;
*p++ = (uint8_t) (w >> 32) + 0x30;
DIGIT8: w = (w & 0xFFFFFFFFUL) * 10;
*p++ = (uint8_t) (w >> 32) + 0x30;
DIGIT9: w = (w & 0xFFFFFFFFUL) * 10;
*p++ = (uint8_t) (w >> 32) + 0x30;
*p = '\0';
return p-str;
}

PS No formatting!

Thanks drizz, I'll have a look  :U
Mind is like a parachute. You know what to do in order to use it :-)

frktons

With random numbers the peformance is now pretty similar
among different compilers, so using the same number over and
over again gave the opportunity to the compiler to play some
nasty trick, and not doing the formatting when not necessary
that means all along the iteration except for the first time.

If some of you are interested in a faster solution to translate
into Assembly for your libraries, have a look Here.

See you
Mind is like a parachute. You know what to do in order to use it :-)