MASM for FUN - #0.0 [Nested Loops - formatted numbers - msgbox]

jj2007 · July 06, 2010, 05:37:13 PM

Quote from: frktons on July 05, 2010, 04:34:51 PM
The strange and impressive thing is that somebody, in Cprogramming
forum where I posted the function, compiled it under Linux with GCC
and got [with the same code] a performace of 50:1 compared with
mine on Win7 and Pelles'C. That was very impressive. Just using
Linux and a different compiler gives the same code an incredible boost. :eek

Incredible indeed. The snippet below takes 8.9 seconds as compared to 12 seconds for yours. If I take Masm32 str$ instead of MasmBasic Str$, it can be done in just over 4 seconds - Str$() is an allrounder, str$ swallows integers only. Now maybe, if we put Lingo on it, that could be reduced to 2 seconds, but 12/50=0.24 seconds. No way! My advice: Test the output of that miraculous Linux proggie... or rather: run it through Olly. I would not be surprised if the compiler optimised a little bit by putting a fixed string or similar ;-)

Code Select

include \masm32\MasmBasic\MasmBasic.inc
.data
num	dd -1234567890
	Init
	push Timer()
	For_ n=1 To 50000000
		void Str$(num)
	Next
	void Timer()
	pop edx
	sub eax, edx
	Inkey Str$("\nYour puter needed %f seconds for showing TheNum: ", eax/1000), Str$(num)
	Exit
end start

frktons · July 06, 2010, 06:29:08 PM

Quote from: clive on July 06, 2010, 04:10:35 PM
Perhaps you should look at the assembler code emitted by GCC to understand what it is doing. (-S option as I recall, generates a .S file at least for my cygwin hosted GCC MIPS compiler)

Unless you are measuring TIME elapsed on the same processor (model, speed), it is not going to be a very effective measure. Your code takes 17 seconds on my system and I didn't even recompile it.

-Clive

In order to reveal some tricks used by GCC, I'm now going
to test the function with random generated numbers, to see if
the Linux/GCC is still capable of similar optimization with -O3 flag
as they said.

Quote from: jj2007 on July 06, 2010, 05:37:13 PM
Incredible indeed. The snippet below takes 8.9 seconds as compared to 12 seconds for yours. If I take Masm32 str$ instead of MasmBasic Str$, it can be done in just over 4 seconds - Str$() is an allrounder, str$ swallows integers only. Now maybe, if we put Lingo on it, that could be reduced to 2 seconds, but 12/50=0.24 seconds. No way! My advice: Test the output of that miraculous Linux proggie... or rather: run it through Olly. I would not be surprised if the compiler optimised a little bit by putting a fixed string or similar ;-)

Code Select Expand
include \masm32\MasmBasic\MasmBasic.inc .data num dd -1234567890 Init push Timer() For_ n=1 To 50000000 void Str$(num) Next void Timer() pop edx sub eax, edx Inkey Str$("\nYour puter needed %f seconds for showing TheNum: ", eax/1000), Str$(num) Exit end start

Well, I was really impressed, so let's find out if there is a trick of the tail
somewhere for using always the same number, changing it with random ones. :green

oex · July 06, 2010, 06:32:38 PM

Quote from: frktons on July 06, 2010, 06:29:08 PM
Well, I was really impressed, so let's find out if there is a trick of the tail
somewhere for using always the same number, changing it with random ones. :green

To find the same number you just run a cmp LastNum -> je and skip any other calculations

or use a jump table

frktons · July 06, 2010, 08:00:22 PM

Quote from: oex on July 06, 2010, 06:32:38 PM
Quote from: frktons on July 06, 2010, 06:29:08 PM
Well, I was really impressed, so let's find out if there is a trick of the tail
somewhere for using always the same number, changing it with random ones. :green

To find the same number you just run a cmp LastNum -> je and skip any other calculations

or use a jump table

With random numbers, changing them all the time I call the function, this
trick will not work? Jump table should be very big for covering all the
numbers, maybe dividing by 1000, and using the remainder for a smaller
jump table could be doable in order to improve the performance.

oex · July 06, 2010, 08:12:59 PM

Sorry i misunderstood your post.... For the same number it would work for random numbers it wouldnt.... My suggestion was to explain how linux could be 50 times faster for the same number....

I didnt read up to the top only based it on your code you (er jj) just posted

For random numbers it would be difficult to make it any faster however for incrementing numbers there is plenty of ability to increase the speed if it is known the number is simply incrementing

str$ runs dwtoa function however 40000-40009 needs only one byte incremented etc not the full DW to A function run

ie cmp byte 2 (from right), 9 -> jne inc byte

jj2007 · July 06, 2010, 08:20:45 PM

As Clive already wrote, try to get hold of the assembler file generated. Or just use the loop counter to eliminate the "optimisation":
   For_ n=1 To 50000000
      void Str$(n)
   Next

clive · July 06, 2010, 09:05:03 PM

If the compiler can roll the MOD,DIV operations into a single operation it would double the speed.

I don't think your code handles the ZERO/SIGN cases particularly well, or the NUL termination of the string. You could eliminate the 'zero' flag if you are not using it anywhere else, and probably address the initialization of the string better. The 'neg' flag could be held locally.

Here was a quick hack of the code

Code Select

void int_format(int num)
{
  int x = 0;
  int y = 0;
  int remain = 0;     // integer variable to store the remainder
  int count = 0;                    // integer counter for positioning the separator
  const int len_str = sizeof(buffer) - 1;                 // string length less the terminator NULL
  const int ten = 10;

  x = len_str;

  if (num != 0)
  {
    zero = FALSE;

    if (num < 0)
    {
      neg = TRUE;
      num = -num; // transform number to positive if negative
    }
    else
      neg = FALSE;

    while(num)
    {
      if (count == 3)
	    {
	      count = 0;
	      buffer[x--] = sep;
      }

      buffer[x--] = '0' + (char)(num % ten);

      num = num / ten;

      count++;
    }

    if (sign == TRUE)
    {
      if (neg == TRUE)
        buffer[x] = '-';
      else
        buffer[x] = '+';
    }
    else
       x++;
  }
  else
  {
    buffer[x] = '0';

    zero = TRUE;
  }

  if (alignment  == 'L')
  {
    if (x == 0)
     return;

    for (y = 0; y < sizeof(buffer); y++, x++)
    {
      buffer[y] = buffer[x];

      if (buffer[y] == '\0')
        return;
     } // end for
  }

  return;
}

frktons · July 06, 2010, 11:13:22 PM

Quote from: clive on July 06, 2010, 09:05:03 PM
If the compiler can roll the MOD,DIV operations into a single operation it would double the speed.

I don't think your code handles the ZERO/SIGN cases particularly well, or the NUL termination of the string. You could eliminate the 'zero' flag if you are not using it anywhere else, and probably address the initialization of the string better. The 'neg' flag could be held locally.

Here was a quick hack of the code

Code Select Expand
void int_format(int num) { int x = 0; int y = 0; int remain = 0; // integer variable to store the remainder int count = 0; // integer counter for positioning the separator const int len_str = sizeof(buffer) - 1; // string length less the terminator NULL const int ten = 10; x = len_str; if (num != 0) { zero = FALSE; if (num < 0) { neg = TRUE; num = -num; // transform number to positive if negative } else neg = FALSE; while(num) { if (count == 3) { count = 0; buffer[x--] = sep; } buffer[x--] = '0' + (char)(num % ten); num = num / ten; count++; } if (sign == TRUE) { if (neg == TRUE) buffer[x] = '-'; else buffer[x] = '+'; } else x++; } else { buffer[x] = '0'; zero = TRUE; } if (alignment == 'L') { if (x == 0) return; for (y = 0; y < sizeof(buffer); y++, x++) { buffer[y] = buffer[x]; if (buffer[y] == '\0') return; } // end for } return; }

Thanks clive, it looks a little bit faster with your hack. :U

Yes something is to refine a little to make it work properly. :P

I have a strange display, with L after the formatted numbers,
do you know what they are? Errors or normal signs for Long numbers?

Code Select



                         Testing version : 0.60

                         ----------------------

 The value of num is: -1234567890

 The formatted value of num is: -1.234.567.890LL
 Elapsed ClickTime: 1.404L
 to perform 10.000.000L cycles of the formatting function

Pelles' C likes TRUE/FALSE in lower letters: true/false, and
I don't know why the stdbool.h was written this way.

This is faster than the lookup table?

Code Select

buffer[x--] = '0' + (char)(num % ten);

In a later version I initialize the buffer this way:

Code Select

    char buffer[15] = {'\0'};        // string array for the formatted number
                                              // 10 digits max + 3 separators max + sign + NULL

Is it better this way?

clive · July 07, 2010, 03:06:15 AM

Microsoft C uses BOOL/TRUE/FALSE, I don't use Pelle

The 'L' are probably because you don't terminate the string. Most of the code is predicated on the fact that buffer[14] is NUL, but is never set as such. The "char buffer[15] = {'\0'};" does not achieve this, the first character is a NUL, but you are filling the buffer from the back.

Code Select


  x = len_str;

  buffer[x--] = 0; // ADD THIS

  if (num != 0)

A simple register/immediate addition should be more efficient than a memory read. A table would help above base 10 due to the non-linearity of the ASCII translation. The overall cost of the table might be hidden. The most time consuming operation is the divide and modulus function. While DIV will provide both, most C compilers aren't going to fold the '/' and '%' pair together.

Microsoft C/C++ 12.00 optimized, 1 IDIV, 1 reciprocal IMUL (12 seconds, 50M iterations)

Code Select

; 143  :       buffer[x--] = '0' + (char)(num % ten);

  00232	8b c1		 mov	 eax, ecx
  00234	bd 0a 00 00 00	 mov	 ebp, 10			; 0000000aH
  00239	99		 cdq
  0023a	f7 fd		 idiv	 ebp

NOTE EAX here is num / ten

; 144  :
; 145  :       num = num / ten;

  0023c	b8 67 66 66 66	 mov	 eax, 1717986919		; 66666667H
  00241	80 c2 30	 add	 dl, 48			; 00000030H
  00244	88 96 00 00 00
	00		 mov	 BYTE PTR _buffer[esi], dl
  0024a	f7 e9		 imul	 ecx
  0024c	c1 fa 02	 sar	 edx, 2
  0024f	8b c2		 mov	 eax, edx
  00251	4e		 dec	 esi
  00252	c1 e8 1f	 shr	 eax, 31			; 0000001fH
  00255	03 d0		 add	 edx, eax

; 146  :
; 147  :       count++;

  00257	47		 inc	 edi
  00258	8b ca		 mov	 ecx, edx
  0025a	85 c9		 test	 ecx, ecx
  0025c	75 c6		 jne	 SHORT $L52929

Microsoft C/C++ 12.00 unoptimized, 2 IDIV (25 seconds, 50M iterations)

Code Select

; 143  :       buffer[x--] = '0' + (char)(num % ten);

  003b1	8b 45 08	 mov	 eax, DWORD PTR _num$[ebp]
  003b4	99		 cdq
  003b5	f7 7d fc	 idiv	 DWORD PTR _ten$[ebp]
  003b8	0f be ca	 movsx	 ecx, dl
  003bb	83 c1 30	 add	 ecx, 48			; 00000030H
  003be	8b 55 f0	 mov	 edx, DWORD PTR _x$[ebp]
  003c1	88 8a 00 00 00
	00		 mov	 BYTE PTR _buffer[edx], cl
  003c7	8b 45 f0	 mov	 eax, DWORD PTR _x$[ebp]
  003ca	83 e8 01	 sub	 eax, 1
  003cd	89 45 f0	 mov	 DWORD PTR _x$[ebp], eax

; 144  :
; 145  :       num = num / ten;

  003d0	8b 45 08	 mov	 eax, DWORD PTR _num$[ebp]
  003d3	99		 cdq
  003d4	f7 7d fc	 idiv	 DWORD PTR _ten$[ebp]
  003d7	89 45 08	 mov	 DWORD PTR _num$[ebp], eax

; 146  :
; 147  :       count++;

  003da	8b 4d f8	 mov	 ecx, DWORD PTR _count$[ebp]
  003dd	83 c1 01	 add	 ecx, 1
  003e0	89 4d f8	 mov	 DWORD PTR _count$[ebp], ecx

; 148  :     }

  003e3	eb a1		 jmp	 SHORT $L52885

frktons · July 07, 2010, 12:14:24 PM

Quote from: clive on July 07, 2010, 03:06:15 AM
Microsoft C uses BOOL/TRUE/FALSE, I don't use Pelle

The 'L' are probably because you don't terminate the string. Most of the code is predicated on the fact that buffer[14] is NUL, but is never set as such. The "char buffer[15] = {'\0'};" does not achieve this, the first character is a NUL, but you are filling the buffer from the back.

Code Select Expand
x = len_str; buffer[x--] = 0; // ADD THIS if (num != 0)

Well, according to what I've read so far, I had the naive assumption
that char buffer[15] = {'\0'} was referring to all the bytes
of the string array, not only the first. ::)

I'll add the code you suggest and see what happens as soon as I get
home and put my hands on my pc. :P

Edit: OK clive, it worked, thanks :U

frktons · July 07, 2010, 04:13:45 PM

And the new version, smarter one, posted by malcom on
cprogramming forum:

Code Select


int iMalcsUltimateFormat(long snum, char s[])
{
    char *ps = s;
    unsigned long num1 = snum, num2, num3, div;
    if (snum < 0) {
        *ps++ = '-';
        num1 = -snum;
    }
    if (num1 < 10000) {
        if (num1 < 10) goto L1;
        if (num1 < 100) goto L2;
        if (num1 < 1000) goto L3;
    } else {
        num2 = num1 / 10000;
        num1 -= num2 * 10000;
        if (num2 < 10000) {
            if (num2 < 10) goto L5;
            if (num2 < 100) goto L6;
            if (num2 < 1000) goto L7;
        } else {
            num3 = num2 / 10000;
            num2 -= num3 * 10000;
            if (num3 >= 10) {
                *ps++ = '0' + (char)(div = (num3*6554UL)>>16); num3 -= div*10;
                *ps++ = ',';
            }
            *ps++ = '0' + (char)(num3);
        }
        *ps++ = '0' + (char)(div = (num2*8389UL)>>23); num2 -= div*1000;
L7:
        *ps++ = '0' + (char)(div = (num2*5243UL)>>19); num2 -= div*100;
        *ps++ = ',';
L6:
        *ps++ = '0' + (char)(div = (num2*6554UL)>>16); num2 -= div*10;
L5:
        *ps++ = '0' + (char)(num2);
    }
    *ps++ = '0' + (char)(div = (num1*8389UL)>>23); num1 -= div*1000;
    *ps++ = ',';
L3:
    *ps++ = '0' + (char)(div = (num1*5243UL)>>19); num1 -= div*100;
L2:
    *ps++ = '0' + (char)(div = (num1*6554UL)>>16); num1 -= div*10;
L1:
    *ps++ = '0' + (char)(num1);
    *ps = '\0';
    return ps - s;
}

is about 3 times faster than the unoptimized, crumpled solution I
posted and clive corrected a little.

The entire pgm for testing pourpose source and exe are attached.

This is the output it gives:

Code Select


            Testing version : 0.60

            --------------------------

 The value of num is: -1234567890

 The formatted value of num is: -1,234,567,890

 Elapsed ClickTime: 463
 to perform 10,000,000 cycles of the formatting function

It looks like a better algorithm, nearly as fast as assembly code
somebody tested here. :U

drizz · July 08, 2010, 06:21:10 AM

Quote from: frktons on July 07, 2010, 04:13:45 PMIt looks like a better algorithm, nearly as fast as assembly code

Here is my fastest int2str routine in assembly (Int3264ToStr.asm in attachment)
http://www.masm32.com/board/index.php?topic=9906.msg93198#msg93198

The funny thing is that my C translation produces almost identical code when compiled. :)
Though, I did carefully cater the compiler.

Code Select

uint32_t uint2str(uint32_t num, uint8_t *str)
{
		uint8_t *p = str;
		uint32_t val = num, first5, tmp;
		uint64_t w = num;
		
		w *= 0xA7C5AC47UL;
		w += 0xA7C5AC47ULL;
		first5 = (w >> 48) & 0xFFFFFFFFUL;
		tmp = (int32_t)first5 * 100000L;
		w = first5; 
		val -= tmp;
		w *= 0x68DB9UL;

		if (first5 == 0) goto NEXT5;
		if (first5 > 9999) goto DIGIT0;
		if (first5 > 999) goto DIGIT1;
		w = (uint32_t)(w & 0xFFFFFFFFUL) * 10UL;
		if (first5 > 99) goto DIGIT2;
		w = (uint32_t)(w & 0xFFFFFFFFUL) * 10UL;
		if (first5 > 9) goto DIGIT3;
		w = (uint32_t)(w & 0xFFFFFFFFUL) * 10UL;
		goto DIGIT4;
NEXT5:	w=val;
		w*=0x68DB9UL;
		if (val > 9999) goto DIGIT5;
		if (val > 999) goto DIGIT6;
		w = (uint32_t)(w & 0xFFFFFFFFUL) * 10UL;
		if (val > 99) goto DIGIT7;
		w = (uint32_t)(w & 0xFFFFFFFFUL) * 10UL;
		if (val > 9) goto DIGIT8;
		*((uint16_t *)p)=(uint16_t)val + '\x30';
		return 1;
DIGIT0:	*p++ = (uint8_t) (w >> 32) + 0x30;
DIGIT1:	w = (w & 0xFFFFFFFFUL) * 10;
		*p++ = (uint8_t) (w >> 32) + 0x30;
DIGIT2:	w = (w & 0xFFFFFFFFUL) * 10;
		*p++ = (uint8_t) (w >> 32) + 0x30;
DIGIT3:	w = (w & 0xFFFFFFFFUL) * 10;
		*p++ = (uint8_t) (w >> 32) + 0x30;
DIGIT4:	w = (w & 0xFFFFFFFFUL) * 10;
		*p++ = (uint8_t) (w >> 32) + 0x30;
		w=val;
		w*=0x68DB9UL;
DIGIT5:	*p++ = (uint8_t) (w >> 32) + 0x30;
DIGIT6:	w = (w & 0xFFFFFFFFUL) * 10;
		*p++ = (uint8_t) (w >> 32) + 0x30;
DIGIT7:	w = (w & 0xFFFFFFFFUL) * 10;
		*p++ = (uint8_t) (w >> 32) + 0x30;
DIGIT8:	w = (w & 0xFFFFFFFFUL) * 10;
		*p++ = (uint8_t) (w >> 32) + 0x30;
DIGIT9:	w = (w & 0xFFFFFFFFUL) * 10;
		*p++ = (uint8_t) (w >> 32) + 0x30;
		*p = '\0'; 
		return p-str;
}

PS No formatting!

frktons · July 08, 2010, 08:13:03 AM

Quote from: drizz on July 08, 2010, 06:21:10 AM
Quote from: frktons on July 07, 2010, 04:13:45 PMIt looks like a better algorithm, nearly as fast as assembly code
Here is my fastest int2str routine in assembly (Int3264ToStr.asm in attachment)
http://www.masm32.com/board/index.php?topic=9906.msg93198#msg93198

The funny thing is that my C translation produces almost identical code when compiled. :)
Though, I did carefully cater the compiler.

Code Select Expand
uint32_t uint2str(uint32_t num, uint8_t *str) { uint8_t *p = str; uint32_t val = num, first5, tmp; uint64_t w = num; w *= 0xA7C5AC47UL; w += 0xA7C5AC47ULL; first5 = (w >> 48) & 0xFFFFFFFFUL; tmp = (int32_t)first5 * 100000L; w = first5; val -= tmp; w *= 0x68DB9UL; if (first5 == 0) goto NEXT5; if (first5 > 9999) goto DIGIT0; if (first5 > 999) goto DIGIT1; w = (uint32_t)(w & 0xFFFFFFFFUL) * 10UL; if (first5 > 99) goto DIGIT2; w = (uint32_t)(w & 0xFFFFFFFFUL) * 10UL; if (first5 > 9) goto DIGIT3; w = (uint32_t)(w & 0xFFFFFFFFUL) * 10UL; goto DIGIT4; NEXT5: w=val; w*=0x68DB9UL; if (val > 9999) goto DIGIT5; if (val > 999) goto DIGIT6; w = (uint32_t)(w & 0xFFFFFFFFUL) * 10UL; if (val > 99) goto DIGIT7; w = (uint32_t)(w & 0xFFFFFFFFUL) * 10UL; if (val > 9) goto DIGIT8; *((uint16_t *)p)=(uint16_t)val + '\x30'; return 1; DIGIT0: *p++ = (uint8_t) (w >> 32) + 0x30; DIGIT1: w = (w & 0xFFFFFFFFUL) * 10; *p++ = (uint8_t) (w >> 32) + 0x30; DIGIT2: w = (w & 0xFFFFFFFFUL) * 10; *p++ = (uint8_t) (w >> 32) + 0x30; DIGIT3: w = (w & 0xFFFFFFFFUL) * 10; *p++ = (uint8_t) (w >> 32) + 0x30; DIGIT4: w = (w & 0xFFFFFFFFUL) * 10; *p++ = (uint8_t) (w >> 32) + 0x30; w=val; w*=0x68DB9UL; DIGIT5: *p++ = (uint8_t) (w >> 32) + 0x30; DIGIT6: w = (w & 0xFFFFFFFFUL) * 10; *p++ = (uint8_t) (w >> 32) + 0x30; DIGIT7: w = (w & 0xFFFFFFFFUL) * 10; *p++ = (uint8_t) (w >> 32) + 0x30; DIGIT8: w = (w & 0xFFFFFFFFUL) * 10; *p++ = (uint8_t) (w >> 32) + 0x30; DIGIT9: w = (w & 0xFFFFFFFFUL) * 10; *p++ = (uint8_t) (w >> 32) + 0x30; *p = '\0'; return p-str; }
PS No formatting!

Thanks drizz, I'll have a look :U

frktons · July 10, 2010, 10:12:40 AM

With random numbers the peformance is now pretty similar
among different compilers, so using the same number over and
over again gave the opportunity to the compiler to play some
nasty trick, and not doing the formatting when not necessary
that means all along the iteration except for the first time.

If some of you are interested in a faster solution to translate
into Assembly for your libraries, have a look Here.

See you

News:

MASM for FUN - #0.0 [Nested Loops - formatted numbers - msgbox]