MASM for FUN - #0.0 [Nested Loops - formatted numbers - msgbox]

Started by frktons, May 05, 2010, 05:56:39 PM

Previous topic - Next topic

Ghandi

It is also important to note that the routine i posted above does not preserve registers but should, so i would change the function header with:


FormatString PROC uses ebx edx esi edi pInput:DWORD,pOutput:DWORD

frktons

Quote from: Ghandi on May 06, 2010, 02:05:01 PM
It is also important to note that the routine i posted above does not preserve registers but should, so i would change the function header with:


FormatString PROC uses ebx edx esi edi pInput:DWORD,pOutput:DWORD


Thanks Ghandi,

as I have time I'll give it a try.  :U
Mind is like a parachute. You know what to do in order to use it :-)

frktons

While I was travelling for work, I came up with a basic idea for
formatting huge numbers [qword ones] or 32 bit integer numbers
with basic MASM OPCODES.

Let's start with the simpler ones, 32 bit numbers.

The idea is to allocate an array of string bytes [10 elements]
with value: "0" ... "9", and DIVide the 32 bit value by ten
using the remainder to get the corresponding byte of the array,
and placing it into the buiding formatted string, starting from right to left.
The quotient of the DIV is used for the next DIV until it
becomes zero.
Counting three bytes, before the 4th we place a "," and the counter
is zeroed.

What do you think?

Am I using the worst of all possible algos?

Could it be done just for exercising some basic opcodes?

Is it better to use binary or BCD format numbers for this task?

Should I use three 32 bits general registers for dividend, divisor, quotient?

For 64 bit numbers is it better to use MMX registers?

Have they any particular opcodes I have to be aware of?

Thanks

Frank
Mind is like a parachute. You know what to do in order to use it :-)

clive

Here's a code fragment using basic opcodes, the IDIV's are quite expensive, but it should be easy enough to follow.
I make no claims that it is super efficient.
It uses the stack to deal with the fact digits come out backward from the division.
I divide by 1000 to get groups of three digits, and then divide that remainder by 10 three times to get the digits out.
It basically tests for 0-9, 0-99, or 0-999 to determine how many digits it needs when then current value is less than 1000
Personally I'd use a pair of IDIV's to handle 64-bit values.

; EAX = 32-bit number
; ESI = string buffer for NUL terminated ASCII
; Uses ESI,EDI,EAX,ECX,EDX

    push 0 ; Mark stack end with NUL

divloop:

    mov  ecx,1000 ; Divide into 3 digit groups
    xor edx,edx ; Clear high order 32-bit for divide
    idiv ecx ; eax = edx:eax / ecx, edx = edx:eax % ecx

    mov edi,eax ; Save division result

    mov ecx,10 ; Subdivide in 10's
    mov eax,edx ; Get remainder

    or edi,edi ; Still number left, so at least 3 digits in remainder
    jnz digit000

    cmp eax,10 ; remainder has one digit
    jb  digit0

    cmp eax,100 ; remainder has two digits
    jb  digit00

digit000: ; 3 digits

    xor edx,edx ; Clear high order 32-bit for divide
    idiv ecx    ; eax = edx:eax / ecx, edx = edx:eax % ecx
    add edx,30h ; += '0'
    push edx ; Stack

digit00: ; 2 digits

    xor edx,edx ; Clear high order 32-bit for divide
    idiv ecx    ; eax = edx:eax / ecx, edx = edx:eax % ecx
    add edx,30h ; += '0'
    push edx    ; Stack

digit0: ; 1 digit

    xor edx,edx ; Clear high order 32-bit for divide
    idiv ecx    ; eax = edx:eax / ecx, edx = edx:eax % ecx
    add edx,30h ; += '0'
    push edx    ; Stack

    mov eax,edi ; Recover remaining number

    or eax,eax ; Zero?
    jz poploop

    push  2Ch ; Comma added to groups of three digits
    jmp divloop

poploop:
    pop eax ; Recover next digit
    mov [esi],al ; Add to string
    inc esi
    or  eax,eax ; Was it a NUL?
    jnz poploop


1
12
128
1,000
10,000
100,000
256,256,256
1,234,567,890
0
4,294,967,295
It could be a random act of randomness. Those happen a lot as well.

frktons

Quote from: clive on May 14, 2010, 05:28:49 PM
Here's a code fragment using basic opcodes, the IDIV's are quite expensive, but it should be easy enough to follow.
I make no claims that it is super efficient.
It uses the stack to deal with the fact digits come out backward from the division.
I divide by 1000 to get groups of three digits, and then divide that remainder by 10 three times to get the digits out.
It basically tests for 0-9, 0-99, or 0-999 to determine how many digits it needs when then current value is less than 1000
Personally I'd use a pair of IDIV's to handle 64-bit values.

; EAX = 32-bit number
; ESI = string buffer for NUL terminated ASCII
; Uses ESI,EDI,EAX,ECX,EDX

    push 0 ; Mark stack end with NUL

divloop:

    mov  ecx,1000 ; Divide into 3 digit groups
    xor edx,edx ; Clear high order 32-bit for divide
    idiv ecx ; eax = edx:eax / ecx, edx = edx:eax % ecx

    mov edi,eax ; Save division result

    mov ecx,10 ; Subdivide in 10's
    mov eax,edx ; Get remainder

    or edi,edi ; Still number left, so at least 3 digits in remainder
    jnz digit000

    cmp eax,10 ; remainder has one digit
    jb  digit0

    cmp eax,100 ; remainder has two digits
    jb  digit00

digit000: ; 3 digits

    xor edx,edx ; Clear high order 32-bit for divide
    idiv ecx    ; eax = edx:eax / ecx, edx = edx:eax % ecx
    add edx,30h ; += '0'
    push edx ; Stack

digit00: ; 2 digits

    xor edx,edx ; Clear high order 32-bit for divide
    idiv ecx    ; eax = edx:eax / ecx, edx = edx:eax % ecx
    add edx,30h ; += '0'
    push edx    ; Stack

digit0: ; 1 digit

    xor edx,edx ; Clear high order 32-bit for divide
    idiv ecx    ; eax = edx:eax / ecx, edx = edx:eax % ecx
    add edx,30h ; += '0'
    push edx    ; Stack

    mov eax,edi ; Recover remaining number

    or eax,eax ; Zero?
    jz poploop

    push  2Ch ; Comma added to groups of three digits
    jmp divloop

poploop:
    pop eax ; Recover next digit
    mov [esi],al ; Add to string
    inc esi
    or  eax,eax ; Was it a NUL?
    jnz poploop


1
12
128
1,000
10,000
100,000
256,256,256
1,234,567,890
0
4,294,967,295


Thanks Clive, this routine will help me a lot
to understand how to use the stack and the DIV/IDIV opcodes.  :U
Mind is like a parachute. You know what to do in order to use it :-)

clive

Actually, the last divide by 10 isn't necessary, as eax will be 0-9 at this point.

digit0:  ; 1 digit

    add eax,30h ; += '0'
    push eax    ; Stack

    mov eax,edi ; Recover remaining number
It could be a random act of randomness. Those happen a lot as well.

frktons

Quote from: clive on May 15, 2010, 01:09:47 AM
Actually, the last divide by 10 isn't necessary, as eax will be 0-9 at this point.

digit0:  ; 1 digit

    add eax,30h ; += '0'
    push eax    ; Stack

    mov eax,edi ; Recover remaining number


Very good auto-optimization  :clap:

Quote from: clive on May 14, 2010, 05:28:49 PM
It uses the stack to deal with the fact digits come out backward from the division.

This is also a good use of the stack structure  :U

Quote from: clive on May 14, 2010, 05:28:49 PM
Personally I'd use a pair of IDIV's to handle 64-bit values.

You are not talking about 64 bit MMX registers, are you?

Actually I am not sure how to perform 64 bits DIV/IDIV operations.  :red
Would you be so kind to show me an example?

Thanks

Frank
Mind is like a parachute. You know what to do in order to use it :-)

dedndave

DIV and IDIV are slow instructions
if you are trying to make a faster routine, there are a few tricks that can be used

one is to multiply by the reciprocal of the divisor (MUL instructions are fairly fast)
this is called "multiply-to-divide" or sometimes called "magic number division"
so, instead of dividing by 10, multiply the value by n^2/10, where n^2 is some power of 2 (convenient to use a register width if practical)

another trick is to use a larger divisor to extract more than one digit at a time
dividing by 10,000 will get you 4 decimal digits at a time, for example
this can also prevent divide overflow problems encountered when dividing large values by small ones

then, there is Horner's Rule or, as i like to call it, Ling Long Kai Fang
it is a little bit involved to explain, but it is really a simple math identity that is worth knowing
some time ago, i wrote a few routines to convert large integers to ASCII decimal strings
the routines combine both Ling Long Kai Fang and Multiply-to-Divide techniques

http://www.masm32.com/board/index.php?topic=12363.msg94779#msg94779

download the LLKF9_1 attachment from that thread for some code
i am not a super-duper programmer, but the routines work fairly well   :lol
they use no SSE or FPU instructions, so there is plenty of room for improvement

clive

Dividing a 64-bit number with a 32-bit number, getting at 64-bit quotient,  and 32-bit remainder.

  ; edx:eax = edx:eax / ecx
  ; ebx     = edx:eax % ecx

    mov ebx,eax   ; save lower bits
    mov eax,edx   ; move high order bits into lower bits
    xor edx,edx   ; clear upper bits
    div ecx       ; long division, handle high order first
    xchg eax,ebx  ; swap low order with quotient
    div ecx       ; then low order using high order remainder
    xchg edx,ebx  ; switch high order quotient with remainder
It could be a random act of randomness. Those happen a lot as well.

clive

Equivalent to previous, using reciprocals.

; EAX = 32-bit number
; ESI = string buffer for NUL terminated ASCII
; Uses ESI,EDI,EAX,ECX,EDX

        push    0

divloop:

        mov     edx,04189374Ch       ; 2^40/1,000
        mov     ecx,eax              ; save the dividend
        mul     edx                  ; multiply to divide
        shr     edx,8                ; adjust the quotient
        imul    eax,edx,1000         ; find the product
        sub     ecx,eax              ; subtract the difference

        ; edx = quotient
        ; ecx = remainder

        mov     eax,ecx
        mov     edi,edx

        or      edi,edi
        jnz     digit000

        cmp     eax,10
        jb      digit0

        cmp     eax,100
        jb      digit00

digit000:

        mov     edx,0CCCCCCCDh       ; 2^35/10
        mov     ecx,eax              ; save the dividend
        mul     edx                  ; multiply to divide
        shr     edx,3                ; adjust the quotient
        imul    eax,edx,10           ; find the product
        sub     ecx,eax              ; subtract the difference

    ;edx = quotient
    ;ecx = remainder

        mov     eax,edx

        add     ecx,30h
        push    ecx

digit00:

        mov     edx,0CCCCCCCDh       ; 2^35/10
        mov     ecx,eax              ; save the dividend
        mul     edx                  ; multiply to divide
        shr     edx,3                ; adjust the quotient
        imul    eax,edx,10           ; find the product
        sub     ecx,eax              ; subtract the difference

    ;edx = quotient
    ;ecx = remainder

        mov     eax,edx

        add     ecx,30h
        push    ecx

digit0:

        add     eax,030h
        push    eax

        mov     eax,edi

        or      eax,eax
        jz      poploop

        push    2Ch

        jmp     divloop

poploop:

        pop     eax
        mov     [esi],al
        inc     esi
        or      eax,eax
        jnz     poploop
It could be a random act of randomness. Those happen a lot as well.

clive

As Dave indicates MUL instructions are much faster, they can get pipelined to produce a result in 5-10 cycles, verses 75-100 cycles of divides. This is an order of magnitude comparison, and will obviously vary between CPU architectures.

The divide by 1,000 was to get the comma spacing efficiently, we could divide by 1,000,000 and then subdivide that. There are many ways to do it, and there is a trade off between compactness, clarity, and efficiency.

Divide by 10,000 works well if you don't want commas and can handle 4 digits at a time.
It could be a random act of randomness. Those happen a lot as well.

frktons

Quote from: dedndave on May 15, 2010, 12:42:13 PM
DIV and IDIV are slow instructions
if you are trying to make a faster routine, there are a few tricks that can be used

Hi Dave, I suspect there are lots of ways to optimize a slow DIV operation.
Actually I'm just trying to understand how DIV/IDIV work. After that, I'll
jump into speeding up the routine with "Ling Long Kai Fang and Multiply-to-Divide techniques"
and the like.  :P

Quote from: clive on May 15, 2010, 01:33:19 PM
Dividing a 64-bit number with a 32-bit number, getting at 64-bit quotient,  and 32-bit remainder.

  ; edx:eax = edx:eax / ecx
  ; ebx     = edx:eax % ecx

    mov ebx,eax   ; save lower bits
    mov eax,edx   ; move high order bits into lower bits
    xor edx,edx   ; clear upper bits
    div ecx       ; long division, handle high order first
    xchg eax,ebx  ; swap low order with quotient
    div ecx       ; then low order using high order remainder
    xchg edx,ebx  ; switch high order quotient with remainder


As I can see for a 64 bit DIV we are using a pair of 32 bit registers  ::)
Does it mean the 64 bit MMX registers are slower or not fitted for the task?
::)

Quote from: clive on May 15, 2010, 01:37:17 PM
Equivalent to previous, using reciprocals.

; EAX = 32-bit number
; ESI = string buffer for NUL terminated ASCII
; Uses ESI,EDI,EAX,ECX,EDX

        push    0

divloop:

        mov     edx,04189374Ch       ; 2^40/1,000
        mov     ecx,eax              ; save the dividend
        mul      edx                  ; multiply to divide
        shr       edx,8                ; adjust the quotient
        imul    eax,edx,1000         ; find the product
        sub      ecx,eax              ; subtract the difference

        ; edx = quotient
        ; ecx = remainder

        mov     eax,ecx
        mov     edi,edx

        or      edi,edi
        jnz     digit000

        cmp     eax,10
        jb      digit0

        cmp     eax,100
        jb      digit00

digit000:

        mov     edx,0CCCCCCCDh       ; 2^35/10
        mov     ecx,eax              ; save the dividend
        mul     edx                  ; multiply to divide
        shr     edx,3                ; adjust the quotient
        imul    eax,edx,10           ; find the product
        sub     ecx,eax              ; subtract the difference

    ;edx = quotient
    ;ecx = remainder

        mov     eax,edx

        add     ecx,30h
        push    ecx

digit00:

        mov     edx,0CCCCCCCDh       ; 2^35/10
        mov     ecx,eax              ; save the dividend
        mul     edx                  ; multiply to divide
        shr     edx,3                ; adjust the quotient
        imul    eax,edx,10           ; find the product
        sub     ecx,eax              ; subtract the difference

    ;edx = quotient
    ;ecx = remainder

        mov     eax,edx

        add     ecx,30h
        push    ecx

digit0:

        add     eax,030h
        push    eax

        mov     eax,edi

        or      eax,eax
        jz      poploop

        push    2Ch

        jmp     divloop

poploop:

        pop     eax
        mov     [esi],al
        inc     esi
        or      eax,eax
        jnz     poploop


All right, I have enough stuff to meditate upon.  :P

Quote
As Dave indicates MUL instructions are much faster, they can get pipelined to produce a result in 5-10 cycles, verses 75-100 cycles of divides. This is an order of magnitude comparison, and will obviously vary between CPU architectures.

The divide by 1,000 was to get the comma spacing efficiently, we could divide by 1,000,000 and then subdivide that. There are many ways to do it, and there is a trade off between compactness, clarity, and efficiency.

Divide by 10,000 works well if you don't want commas and can handle 4 digits at a time

As a matter of fact I think the routines you proposed are exactly what I need
for the time being, the optimization task can wait some days more  :U

Thanks to both of you for your help.  :bg

Frank
Mind is like a parachute. You know what to do in order to use it :-)

dedndave

ok - it's good to know how to use DIV, too   :bg
as Clive suggested, you can "cascade" or "chain" DIV instructions together to divide larger quotients
the real limiting factor is the size of the divisor
if it is larger than machine-size (32 bits for many of us), you have to find some alternative to DIV/IDIV

Randy Hyde gave a pretty nice treatment on the subject in AoA
http://oopweb.com/Assembly/Documents/ArtOfAssembly/Volume/Chapter_9/CH09-4.html

frktons

With the little knowledge of C syntax I learned recently,
I assembled a routine, a function, to format integer numbers
with sign, comma or point or other separators, not only
british/american style, and the possibility to align the
resulting string towards right or left.

Now I'm trying to optimize it, so maybe a better algorithm
or some Assembly could help.

In the attached file there is the source code and the exe
compiled with Pelles' C.

The routine tests 50 million cycles of formatting an integer:
-1234567890
and takes about 6-7 ''.

The strange and impressive thing is that somebody, in Cprogramming
forum where I posted the function, compiled it under Linux with GCC
and got [with the same code] a performace of 50:1 compared with
mine on Win7 and Pelles'C. That was very impressive. Just using
Linux and a different compiler gives the same code an incredible boost.  :eek

I'm not able to evaluate these results, but you have far more
experience than me so you could help me to understand better
what happened, and suggest me a better algorithm or some Assembly
to improve the performance.

At least I'd be happy to get the same speed Linux-guys have on
their systems with GCC optimization.

Any idea to improve the performance?

Thanks
Mind is like a parachute. You know what to do in order to use it :-)

clive

Perhaps you should look at the assembler code emitted by GCC to understand what it is doing. (-S option as I recall, generates a .S file at least for my cygwin hosted GCC MIPS compiler)

Unless you are measuring TIME elapsed on the same processor (model, speed), it is not going to be a very effective measure. Your code takes 17 seconds on my system and I didn't even recompile it.

-Clive
It could be a random act of randomness. Those happen a lot as well.