QW to ASCII?

Ian_B · October 24, 2005, 03:50:05 PM

OK, here's a request for help. I'm posting it here 'cos this is where all the mathematically clever people hang out! :bg

I've added below P Dixon's code for converting a positive-only DWORD to the ASCII representation. This is from the old forum and I believe was unchallenged for speed. Unfortunately, my application must be able to handle potential file-sizes of up to a QWORD (2^62-1)... :eek I can't do anything about these specs except support them, and as I will need to write out filesizes and offsets in ASCII, I need a bigger version of this which will manage QWORDs if I am unlucky enough to encounter them. Unfortunately I can't get my head round where to start modifying this, so I'd appreciate any help or suggestions.

IanB

Code Select


.data

        NumText              BYTE 20 DUP (0)

.code

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

DWtoASCII    proc
        
 comment    * ------------------------------------------------------------
            Convert positive DWORD to ASCII string (no sign) by Paul Dixon
            with specific references/optimisations added by IanB.

            On entry:
                EAX = value to convert
            On exit:
                EAX = address of result buffer
                NumText buffer contains the ASCII string representation,
                null-terminated.
            EBX/EBP/EDI/ESI preserved
            ------------------------------------------------------------ *

        push    ebx
        push    edi
        push    eax                 ; save absolute value of number
                                    ; call it 1234567890
        mov     ecx, 2814749768
        lea     edi, NumText        ; pointer to answer string

        mul     ecx                 ; fast div by 100000
        mov     ecx, 100000         ; prepare to multiply up the top 5 digits
        shr     edx, 16             ; shift top 5 digits into place, EDX=12345
        mov     eax, edx            ; copy to eax to multiply up to real size again
        mov     ebx, edx            ; save a copy of top 5 digits
        mul     ecx                 ; EAX=1234500000
        sub     [esp], eax          ; sub from original number leaves 0000067890

        mov     eax, ebx            ; get back high digits
        mov     ecx, 429497         ; about to do div 10000 by reciprocal multiply
        mov     ebx, 10

        or      eax, eax            ; if top 5 digit = 0 then skip that part
        jz      SkipTop5

        mul     ecx                 ; div top 5 digits by 10000
        jc      digit1              ; if digit is not zero then process it
        mul     ebx                 ; else multiply by 10 to get next digit into EDX
        jc      digit2
        mul     ebx
        jc      digit3
        mul     ebx
        jc      digit4
        mul     ebx
        jc      digit5

SkipTop5:
        pop     eax                 ; retrieve lower 5 digits
        mul     ecx                 ; div 10000
        jc      digit6
        mul     ebx                 ; multiply by 10 to get next digit in EDX
        jc      digit7
        mul     ebx
        jc      digit8
        mul     ebx
        jc      digit9
        mul     ebx
        jmp     digit10

digit1:
        add     edx, 30h            ; top digit is left in EDX, convert to ascii
        mov     [edi], dl           ; store top digit
        mul     ebx                 ; multiply by 10 to get next digit into EDX
        add     edi, 1

digit2:
        add     edx, 30h
        mov     [edi], dl
        mul     ebx
        add     edi, 1

digit3:
        add     edx, 30h
        mov     [edi], dl
        mul     ebx
        add     edi, 1

digit4:
        add     edx, 30h
        mov     [edi], dl
        mul     ebx
        add     edi, 1

digit5:
        add     edx, 30h
        mov     [edi], dl

        pop     eax                 ; retrieve lower 5 digits
        mul     ecx                 ; div 10000
        add     edi, 1

digit6:
        add     edx, 30h
        mov     [edi], dl
        mul     ebx
        add     edi, 1

digit7:
        add     edx, 30h
        mov     [edi], dl
        mul     ebx
        add     edi, 1

digit8:
        add     edx, 30h
        mov     [edi], dl
        mul     ebx
        add     edi, 1

digit9:
        add     edx, 30h
        mov     [edi], dl
        mul     ebx
        add     edi, 1

digit10:
        add     edx, 30h
        lea     eax, NumText        ; return buffer pointer
        mov     [edi], dx           ; last digit, store DX not DL to give a 
                                    ; zero termination for the result string
        pop     edi
        pop     ebx
        ret

DWtoASCII    endp

Codewarp · October 24, 2005, 06:54:09 PM

Why not just call one of these, in the cruntime --

char *_i64toa( __int64 value, char *string, int radix );

char * _ui64toa( unsigned _int64 value, char *string, int radix );

Ian_B · October 25, 2005, 04:34:48 PM

Er, is that a macro, C, Pascal or what? It doesn't look like any ASM I recognise... :eek

PBrennick · October 26, 2005, 01:01:26 AM

Perhaps he thinks you are calling your converter from a c program? That would be the only reason to include the c runtime module, unless I am getting confused, also. What's up, Codewarp?

Paul

GregL · October 26, 2005, 03:12:15 AM

Here's an example:

Code Select


.586
.model flat, stdcall
option casemap:none

include c:\masm32\include\windows.inc

include c:\masm32\include\kernel32.inc
include c:\masm32\include\user32.inc
include c:\masm32\include\masm32.inc
include c:\masm32\include\msvcrt.inc

include c:\masm32\macros\macros.asm

includelib c:\masm32\lib\kernel32.lib
includelib c:\masm32\lib\user32.lib
includelib c:\masm32\lib\masm32.lib
includelib c:\masm32\lib\msvcrt.lib

main   PROTO

.CONST

    qwTest     QWORD 0DEADBEEFDEADBEEFh

.DATA?

    szBuffer   BYTE  32 dup(?)

.CODE

start:

    call   main
    invoke ExitProcess, 0
    
;==============================================    
main PROC    

    ; unsigned    
    print uqword$(qwTest)
    print chr$(13,10)
    
    ; signed    
    print sqword$(qwTest)
    print chr$(13,10)
    
    ; hex
    print xqword$(qwTest)
    print chr$(13,10)
    
    ret
    
main ENDP    
;==============================================

end start

Ian_B · October 26, 2005, 04:29:58 PM

Thanks, Greg, but I can't tell which module those functions/macros/whatever are in so I can't find the documentation for them, and I'm not a big macro fan. I like to be able to see and play with and therefore try to understand the code I'm calling. :'(

This ain't pretty or short, but by doing repeated subtractions at least half the time it should be faster than doing any code that relies on MUL to produce each digit. (How many SUB/ADD/MOV/JMP can you do instead of a 16-clock MUL?) By finding the most significant digit first it also avoids unnecessary processing. The problem is the upper DWORD, any operation you do to that affects the remainder in the lower one, so I can't see an easier way of approaching this than by using the dreaded SBB there. That was the thrust of my original query, to see if there was another mathematical way to approach this.

However, given how unlikely it is in my particular application that a file of that size would be encountered (a single file greater than 4Gb on Usenet? Only if someone has posted an entire DVD image!) I'm not desperately concerned about the speed of processing that. Actually doing the decoding/verification of such a monster is going to be the time-critical problem, not a few SBB ops.

IanB

EDIT: This code does not work properly, don't use it!

[attachment deleted by admin]

raymond · October 27, 2005, 02:45:48 AM

In case you may be interested, attached is my version for converting a signed qword to ASCII. It uses the FPU and is written in MASM syntax.

(It cannot be altered to take an unsigned qword because of the use of the FPU).

Raymond

[attachment deleted by admin]

GregL · October 27, 2005, 04:49:45 AM

Ian_B,

Yeah, those are macros, they use the sprintf function from the C Run-Time Library. They are in macros.asm. If you're looking for maximum speed, they're not the way to go. If you are not too concerned about speed, they work very well.

Ian_B · October 27, 2005, 09:32:08 AM

OK, here's a modified version of the previous attempt which ought to be an improvement, even though it's even longer... :eek Compact code this is not. :(

There are two critical fixes. One is that for some of the higher DWORD digits it's possible to use a much faster CDQ rather than SBB to make the carry bit become a -1 in EDX that you can add directly. This has meant reversing the master/work registers as EAX/EDX are needed for the CDQ. Unfortunately not all the digits can be done this way because some of the subtractors are too large to leave a top bit set in the result, which is what CDQ is copying, not the carry bit itself. But even with the following unavoidable register stalls, CDQ/ADD EDX/SUB EDX should be much faster than the SBB.

The most substantial improvement is in the more general lower DWORD routine. To double the chances of getting out of the loop early, I've used a spare register to remove 9x(10^digit) first and work upwards at the same time. So one half of the loop is subtracting, testing output digits from 0 up to 4, and the other adds to the lower bound result, testing digits from 9 down to 5. The register references are widely enough spaced that these fast instructions may even get quad-pumped on P4, although I don't know how the jumps will affect that, so hopefully all this extra code will actually execute faster overall, as well as quitting the loops sooner on average.

IanB

EDIT: bad code removed

Ian_B · October 27, 2005, 06:30:48 PM

Many apologies to the two people who bothered to download that. Here's the massively bug-fixed and cleaned-up version with slightly more useful exit parameters. My bad... :red

IanB

EDIT: bad code removed

Ian_B · October 29, 2005, 03:19:46 AM

I've done some timetesting and checking of outputs of this code and the available comparisons. Sadly, my optimisations using CDQ to help with the upper DWORD can't work at all, because I was assuming (wrongly) that for all the subtractions the sign would always match the carry bit. so it's back to SBB there unless anyone can work out another way. For the lower DWORD, the opposite was true, and I found that testing for the carry was not working and I should have been testing for the sign instead. :red

So here is the final, guaranteed absolutely working code. :green2

I've also attached a time-testing app that compares the outputs of three routines and confirms they are correct, as well as comparing the number of clocks they take. If someone can find another QWORD routine (I couldn't find the macro Greg mentioned in my copy of MASM) I'll add it. But the result on my P4 shows that apart from the worst-case scenario for the 9-digit DWORD, which is marginally slower than PDixon's code because it requires the maximum number of loops for every digit, this routine is significantly faster in all cases, and especially for smaller values. The wsprintf function (used in the MASM library version) is in all cases at least 10 times slower than either PDixon's code or mine. :dance:

If you don't mind the extra length of the code, this is a significant improvement on the available DW to ASCII conversions and therefore the likely QW to ASCII ones too.

[attachment deleted by admin]

Ian_B · October 29, 2005, 10:03:39 AM

Out of interest, I made a slightly modified version of my code that converted signed QWORDs instead to see how it compared against Ray's version using FPU code.

Ray, I don't think you're going to like these timings. Your code is actually slower than the "general" wsprintf function for converting DWORDs most of the time, and between 5 and 20 times slower than mine in all cases. It's nice to have another winner. :cheekygreen:

It does bear out my hunch that doing small, fast integer operations well-organised can outperform "clever" code using floating point etc. After all, my code is really doing division by the "obvious" method of repeated subtraction, just like any schoolkid learns. I don't know where the bottleneck is in Ray's code, whether it's in the FPU setup or the conversion, but it's substantial enough to seriously drag the performance back. Unless there's an issue with using the FPU repeatedly without some sort of resetting as this code does, which I'm not aware of.

IanB

[attachment deleted by admin]

MichaelW · October 29, 2005, 10:27:48 PM

I added a test of the MSVCRT _i64toa function to SQW2Atimetest.asm. Here are the results:

Code Select


QWORD to ASCII tests:

QWTest 1 - value expected -9199999999444444444
SQWtoASCII - IanB:   -9199999999444444444 ; clocks: 790
qwtoa - Raymond F:   -9199999999444444444 ; clocks: 520
MSVCRT _i64toa   :   -9199999999444444444 ; clocks: 3599
QWTest 2 - value expected -9087654321987654321
SQWtoASCII - IanB:   -9087654321987654321 ; clocks: 493
qwtoa - Raymond F:   -9087654321987654321 ; clocks: 519
MSVCRT _i64toa   :   -9087654321987654321 ; clocks: 3599
QWTest 3 - value expected -1000000000999999999
SQWtoASCII - IanB:   -1000000000999999999 ; clocks: 145
qwtoa - Raymond F:   -1000000000999999999 ; clocks: 518
MSVCRT _i64toa   :   -1000000000999999999 ; clocks: 3602
QWTest 4 - value expected -999999444444444
SQWtoASCII - IanB:   -999999444444444 ; clocks: 583
qwtoa - Raymond F:   -999999444444444 ; clocks: 493
MSVCRT _i64toa   :   -999999444444444 ; clocks: 2873
QWTest 5 - value expected -555555123456789
SQWtoASCII - IanB:   -555555123456789 ; clocks: 412
qwtoa - Raymond F:   -555555123456789 ; clocks: 493
MSVCRT _i64toa   :   -555555123456789 ; clocks: 2874
QWTest 6 - value expected -100000999999999
SQWtoASCII - IanB:   -100000999999999 ; clocks: 112
qwtoa - Raymond F:   -100000999999999 ; clocks: 494
MSVCRT _i64toa   :   -100000999999999 ; clocks: 2890

At least when running on my P3, and when converting a QWORD, Raymond's code is reasonably competitive with yours.

[attachment deleted by admin]

hutch-- · October 30, 2005, 02:24:43 AM

Somewhere along the line I think there is a simple conversion from signed to unsigned and back but it would involve 64 bit capcity to add and subtract.

Ian_B · October 30, 2005, 12:50:55 PM

On my 2+ year-old P4:

Code Select


DWORD to ASCII tests:

Test 1 - value expected 3444444444
wsprintf conversion: 3444444444 ; clocks: 1664
QWtoASCII - IanB:    3444444444 ; clocks: 181
DWtoASCII - PDixon:  3444444444 ; clocks: 147

Test 2 - value expected 2345678901
wsprintf conversion: 2345678901 ; clocks: 1664
QWtoASCII - IanB:    2345678901 ; clocks: 117
DWtoASCII - PDixon:  2345678901 ; clocks: 157

Test 3 - value expected 1999999999
wsprintf conversion: 1999999999 ; clocks: 1671
QWtoASCII - IanB:    1999999999 ; clocks: 54
DWtoASCII - PDixon:  1999999999 ; clocks: 145

Test 4 - value expected 44444
wsprintf conversion: 44444 ; clocks: 901
QWtoASCII - IanB:    44444 ; clocks: 88
DWtoASCII - PDixon:  44444 ; clocks: 127

Test 5 - value expected 12345
wsprintf conversion: 12345 ; clocks: 898
QWtoASCII - IanB:    12345 ; clocks: 67
DWtoASCII - PDixon:  12345 ; clocks: 127

Test 6 - value expected 99999
wsprintf conversion: 99999 ; clocks: 898
QWtoASCII - IanB:    99999 ; clocks: 26
DWtoASCII - PDixon:  99999 ; clocks: 134

Test 7 - value expected 1
wsprintf conversion: 1 ; clocks: 317
QWtoASCII - IanB:    1 ; clocks: 13
DWtoASCII - PDixon:  1 ; clocks: 138

and for the signed version:

Code Select


QWORD to ASCII tests:

QWTest 1 - value expected -9199999999444444444
SQWtoASCII - IanB:   -9199999999444444444 ; clocks: 574
qwtoa - Raymond F:   -9199999999444444444 ; clocks: 2313
QWTest 2 - value expected -9087654321987654321
SQWtoASCII - IanB:   -9087654321987654321 ; clocks: 378
qwtoa - Raymond F:   -9087654321987654321 ; clocks: 2305
QWTest 3 - value expected -1000000000999999999
SQWtoASCII - IanB:   -1000000000999999999 ; clocks: 113
qwtoa - Raymond F:   -1000000000999999999 ; clocks: 2302
QWTest 4 - value expected -999999444444444
SQWtoASCII - IanB:   -999999444444444 ; clocks: 442
qwtoa - Raymond F:   -999999444444444 ; clocks: 2210
QWTest 5 - value expected -555555123456789
SQWtoASCII - IanB:   -555555123456789 ; clocks: 289
qwtoa - Raymond F:   -555555123456789 ; clocks: 2199
QWTest 6 - value expected -100000999999999
SQWtoASCII - IanB:   -100000999999999 ; clocks: 85
qwtoa - Raymond F:   -100000999999999 ; clocks: 2210

Those are startling differences between P3 and P4. The clocks aren't that different for my code, but the FPU code is running around 5 times slower. :eek

News:

QW to ASCII?

Ian_B

Codewarp

Ian_B

PBrennick

GregL

Ian_B

raymond

GregL

Ian_B

Ian_B

Ian_B

Ian_B

MichaelW

hutch--

Ian_B