The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: Ian_B on October 24, 2005, 03:50:05 PM

Title: QW to ASCII?
Post by: Ian_B on October 24, 2005, 03:50:05 PM
OK, here's a request for help. I'm posting it here 'cos this is where all the mathematically clever people hang out!  :bg

I've added below P Dixon's code for converting a positive-only DWORD to the ASCII representation. This is from the old forum and I believe was unchallenged for speed. Unfortunately, my application must be able to handle potential file-sizes of up to a QWORD (2^62-1)...  :eek  I can't do anything about these specs except support them, and as I will need to write out filesizes and offsets in ASCII, I need a bigger version of this which will manage QWORDs if I am unlucky enough to encounter them. Unfortunately I can't get my head round where to start modifying this, so I'd appreciate any help or suggestions.

IanB


.data

        NumText              BYTE 20 DUP (0)

.code

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

DWtoASCII    proc
       
comment    * ------------------------------------------------------------
            Convert positive DWORD to ASCII string (no sign) by Paul Dixon
            with specific references/optimisations added by IanB.

            On entry:
                EAX = value to convert
            On exit:
                EAX = address of result buffer
                NumText buffer contains the ASCII string representation,
                null-terminated.
            EBX/EBP/EDI/ESI preserved
            ------------------------------------------------------------ *

        push    ebx
        push    edi
        push    eax                 ; save absolute value of number
                                    ; call it 1234567890
        mov     ecx, 2814749768
        lea     edi, NumText        ; pointer to answer string

        mul     ecx                 ; fast div by 100000
        mov     ecx, 100000         ; prepare to multiply up the top 5 digits
        shr     edx, 16             ; shift top 5 digits into place, EDX=12345
        mov     eax, edx            ; copy to eax to multiply up to real size again
        mov     ebx, edx            ; save a copy of top 5 digits
        mul     ecx                 ; EAX=1234500000
        sub     [esp], eax          ; sub from original number leaves 0000067890

        mov     eax, ebx            ; get back high digits
        mov     ecx, 429497         ; about to do div 10000 by reciprocal multiply
        mov     ebx, 10

        or      eax, eax            ; if top 5 digit = 0 then skip that part
        jz      SkipTop5

        mul     ecx                 ; div top 5 digits by 10000
        jc      digit1              ; if digit is not zero then process it
        mul     ebx                 ; else multiply by 10 to get next digit into EDX
        jc      digit2
        mul     ebx
        jc      digit3
        mul     ebx
        jc      digit4
        mul     ebx
        jc      digit5

SkipTop5:
        pop     eax                 ; retrieve lower 5 digits
        mul     ecx                 ; div 10000
        jc      digit6
        mul     ebx                 ; multiply by 10 to get next digit in EDX
        jc      digit7
        mul     ebx
        jc      digit8
        mul     ebx
        jc      digit9
        mul     ebx
        jmp     digit10

digit1:
        add     edx, 30h            ; top digit is left in EDX, convert to ascii
        mov     [edi], dl           ; store top digit
        mul     ebx                 ; multiply by 10 to get next digit into EDX
        add     edi, 1

digit2:
        add     edx, 30h
        mov     [edi], dl
        mul     ebx
        add     edi, 1

digit3:
        add     edx, 30h
        mov     [edi], dl
        mul     ebx
        add     edi, 1

digit4:
        add     edx, 30h
        mov     [edi], dl
        mul     ebx
        add     edi, 1

digit5:
        add     edx, 30h
        mov     [edi], dl

        pop     eax                 ; retrieve lower 5 digits
        mul     ecx                 ; div 10000
        add     edi, 1

digit6:
        add     edx, 30h
        mov     [edi], dl
        mul     ebx
        add     edi, 1

digit7:
        add     edx, 30h
        mov     [edi], dl
        mul     ebx
        add     edi, 1

digit8:
        add     edx, 30h
        mov     [edi], dl
        mul     ebx
        add     edi, 1

digit9:
        add     edx, 30h
        mov     [edi], dl
        mul     ebx
        add     edi, 1

digit10:
        add     edx, 30h
        lea     eax, NumText        ; return buffer pointer
        mov     [edi], dx           ; last digit, store DX not DL to give a
                                    ; zero termination for the result string
        pop     edi
        pop     ebx
        ret

DWtoASCII    endp

Title: Re: QW to ASCII?
Post by: Codewarp on October 24, 2005, 06:54:09 PM
Why not just call one of these, in the cruntime --

char *_i64toa( __int64 value, char *string, int radix );

char * _ui64toa( unsigned _int64 value, char *string, int radix );

Title: Re: QW to ASCII?
Post by: Ian_B on October 25, 2005, 04:34:48 PM
Er, is that a macro, C, Pascal or what? It doesn't look like any ASM I recognise...  :eek
Title: Re: QW to ASCII?
Post by: PBrennick on October 26, 2005, 01:01:26 AM
Perhaps he thinks you are calling your converter from a c program?  That would be the only reason to include the c runtime module, unless I am getting confused, also.  What's up, Codewarp?

Paul

Title: Re: QW to ASCII?
Post by: GregL on October 26, 2005, 03:12:15 AM
Here's an example:

.586
.model flat, stdcall
option casemap:none

include c:\masm32\include\windows.inc

include c:\masm32\include\kernel32.inc
include c:\masm32\include\user32.inc
include c:\masm32\include\masm32.inc
include c:\masm32\include\msvcrt.inc

include c:\masm32\macros\macros.asm

includelib c:\masm32\lib\kernel32.lib
includelib c:\masm32\lib\user32.lib
includelib c:\masm32\lib\masm32.lib
includelib c:\masm32\lib\msvcrt.lib

main   PROTO

.CONST

    qwTest     QWORD 0DEADBEEFDEADBEEFh

.DATA?

    szBuffer   BYTE  32 dup(?)

.CODE

start:

    call   main
    invoke ExitProcess, 0
   
;==============================================   
main PROC   

    ; unsigned   
    print uqword$(qwTest)
    print chr$(13,10)
   
    ; signed   
    print sqword$(qwTest)
    print chr$(13,10)
   
    ; hex
    print xqword$(qwTest)
    print chr$(13,10)
   
    ret
   
main ENDP   
;==============================================

end start

   
Title: Re: QW to ASCII?
Post by: Ian_B on October 26, 2005, 04:29:58 PM
Thanks, Greg, but I can't tell which module those functions/macros/whatever are in so I can't find the documentation for them, and I'm not a big macro fan. I like to be able to see and play with and therefore try to understand the code I'm calling.  :'(

This ain't pretty or short, but by doing repeated subtractions at least half the time it should be faster than doing any code that relies on MUL to produce each digit. (How many SUB/ADD/MOV/JMP can you do instead of a 16-clock MUL?) By finding the most significant digit first it also avoids unnecessary processing. The problem is the upper DWORD, any operation you do to that affects the remainder in the lower one, so I can't see an easier way of approaching this than by using the dreaded SBB there. That was the thrust of my original query, to see if there was another mathematical way to approach this.

However, given how unlikely it is in my particular application that a file of that size would be encountered (a single file greater than 4Gb on Usenet? Only if someone has posted an entire DVD image!) I'm not desperately concerned about the speed of processing that. Actually doing the decoding/verification of such a monster is going to be the time-critical problem, not a few SBB ops.

IanB

EDIT: This code does not work properly, don't use it!

[attachment deleted by admin]
Title: Re: QW to ASCII?
Post by: raymond on October 27, 2005, 02:45:48 AM
In case you may be interested, attached is my version for converting a signed qword to ASCII. It uses the FPU and is written in MASM syntax.

(It cannot be altered to take an unsigned qword because of the use of the FPU).

Raymond


[attachment deleted by admin]
Title: Re: QW to ASCII?
Post by: GregL on October 27, 2005, 04:49:45 AM
Ian_B,

Yeah, those are macros, they use the sprintf function from the C Run-Time Library. They are in macros.asm. If you're looking for maximum speed, they're not the way to go. If you are not too concerned about speed, they work very well.

Title: Re: QW to ASCII?
Post by: Ian_B on October 27, 2005, 09:32:08 AM
OK, here's a modified version of the previous attempt which ought to be an improvement, even though it's even longer...  :eek  Compact code this is not.  :(

There are two critical fixes. One is that for some of the higher DWORD digits it's possible to use a much faster CDQ rather than SBB to make the carry bit become a -1 in EDX that you can add directly. This has meant reversing the master/work registers as EAX/EDX are needed for the CDQ. Unfortunately not all the digits can be done this way because some of the subtractors are too large to leave a top bit set in the result, which is what CDQ is copying, not the carry bit itself. But even with the following unavoidable register stalls, CDQ/ADD EDX/SUB EDX should be much faster than the SBB.

The most substantial improvement is in the more general lower DWORD routine. To double the chances of getting out of the loop early, I've used a spare register to remove 9x(10^digit) first and work upwards at the same time. So one half of the loop is subtracting, testing output digits from 0 up to 4, and the other adds to the lower bound result, testing digits from 9 down to 5. The register references are widely enough spaced that these fast instructions may even get quad-pumped on P4, although I don't know how the jumps will affect that, so hopefully all this extra code will actually execute faster overall, as well as quitting the loops sooner on average.

IanB

EDIT: bad code removed
Title: Re: QW to ASCII?
Post by: Ian_B on October 27, 2005, 06:30:48 PM
Many apologies to the two people who bothered to download that. Here's the massively bug-fixed and cleaned-up version with slightly more useful exit parameters. My bad...  :red

IanB

EDIT: bad code removed
Title: Re: QW to ASCII?
Post by: Ian_B on October 29, 2005, 03:19:46 AM
I've done some timetesting and checking of outputs of this code and the available comparisons. Sadly, my optimisations using CDQ to help with the upper DWORD can't work at all, because I was assuming (wrongly) that for all the subtractions the sign would always match the carry bit. so it's back to SBB there unless anyone can work out another way. For the lower DWORD, the opposite was true, and I found that testing for the carry was not working and I should have been testing for the sign instead.  :red

So here is the final, guaranteed absolutely working code.  :green2

I've also attached a time-testing app that compares the outputs of three routines and confirms they are correct, as well as comparing the number of clocks they take. If someone can find another QWORD routine (I couldn't find the macro Greg mentioned in my copy of MASM) I'll add it. But the result on my P4 shows that apart from the worst-case scenario for the 9-digit DWORD, which is marginally slower than PDixon's code because it requires the maximum number of loops for every digit, this routine is significantly faster in all cases, and especially for smaller values. The wsprintf function (used in the MASM library version) is in all cases at least 10 times slower than either PDixon's code or mine.  :dance:

If you don't mind the extra length of the code, this is a significant improvement on the available DW to ASCII conversions and therefore the likely QW to ASCII ones too.


[attachment deleted by admin]
Title: Re: QW to ASCII?
Post by: Ian_B on October 29, 2005, 10:03:39 AM
Out of interest, I made a slightly modified version of my code that converted signed QWORDs instead to see how it compared against Ray's version using FPU code.

Ray, I don't think you're going to like these timings. Your code is actually slower than the "general" wsprintf function for converting DWORDs most of the time, and between 5 and 20 times slower than mine in all cases. It's nice to have another winner.  :cheekygreen:

It does bear out my hunch that doing small, fast integer operations well-organised can outperform "clever" code using floating point etc. After all, my code is really doing division by the "obvious" method of repeated subtraction, just like any schoolkid learns. I don't know where the bottleneck is in Ray's code, whether it's in the FPU setup or the conversion, but it's substantial enough to seriously drag the performance back. Unless there's an issue with using the FPU repeatedly without some sort of resetting as this code does, which I'm not aware of.

IanB


[attachment deleted by admin]
Title: Re: QW to ASCII?
Post by: MichaelW on October 29, 2005, 10:27:48 PM
I added a test of the MSVCRT _i64toa function to SQW2Atimetest.asm. Here are the results:

QWORD to ASCII tests:

QWTest 1 - value expected -9199999999444444444
SQWtoASCII - IanB:   -9199999999444444444 ; clocks: 790
qwtoa - Raymond F:   -9199999999444444444 ; clocks: 520
MSVCRT _i64toa   :   -9199999999444444444 ; clocks: 3599
QWTest 2 - value expected -9087654321987654321
SQWtoASCII - IanB:   -9087654321987654321 ; clocks: 493
qwtoa - Raymond F:   -9087654321987654321 ; clocks: 519
MSVCRT _i64toa   :   -9087654321987654321 ; clocks: 3599
QWTest 3 - value expected -1000000000999999999
SQWtoASCII - IanB:   -1000000000999999999 ; clocks: 145
qwtoa - Raymond F:   -1000000000999999999 ; clocks: 518
MSVCRT _i64toa   :   -1000000000999999999 ; clocks: 3602
QWTest 4 - value expected -999999444444444
SQWtoASCII - IanB:   -999999444444444 ; clocks: 583
qwtoa - Raymond F:   -999999444444444 ; clocks: 493
MSVCRT _i64toa   :   -999999444444444 ; clocks: 2873
QWTest 5 - value expected -555555123456789
SQWtoASCII - IanB:   -555555123456789 ; clocks: 412
qwtoa - Raymond F:   -555555123456789 ; clocks: 493
MSVCRT _i64toa   :   -555555123456789 ; clocks: 2874
QWTest 6 - value expected -100000999999999
SQWtoASCII - IanB:   -100000999999999 ; clocks: 112
qwtoa - Raymond F:   -100000999999999 ; clocks: 494
MSVCRT _i64toa   :   -100000999999999 ; clocks: 2890


At least when running on my P3, and when converting a QWORD, Raymond's code is reasonably competitive with yours.



[attachment deleted by admin]
Title: Re: QW to ASCII?
Post by: hutch-- on October 30, 2005, 02:24:43 AM
Somewhere along the line I think there is a simple conversion from signed to unsigned and back but it would involve 64 bit capcity to add and subtract.
Title: Re: QW to ASCII?
Post by: Ian_B on October 30, 2005, 12:50:55 PM
On my 2+ year-old P4:


DWORD to ASCII tests:

Test 1 - value expected 3444444444
wsprintf conversion: 3444444444 ; clocks: 1664
QWtoASCII - IanB:    3444444444 ; clocks: 181
DWtoASCII - PDixon:  3444444444 ; clocks: 147

Test 2 - value expected 2345678901
wsprintf conversion: 2345678901 ; clocks: 1664
QWtoASCII - IanB:    2345678901 ; clocks: 117
DWtoASCII - PDixon:  2345678901 ; clocks: 157

Test 3 - value expected 1999999999
wsprintf conversion: 1999999999 ; clocks: 1671
QWtoASCII - IanB:    1999999999 ; clocks: 54
DWtoASCII - PDixon:  1999999999 ; clocks: 145

Test 4 - value expected 44444
wsprintf conversion: 44444 ; clocks: 901
QWtoASCII - IanB:    44444 ; clocks: 88
DWtoASCII - PDixon:  44444 ; clocks: 127

Test 5 - value expected 12345
wsprintf conversion: 12345 ; clocks: 898
QWtoASCII - IanB:    12345 ; clocks: 67
DWtoASCII - PDixon:  12345 ; clocks: 127

Test 6 - value expected 99999
wsprintf conversion: 99999 ; clocks: 898
QWtoASCII - IanB:    99999 ; clocks: 26
DWtoASCII - PDixon:  99999 ; clocks: 134

Test 7 - value expected 1
wsprintf conversion: 1 ; clocks: 317
QWtoASCII - IanB:    1 ; clocks: 13
DWtoASCII - PDixon:  1 ; clocks: 138


and for the signed version:


QWORD to ASCII tests:

QWTest 1 - value expected -9199999999444444444
SQWtoASCII - IanB:   -9199999999444444444 ; clocks: 574
qwtoa - Raymond F:   -9199999999444444444 ; clocks: 2313
QWTest 2 - value expected -9087654321987654321
SQWtoASCII - IanB:   -9087654321987654321 ; clocks: 378
qwtoa - Raymond F:   -9087654321987654321 ; clocks: 2305
QWTest 3 - value expected -1000000000999999999
SQWtoASCII - IanB:   -1000000000999999999 ; clocks: 113
qwtoa - Raymond F:   -1000000000999999999 ; clocks: 2302
QWTest 4 - value expected -999999444444444
SQWtoASCII - IanB:   -999999444444444 ; clocks: 442
qwtoa - Raymond F:   -999999444444444 ; clocks: 2210
QWTest 5 - value expected -555555123456789
SQWtoASCII - IanB:   -555555123456789 ; clocks: 289
qwtoa - Raymond F:   -555555123456789 ; clocks: 2199
QWTest 6 - value expected -100000999999999
SQWtoASCII - IanB:   -100000999999999 ; clocks: 85
qwtoa - Raymond F:   -100000999999999 ; clocks: 2210


Those are startling differences between P3 and P4. The clocks aren't that different for my code, but the FPU code is running around 5 times slower.  :eek
Title: Re: QW to ASCII?
Post by: Ian_B on October 31, 2005, 07:36:09 PM
By adding a little more code  :eek I've managed to improve the speed of my routine by a third (times reduce by 25%) for the worst case. This is simply done by checking if we are subtracting more than 5*10^digit for the high DWORDs and removing that first, so on average those most time-consuming loops will exit twice as fast with just a little extra overhead. That should remove the difference between Ray's code and mine except for the very worst case longest QWORDs on your P3. Whether the codelength is worth the speed increase is of course a tradeoff people will want to consider for themselves.

IanB


[attachment deleted by admin]
Title: Re: QW to ASCII?
Post by: Ian_B on November 01, 2005, 06:54:34 PM
I finally managed to get a CDQ version going. The trick was to realise that when SUBbing from the low DWORD, the bottom byte wasn't affected by the values being subtracted, so I could shift that out and use the top byte as a place to hold the sign of the subtraced result for the CDQ to work, you just need to keep clearing it between operations. Then before working on the low DWORD only section, you shift the value back and replace the low byte.

Unfortunately I was underwhelmed.  :'(  I didn't manage to improve on the speed of the SBB version at all, and in some cases it got slower. And the code got even longer, needless to say.  :lol  The few extra instructions needed obviously soak up any latency advantage CDQ has over SBB. For completeness' sake I've attached the time comparison if anyone's interested, but I am disappointed that I couldn't improve on the SBB this way, for P4 anyway.

As the processing code is exactly the same for both the signed/unsigned versions of my routine, I'll post a combined version with two entry points shortly.


[attachment deleted by admin]
Title: Re: QW to ASCII?
Post by: MichaelW on November 02, 2005, 09:19:31 AM
QW2Atimetest running on my P3:

QWORD to ASCII tests:

QWTest 1 - value expected 18399999999444444444
QWtoASCII - IanB:    18399999999444444444 ; clocks: 514
QWTest 2 - value expected 12345678901234567890
QWtoASCII - IanB:    12345678901234567890 ; clocks: 352
QWTest 3 - value expected 10000000000999999999
QWtoASCII - IanB:    10000000000999999999 ; clocks: 166
QWTest 4 - value expected 999999444444444
QWtoASCII - IanB:    999999444444444 ; clocks: 395
QWTest 5 - value expected 555555123456789
QWtoASCII - IanB:    555555123456789 ; clocks: 205
QWTest 6 - value expected 100000999999999
QWtoASCII - IanB:    100000999999999 ; clocks: 117


DWORD to ASCII tests:

Test 1 - value expected 3444444444
wsprintf conversion: 3444444444 ; clocks: 1022
QWtoASCII - IanB:    3444444444 ; clocks: 209
DWtoASCII - PDixon:  3444444444 ; clocks: 62

Test 2 - value expected 2345678901
wsprintf conversion: 2345678901 ; clocks: 1021
QWtoASCII - IanB:    2345678901 ; clocks: 144
DWtoASCII - PDixon:  2345678901 ; clocks: 62

Test 3 - value expected 1999999999
wsprintf conversion: 1999999999 ; clocks: 1022
QWtoASCII - IanB:    1999999999 ; clocks: 74
DWtoASCII - PDixon:  1999999999 ; clocks: 63

Test 4 - value expected 44444
wsprintf conversion: 44444 ; clocks: 581
QWtoASCII - IanB:    44444 ; clocks: 114
DWtoASCII - PDixon:  44444 ; clocks: 48

Test 5 - value expected 12345
wsprintf conversion: 12345 ; clocks: 581
QWtoASCII - IanB:    12345 ; clocks: 89
DWtoASCII - PDixon:  12345 ; clocks: 48

Test 6 - value expected 99999
wsprintf conversion: 99999 ; clocks: 581
QWtoASCII - IanB:    99999 ; clocks: 47
DWtoASCII - PDixon:  99999 ; clocks: 48

Test 7 - value expected 1
wsprintf conversion: 1 ; clocks: 242
QWtoASCII - IanB:    1 ; clocks: 23
DWtoASCII - PDixon:  1 ; clocks: 48


SQW2Atimetest running on my P3:

QWORD to ASCII tests:

QWTest 1 - value expected -9199999999444444444
SQWtoASCII - IanB:   -9199999999444444444 ; clocks: 501
qwtoa - Raymond F:   -9199999999444444444 ; clocks: 519
QWTest 2 - value expected -9087654321987654321
SQWtoASCII - IanB:   -9087654321987654321 ; clocks: 338
qwtoa - Raymond F:   -9087654321987654321 ; clocks: 519
QWTest 3 - value expected -1000000000999999999
SQWtoASCII - IanB:   -1000000000999999999 ; clocks: 170
qwtoa - Raymond F:   -1000000000999999999 ; clocks: 519
QWTest 4 - value expected -999999444444444
SQWtoASCII - IanB:   -999999444444444 ; clocks: 393
qwtoa - Raymond F:   -999999444444444 ; clocks: 492
QWTest 5 - value expected -555555123456789
SQWtoASCII - IanB:   -555555123456789 ; clocks: 207
qwtoa - Raymond F:   -555555123456789 ; clocks: 492
QWTest 6 - value expected -100000999999999
SQWtoASCII - IanB:   -100000999999999 ; clocks: 126
qwtoa - Raymond F:   -100000999999999 ; clocks: 492


DWORD to ASCII tests:

Test 1 - value expected -1444444444
wsprintf conversion: -1444444444 ; clocks: 1038
SQWtoASCII - IanB:   -1444444444 ; clocks: 199
qwtoa - Raymond F:   -1444444444 ; clocks: 441

Test 2 - value expected -1234567890
wsprintf conversion: -1234567890 ; clocks: 1024
SQWtoASCII - IanB:   -1234567890 ; clocks: 149
qwtoa - Raymond F:   -1234567890 ; clocks: 441

Test 3 - value expected -1999999999
wsprintf conversion: -1999999999 ; clocks: 1025
SQWtoASCII - IanB:   -1999999999 ; clocks: 78
qwtoa - Raymond F:   -1999999999 ; clocks: 441

Test 4 - value expected -44444
wsprintf conversion: -44444 ; clocks: 591
SQWtoASCII - IanB:   -44444 ; clocks: 110
qwtoa - Raymond F:   -44444 ; clocks: 424

Test 5 - value expected -12345
wsprintf conversion: -12345 ; clocks: 590
SQWtoASCII - IanB:   -12345 ; clocks: 87
qwtoa - Raymond F:   -12345 ; clocks: 424

Test 6 - value expected -99999
wsprintf conversion: -99999 ; clocks: 591
SQWtoASCII - IanB:   -99999 ; clocks: 51
qwtoa - Raymond F:   -99999 ; clocks: 425

Test 7 - value expected -1
wsprintf conversion: -1 ; clocks: 251
SQWtoASCII - IanB:   -1 ; clocks: 27
qwtoa - Raymond F:   -1 ; clocks: 395


Just as you anticipated :U
Title: Re: QW to ASCII?
Post by: Ian_B on November 02, 2005, 07:17:08 PM
Thanks for those timings, MichaelW. It again shows a clear and fascinating discrepancy between P3 and P4, in that the code from PDixon is consistently and significantly faster than mine on P3. The severe increase in latency of the MUL operation from 4 clocks to 14 clocks (according to Agner's info) is the most likely suspect.

This is clearly a routine that is well-optimised for P4 only, and PDixon's code for converting DWORDs is much better suited for anyone knowingly targetting P3. However, as a general case that can manage QWORDs if necessary, and where the possibility/likelihood is that the end user will have a P4, this does seem to be the winning approach. Anyone sufficiently concerned about processor-specific code will no doubt have already got some sort of switch-test set up that could take advantage of this.

It's just a shame that the code cannot easily be contracted into master loops because of the changing "magic numbers". There is one more register to play with...  ::)
Title: Re: QW to ASCII?
Post by: Ian_B on November 03, 2005, 01:21:24 PM
Here's a final version of the routine, with entry points for signed and unsigned QWORDs and signed DWORDs. This is a "package" which uses the same core code for all the conversions. It can be improved slightly if your code module uses OPTION NOSCOPED, as this will allow a jump to a label inside a procedure from another procedure (normally internal labels are not visible from outside the procedure) which means the work routine can be neatly merged into the QWtoASCII one with just a jump label to define it. However this is not suitable for most people's code as OPTION SCOPED is somewhat safer while allowing limited label reuse.

There are some more optimisations to this version, I managed to get another 10% or so improvement in speed generally on my P4. The change to the ADD/ADC at the very top of the SQWtoASCII code made the most difference for me, almost halving the clocks taken to produce the single-digit output.

This hasn't been a Laboratory project that's drawn much attention, which in a way is surprising as a QWORD/DWORD to ASCII routine has to be the single conversion most people are likely to need, where they have any code which outputs the values of numbers on-screen to the user or to file, whether it's file sizes (my concern), loop counts, counts of errors found, line-numbers, or any manner of numerical output. That's why I've felt it was worth pursuing a fastest possible version, however unwieldy, as in profiling terms this is likely to be an important interface component for many.

IanB


[attachment deleted by admin]
Title: Re: QW to ASCII?
Post by: dioxin on November 05, 2005, 02:32:20 AM
Ian,
<<Here's a final version of the routine>>

How many times do we hear that only for it to be followed a day or two later with an update!

<<This hasn't been a Laboratory project that's drawn much attention>>

Perhaps it's too similar to the DWORD version which was well thrashed out previously.

I haven't had time to look into it in detail so I can't comment on the code in this thread but I do recall making significant improvements to various routines on my AMD system only to find they slowed things down on a Pentium so lately I've been put off optimising code to squeeze the last clock cycle out of it. Instead, I just stop when I get a good implementation of an efficient algorithm. If someone else then wants to tweak the code for a specific CPU then they can get on with it.

I have had a go at this one and tried a few methods. My current best is doing the basic 20 digit conversion in around 150clks on an AthlonXP using a mix of FPU and ALU (no MMX or SSE). It doesn't do the sign bit yet or try to short circuit the short numbers and it still needs leading zero removal but I hit the same problem just mentioned. I put in a bit of time to make it a little quicker and saved 7% only to find that when I got access to a PIV system that the original was 20% quicker than the newer "faster" version.

For what it's worth, my "fastest" code takes about 500 bytes and does the basic 20 digit conversion on an AthlonXP in about 136clks for all numbers (it produces 20 digits output regardless). It takes 276 clks on a PIV-M CPU. The "not so fast" code took 144clks (Athlon) and 228clks(PIV-M).

I'm sure I could do better.. but for which CPU? I don't think it's worth the effort.

It's not in MASM format so I haven't posted it.

Paul.
Title: Re: QW to ASCII?
Post by: Ian_B on November 06, 2005, 04:13:13 PM
Quote from: dioxin on November 05, 2005, 02:32:20 AM
Ian,
<<Here's a final version of the routine>>

How many times do we hear that only for it to be followed a day or two later with an update!

Forgive me, I was under the impresson that...
QuoteThis is the place to post assembler algorithms and code design for discussion, optimisation and any other improvements that can be made on it. Post code here to be beaten to death to make it better, smaller, faster or more powerful.

Clearly I was wrong.  :'(  I'd love to be able to write perfectly optimised and perfectly working code first time every time, but I'm still learning. Otherwise I wouldn't have to come here and ask questions.  :eek

I'm well aware how mundane my code is. That's why I asked if there was a better way, but in the process I found that mundane appears to be in some cases significantly better than "clever". That's worth discovering and sharing. And the optimisation "trick" I've been posting about recently, ie. the use of CDQ or making a register sign-extend to avoid jumping, isn't always obvious and takes some tinkering and lateral thought to use effectively, and even then it seems to be far less effective than I'd hoped. That's also worth knowing, as it reminds me (and perhaps others) that using Agner's PentOpf guide just to compare latency info of different instructions in order to replace one with an apparently better one isn't always a reliable guide to the end timing result.

I only just came back here and found MichaelW's new timing macros - you might recall that I was previously having trouble timing effectively from the old forum - so this sort of proper comparison timing is still a new toy for me and is throwing up some surprising results, for me anyway. Please forgive my innocent enthusiasm, but there might just be some other people new to ASM coding here interested in those results; I don't recall this forum being restricted only to "experienced" programmers such as you for whom this is obviously all blasé.

QuoteFor what it's worth, my "fastest" code takes about 500 bytes and does the basic 20 digit conversion on an AthlonXP in about 136clks for all numbers (it produces 20 digits output regardless). It takes 276 clks on a PIV-M CPU. The "not so fast" code took 144clks (Athlon) and 228clks(PIV-M).

I'm sure I could do better.. but for which CPU? I don't think it's worth the effort.

If I'm right, it was your code that I posted at the top of this thread and which inspired me to ask for help with a QWORD version of this routine. It sounds like this should be well worth testing as a comparison to the other routines here and I for one would be very interested to see it. There's always going to be a trade-off finding a routine that works acceptably fast on all possible target processors, and this code happens to have shown (to my surprise, thanks to the timings posted by MichaelW) enormous discrepancies between P3 and P4 of up to a factor of 20, in the FPU functions anyway. The price for my rather P4-specific code is unfortunately excessive loop unrolling.

QuoteIt's not in MASM format so I haven't posted it.

Please do. I'm sure someone here will be able to help convert it, even if you are unable to spare the time.
Title: Re: QW to ASCII?
Post by: dioxin on November 08, 2005, 12:31:32 AM
Ian_B,
   I sense that you've taken my remarks more as a rebuke and not as the intended neutral comment. Ah well, like my code I can't write perfect posts everytime either!

   I had another look at this. The FPU/ALU code I had got stuck. I couldn't see how to improve the speed as the FPU was just too slow when converting to integers, the set-up and restoring was taking too long, there was no quick way to handle signed numbers and there was no clear way to take a short cut for smaller values so I abandoned that approach.

   All it did was use the FPU to break the 64 bit (20 digit) number into blocks of 4 digits in size and, while the FPU got on with the second block, the ALU would break up the first 4 digit block into 4 bytes of ascii.
   I couldn't see the time being reduced below about 150clks (pushing 300 clks on a P-IV) which was too slow so I gave up.


   I had a think about alternatives and I've just about got one sorted. Although it's faster on an AthlonXP, it relies heavily on the MUL instruction which I think you said has a latency of 14 on the P-IV(?) so it's not  so good on a P-IV.

   It's not finished yet, but I'll post a couple of versions when it is.
   Currently the DWORD version is running inline in under 20 clks for values under 10000 and around 40-45clks for larger values. This translates to roughly 40 and 100 clks on a P-IV.
   The routine is limited by the MUL. There are 7 MULS with a latency of 5 on an Athlon that give 35clks so 40 is approaching optimal for this method. If P-IV MUL has a latency of 14 then that would give 7x14=98 so it's also doing as well as could be expected on an P-IV.

   
   Although not finished, the unsigned QWORD version appears to be doing the whole 20 digits in around 120 clks but I can see room for improvement. It takes 20 MULS so I'm expecting to get the time down to near 100 Athlon clks (about 280 P-IV clks) and, eventually, with shorter times for small values.

   
   The method for both QWORD and DWORD versions is:
   Break the number up into chunks by reciprocal multiplies. It turns out that 6 digit chunks are good.
   Each chunk is then broken up by the same method into 2 digit chunks.
   The next bit is why it's faster than the previous version..
   The 2 digit chunk is used as an index into a 2char x 100 entry table to look up the 2 ascii characters.

   The lookup takes place while waiting for the MUL so it takes no aditional time and it halves the number of MULs needed.

Paul.
Title: Re: QW to ASCII?
Post by: dioxin on November 08, 2005, 05:41:25 PM
Here's the latest version of an unsigned DWORD to ASCII conversion. The QWORD to ASCII will be along in a few days when I get time to test it fully and tidy it up.

This one does values <10000 in 12-17 Athlon clks and higher values in 32 to 36 clks. It's been tested with all possible input values.

The code assumes the value to convert is in EAX on entry and that pans contains the pointer to the answer. This must be an area of 12 bytes into which the result will be written and doesn't need to be initialised.

Code size is around 350 bytes.

Sorry, it's not in MASM format.. but I ever did get around to installing MASM in a usable way!

It's actually in PowerBASIC format which is almost the same but &Hxxx means hex instead of xxxh

Make sure the lookup table is aligned for best speed.

For signed numbers, include these lines where indicated:

!or eax,eax
!jns skp

!neg eax
!mov [esi],"-"
!inc esi

skp:



   I hope the conversion to MASM is not troublesome. Feel free to encapsulate it anyway you like to make it more useful.

Paul.


[attachment deleted by admin]
Title: Re: QW to ASCII?
Post by: dioxin on November 15, 2005, 10:26:15 PM
<<The QWORD to ASCII will be along in a few days when I get time to test it fully and tidy it up.>>


How time flies. It's a week since I said it'd be a few days and it's still not ready!

While working on the QWORD version I found some improvements to the DWORD version so the DWORD version is now faster, now taking around 30 clks and the QWORD version is looking like it'll take under 100clks worst case.
The DWORD version is attached to this message. It'll probably be my last attempt 'cos I haven't got time to try again.


My reasoning for MUL taking 6clks was flawed. It does take 6 but 2 of them can be overlapped with another MUL so repeated MULS can be done in a little over 4clks/MUL. That's how 7 MULs can be done in 30clks.

I also managed to save a few clks by:
   using the faster IMUL where possible,
   preloading numbers well in advance so the CPU isn't left waiting for them
   changing the lookup table to DWORDS (this wastes 200 bytes but because all data is now aligned it averages 5 clks faster and makes the timing very consistent)


   I hope the QWORD version will be along shortly, subject to getting the time to finish it off.

Paul.

[attachment deleted by admin]
Title: Re: QW to ASCII?
Post by: dioxin on November 16, 2005, 03:22:27 AM
Attached is my lastest QWORDtoASCII program.
It's combined with the DWORDtoASCII code as they both use the same techniques and lookup table so it can be shared.
For numbers with the top 32 bits zero the QWORD routine will take a short cut and run the faster DWORD routine instead.

Timings are around 90 Athlon clks for all large numbers, much shorter if the number fits in a single DWORD.
This equates to about 228clks on the P4 I have access to.

The DWORD routine was 100% tested.
The QWORD routine was tested as well as I could manage and I believe it is 100% but I haven't verified it with all inputs as that would take too long.
If anyone finds an error then I'll look into it but otherwise I'll leave it as it is even though I know I can get the QWORD version a little faster! I just can't spare the time.
Still, under 100 clks was my target which I've acheived although I'm a little disappointed that the size is 1106 bytes. I was expecting under 1k.
200 bytes could be saved by using 16 bit table entries but this adds about 0.5-1 clk per digit to the timing.

Feel free to use it and test it or encapsulate it as you wish and to convert it from PowerBASIC format (I really ought to get MASM installed properly on a working computer!)

   I have no idea if the effort was worth it, perhaps IanB's is smaller and faster!

Paul.

[attachment deleted by admin]
Title: Re: QW to ASCII?
Post by: Rockphorr on June 01, 2006, 09:20:14 AM
There is common template of algoritm for convertion DX:AX or EDX:EAX or RDX:RAX to ascii string.