OK, here's a request for help. I'm posting it here 'cos this is where all the mathematically clever people hang out! :bg
I've added below P Dixon's code for converting a positive-only DWORD to the ASCII representation. This is from the old forum and I believe was unchallenged for speed. Unfortunately, my application must be able to handle potential file-sizes of up to a QWORD (2^62-1)... :eek I can't do anything about these specs except support them, and as I will need to write out filesizes and offsets in ASCII, I need a bigger version of this which will manage QWORDs if I am unlucky enough to encounter them. Unfortunately I can't get my head round where to start modifying this, so I'd appreciate any help or suggestions.
IanB
.data
NumText BYTE 20 DUP (0)
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
DWtoASCII proc
comment * ------------------------------------------------------------
Convert positive DWORD to ASCII string (no sign) by Paul Dixon
with specific references/optimisations added by IanB.
On entry:
EAX = value to convert
On exit:
EAX = address of result buffer
NumText buffer contains the ASCII string representation,
null-terminated.
EBX/EBP/EDI/ESI preserved
------------------------------------------------------------ *
push ebx
push edi
push eax ; save absolute value of number
; call it 1234567890
mov ecx, 2814749768
lea edi, NumText ; pointer to answer string
mul ecx ; fast div by 100000
mov ecx, 100000 ; prepare to multiply up the top 5 digits
shr edx, 16 ; shift top 5 digits into place, EDX=12345
mov eax, edx ; copy to eax to multiply up to real size again
mov ebx, edx ; save a copy of top 5 digits
mul ecx ; EAX=1234500000
sub [esp], eax ; sub from original number leaves 0000067890
mov eax, ebx ; get back high digits
mov ecx, 429497 ; about to do div 10000 by reciprocal multiply
mov ebx, 10
or eax, eax ; if top 5 digit = 0 then skip that part
jz SkipTop5
mul ecx ; div top 5 digits by 10000
jc digit1 ; if digit is not zero then process it
mul ebx ; else multiply by 10 to get next digit into EDX
jc digit2
mul ebx
jc digit3
mul ebx
jc digit4
mul ebx
jc digit5
SkipTop5:
pop eax ; retrieve lower 5 digits
mul ecx ; div 10000
jc digit6
mul ebx ; multiply by 10 to get next digit in EDX
jc digit7
mul ebx
jc digit8
mul ebx
jc digit9
mul ebx
jmp digit10
digit1:
add edx, 30h ; top digit is left in EDX, convert to ascii
mov [edi], dl ; store top digit
mul ebx ; multiply by 10 to get next digit into EDX
add edi, 1
digit2:
add edx, 30h
mov [edi], dl
mul ebx
add edi, 1
digit3:
add edx, 30h
mov [edi], dl
mul ebx
add edi, 1
digit4:
add edx, 30h
mov [edi], dl
mul ebx
add edi, 1
digit5:
add edx, 30h
mov [edi], dl
pop eax ; retrieve lower 5 digits
mul ecx ; div 10000
add edi, 1
digit6:
add edx, 30h
mov [edi], dl
mul ebx
add edi, 1
digit7:
add edx, 30h
mov [edi], dl
mul ebx
add edi, 1
digit8:
add edx, 30h
mov [edi], dl
mul ebx
add edi, 1
digit9:
add edx, 30h
mov [edi], dl
mul ebx
add edi, 1
digit10:
add edx, 30h
lea eax, NumText ; return buffer pointer
mov [edi], dx ; last digit, store DX not DL to give a
; zero termination for the result string
pop edi
pop ebx
ret
DWtoASCII endp
Why not just call one of these, in the cruntime --
char *_i64toa( __int64 value, char *string, int radix );
char * _ui64toa( unsigned _int64 value, char *string, int radix );
Er, is that a macro, C, Pascal or what? It doesn't look like any ASM I recognise... :eek
Perhaps he thinks you are calling your converter from a c program? That would be the only reason to include the c runtime module, unless I am getting confused, also. What's up, Codewarp?
Paul
Here's an example:
.586
.model flat, stdcall
option casemap:none
include c:\masm32\include\windows.inc
include c:\masm32\include\kernel32.inc
include c:\masm32\include\user32.inc
include c:\masm32\include\masm32.inc
include c:\masm32\include\msvcrt.inc
include c:\masm32\macros\macros.asm
includelib c:\masm32\lib\kernel32.lib
includelib c:\masm32\lib\user32.lib
includelib c:\masm32\lib\masm32.lib
includelib c:\masm32\lib\msvcrt.lib
main PROTO
.CONST
qwTest QWORD 0DEADBEEFDEADBEEFh
.DATA?
szBuffer BYTE 32 dup(?)
.CODE
start:
call main
invoke ExitProcess, 0
;==============================================
main PROC
; unsigned
print uqword$(qwTest)
print chr$(13,10)
; signed
print sqword$(qwTest)
print chr$(13,10)
; hex
print xqword$(qwTest)
print chr$(13,10)
ret
main ENDP
;==============================================
end start
Thanks, Greg, but I can't tell which module those functions/macros/whatever are in so I can't find the documentation for them, and I'm not a big macro fan. I like to be able to see and play with and therefore try to understand the code I'm calling. :'(
This ain't pretty or short, but by doing repeated subtractions at least half the time it should be faster than doing any code that relies on MUL to produce each digit. (How many SUB/ADD/MOV/JMP can you do instead of a 16-clock MUL?) By finding the most significant digit first it also avoids unnecessary processing. The problem is the upper DWORD, any operation you do to that affects the remainder in the lower one, so I can't see an easier way of approaching this than by using the dreaded SBB there. That was the thrust of my original query, to see if there was another mathematical way to approach this.
However, given how unlikely it is in my particular application that a file of that size would be encountered (a single file greater than 4Gb on Usenet? Only if someone has posted an entire DVD image!) I'm not desperately concerned about the speed of processing that. Actually doing the decoding/verification of such a monster is going to be the time-critical problem, not a few SBB ops.
IanB
EDIT: This code does not work properly, don't use it!
[attachment deleted by admin]
In case you may be interested, attached is my version for converting a signed qword to ASCII. It uses the FPU and is written in MASM syntax.
(It cannot be altered to take an unsigned qword because of the use of the FPU).
Raymond
[attachment deleted by admin]
Ian_B,
Yeah, those are macros, they use the sprintf function from the C Run-Time Library. They are in macros.asm. If you're looking for maximum speed, they're not the way to go. If you are not too concerned about speed, they work very well.
OK, here's a modified version of the previous attempt which ought to be an improvement, even though it's even longer... :eek Compact code this is not. :(
There are two critical fixes. One is that for some of the higher DWORD digits it's possible to use a much faster CDQ rather than SBB to make the carry bit become a -1 in EDX that you can add directly. This has meant reversing the master/work registers as EAX/EDX are needed for the CDQ. Unfortunately not all the digits can be done this way because some of the subtractors are too large to leave a top bit set in the result, which is what CDQ is copying, not the carry bit itself. But even with the following unavoidable register stalls, CDQ/ADD EDX/SUB EDX should be much faster than the SBB.
The most substantial improvement is in the more general lower DWORD routine. To double the chances of getting out of the loop early, I've used a spare register to remove 9x(10^digit) first and work upwards at the same time. So one half of the loop is subtracting, testing output digits from 0 up to 4, and the other adds to the lower bound result, testing digits from 9 down to 5. The register references are widely enough spaced that these fast instructions may even get quad-pumped on P4, although I don't know how the jumps will affect that, so hopefully all this extra code will actually execute faster overall, as well as quitting the loops sooner on average.
IanB
EDIT: bad code removed
Many apologies to the two people who bothered to download that. Here's the massively bug-fixed and cleaned-up version with slightly more useful exit parameters. My bad... :red
IanB
EDIT: bad code removed
I've done some timetesting and checking of outputs of this code and the available comparisons. Sadly, my optimisations using CDQ to help with the upper DWORD can't work at all, because I was assuming (wrongly) that for all the subtractions the sign would always match the carry bit. so it's back to SBB there unless anyone can work out another way. For the lower DWORD, the opposite was true, and I found that testing for the carry was not working and I should have been testing for the sign instead. :red
So here is the final, guaranteed absolutely working code. :green2
I've also attached a time-testing app that compares the outputs of three routines and confirms they are correct, as well as comparing the number of clocks they take. If someone can find another QWORD routine (I couldn't find the macro Greg mentioned in my copy of MASM) I'll add it. But the result on my P4 shows that apart from the worst-case scenario for the 9-digit DWORD, which is marginally slower than PDixon's code because it requires the maximum number of loops for every digit, this routine is significantly faster in all cases, and especially for smaller values. The wsprintf function (used in the MASM library version) is in all cases at least 10 times slower than either PDixon's code or mine. :dance:
If you don't mind the extra length of the code, this is a significant improvement on the available DW to ASCII conversions and therefore the likely QW to ASCII ones too.
[attachment deleted by admin]
Out of interest, I made a slightly modified version of my code that converted signed QWORDs instead to see how it compared against Ray's version using FPU code.
Ray, I don't think you're going to like these timings. Your code is actually slower than the "general" wsprintf function for converting DWORDs most of the time, and between 5 and 20 times slower than mine in all cases. It's nice to have another winner. :cheekygreen:
It does bear out my hunch that doing small, fast integer operations well-organised can outperform "clever" code using floating point etc. After all, my code is really doing division by the "obvious" method of repeated subtraction, just like any schoolkid learns. I don't know where the bottleneck is in Ray's code, whether it's in the FPU setup or the conversion, but it's substantial enough to seriously drag the performance back. Unless there's an issue with using the FPU repeatedly without some sort of resetting as this code does, which I'm not aware of.
IanB
[attachment deleted by admin]
I added a test of the MSVCRT _i64toa function to SQW2Atimetest.asm. Here are the results:
QWORD to ASCII tests:
QWTest 1 - value expected -9199999999444444444
SQWtoASCII - IanB: -9199999999444444444 ; clocks: 790
qwtoa - Raymond F: -9199999999444444444 ; clocks: 520
MSVCRT _i64toa : -9199999999444444444 ; clocks: 3599
QWTest 2 - value expected -9087654321987654321
SQWtoASCII - IanB: -9087654321987654321 ; clocks: 493
qwtoa - Raymond F: -9087654321987654321 ; clocks: 519
MSVCRT _i64toa : -9087654321987654321 ; clocks: 3599
QWTest 3 - value expected -1000000000999999999
SQWtoASCII - IanB: -1000000000999999999 ; clocks: 145
qwtoa - Raymond F: -1000000000999999999 ; clocks: 518
MSVCRT _i64toa : -1000000000999999999 ; clocks: 3602
QWTest 4 - value expected -999999444444444
SQWtoASCII - IanB: -999999444444444 ; clocks: 583
qwtoa - Raymond F: -999999444444444 ; clocks: 493
MSVCRT _i64toa : -999999444444444 ; clocks: 2873
QWTest 5 - value expected -555555123456789
SQWtoASCII - IanB: -555555123456789 ; clocks: 412
qwtoa - Raymond F: -555555123456789 ; clocks: 493
MSVCRT _i64toa : -555555123456789 ; clocks: 2874
QWTest 6 - value expected -100000999999999
SQWtoASCII - IanB: -100000999999999 ; clocks: 112
qwtoa - Raymond F: -100000999999999 ; clocks: 494
MSVCRT _i64toa : -100000999999999 ; clocks: 2890
At least when running on my P3, and when converting a QWORD, Raymond's code is reasonably competitive with yours.
[attachment deleted by admin]
Somewhere along the line I think there is a simple conversion from signed to unsigned and back but it would involve 64 bit capcity to add and subtract.
On my 2+ year-old P4:
DWORD to ASCII tests:
Test 1 - value expected 3444444444
wsprintf conversion: 3444444444 ; clocks: 1664
QWtoASCII - IanB: 3444444444 ; clocks: 181
DWtoASCII - PDixon: 3444444444 ; clocks: 147
Test 2 - value expected 2345678901
wsprintf conversion: 2345678901 ; clocks: 1664
QWtoASCII - IanB: 2345678901 ; clocks: 117
DWtoASCII - PDixon: 2345678901 ; clocks: 157
Test 3 - value expected 1999999999
wsprintf conversion: 1999999999 ; clocks: 1671
QWtoASCII - IanB: 1999999999 ; clocks: 54
DWtoASCII - PDixon: 1999999999 ; clocks: 145
Test 4 - value expected 44444
wsprintf conversion: 44444 ; clocks: 901
QWtoASCII - IanB: 44444 ; clocks: 88
DWtoASCII - PDixon: 44444 ; clocks: 127
Test 5 - value expected 12345
wsprintf conversion: 12345 ; clocks: 898
QWtoASCII - IanB: 12345 ; clocks: 67
DWtoASCII - PDixon: 12345 ; clocks: 127
Test 6 - value expected 99999
wsprintf conversion: 99999 ; clocks: 898
QWtoASCII - IanB: 99999 ; clocks: 26
DWtoASCII - PDixon: 99999 ; clocks: 134
Test 7 - value expected 1
wsprintf conversion: 1 ; clocks: 317
QWtoASCII - IanB: 1 ; clocks: 13
DWtoASCII - PDixon: 1 ; clocks: 138
and for the signed version:
QWORD to ASCII tests:
QWTest 1 - value expected -9199999999444444444
SQWtoASCII - IanB: -9199999999444444444 ; clocks: 574
qwtoa - Raymond F: -9199999999444444444 ; clocks: 2313
QWTest 2 - value expected -9087654321987654321
SQWtoASCII - IanB: -9087654321987654321 ; clocks: 378
qwtoa - Raymond F: -9087654321987654321 ; clocks: 2305
QWTest 3 - value expected -1000000000999999999
SQWtoASCII - IanB: -1000000000999999999 ; clocks: 113
qwtoa - Raymond F: -1000000000999999999 ; clocks: 2302
QWTest 4 - value expected -999999444444444
SQWtoASCII - IanB: -999999444444444 ; clocks: 442
qwtoa - Raymond F: -999999444444444 ; clocks: 2210
QWTest 5 - value expected -555555123456789
SQWtoASCII - IanB: -555555123456789 ; clocks: 289
qwtoa - Raymond F: -555555123456789 ; clocks: 2199
QWTest 6 - value expected -100000999999999
SQWtoASCII - IanB: -100000999999999 ; clocks: 85
qwtoa - Raymond F: -100000999999999 ; clocks: 2210
Those are startling differences between P3 and P4. The clocks aren't that different for my code, but the FPU code is running around 5 times slower. :eek
By adding a little more code :eek I've managed to improve the speed of my routine by a third (times reduce by 25%) for the worst case. This is simply done by checking if we are subtracting more than 5*10^digit for the high DWORDs and removing that first, so on average those most time-consuming loops will exit twice as fast with just a little extra overhead. That should remove the difference between Ray's code and mine except for the very worst case longest QWORDs on your P3. Whether the codelength is worth the speed increase is of course a tradeoff people will want to consider for themselves.
IanB
[attachment deleted by admin]
I finally managed to get a CDQ version going. The trick was to realise that when SUBbing from the low DWORD, the bottom byte wasn't affected by the values being subtracted, so I could shift that out and use the top byte as a place to hold the sign of the subtraced result for the CDQ to work, you just need to keep clearing it between operations. Then before working on the low DWORD only section, you shift the value back and replace the low byte.
Unfortunately I was underwhelmed. :'( I didn't manage to improve on the speed of the SBB version at all, and in some cases it got slower. And the code got even longer, needless to say. :lol The few extra instructions needed obviously soak up any latency advantage CDQ has over SBB. For completeness' sake I've attached the time comparison if anyone's interested, but I am disappointed that I couldn't improve on the SBB this way, for P4 anyway.
As the processing code is exactly the same for both the signed/unsigned versions of my routine, I'll post a combined version with two entry points shortly.
[attachment deleted by admin]
QW2Atimetest running on my P3:
QWORD to ASCII tests:
QWTest 1 - value expected 18399999999444444444
QWtoASCII - IanB: 18399999999444444444 ; clocks: 514
QWTest 2 - value expected 12345678901234567890
QWtoASCII - IanB: 12345678901234567890 ; clocks: 352
QWTest 3 - value expected 10000000000999999999
QWtoASCII - IanB: 10000000000999999999 ; clocks: 166
QWTest 4 - value expected 999999444444444
QWtoASCII - IanB: 999999444444444 ; clocks: 395
QWTest 5 - value expected 555555123456789
QWtoASCII - IanB: 555555123456789 ; clocks: 205
QWTest 6 - value expected 100000999999999
QWtoASCII - IanB: 100000999999999 ; clocks: 117
DWORD to ASCII tests:
Test 1 - value expected 3444444444
wsprintf conversion: 3444444444 ; clocks: 1022
QWtoASCII - IanB: 3444444444 ; clocks: 209
DWtoASCII - PDixon: 3444444444 ; clocks: 62
Test 2 - value expected 2345678901
wsprintf conversion: 2345678901 ; clocks: 1021
QWtoASCII - IanB: 2345678901 ; clocks: 144
DWtoASCII - PDixon: 2345678901 ; clocks: 62
Test 3 - value expected 1999999999
wsprintf conversion: 1999999999 ; clocks: 1022
QWtoASCII - IanB: 1999999999 ; clocks: 74
DWtoASCII - PDixon: 1999999999 ; clocks: 63
Test 4 - value expected 44444
wsprintf conversion: 44444 ; clocks: 581
QWtoASCII - IanB: 44444 ; clocks: 114
DWtoASCII - PDixon: 44444 ; clocks: 48
Test 5 - value expected 12345
wsprintf conversion: 12345 ; clocks: 581
QWtoASCII - IanB: 12345 ; clocks: 89
DWtoASCII - PDixon: 12345 ; clocks: 48
Test 6 - value expected 99999
wsprintf conversion: 99999 ; clocks: 581
QWtoASCII - IanB: 99999 ; clocks: 47
DWtoASCII - PDixon: 99999 ; clocks: 48
Test 7 - value expected 1
wsprintf conversion: 1 ; clocks: 242
QWtoASCII - IanB: 1 ; clocks: 23
DWtoASCII - PDixon: 1 ; clocks: 48
SQW2Atimetest running on my P3:
QWORD to ASCII tests:
QWTest 1 - value expected -9199999999444444444
SQWtoASCII - IanB: -9199999999444444444 ; clocks: 501
qwtoa - Raymond F: -9199999999444444444 ; clocks: 519
QWTest 2 - value expected -9087654321987654321
SQWtoASCII - IanB: -9087654321987654321 ; clocks: 338
qwtoa - Raymond F: -9087654321987654321 ; clocks: 519
QWTest 3 - value expected -1000000000999999999
SQWtoASCII - IanB: -1000000000999999999 ; clocks: 170
qwtoa - Raymond F: -1000000000999999999 ; clocks: 519
QWTest 4 - value expected -999999444444444
SQWtoASCII - IanB: -999999444444444 ; clocks: 393
qwtoa - Raymond F: -999999444444444 ; clocks: 492
QWTest 5 - value expected -555555123456789
SQWtoASCII - IanB: -555555123456789 ; clocks: 207
qwtoa - Raymond F: -555555123456789 ; clocks: 492
QWTest 6 - value expected -100000999999999
SQWtoASCII - IanB: -100000999999999 ; clocks: 126
qwtoa - Raymond F: -100000999999999 ; clocks: 492
DWORD to ASCII tests:
Test 1 - value expected -1444444444
wsprintf conversion: -1444444444 ; clocks: 1038
SQWtoASCII - IanB: -1444444444 ; clocks: 199
qwtoa - Raymond F: -1444444444 ; clocks: 441
Test 2 - value expected -1234567890
wsprintf conversion: -1234567890 ; clocks: 1024
SQWtoASCII - IanB: -1234567890 ; clocks: 149
qwtoa - Raymond F: -1234567890 ; clocks: 441
Test 3 - value expected -1999999999
wsprintf conversion: -1999999999 ; clocks: 1025
SQWtoASCII - IanB: -1999999999 ; clocks: 78
qwtoa - Raymond F: -1999999999 ; clocks: 441
Test 4 - value expected -44444
wsprintf conversion: -44444 ; clocks: 591
SQWtoASCII - IanB: -44444 ; clocks: 110
qwtoa - Raymond F: -44444 ; clocks: 424
Test 5 - value expected -12345
wsprintf conversion: -12345 ; clocks: 590
SQWtoASCII - IanB: -12345 ; clocks: 87
qwtoa - Raymond F: -12345 ; clocks: 424
Test 6 - value expected -99999
wsprintf conversion: -99999 ; clocks: 591
SQWtoASCII - IanB: -99999 ; clocks: 51
qwtoa - Raymond F: -99999 ; clocks: 425
Test 7 - value expected -1
wsprintf conversion: -1 ; clocks: 251
SQWtoASCII - IanB: -1 ; clocks: 27
qwtoa - Raymond F: -1 ; clocks: 395
Just as you anticipated :U
Thanks for those timings, MichaelW. It again shows a clear and fascinating discrepancy between P3 and P4, in that the code from PDixon is consistently and significantly faster than mine on P3. The severe increase in latency of the MUL operation from 4 clocks to 14 clocks (according to Agner's info) is the most likely suspect.
This is clearly a routine that is well-optimised for P4 only, and PDixon's code for converting DWORDs is much better suited for anyone knowingly targetting P3. However, as a general case that can manage QWORDs if necessary, and where the possibility/likelihood is that the end user will have a P4, this does seem to be the winning approach. Anyone sufficiently concerned about processor-specific code will no doubt have already got some sort of switch-test set up that could take advantage of this.
It's just a shame that the code cannot easily be contracted into master loops because of the changing "magic numbers". There is one more register to play with... ::)
Here's a final version of the routine, with entry points for signed and unsigned QWORDs and signed DWORDs. This is a "package" which uses the same core code for all the conversions. It can be improved slightly if your code module uses OPTION NOSCOPED, as this will allow a jump to a label inside a procedure from another procedure (normally internal labels are not visible from outside the procedure) which means the work routine can be neatly merged into the QWtoASCII one with just a jump label to define it. However this is not suitable for most people's code as OPTION SCOPED is somewhat safer while allowing limited label reuse.
There are some more optimisations to this version, I managed to get another 10% or so improvement in speed generally on my P4. The change to the ADD/ADC at the very top of the SQWtoASCII code made the most difference for me, almost halving the clocks taken to produce the single-digit output.
This hasn't been a Laboratory project that's drawn much attention, which in a way is surprising as a QWORD/DWORD to ASCII routine has to be the single conversion most people are likely to need, where they have any code which outputs the values of numbers on-screen to the user or to file, whether it's file sizes (my concern), loop counts, counts of errors found, line-numbers, or any manner of numerical output. That's why I've felt it was worth pursuing a fastest possible version, however unwieldy, as in profiling terms this is likely to be an important interface component for many.
IanB
[attachment deleted by admin]
Ian,
<<Here's a final version of the routine>>
How many times do we hear that only for it to be followed a day or two later with an update!
<<This hasn't been a Laboratory project that's drawn much attention>>
Perhaps it's too similar to the DWORD version which was well thrashed out previously.
I haven't had time to look into it in detail so I can't comment on the code in this thread but I do recall making significant improvements to various routines on my AMD system only to find they slowed things down on a Pentium so lately I've been put off optimising code to squeeze the last clock cycle out of it. Instead, I just stop when I get a good implementation of an efficient algorithm. If someone else then wants to tweak the code for a specific CPU then they can get on with it.
I have had a go at this one and tried a few methods. My current best is doing the basic 20 digit conversion in around 150clks on an AthlonXP using a mix of FPU and ALU (no MMX or SSE). It doesn't do the sign bit yet or try to short circuit the short numbers and it still needs leading zero removal but I hit the same problem just mentioned. I put in a bit of time to make it a little quicker and saved 7% only to find that when I got access to a PIV system that the original was 20% quicker than the newer "faster" version.
For what it's worth, my "fastest" code takes about 500 bytes and does the basic 20 digit conversion on an AthlonXP in about 136clks for all numbers (it produces 20 digits output regardless). It takes 276 clks on a PIV-M CPU. The "not so fast" code took 144clks (Athlon) and 228clks(PIV-M).
I'm sure I could do better.. but for which CPU? I don't think it's worth the effort.
It's not in MASM format so I haven't posted it.
Paul.
Quote from: dioxin on November 05, 2005, 02:32:20 AM
Ian,
<<Here's a final version of the routine>>
How many times do we hear that only for it to be followed a day or two later with an update!
Forgive me, I was under the impresson that...
QuoteThis is the place to post assembler algorithms and code design for discussion, optimisation and any other improvements that can be made on it. Post code here to be beaten to death to make it better, smaller, faster or more powerful.
Clearly I was wrong. :'( I'd love to be able to write perfectly optimised and perfectly working code first time every time, but I'm still learning. Otherwise I wouldn't have to come here and ask questions. :eek
I'm well aware how mundane my code is. That's why I asked if there was a better way, but in the process I found that mundane appears to be
in some cases significantly better than "clever". That's worth discovering and sharing. And the optimisation "trick" I've been posting about recently, ie. the use of CDQ or making a register sign-extend to avoid jumping, isn't always obvious and takes some tinkering and lateral thought to use effectively, and even then it seems to be far less effective than I'd hoped. That's also worth knowing, as it reminds me (and perhaps others) that using Agner's PentOpf guide just to compare latency info of different instructions in order to replace one with an apparently better one isn't always a reliable guide to the end timing result.
I only just came back here and found MichaelW's new timing macros - you might recall that I was previously having trouble timing effectively from the old forum - so this sort of proper comparison timing is still a new toy for me and is throwing up some surprising results, for me anyway. Please forgive my innocent enthusiasm, but there might just be some other people new to ASM coding here interested in those results; I don't recall this forum being restricted only to "experienced" programmers such as you for whom this is obviously all blasé.
QuoteFor what it's worth, my "fastest" code takes about 500 bytes and does the basic 20 digit conversion on an AthlonXP in about 136clks for all numbers (it produces 20 digits output regardless). It takes 276 clks on a PIV-M CPU. The "not so fast" code took 144clks (Athlon) and 228clks(PIV-M).
I'm sure I could do better.. but for which CPU? I don't think it's worth the effort.
If I'm right, it was your code that I posted at the top of this thread and which inspired me to ask for help with a QWORD version of this routine. It sounds like this should be well worth testing as a comparison to the other routines here and I for one would be very interested to see it. There's always going to be a trade-off finding a routine that works acceptably fast on all possible target processors, and this code happens to have shown (to my surprise, thanks to the timings posted by MichaelW) enormous discrepancies between P3 and P4 of up to a factor of 20, in the FPU functions anyway. The price for my rather P4-specific code is unfortunately excessive loop unrolling.
QuoteIt's not in MASM format so I haven't posted it.
Please do. I'm sure someone here will be able to help convert it, even if you are unable to spare the time.
Ian_B,
I sense that you've taken my remarks more as a rebuke and not as the intended neutral comment. Ah well, like my code I can't write perfect posts everytime either!
I had another look at this. The FPU/ALU code I had got stuck. I couldn't see how to improve the speed as the FPU was just too slow when converting to integers, the set-up and restoring was taking too long, there was no quick way to handle signed numbers and there was no clear way to take a short cut for smaller values so I abandoned that approach.
All it did was use the FPU to break the 64 bit (20 digit) number into blocks of 4 digits in size and, while the FPU got on with the second block, the ALU would break up the first 4 digit block into 4 bytes of ascii.
I couldn't see the time being reduced below about 150clks (pushing 300 clks on a P-IV) which was too slow so I gave up.
I had a think about alternatives and I've just about got one sorted. Although it's faster on an AthlonXP, it relies heavily on the MUL instruction which I think you said has a latency of 14 on the P-IV(?) so it's not so good on a P-IV.
It's not finished yet, but I'll post a couple of versions when it is.
Currently the DWORD version is running inline in under 20 clks for values under 10000 and around 40-45clks for larger values. This translates to roughly 40 and 100 clks on a P-IV.
The routine is limited by the MUL. There are 7 MULS with a latency of 5 on an Athlon that give 35clks so 40 is approaching optimal for this method. If P-IV MUL has a latency of 14 then that would give 7x14=98 so it's also doing as well as could be expected on an P-IV.
Although not finished, the unsigned QWORD version appears to be doing the whole 20 digits in around 120 clks but I can see room for improvement. It takes 20 MULS so I'm expecting to get the time down to near 100 Athlon clks (about 280 P-IV clks) and, eventually, with shorter times for small values.
The method for both QWORD and DWORD versions is:
Break the number up into chunks by reciprocal multiplies. It turns out that 6 digit chunks are good.
Each chunk is then broken up by the same method into 2 digit chunks.
The next bit is why it's faster than the previous version..
The 2 digit chunk is used as an index into a 2char x 100 entry table to look up the 2 ascii characters.
The lookup takes place while waiting for the MUL so it takes no aditional time and it halves the number of MULs needed.
Paul.
Here's the latest version of an unsigned DWORD to ASCII conversion. The QWORD to ASCII will be along in a few days when I get time to test it fully and tidy it up.
This one does values <10000 in 12-17 Athlon clks and higher values in 32 to 36 clks. It's been tested with all possible input values.
The code assumes the value to convert is in EAX on entry and that pans contains the pointer to the answer. This must be an area of 12 bytes into which the result will be written and doesn't need to be initialised.
Code size is around 350 bytes.
Sorry, it's not in MASM format.. but I ever did get around to installing MASM in a usable way!
It's actually in PowerBASIC format which is almost the same but &Hxxx means hex instead of xxxh
Make sure the lookup table is aligned for best speed.
For signed numbers, include these lines where indicated:
!or eax,eax
!jns skp
!neg eax
!mov [esi],"-"
!inc esi
skp:
I hope the conversion to MASM is not troublesome. Feel free to encapsulate it anyway you like to make it more useful.
Paul.
[attachment deleted by admin]
<<The QWORD to ASCII will be along in a few days when I get time to test it fully and tidy it up.>>
How time flies. It's a week since I said it'd be a few days and it's still not ready!
While working on the QWORD version I found some improvements to the DWORD version so the DWORD version is now faster, now taking around 30 clks and the QWORD version is looking like it'll take under 100clks worst case.
The DWORD version is attached to this message. It'll probably be my last attempt 'cos I haven't got time to try again.
My reasoning for MUL taking 6clks was flawed. It does take 6 but 2 of them can be overlapped with another MUL so repeated MULS can be done in a little over 4clks/MUL. That's how 7 MULs can be done in 30clks.
I also managed to save a few clks by:
using the faster IMUL where possible,
preloading numbers well in advance so the CPU isn't left waiting for them
changing the lookup table to DWORDS (this wastes 200 bytes but because all data is now aligned it averages 5 clks faster and makes the timing very consistent)
I hope the QWORD version will be along shortly, subject to getting the time to finish it off.
Paul.
[attachment deleted by admin]
Attached is my lastest QWORDtoASCII program.
It's combined with the DWORDtoASCII code as they both use the same techniques and lookup table so it can be shared.
For numbers with the top 32 bits zero the QWORD routine will take a short cut and run the faster DWORD routine instead.
Timings are around 90 Athlon clks for all large numbers, much shorter if the number fits in a single DWORD.
This equates to about 228clks on the P4 I have access to.
The DWORD routine was 100% tested.
The QWORD routine was tested as well as I could manage and I believe it is 100% but I haven't verified it with all inputs as that would take too long.
If anyone finds an error then I'll look into it but otherwise I'll leave it as it is even though I know I can get the QWORD version a little faster! I just can't spare the time.
Still, under 100 clks was my target which I've acheived although I'm a little disappointed that the size is 1106 bytes. I was expecting under 1k.
200 bytes could be saved by using 16 bit table entries but this adds about 0.5-1 clk per digit to the timing.
Feel free to use it and test it or encapsulate it as you wish and to convert it from PowerBASIC format (I really ought to get MASM installed properly on a working computer!)
I have no idea if the effort was worth it, perhaps IanB's is smaller and faster!
Paul.
[attachment deleted by admin]
There is common template of algoritm for convertion DX:AX or EDX:EAX or RDX:RAX to ascii string.