News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

QW to ASCII?

Started by Ian_B, October 24, 2005, 03:50:05 PM

Previous topic - Next topic

Ian_B

By adding a little more code  :eek I've managed to improve the speed of my routine by a third (times reduce by 25%) for the worst case. This is simply done by checking if we are subtracting more than 5*10^digit for the high DWORDs and removing that first, so on average those most time-consuming loops will exit twice as fast with just a little extra overhead. That should remove the difference between Ray's code and mine except for the very worst case longest QWORDs on your P3. Whether the codelength is worth the speed increase is of course a tradeoff people will want to consider for themselves.

IanB


[attachment deleted by admin]

Ian_B

I finally managed to get a CDQ version going. The trick was to realise that when SUBbing from the low DWORD, the bottom byte wasn't affected by the values being subtracted, so I could shift that out and use the top byte as a place to hold the sign of the subtraced result for the CDQ to work, you just need to keep clearing it between operations. Then before working on the low DWORD only section, you shift the value back and replace the low byte.

Unfortunately I was underwhelmed.  :'(  I didn't manage to improve on the speed of the SBB version at all, and in some cases it got slower. And the code got even longer, needless to say.  :lol  The few extra instructions needed obviously soak up any latency advantage CDQ has over SBB. For completeness' sake I've attached the time comparison if anyone's interested, but I am disappointed that I couldn't improve on the SBB this way, for P4 anyway.

As the processing code is exactly the same for both the signed/unsigned versions of my routine, I'll post a combined version with two entry points shortly.


[attachment deleted by admin]

MichaelW

QW2Atimetest running on my P3:

QWORD to ASCII tests:

QWTest 1 - value expected 18399999999444444444
QWtoASCII - IanB:    18399999999444444444 ; clocks: 514
QWTest 2 - value expected 12345678901234567890
QWtoASCII - IanB:    12345678901234567890 ; clocks: 352
QWTest 3 - value expected 10000000000999999999
QWtoASCII - IanB:    10000000000999999999 ; clocks: 166
QWTest 4 - value expected 999999444444444
QWtoASCII - IanB:    999999444444444 ; clocks: 395
QWTest 5 - value expected 555555123456789
QWtoASCII - IanB:    555555123456789 ; clocks: 205
QWTest 6 - value expected 100000999999999
QWtoASCII - IanB:    100000999999999 ; clocks: 117


DWORD to ASCII tests:

Test 1 - value expected 3444444444
wsprintf conversion: 3444444444 ; clocks: 1022
QWtoASCII - IanB:    3444444444 ; clocks: 209
DWtoASCII - PDixon:  3444444444 ; clocks: 62

Test 2 - value expected 2345678901
wsprintf conversion: 2345678901 ; clocks: 1021
QWtoASCII - IanB:    2345678901 ; clocks: 144
DWtoASCII - PDixon:  2345678901 ; clocks: 62

Test 3 - value expected 1999999999
wsprintf conversion: 1999999999 ; clocks: 1022
QWtoASCII - IanB:    1999999999 ; clocks: 74
DWtoASCII - PDixon:  1999999999 ; clocks: 63

Test 4 - value expected 44444
wsprintf conversion: 44444 ; clocks: 581
QWtoASCII - IanB:    44444 ; clocks: 114
DWtoASCII - PDixon:  44444 ; clocks: 48

Test 5 - value expected 12345
wsprintf conversion: 12345 ; clocks: 581
QWtoASCII - IanB:    12345 ; clocks: 89
DWtoASCII - PDixon:  12345 ; clocks: 48

Test 6 - value expected 99999
wsprintf conversion: 99999 ; clocks: 581
QWtoASCII - IanB:    99999 ; clocks: 47
DWtoASCII - PDixon:  99999 ; clocks: 48

Test 7 - value expected 1
wsprintf conversion: 1 ; clocks: 242
QWtoASCII - IanB:    1 ; clocks: 23
DWtoASCII - PDixon:  1 ; clocks: 48


SQW2Atimetest running on my P3:

QWORD to ASCII tests:

QWTest 1 - value expected -9199999999444444444
SQWtoASCII - IanB:   -9199999999444444444 ; clocks: 501
qwtoa - Raymond F:   -9199999999444444444 ; clocks: 519
QWTest 2 - value expected -9087654321987654321
SQWtoASCII - IanB:   -9087654321987654321 ; clocks: 338
qwtoa - Raymond F:   -9087654321987654321 ; clocks: 519
QWTest 3 - value expected -1000000000999999999
SQWtoASCII - IanB:   -1000000000999999999 ; clocks: 170
qwtoa - Raymond F:   -1000000000999999999 ; clocks: 519
QWTest 4 - value expected -999999444444444
SQWtoASCII - IanB:   -999999444444444 ; clocks: 393
qwtoa - Raymond F:   -999999444444444 ; clocks: 492
QWTest 5 - value expected -555555123456789
SQWtoASCII - IanB:   -555555123456789 ; clocks: 207
qwtoa - Raymond F:   -555555123456789 ; clocks: 492
QWTest 6 - value expected -100000999999999
SQWtoASCII - IanB:   -100000999999999 ; clocks: 126
qwtoa - Raymond F:   -100000999999999 ; clocks: 492


DWORD to ASCII tests:

Test 1 - value expected -1444444444
wsprintf conversion: -1444444444 ; clocks: 1038
SQWtoASCII - IanB:   -1444444444 ; clocks: 199
qwtoa - Raymond F:   -1444444444 ; clocks: 441

Test 2 - value expected -1234567890
wsprintf conversion: -1234567890 ; clocks: 1024
SQWtoASCII - IanB:   -1234567890 ; clocks: 149
qwtoa - Raymond F:   -1234567890 ; clocks: 441

Test 3 - value expected -1999999999
wsprintf conversion: -1999999999 ; clocks: 1025
SQWtoASCII - IanB:   -1999999999 ; clocks: 78
qwtoa - Raymond F:   -1999999999 ; clocks: 441

Test 4 - value expected -44444
wsprintf conversion: -44444 ; clocks: 591
SQWtoASCII - IanB:   -44444 ; clocks: 110
qwtoa - Raymond F:   -44444 ; clocks: 424

Test 5 - value expected -12345
wsprintf conversion: -12345 ; clocks: 590
SQWtoASCII - IanB:   -12345 ; clocks: 87
qwtoa - Raymond F:   -12345 ; clocks: 424

Test 6 - value expected -99999
wsprintf conversion: -99999 ; clocks: 591
SQWtoASCII - IanB:   -99999 ; clocks: 51
qwtoa - Raymond F:   -99999 ; clocks: 425

Test 7 - value expected -1
wsprintf conversion: -1 ; clocks: 251
SQWtoASCII - IanB:   -1 ; clocks: 27
qwtoa - Raymond F:   -1 ; clocks: 395


Just as you anticipated :U
eschew obfuscation

Ian_B

Thanks for those timings, MichaelW. It again shows a clear and fascinating discrepancy between P3 and P4, in that the code from PDixon is consistently and significantly faster than mine on P3. The severe increase in latency of the MUL operation from 4 clocks to 14 clocks (according to Agner's info) is the most likely suspect.

This is clearly a routine that is well-optimised for P4 only, and PDixon's code for converting DWORDs is much better suited for anyone knowingly targetting P3. However, as a general case that can manage QWORDs if necessary, and where the possibility/likelihood is that the end user will have a P4, this does seem to be the winning approach. Anyone sufficiently concerned about processor-specific code will no doubt have already got some sort of switch-test set up that could take advantage of this.

It's just a shame that the code cannot easily be contracted into master loops because of the changing "magic numbers". There is one more register to play with...  ::)

Ian_B

Here's a final version of the routine, with entry points for signed and unsigned QWORDs and signed DWORDs. This is a "package" which uses the same core code for all the conversions. It can be improved slightly if your code module uses OPTION NOSCOPED, as this will allow a jump to a label inside a procedure from another procedure (normally internal labels are not visible from outside the procedure) which means the work routine can be neatly merged into the QWtoASCII one with just a jump label to define it. However this is not suitable for most people's code as OPTION SCOPED is somewhat safer while allowing limited label reuse.

There are some more optimisations to this version, I managed to get another 10% or so improvement in speed generally on my P4. The change to the ADD/ADC at the very top of the SQWtoASCII code made the most difference for me, almost halving the clocks taken to produce the single-digit output.

This hasn't been a Laboratory project that's drawn much attention, which in a way is surprising as a QWORD/DWORD to ASCII routine has to be the single conversion most people are likely to need, where they have any code which outputs the values of numbers on-screen to the user or to file, whether it's file sizes (my concern), loop counts, counts of errors found, line-numbers, or any manner of numerical output. That's why I've felt it was worth pursuing a fastest possible version, however unwieldy, as in profiling terms this is likely to be an important interface component for many.

IanB


[attachment deleted by admin]

dioxin

Ian,
<<Here's a final version of the routine>>

How many times do we hear that only for it to be followed a day or two later with an update!

<<This hasn't been a Laboratory project that's drawn much attention>>

Perhaps it's too similar to the DWORD version which was well thrashed out previously.

I haven't had time to look into it in detail so I can't comment on the code in this thread but I do recall making significant improvements to various routines on my AMD system only to find they slowed things down on a Pentium so lately I've been put off optimising code to squeeze the last clock cycle out of it. Instead, I just stop when I get a good implementation of an efficient algorithm. If someone else then wants to tweak the code for a specific CPU then they can get on with it.

I have had a go at this one and tried a few methods. My current best is doing the basic 20 digit conversion in around 150clks on an AthlonXP using a mix of FPU and ALU (no MMX or SSE). It doesn't do the sign bit yet or try to short circuit the short numbers and it still needs leading zero removal but I hit the same problem just mentioned. I put in a bit of time to make it a little quicker and saved 7% only to find that when I got access to a PIV system that the original was 20% quicker than the newer "faster" version.

For what it's worth, my "fastest" code takes about 500 bytes and does the basic 20 digit conversion on an AthlonXP in about 136clks for all numbers (it produces 20 digits output regardless). It takes 276 clks on a PIV-M CPU. The "not so fast" code took 144clks (Athlon) and 228clks(PIV-M).

I'm sure I could do better.. but for which CPU? I don't think it's worth the effort.

It's not in MASM format so I haven't posted it.

Paul.

Ian_B

Quote from: dioxin on November 05, 2005, 02:32:20 AM
Ian,
<<Here's a final version of the routine>>

How many times do we hear that only for it to be followed a day or two later with an update!

Forgive me, I was under the impresson that...
QuoteThis is the place to post assembler algorithms and code design for discussion, optimisation and any other improvements that can be made on it. Post code here to be beaten to death to make it better, smaller, faster or more powerful.

Clearly I was wrong.  :'(  I'd love to be able to write perfectly optimised and perfectly working code first time every time, but I'm still learning. Otherwise I wouldn't have to come here and ask questions.  :eek

I'm well aware how mundane my code is. That's why I asked if there was a better way, but in the process I found that mundane appears to be in some cases significantly better than "clever". That's worth discovering and sharing. And the optimisation "trick" I've been posting about recently, ie. the use of CDQ or making a register sign-extend to avoid jumping, isn't always obvious and takes some tinkering and lateral thought to use effectively, and even then it seems to be far less effective than I'd hoped. That's also worth knowing, as it reminds me (and perhaps others) that using Agner's PentOpf guide just to compare latency info of different instructions in order to replace one with an apparently better one isn't always a reliable guide to the end timing result.

I only just came back here and found MichaelW's new timing macros - you might recall that I was previously having trouble timing effectively from the old forum - so this sort of proper comparison timing is still a new toy for me and is throwing up some surprising results, for me anyway. Please forgive my innocent enthusiasm, but there might just be some other people new to ASM coding here interested in those results; I don't recall this forum being restricted only to "experienced" programmers such as you for whom this is obviously all blasé.

QuoteFor what it's worth, my "fastest" code takes about 500 bytes and does the basic 20 digit conversion on an AthlonXP in about 136clks for all numbers (it produces 20 digits output regardless). It takes 276 clks on a PIV-M CPU. The "not so fast" code took 144clks (Athlon) and 228clks(PIV-M).

I'm sure I could do better.. but for which CPU? I don't think it's worth the effort.

If I'm right, it was your code that I posted at the top of this thread and which inspired me to ask for help with a QWORD version of this routine. It sounds like this should be well worth testing as a comparison to the other routines here and I for one would be very interested to see it. There's always going to be a trade-off finding a routine that works acceptably fast on all possible target processors, and this code happens to have shown (to my surprise, thanks to the timings posted by MichaelW) enormous discrepancies between P3 and P4 of up to a factor of 20, in the FPU functions anyway. The price for my rather P4-specific code is unfortunately excessive loop unrolling.

QuoteIt's not in MASM format so I haven't posted it.

Please do. I'm sure someone here will be able to help convert it, even if you are unable to spare the time.

dioxin

Ian_B,
   I sense that you've taken my remarks more as a rebuke and not as the intended neutral comment. Ah well, like my code I can't write perfect posts everytime either!

   I had another look at this. The FPU/ALU code I had got stuck. I couldn't see how to improve the speed as the FPU was just too slow when converting to integers, the set-up and restoring was taking too long, there was no quick way to handle signed numbers and there was no clear way to take a short cut for smaller values so I abandoned that approach.

   All it did was use the FPU to break the 64 bit (20 digit) number into blocks of 4 digits in size and, while the FPU got on with the second block, the ALU would break up the first 4 digit block into 4 bytes of ascii.
   I couldn't see the time being reduced below about 150clks (pushing 300 clks on a P-IV) which was too slow so I gave up.


   I had a think about alternatives and I've just about got one sorted. Although it's faster on an AthlonXP, it relies heavily on the MUL instruction which I think you said has a latency of 14 on the P-IV(?) so it's not  so good on a P-IV.

   It's not finished yet, but I'll post a couple of versions when it is.
   Currently the DWORD version is running inline in under 20 clks for values under 10000 and around 40-45clks for larger values. This translates to roughly 40 and 100 clks on a P-IV.
   The routine is limited by the MUL. There are 7 MULS with a latency of 5 on an Athlon that give 35clks so 40 is approaching optimal for this method. If P-IV MUL has a latency of 14 then that would give 7x14=98 so it's also doing as well as could be expected on an P-IV.

   
   Although not finished, the unsigned QWORD version appears to be doing the whole 20 digits in around 120 clks but I can see room for improvement. It takes 20 MULS so I'm expecting to get the time down to near 100 Athlon clks (about 280 P-IV clks) and, eventually, with shorter times for small values.

   
   The method for both QWORD and DWORD versions is:
   Break the number up into chunks by reciprocal multiplies. It turns out that 6 digit chunks are good.
   Each chunk is then broken up by the same method into 2 digit chunks.
   The next bit is why it's faster than the previous version..
   The 2 digit chunk is used as an index into a 2char x 100 entry table to look up the 2 ascii characters.

   The lookup takes place while waiting for the MUL so it takes no aditional time and it halves the number of MULs needed.

Paul.

dioxin

Here's the latest version of an unsigned DWORD to ASCII conversion. The QWORD to ASCII will be along in a few days when I get time to test it fully and tidy it up.

This one does values <10000 in 12-17 Athlon clks and higher values in 32 to 36 clks. It's been tested with all possible input values.

The code assumes the value to convert is in EAX on entry and that pans contains the pointer to the answer. This must be an area of 12 bytes into which the result will be written and doesn't need to be initialised.

Code size is around 350 bytes.

Sorry, it's not in MASM format.. but I ever did get around to installing MASM in a usable way!

It's actually in PowerBASIC format which is almost the same but &Hxxx means hex instead of xxxh

Make sure the lookup table is aligned for best speed.

For signed numbers, include these lines where indicated:

!or eax,eax
!jns skp

!neg eax
!mov [esi],"-"
!inc esi

skp:



   I hope the conversion to MASM is not troublesome. Feel free to encapsulate it anyway you like to make it more useful.

Paul.


[attachment deleted by admin]

dioxin

<<The QWORD to ASCII will be along in a few days when I get time to test it fully and tidy it up.>>


How time flies. It's a week since I said it'd be a few days and it's still not ready!

While working on the QWORD version I found some improvements to the DWORD version so the DWORD version is now faster, now taking around 30 clks and the QWORD version is looking like it'll take under 100clks worst case.
The DWORD version is attached to this message. It'll probably be my last attempt 'cos I haven't got time to try again.


My reasoning for MUL taking 6clks was flawed. It does take 6 but 2 of them can be overlapped with another MUL so repeated MULS can be done in a little over 4clks/MUL. That's how 7 MULs can be done in 30clks.

I also managed to save a few clks by:
   using the faster IMUL where possible,
   preloading numbers well in advance so the CPU isn't left waiting for them
   changing the lookup table to DWORDS (this wastes 200 bytes but because all data is now aligned it averages 5 clks faster and makes the timing very consistent)


   I hope the QWORD version will be along shortly, subject to getting the time to finish it off.

Paul.

[attachment deleted by admin]

dioxin

Attached is my lastest QWORDtoASCII program.
It's combined with the DWORDtoASCII code as they both use the same techniques and lookup table so it can be shared.
For numbers with the top 32 bits zero the QWORD routine will take a short cut and run the faster DWORD routine instead.

Timings are around 90 Athlon clks for all large numbers, much shorter if the number fits in a single DWORD.
This equates to about 228clks on the P4 I have access to.

The DWORD routine was 100% tested.
The QWORD routine was tested as well as I could manage and I believe it is 100% but I haven't verified it with all inputs as that would take too long.
If anyone finds an error then I'll look into it but otherwise I'll leave it as it is even though I know I can get the QWORD version a little faster! I just can't spare the time.
Still, under 100 clks was my target which I've acheived although I'm a little disappointed that the size is 1106 bytes. I was expecting under 1k.
200 bytes could be saved by using 16 bit table entries but this adds about 0.5-1 clk per digit to the timing.

Feel free to use it and test it or encapsulate it as you wish and to convert it from PowerBASIC format (I really ought to get MASM installed properly on a working computer!)

   I have no idea if the effort was worth it, perhaps IanB's is smaller and faster!

Paul.

[attachment deleted by admin]

Rockphorr

There is common template of algoritm for convertion DX:AX or EDX:EAX or RDX:RAX to ascii string.
Strike while the iron is hot - Бей утюгом, пока он горячий