Print Page - Faster Bin2ASCII

Title: Faster Bin2ASCII
Post by: lingo on March 26, 2008, 01:07:37 PM

I rewrote speed optimized
"Binary-to-ASCII Decimal Conversion Suppressing Leading Zeros" algo
by AMD (see AMD Software Optimization Guide 25112.pdf)
As a result I have 88 instructions vs 101 instructions from AMD and my time is:

Code Select


P4 3.6 GHz, 2GB RAM,  Vista Ultimate with SP1  
156 cycles, bin2ascii-AMD
92 cycles, bin2ascii-lingo
Press any key to exit...

P4 3.6 GHz, 2GB RAM,  Windows XP Pro with SP2  
123 cycles, bin2ascii-AMD
95 cycles, bin2ascii-lingo
Press any key to exit...

Pls, delete your old zip and download new one again.

Regards,
Lingo

[attachment deleted by admin]

Title: Re: Faster Bin2ASCII
Post by: thomas_remkus on March 26, 2008, 02:55:17 PM

Code Select

134 cycles, bin2ascii-AMD
126 cycles, bin2ascii-lingo

XP Pro SP2
Intel 2.8GHz
1.25GB RAM

Title: Re: Faster Bin2ASCII
Post by: lingo on March 26, 2008, 03:06:28 PM

try again and again until you receive better results... :wink

Title: Re: Faster Bin2ASCII
Post by: jj2007 on March 26, 2008, 05:17:24 PM

Me trying hard, too :wink

Code Select

134 cycles, bin2ascii-AMD
127 cycles, bin2ascii-lingo

XP SP1, 512 MB Ram

Title: Re: Faster Bin2ASCII
Post by: MichaelW on March 26, 2008, 05:40:40 PM

Intel P3, Win2k SP4:

Code Select


72 cycles, bin2ascii-AMD
62 cycles, bin2ascii-lingo

Title: Re: Faster Bin2ASCII
Post by: Ossa on March 26, 2008, 06:14:02 PM

Code Select

51 cycles, bin2ascii-AMD
43 cycles, bin2ascii-lingo

AMD 64 Mobility Athlon 3400+
512MB RAM
Win XP Home (SP2) (32-bit)

Ossa

Title: Re: Faster Bin2ASCII
Post by: hutch-- on March 27, 2008, 11:11:39 AM

3.2 gig Prescott 800 meg fsb.

Code Select


123 cycles, bin2ascii-AMD
95 cycles, bin2ascii-lingo
Press any key to exit...

LATER

Here is the timing on my 2.8 gig Northwood.

Code Select


133 cycles, bin2ascii-AMD
121 cycles, bin2ascii-lingo
Press any key to exit...

It appears the code in your latest version favours the later Intel hardware, I would be interested to see how it performed on both late AMD and Intel Core2 Duos.

Title: Re: Faster Bin2ASCII
Post by: lingo on March 27, 2008, 12:39:13 PM

Thanks Hutch,
My CPU is Prescott too, so
I receive the similar results :wink

To All: Pls, reload the last one zip file

Title: Re: Faster Bin2ASCII
Post by: jj2007 on March 27, 2008, 02:44:19 PM

Compliments, Lingo!

XP SP2, Pentium 4 3.4 GHz, 1.5 G RAM

221 cycles, bin2ascii-AMD
147 cycles, bin2ascii-lingo

149 cycles, bin2ascii-AMD
95 cycles, bin2ascii-lingo

149 cycles, bin2ascii-AMD
97 cycles, bin2ascii-lingo

149 cycles, bin2ascii-AMD
97 cycles, bin2ascii-lingo

149 cycles, bin2ascii-AMD
95 cycles, bin2ascii-lingo

Title: Re: Faster Bin2ASCII
Post by: Ossa on March 27, 2008, 05:46:45 PM

Code Select

51 cycles, bin2ascii-AMD
48 cycles, bin2ascii-lingo

(same setup as before).

AMD 64 Mobility Athlon 3400+
512MB RAM
Win XP Home (SP2) (32-bit)

that's worse than before for me (it was 43 cycles last time).

Ossa

Title: Re: Faster Bin2ASCII
Post by: MichaelW on March 27, 2008, 06:21:26 PM

The new version is a little slower on my P3:

Code Select


72 cycles, bin2ascii-AMD
65 cycles, bin2ascii-lingo

Title: Re: Faster Bin2ASCII
Post by: GregL on March 27, 2008, 07:40:28 PM

Pentium D 940 (3.2 GHz)

123 cycles, bin2ascii-AMD
95 cycles, bin2ascii-lingo

Title: Re: Faster Bin2ASCII
Post by: thomas_remkus on March 27, 2008, 10:05:56 PM

123 - AMD
96 - lingo

Title: Re: Faster Bin2ASCII
Post by: jj2007 on March 27, 2008, 10:19:35 PM

120 cycles, bin2ascii-AMD
96 cycles, bin2ascii-lingo

Celeron 2.4 GHz, 448 MB RAM, XP SP1

The P4 on my office box (SP2) is slower for AMD, see above

Title: Re: Faster Bin2ASCII
Post by: u on March 28, 2008, 10:41:37 PM

Sempron 3000+

Code Select


51 cycles, bin2ascii-AMD
48 cycles, bin2ascii-lingo

Title: Re: Faster Bin2ASCII
Post by: NightWare on March 29, 2008, 03:33:23 AM

Quote from: hutch-- on March 27, 2008, 11:11:39 AM
I would be interested to see how it performed on both late AMD and Intel Core2 Duos.

on vista, core2 duo 2ghz

Code Select

92 cycles, bin2ascii-AMD
60 cycles, bin2ascii-lingo

i can't compile it, but i'd like to see the speed with this modified version of the algo from michealw/pdixon :

Code Select

ALIGN 16
;
; syntax :
; eax = la valeur
; edi = adresse de la chaine de caractères à créer
; call uDw2Ascii
;
uDw2Ascii PROC
		push ebx									;; empiler ebx
		push edi									;; empiler edi

		mov ecx,eax								;; copier eax dans ecx
; on va diviser eax par 100000 de manière optimisée
		mov edx,2814749768							;; placer 2814749768 (remplacement de 2814749767 et eax+1) dans edx
		sub edi,5									;; enlever 5 caractères à edi
		mul edx									;; multiplier eax par 2814749768
		shr edx,16								;; décaler edx de 16 bits à droite
		mov eax,100000								;; placer 100000 dans eax
		mov ebx,edx								;; copier edx dans ebx
		mul edx									;; multiplier edx par 100000
		sub ecx,eax								;; soustraire le résultat en eax à ecx
		test ebx,ebx								;; fixer les flags de ebx
		mov edx,ebx								;; replacer ebx dans edx
		mov ebx,10								;; placer 10 (notre multiplicateur décimale) dans ebx
		jz Label01								;; si c'est égal à 0, aller Label01
; sinon, on traite la partie supérieure du nombre
		mov eax,429497								;; on remplace eax par 429497
		add edi,5									;; ici, il y a plus de 5 caractères alors on rajoute les 5 qu'on avait enlevé
		mul edx									;; multiplier edx par 429497 (pour pouvoir extraire correctement les 5 décimales supérieures)
		jc Label02								;; s'il existe un dépassement, aller Label02
		dec edi									;; décrémenter l'adresse
		mul ebx									;; multiplier eax par 10
		jc Label03								;; s'il existe un dépassement, aller Label03
		dec edi									;; décrémenter l'adresse
		mul ebx									;; multiplier eax par 10
		jc Label04								;; s'il existe un dépassement, aller Label04
		dec edi									;; décrémenter l'adresse
		mul ebx									;; multiplier eax par 10
		jc Label05								;; s'il existe un dépassement, aller Label05
		dec edi									;; décrémenter l'adresse
		mul ebx									;; multiplier eax par 10
		jc Label06								;; s'il existe un dépassement, aller Label06
		dec edi									;; décrémenter l'adresse
; ici, on traite la partie inférieure du nombre
Label01:	mov eax,429497								;; on remplace eax par 429497
		mul ecx									;; multiplier ecx par 429497 (pour pouvoir extraire correctement les 5 décimales supérieures)
		jc Label07								;; s'il existe un dépassement, aller Label07
		dec edi									;; décrémenter l'adresse
		mul ebx									;; multiplier eax par 10
		jc Label08								;; s'il existe un dépassement, aller Label08
		dec edi									;; décrémenter l'adresse
		mul ebx									;; multiplier eax par 10
		jc Label09								;; s'il existe un dépassement, aller Label09
		dec edi									;; décrémenter l'adresse
		mul ebx									;; multiplier eax par 10
		jc Label10								;; s'il existe un dépassement, aller Label10
		dec edi									;; décrémenter l'adresse
		mul ebx									;; multiplier eax par 10
		jmp Label11								;; aller Label11
; ici, on va placer XXXXXXXXXX et le zéro final
Label02:	add dl,30h								;; ajouter le caractère de base "0" à la valeur en edx
		mov BYTE PTR [edi],dl						;; placer l'octet en dl à l'adresse en edi
		mul ebx									;; multiplier eax par 10
Label03:	add dl,30h								;; ajouter le caractère de base "0" à la valeur en edx
		mov BYTE PTR [edi+1],dl						;; placer l'octet en dl à l'adresse en edi
		mul ebx									;; multiplier eax par 10
Label04:	add dl,30h								;; ajouter le caractère de base "0" à la valeur en edx
		mov BYTE PTR [edi+2],dl						;; placer l'octet en dl à l'adresse en edi
		mul ebx									;; multiplier eax par 10
Label05:	add dl,30h								;; ajouter le caractère de base "0" à la valeur en edx
		mov BYTE PTR [edi+3],dl						;; placer l'octet en dl à l'adresse en edi
		mul ebx									;; multiplier eax par 10
Label06:	add dl,30h								;; ajouter le caractère de base "0" à la valeur en edx
		mov BYTE PTR [edi+4],dl						;; placer l'octet en dl à l'adresse en edi
		mov eax,429497								;; on remplace eax par 429497
		mul ecx									;; multiplier ecx par 429497 (pour pouvoir extraire correctement les 5 décimales supérieures)
Label07:	add dl,30h								;; ajouter le caractère de base "0" à la valeur en edx
		mov BYTE PTR [edi+5],dl						;; placer l'octet en dl à l'adresse en edi
		mul ebx									;; multiplier eax par 10
Label08:	add dl,30h								;; ajouter le caractère de base "0" à la valeur en edx
		mov BYTE PTR [edi+6],dl						;; placer l'octet en dl à l'adresse en edi
		mul ebx									;; multiplier eax par 10
Label09:	add dl,30h								;; ajouter le caractère de base "0" à la valeur en edx
		mov BYTE PTR [edi+7],dl						;; placer l'octet en dl à l'adresse en edi
		mul ebx									;; multiplier eax par 10
Label10:	add dl,30h								;; ajouter le caractère de base "0" à la valeur en edx
		mov BYTE PTR [edi+8],dl						;; placer l'octet en dl à l'adresse en edi
		mul ebx									;; multiplier eax par 10
Label11:	add dx,30h								;; ajouter le caractère de base "0" à la valeur en dx
		mov WORD PTR [edi+9],dx						;; placer l'octet en dl et le 0 final en dh à l'adresse en edi

		pop edi									;; désempiler edi
		pop ebx									;; désempiler ebx
	ret											;; retourner (sortir de la procédure)
uDw2Ascii ENDP

Title: Re: Faster Bin2ASCII
Post by: MichaelW on March 29, 2008, 04:56:53 AM

P3:

Code Select


123456789
123456789
123456789
72 cycles, bin2ascii-AMD
65 cycles, bin2ascii-lingo
56 cycles, uDw2Ascii

[attachment deleted by admin]

Title: Re: Faster Bin2ASCII
Post by: sinsi on March 29, 2008, 05:39:10 AM

Q6600

Code Select


83 cycles, bin2ascii-AMD
72 cycles, bin2ascii-lingo

123456789
123456789
123456789
84 cycles, bin2ascii-AMD
70 cycles, bin2ascii-lingo
32 cycles, uDw2Ascii

Title: Re: Faster Bin2ASCII
Post by: MichaelW on March 29, 2008, 06:47:01 AM

This version sets the process affinity mask after the first test, hopefully to limit the second test to a single core.

Requires Windows 2000 or later.

[attachment deleted by admin]

Title: Re: Faster Bin2ASCII
Post by: sinsi on March 29, 2008, 07:29:46 AM

Code Select


123456789
123456789
123456789
00001111
00001111
83 cycles, bin2ascii-AMD
73 cycles, bin2ascii-lingo
33 cycles, uDw2Ascii
84 cycles, bin2ascii-AMD
83 cycles, bin2ascii-lingo
47 cycles, uDw2Ascii

I remember having trouble with a timer overflow in a speed test, but can't test this because of 'timers.asm' - is that a part of masm32?
I don't know what version I've got - how can I tell?

Title: Re: Faster Bin2ASCII
Post by: MichaelW on March 29, 2008, 07:47:11 AM

Timers.asm is in TimingMacros.zip available here (http://www.masm32.com/board/index.php?topic=770.0). IIRC it was included in the macros directory for one release of the MASM32 package. Timers.asm was updated some time after the first post, but it has been stable for more than two years now.

Title: Re: Faster Bin2ASCII
Post by: sinsi on March 29, 2008, 08:37:45 AM

These are with a counter of 10000000

Code Select


bin2ascii
55 cycles, bin2ascii-AMD
47 cycles, bin2ascii-lingo

b2ascii_
123456789
123456789
123456789
55 cycles, bin2ascii-AMD
47 cycles, bin2ascii-lingo
31 cycles, uDw2Ascii

b2ascii_pam
123456789
123456789
123456789
00001111
00001111
55 cycles, bin2ascii-AMD
47 cycles, bin2ascii-lingo
32 cycles, uDw2Ascii
55 cycles, bin2ascii-AMD
47 cycles, bin2ascii-lingo
32 cycles, uDw2Ascii

Title: Re: Faster Bin2ASCII
Post by: MichaelW on March 29, 2008, 09:19:47 AM

So it would seem that the higher cycle counts with the processor affinity mask set to 1 were due to some temporary effect, and the code was already running on only one core?

Thinking about this more, it makes sense that the code would run on only one core because there is only one thread involved. It would be interesting to see the effects of splitting a suitable algorithm across two or more cores.

Title: Re: Faster Bin2ASCII
Post by: sinsi on March 29, 2008, 09:35:14 AM

Doesn't the mask limit a thread from running on more than one CPU?
Then again, 55/47/32 matches 55/47/32 (ref b2ascii_pam)

I've run these from a batch file (without the inkey) and get similar results (+/- a unicycle).
Q6600,2GB,XPHSP2, 20 processes - although having NFS ProStreet minimised didn't change it by more than 5.

I think we need to change timing code (e.g. different TSC's - lots of games have been caught out by that one).

Title: Re: Faster Bin2ASCII
Post by: lingo on March 29, 2008, 01:28:37 PM

Many thanks guys for testing.

NightWare,

"to see the speed with this modified version of the algo from michealw/pdixon"

I don't know anything about michealw/pdixon algo, but ...

your version is not tested and works wrong!!! :lol

Example: for 0BBC8D09Fh result is 31505è9998 vs 3150499999 from my algo

MichaelW,

"The new version is a little slower on my P3:"

Yes, because I have a few new instructions (cmp eax, -1; je @f; )
for properly computing of just one digit -> 0FFFFFFFFh. :wink

Regards,
Lingo

Title: Re: Faster Bin2ASCII
Post by: Jimg on March 29, 2008, 06:02:25 PM

I added a validity printout, converted uDw2Ascii to english (My apologies to NightWare) using babelfish so I could read it.
I fixed the error (I think), converted it to the same call as the other routines for fairness, and saved as uDw2Ascii2.

Code Select

AuthenticAMD  Family 7  Model 10  Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE

validity tests:
3150499999 bin2ascii-AMD
3150499999 bin2ascii-lingo
31505Φ9998 uDw2Ascii
3150499999 uDw2Ascii2

61 cycles, bin2ascii-AMD
65 cycles, bin2ascii-lingo
52 cycles, uDw2Ascii
59 cycles, uDw2Ascii2

Not much to choose from on an AMD. Surprising how much difference between uDw2Ascii and uDw2Ascii2 there is just caused by converting to the same call as the other test routines.

[attachment deleted by admin]

Title: Re: Faster Bin2ASCII
Post by: lingo on March 29, 2008, 09:24:50 PM

Jimg,

"I fixed the error (I think)..."

Your "tested" and "free of errors" algo is wrong too...
Example: for 186A0h-> result is 0000 vs 100000 from my algo
Regards,
Lingo

Title: Re: Faster Bin2ASCII
Post by: NightWare on March 29, 2008, 10:32:12 PM

Quote from: lingo on March 29, 2008, 01:28:37 PM
your version is not tested and works wrong!!! :lol

:wink ok, originally it's a signed algo... i'm going to see if it's possible to fix it...

Quote from: Jimg on March 29, 2008, 06:02:25 PM
Surprising how much difference between uDw2Ascii and uDw2Ascii2 there is just caused by converting to the same call as the other test routines.

yep, invoke is always slower :wink

Title: Re: Faster Bin2ASCII
Post by: hutch-- on March 29, 2008, 11:26:09 PM

NightWare,

> yep, invoke is always slower

I would be interested to hear your reasoning behind this statement, the built in invoke operator is very well known in what it does and in most instances it delivers identical code. Exceptions are things like nested function calls and calculations performed during the push sequence before the call.

Title: Re: Faster Bin2ASCII
Post by: Ossa on March 29, 2008, 11:51:17 PM

Quote from: hutch-- on March 29, 2008, 11:26:09 PM
> yep, invoke is always slower

I would be interested to hear your reasoning behind this statement, the built in invoke operator is very well known in what it does and in most instances it delivers identical code. Exceptions are things like nested function calls and calculations performed during the push sequence before the call.

Well, one is using the stack, the other is passing in registers (eax, edi). Naturally the stack method is slower:

more ops to get them off the stack into regs
using memory accesses where none are needed before

I do think it is surprising that you get *7* cycles slowdown, though.

[edit] Well, I guess there is little more slowdown for "ret 8" to sort the stack out too. [/edit]

Title: Re: Faster Bin2ASCII
Post by: Jimg on March 30, 2008, 12:18:23 AM

Nightware- Can you reference a thread where this was originally presented? I can't seem to find it anywhere.

Lingo- Hi, how's it going? Nice to hear from you.

Title: Re: Faster Bin2ASCII
Post by: NightWare on March 30, 2008, 12:37:24 AM

Quote from: hutch-- on March 29, 2008, 11:26:09 PM
I would be interested to hear your reasoning behind this statement, the built in invoke operator is very well known in what it does and in most instances it delivers identical code. Exceptions are things like nested function calls and calculations performed during the push sequence before the call.

push from mem = 3 cycles, vs mov reg,mem = 2 cycles
beside after that, you have to read again the values (1 or 2 cycles to put them in reg) in invoke method, and the values are already in the registers with "call", it's why i don't like invoke, even with esp use...

Quote from: Jimg on March 30, 2008, 12:18:23 AM
Nightware- Can you reference a thread where this was originally presented? I can't seem to find it anywhere.

if you speak about the original algo, if i remember well, i found it in donkey's library. but donkey's page appear not being accessible now...

Title: Re: Faster Bin2ASCII
Post by: MichaelW on March 30, 2008, 02:52:58 AM

I have only a vague memory of the thread. AFAIK Paul Dixon's current user name is dioxin. There is a version of the code here (http://www.masm32.com/board/index.php?topic=3051.0). I think the code was originally done as PowerBASIC inline asm, and I think this is the version that I started with, dated 2/14/2004:

Code Select


num&=1234567890 
k&=2^32/100000 
k2&=2^32/10000 

!mov esi,ptrans&    ;point to answer 
!mov ebx,10 

!mov eax,num&       ;get the number 
!mov ecx,k&         ;get 2^32/100000 
!mul ecx            ;div 100000 
!mov edi,edx        ;save top 5 digits 

!mov eax,100000 
!mul edx 

!sub eax,num&         ;get the number 
!neg eax              ;get the low 5 digits 



!mov ecx,k2&        ;about to divide 5 digit number by 10000 
!mul ecx 

!add edx,&h30       ;first digit is now in DL, convert to ASCII 
!mov [esi+5],dl     ;and store 

!mul ebx            ;multiply remainder by 10 to get next digit 
!add edx,&h30       ;convert to ASCII 
!mov [esi+6],dl     ;and store 

!mul ebx         ;..etc.. 
!add edx,&h30 
!mov [esi+7],dl 

!mul ebx 
!add edx,&h30 
!mov [esi+8],dl 

!mul ebx 
!add edx,&h30 
!mov [esi+9],dl 


!mov eax,edi     ;first 5 digits done, now get next 5 and do the same.. 
!mul ecx 

!add edx,&h30 
!mov [esi+0],dl 

!mul ebx 
!add edx,&h30 
!mov [esi+1],dl 

!mul ebx 
!add edx,&h30 
!mov [esi+2],dl 

!mul ebx 
!add edx,&h30 
!mov [esi+3],dl 

!mul ebx 
!add edx,&h30 
!mov [esi+4],dl

I think this code from 2/16/2004 was his MASM version:

Code Select


_dwtoa proc dwValue:DWORD, lpBuffer:DWORD
    
    push ebx
    push esi
    push edi
    mov eax,dwValue
    mov esi,lpBuffer
    
    k4 = 429497             ;2^32/10000  rounded up
    k9 = 2814749768         ;2^43/5^5 = 2814749767.28

    mov ecx,k9              ;prepare to fast div by 100000
    or eax,eax              ;is number negative?
    jns skip                ;no, just leave it
    neg eax                 ;yes, make it positive
    mov byte ptr [esi],"-"  ;put sign in place
    inc esi                 ;leave room for sign
skip:
    mov edi,eax             ;save absolute value of number
                            ; let's call it 1234567890
    ;#####fast div by 100000
    mul ecx                 ;div 100000
    mov ecx,100000          ;prepare to multiply up the top 5 digits
    shr edx,16              ;shift top 5 digits into place edx=12345
    mov eax,edx             ;copy to eax to multiply up to real size again
    mov ebx,edx             ;save a copy of top 5 digits
    mul ecx                 ;eax=1234500000
    sub edi,eax             ;subtract orignal number  edi=0000067890
    ;##end fast div
    mov eax,ebx             ;get back high digits
    mov ecx,k4              ;about to do div 10000 by reciprocal multiply
    mov ebx,10              ;useful constant

    or eax,eax              ;if top 5 digit = 0 then skip that part
    jz SkipTop5

    ;####################part 1
    mul ecx                 ;div top 5 digits by 10000
    jc digit1               ;if digit is not zero then process it,
                            ; else try next digit
    mul ebx                 ;multiply by 10 to get next digit into edx
    jc digit2

    mul ebx
    jc digit3

    mul ebx
    jc digit4

    mul ebx
    jc digit5
SkipTop5:
    mov eax,edi             ;retrieve lower 5 digits
    mul ecx                 ;div 10000
    jc digit6

    mul ebx                 ;multiply by 10 to get next digit in edx
    jc digit7

    mul ebx                 ;3rd digit
    jc digit8

    mul ebx                 ;4th digit
    jc digit9

    mul ebx                 ;5th digit
    jmp digit10

    ;###############end part1

digit1:
    add edx,30h             ;top digit is left in edx, convert to ascii
    mov [esi],dl            ;store top digit

    mul ebx                 ;multiply by 10 to get next digit into edx
    inc esi
digit2:
    add edx,30h             ;convert to ascii
    mov [esi],dl            ;and store

    mul ebx
    inc esi
digit3:
    add edx,30h             ;convert to ascii
    mov [esi],dl            ;and store

    mul ebx
    inc esi
digit4:
    add edx,30h             ;convert to ascii
    mov [esi],dl            ;and store

    mul ebx
    inc esi
digit5:
    add edx,30h             ;convert to ascii
    mov [esi],dl            ;and store

    mov eax,edi             ;retrieve lower 5 digits
    mul ecx
    inc esi                 ;div 10000
digit6:
    add edx,30h             ;convert 1st digit to ascii
    mov [esi],dl            ;and store

    mul ebx                 ;multiply by 10 to get next digit in edx
    inc esi
digit7:
    add edx,30h             ;convert to ascii
    mov [esi],dl            ;and store

    mul ebx                 ;3rd digit
    inc esi
digit8:
    add edx,30h
    mov [esi],dl

    mul ebx                 ;4th digit
    inc esi
digit9:
    add edx,30h
    mov [esi],dl

    mul ebx                 ;5th digit
    inc esi
digit10:
    add edx,30h
    mov [esi],dx            ;last digit, store dx not dl to give a
                            ; zero termination for the result string
    pop edi
    pop esi
    pop ebx
    ret
_dwtoa endp

Title: Re: Faster Bin2ASCII
Post by: Jimg on March 30, 2008, 03:03:27 AM

I couldn't get that one to work immediately, but here's results from another Paul Dixon routine from that thread-

Code Select

AuthenticAMD  Family 7  Model 10  Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE

validity tests:
3150499999 bin2ascii-AMD
3150499999 bin2ascii-lingo
3150499999 udword

100000 bin2ascii-AMD
100000 bin2ascii-lingo
100000 udword

4294967294 bin2ascii-AMD
4294967294 bin2ascii-lingo
4294967294 udword


timing tests:
61 cycles, bin2ascii-AMD
55 cycles, bin2ascii-lingo
37 cycles, udword

Press any key to exit...

Of course speed costs, in this case, an extra 200 bytes for a lookup table.

He had one later that he claimed was faster still, but I couldn't get it to work.

[attachment deleted by admin]

Title: Re: Faster Bin2ASCII
Post by: MichaelW on March 30, 2008, 03:15:14 AM

Code Select


GenuineIntel  Family 6  Model 7  Stepping 3
Intel Brand String: NA, processor is Pentium III
Features: FPU TSC CX8 CMOV FXSR MMX SSE

validity tests:
3150499999 bin2ascii-AMD
3150499999 bin2ascii-lingo
3150499999 udword

100000 bin2ascii-AMD
100000 bin2ascii-lingo
100000 udword

4294967294 bin2ascii-AMD
4294967294 bin2ascii-lingo
4294967294 udword


timing tests:
72 cycles, bin2ascii-AMD
62 cycles, bin2ascii-lingo
51 cycles, udword

Title: Re: Faster Bin2ASCII
Post by: NightWare on March 30, 2008, 04:07:50 AM

here, it should be fixed, (i haven't tested strings, just the divisions), and i know... i'm a lazy coder to fix this like that... but why makes things complicate, when there is easy ways... :bg

Code Select

ALIGN 16
;
; syntax :
; mov eax,value
; mov edi,StringPointer
; call uDw2Ascii
;
uDw2Ascii PROC
		push ebx									;; empiler ebx
		push edi									;; empiler edi

		mov ecx,eax								;; copier eax dans ecx
inc eax
jz SpecialCase
; on va diviser eax par 100000 de manière optimisée
;		mov edx,2814749768							;; placer 2814749768 (remplacement de 2814749767 et eax+1) dans edx
mov edx,2814749767
		sub edi,5									;; enlever 5 caractères à edi
		mul edx									;; multiplier eax par 2814749768
		shr edx,16								;; décaler edx de 16 bits à droite
		mov eax,100000								;; placer 100000 dans eax
		mov ebx,edx								;; copier edx dans ebx
		mul edx									;; multiplier edx par 100000
		sub ecx,eax								;; soustraire le résultat en eax à ecx
		test ebx,ebx								;; fixer les flags de ebx
		mov edx,ebx								;; replacer ebx dans edx
		mov ebx,10								;; placer 10 (notre multiplicateur décimale) dans ebx
		jz Label01								;; si c'est égal à 0, aller Label01
; sinon, on traite la partie supérieure du nombre
		mov eax,429497								;; on remplace eax par 429497
		add edi,5									;; ici, il y a plus de 5 caractères alors on rajoute les 5 qu'on avait enlevé
		mul edx									;; multiplier edx par 429497 (pour pouvoir extraire correctement les 5 décimales supérieures)
		jc Label02								;; s'il existe un dépassement, aller Label02
		dec edi									;; décrémenter l'adresse
		mul ebx									;; multiplier eax par 10
		jc Label03								;; s'il existe un dépassement, aller Label03
		dec edi									;; décrémenter l'adresse
		mul ebx									;; multiplier eax par 10
		jc Label04								;; s'il existe un dépassement, aller Label04
		dec edi									;; décrémenter l'adresse
		mul ebx									;; multiplier eax par 10
		jc Label05								;; s'il existe un dépassement, aller Label05
		dec edi									;; décrémenter l'adresse
		mul ebx									;; multiplier eax par 10
		jc Label06								;; s'il existe un dépassement, aller Label06
		dec edi									;; décrémenter l'adresse
; ici, on traite la partie inférieure du nombre
Label01:	mov eax,429497								;; on remplace eax par 429497
		mul ecx									;; multiplier ecx par 429497 (pour pouvoir extraire correctement les 5 décimales supérieures)
		jc Label07								;; s'il existe un dépassement, aller Label07
		dec edi									;; décrémenter l'adresse
		mul ebx									;; multiplier eax par 10
		jc Label08								;; s'il existe un dépassement, aller Label08
		dec edi									;; décrémenter l'adresse
		mul ebx									;; multiplier eax par 10
		jc Label09								;; s'il existe un dépassement, aller Label09
		dec edi									;; décrémenter l'adresse
		mul ebx									;; multiplier eax par 10
		jc Label10								;; s'il existe un dépassement, aller Label10
		dec edi									;; décrémenter l'adresse
		mul ebx									;; multiplier eax par 10
		jmp Label11								;; aller Label11
; ici, on va placer XXXXXXXXXX et le zéro final
Label02:	add dl,30h								;; ajouter le caractère de base "0" à la valeur en edx
		mov BYTE PTR [edi],dl						;; placer l'octet en dl à l'adresse en edi
		mul ebx									;; multiplier eax par 10
Label03:	add dl,30h								;; ajouter le caractère de base "0" à la valeur en edx
		mov BYTE PTR [edi+1],dl						;; placer l'octet en dl à l'adresse en edi
		mul ebx									;; multiplier eax par 10
Label04:	add dl,30h								;; ajouter le caractère de base "0" à la valeur en edx
		mov BYTE PTR [edi+2],dl						;; placer l'octet en dl à l'adresse en edi
		mul ebx									;; multiplier eax par 10
Label05:	add dl,30h								;; ajouter le caractère de base "0" à la valeur en edx
		mov BYTE PTR [edi+3],dl						;; placer l'octet en dl à l'adresse en edi
		mul ebx									;; multiplier eax par 10
Label06:	add dl,30h								;; ajouter le caractère de base "0" à la valeur en edx
		mov BYTE PTR [edi+4],dl						;; placer l'octet en dl à l'adresse en edi
		mov eax,429497								;; on remplace eax par 429497
		mul ecx									;; multiplier ecx par 429497 (pour pouvoir extraire correctement les 5 décimales supérieures)
Label07:	add dl,30h								;; ajouter le caractère de base "0" à la valeur en edx
		mov BYTE PTR [edi+5],dl						;; placer l'octet en dl à l'adresse en edi
		mul ebx									;; multiplier eax par 10
Label08:	add dl,30h								;; ajouter le caractère de base "0" à la valeur en edx
		mov BYTE PTR [edi+6],dl						;; placer l'octet en dl à l'adresse en edi
		mul ebx									;; multiplier eax par 10
Label09:	add dl,30h								;; ajouter le caractère de base "0" à la valeur en edx
		mov BYTE PTR [edi+7],dl						;; placer l'octet en dl à l'adresse en edi
		mul ebx									;; multiplier eax par 10
Label10:	add dl,30h								;; ajouter le caractère de base "0" à la valeur en edx
		mov BYTE PTR [edi+8],dl						;; placer l'octet en dl à l'adresse en edi
		mul ebx									;; multiplier eax par 10
Label11:	add dx,30h								;; ajouter le caractère de base "0" à la valeur en dx
		mov WORD PTR [edi+9],dx						;; placer l'octet en dl et le 0 final en dh à l'adresse en edi

		pop edi									;; désempiler edi
		pop ebx									;; désempiler ebx
	ret											;; retourner (sortir de la procédure)

SpecialCase:
mov DWORD PTR [edi],"4924"
mov DWORD PTR [edi+4],"2769"
mov WORD PTR [edi+8],"59"
mov BYTE PTR [edi+10],0

		pop edi									;; désempiler edi
		pop ebx									;; désempiler ebx
	ret											;; retourner (sortir de la procédure)
uDw2Ascii ENDP

Title: Re: Faster Bin2ASCII
Post by: lingo on March 30, 2008, 05:50:54 AM

NightWare,

"but why makes things complicate, when there is easy ways... "

Your easy ways are just to get lingo's decision of the problem... :lol
but your algo is still slow..

Jimg,
Nice code, and fastest... :wink

Code Select


P4 3.6 GHz, 2GB RAM,  Windows XP Pro with SP2  

123456789
123456789
123456789
123456789
123456789
153 cycles, bin2ascii-AMD
 92   cycles, bin2ascii-lingo
 61   cycles, bi2ascii-lingo 2
 83   cycles, uDw2Ascii - NightWare
 47   cycles, udword-p.dixon
Press any key to exit...

Regards,
Lingo

[attachment deleted by admin]

Title: Re: Faster Bin2ASCII
Post by: hutch-- on March 30, 2008, 06:02:58 AM

I wasn't aware that nightware was contrasting stack parameters to register passing so in part the comment makes sense as long as the algo the regs are passed to can use the data in the regs that are passed without running out of regs to do the work. I have rarely ever found this to be the case and while it starts to make sense on a 64 bit machine with twice as many registers, it is problematic on a 32 bit machine with 8 gp registers where the stack pointer is rarely ever available and EBP is only available with no stack frame.

Title: Re: Faster Bin2ASCII
Post by: lingo on March 30, 2008, 03:55:55 PM

Nice thread :lol
I optimized AMD algo and two P.Dixon's algos and achieved the best time again: :wink

Code Select


P4 3.6 GHz, 2GB RAM,  Windows XP Pro with SP2  

123456789
123456789
123456789
123456789
123456789
123456789
153 cycles, bin2ascii-AMD
92 cycles, bin2ascii-lingo
61 cycles, bi2ascii-lingo 2
83 cycles, uDw2Ascii - NightWare
47 cycles, udword-p.dixon
37 cycles, udwordl-lingo 3
Press any key to exit...

Regards,
Lingo

[attachment deleted by admin]

Title: Re: Faster Bin2ASCII
Post by: MichaelW on March 30, 2008, 04:07:11 PM

P3:

Code Select


72 cycles, bin2ascii-AMD
65 cycles, bin2ascii-lingo
58 cycles, bi2ascii-lingo 2
56 cycles, uDw2Ascii - NightWare
50 cycles, udword-p.dixon
45 cycles, udwordl-lingo 3

Title: Re: Faster Bin2ASCII
Post by: hutch-- on March 30, 2008, 04:09:51 PM

Result on my old PIV. 2.8g Northwood, 800 meg fsb, ddr400.

Code Select


123456789
123456789
123456789
123456789
123456789
123456789
145 cycles, bin2ascii-AMD
122 cycles, bin2ascii-lingo
143 cycles, bi2ascii-lingo 2
128 cycles, uDw2Ascii - NightWare
77 cycles, udword-p.dixon
74 cycles, udwordl-lingo 3
Press any key to exit...

Title: Re: Faster Bin2ASCII
Post by: six_L on March 30, 2008, 04:46:10 PM

Quote123456789
123456789
123456789
123456789
123456789
123456789
72 cycles, bin2ascii-AMD
61 cycles, bin2ascii-lingo
49 cycles, bi2ascii-lingo 2
48 cycles, uDw2Ascii - NightWare
42 cycles, udword-p.dixon
41 cycles, udwordl-lingo 3
Press any key to exit...

Title: Re: Faster Bin2ASCII
Post by: Jimg on March 30, 2008, 04:50:38 PM

Nice job Lingo :U

Code Select

123456789
123456789
123456789
123456789
123456789
123456789
63 cycles, bin2ascii-AMD
55 cycles, bin2ascii-lingo
55 cycles, bi2ascii-lingo 2
54 cycles, uDw2Ascii - NightWare
39 cycles, udword-p.dixon
35 cycles, udwordl-lingo 3
Press any key to exit...

I have a question however. This is the end of your proc-

Code Select

	movzx	edx, word ptr chartab[edx*2]	; look up 2 digits
	mov	[esi],	dh		; but only write the 1 we need, supress the leading zero
	add	esi,	1		; update pointer by 1
	jnz	ZS2			; continue with pairs of digits to the end
udwordl endp    


align 16 
udword proc lpbuf:DWORD, nNumber:DWORD    ;unsigned DWORD to ASCII
    push edi                   ;save registers that need to be saved
    push esi

    mov  esi, [esp+12]      ; sptr
    mov  eax, [esp+16]      ; x

Clearly esi will never be zero. Is jnz faster than a jmp?

Title: Re: Faster Bin2ASCII
Post by: Ossa on March 30, 2008, 05:40:08 PM

Code Select

AMD 64 Mobility Athlon 3400+
512MB RAM
Win XP Home (SP2) (32-bit)

51 cycles, bin2ascii-AMD
43 cycles, bin2ascii-lingo
31 cycles, bi2ascii-lingo 2
28 cycles, uDw2Ascii - NightWare
25 cycles, udword-p.dixon
23 cycles, udwordl-lingo 3

Very nice.

Ossa

Title: Re: Faster Bin2ASCII
Post by: Jimg on March 30, 2008, 06:39:31 PM

lingo-

I like your stack handling (now why didn't I think of that?).

Applying your stack handling to the original p.dixon routine cuts off 2 cycles with no other changes.

We're getting down into the noise level here, however.

Code Select

AuthenticAMD  Family 7  Model 10  Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE

timing tests:
60 cycles, bin2ascii-AMD
55 cycles, bin2ascii-lingo
38 cycles, udword-p.dixon
35 cycles, udwordl-lingo3
36 cycles, udwordx2-p.dixon3

Press any key to exit...

Title: Re: Faster Bin2ASCII
Post by: NightWare on March 30, 2008, 09:29:05 PM

Quote from: lingo on March 30, 2008, 05:50:54 AM
Your easy ways are just to get lingo's decision of the problem... :lol

in fact, i speaking of HOW i've fixed the code :lol

Code Select

123456789
123456789
123456789
123456789
123456789
123456789
93 cycles, bin2ascii-AMD
48 cycles, bin2ascii-lingo
37 cycles, bi2ascii-lingo 2
34 cycles, uDw2Ascii - NightWare
22 cycles, udword-p.dixon
19 cycles, udwordl-lingo 3
Press any key to exit...

Title: Re: Faster Bin2ASCII
Post by: MichaelW on March 31, 2008, 10:01:29 PM

For what little it's worth, I tested a version that used the FPU to do the conversion to decimal, and while it worked OK, it was *very* slow. On my P3, just the necessary fild and fbstp instructions took 168 cycles.

Title: Re: Faster Bin2ASCII
Post by: NightWare on March 31, 2008, 11:52:23 PM

fild not really slow, but look at fbstp in agner fog's instructions tables... :lol

and it's not only on p3

Title: Re: Faster Bin2ASCII
Post by: MichaelW on April 01, 2008, 01:26:10 AM

I tested them together because I was guessing that I could not get a valid cycle count from a thousand iterations of fbstp. Now that I try I get 106 cycles for fbstp alone, and 106 cycles for fild alone, so I'll assume that my guess was right.

Title: Re: Faster Bin2ASCII
Post by: NightWare on April 01, 2008, 02:11:00 AM

:eek 106 cycles for fild alone ? for fbstp yes, it use the (slow) bcd technic on a ten byte value (it's why fpu is slow, everything is made with ten byte values...) but for fild it's abnormal, it should be between 2 and 12

Title: Re: Faster Bin2ASCII
Post by: MichaelW on April 01, 2008, 02:36:35 AM

I assume it's abnormal because for either one of the instructions used alone, 992 of the 1000 iterations generate an exception (in this case internal to the FPU).

Title: Re: Faster Bin2ASCII
Post by: thomas_remkus on April 01, 2008, 03:00:33 AM

1) Is there a collection of procs like this where the fastest get put inside a header for maybe the next version of MASM?

2) I see that MMX is mostly ignored from performance routines like this. Is this because of the challenge or because it's not practical to test MMX versions of these procs?

Oh, an my tests were:

Code Select

GenuineIntel Family 15 Model 4 Stepping 7
FPU TSC CX8 CMOV CLFSH FXSR HTT MXX SSE SSE2 SSE3
Pentium D 2.8, 3.5gb RAM

122 cycles, bin2ascii-AMD
97 cycles, bin2ascii-lingo
64 cycles, bi2ascii-lingo 2
85 cycles, uDw2Ascii - NighwWare
48 cycles, udword-p.dixon
39 cycles, udwordl-lingo 3

Title: Re: Faster Bin2ASCII
Post by: Ian_B on April 01, 2008, 03:27:26 AM

Heh, I pulled out my own version of this from the old thread the other day to see if I could optimise it any more. That was a repeated-subtraction approach, hideous code length but had the advantage of being fastest for small numbers compared to Paul Dixon's approach, because it processes only the digits it needs to. To my chagrin, every trick I tried couldn't improve on the code speed at all and generally made it far worse! I tried passing params in registers to allow a "JMP proc ... JMP ebx" type of pseudo-call to remove the repeated loops, but this was terrible speedwise too. However bad the code looks, it was still beating Pauls' MUL-based approach for most values. I'll be interested to try the new Lingo code, tho.

FWIW, no-one has yet posted a QWORD-based solution (value to convert held in EDX:EAX). My old code, however bad, is still a benchmark there... I'd be interested to see if anyone can come up with a better algorithm for that.

Title: Re: Faster Bin2ASCII
Post by: NightWare on April 01, 2008, 10:06:25 PM

thomas remkus,
MMX or SSE+ are usefull for parallel ops, but here you need to know a value, depending of another value, depending of another value, etc... so it's inappropriate here, (it's possible if you cut the value in 2*5 digits, but) after that, you will have to test/shift the bytes to remove useless zeros, and all thoses ops on 64/128 bits registers... not really efficient... :(

michaelw,
if i understand well, you try to rework your timing macro ? if you ask me, you loose your time, because of cpu hardware features... i explain : on p4- we had to do things by ourself, but on p4 and + the automatic harware data prefetcher has been implemented, so all thoses +1 extra uops due to mem access are removed, for example, if you take an algo with 40 cycles (where there is +9 uops due to mem access) and 1000 iterations in loop test you will obtain 40 cycles on P3 and 31.009 cycles on P4+ (not exactly that, but you've understood the concept). the cache is heavilly used, and automatic, on P4+. it depend of the usage you make of your algo. if the algo is used once with many other algos, then you will prefere the algo with the best result on P3. in the other hand, if like me you group the tasks to use recent cpu features, you will prefere the algo with the best result on P4+... it really depend of the usage. now make your own opinion...

Title: Re: Faster Bin2ASCII
Post by: lingo on April 03, 2008, 01:57:12 PM

a bit cosmetics and new time again: :wink

Code Select


P4 3.6 GHz, 2GB RAM,  Windows Vista with SP1  

123456789
123456789
123456789
123456789
123456789
123456789
153 cycles, bin2ascii-AMD
92 cycles, bin2ascii-lingo
61 cycles, bi2ascii-lingo 2
83 cycles, uDw2Ascii - NightWare
47 cycles, udword-p.dixon
35 cycles, udwordl-lingo 3
Press any key to exit...

Regards,
Lingo

[attachment deleted by admin]

Title: Re: Faster Bin2ASCII
Post by: Jimg on April 03, 2008, 02:26:57 PM

lingo-

Your latest is consistently 5 cycles slower on my

Code Select

AuthenticAMD  Family 7  Model 10  Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE

Previous version:

Code Select

63 cycles, bin2ascii-AMD
55 cycles, bin2ascii-lingo
55 cycles, bi2ascii-lingo 2
54 cycles, uDw2Ascii - NightWare
39 cycles, udword-p.dixon
35 cycles, udwordl-lingo 3
Press any key to exit...

New Version:

Code Select

62 cycles, bin2ascii-AMD
55 cycles, bin2ascii-lingo
55 cycles, bi2ascii-lingo 2
54 cycles, uDw2Ascii - NightWare
39 cycles, udword-p.dixon
40 cycles, udwordl-lingo 3
Press any key to exit...

Title: Re: Faster Bin2ASCII
Post by: Ossa on April 03, 2008, 07:29:44 PM

Code Select

AMD 64 Mobility Athlon 3400+
512MB RAM
Win XP Home (SP2) (32-bit)

52 cycles, bin2ascii-AMD
43 cycles, bin2ascii-lingo
31 cycles, bi2ascii-lingo 2
28 cycles, uDw2Ascii - NightWare
25 cycles, udword-p.dixon
20 cycles, udwordl-lingo 3

3 cycles faster than last time.

Title: Re: Faster Bin2ASCII
Post by: NightWare on April 03, 2008, 10:03:57 PM

Code Select

123456789
123456789
123456789
123456789
123456789
123456789
93 cycles, bin2ascii-AMD
48 cycles, bin2ascii-lingo
36 cycles, bi2ascii-lingo 2
34 cycles, uDw2Ascii - NightWare
22 cycles, udword-p.dixon
20 cycles, udwordl-lingo 3
Press any key to exit...

aproximatively same speed than the previous one, for me

Title: Re: Faster Bin2ASCII
Post by: lingo on April 11, 2008, 12:56:28 AM

"FWIW, no-one has yet posted a QWORD-based solution"

Here is P.Dixon's algo which combined 32 and 64 bit code with my modest attempt for improvement... :wink

Code Select


P4 3.6 GHz, 2GB RAM,  Vista Ultimate SP1  

18446744073709551614
18446744073709551614
4294967294
4294967294
137 cycles, uqword-with 64 bit number - p.dixon
125 cycles, b2a3264-with 64 bit number - lingo
45 cycles, uqword-with 32 bit number - p.dixon
39 cycles, b2a3264-with 32 bit number - lingo
Press any key to exit...

Regards,
Lingo

[attachment deleted by admin]

Title: Re: Faster Bin2ASCII
Post by: six_L on April 11, 2008, 03:14:41 AM

Quote18446744073709551614
18446744073709551614
4294967294
4294967294
87 cycles, uqword-with 64 bit number - p.dixon
77 cycles, b2a3264-with 64 bit number - lingo
43 cycles, uqword-with 32 bit number - p.dixon
44 cycles, b2a3264-with 32 bit number - lingo
Press any key to exit...

Title: Re: Faster Bin2ASCII
Post by: Rainstorm on April 18, 2008, 02:48:54 PM

core2duo 2.66Ghz 1333mhz FSB/4M - Winxp Pro - sp2

bin2ascii
------------

74 cycles, bin2ascii-AMD
65 cycles, bin2ascii-lingo
Press any key to exit...

lastbin2asci
-----------------
74 cycles, bin2ascii-AMD
63 cycles, bin2ascii-lingo
47 cycles, bi2ascii-lingo 2
46 cycles, uDw2Ascii - NightWare
29 cycles, udword-p.dixon
26 cycles, udwordl-lingo 3
Press any key to exit...

b2a3264
-------------
86 cycles, uqword-with 64 bit number - p.dixon
76 cycles, b2a3264-with 64 bit number - lingo
32 cycles, uqword-with 32 bit number - p.dixon
29 cycles, b2a3264-with 32 bit number - lingo
Press any key to exit..

Title: Re: Faster Bin2ASCII
Post by: FairLight on April 23, 2008, 08:41:51 AM

Intel Core2Duo E6850, 3Ghz, 4 GB installed RAM, XP Pro 32bit SP3

---------------------------------------------
1st run:

123456789
123456789
123456789
123456789
123456789
123456789
56 cycles, bin2ascii-AMD
48 cycles, bin2ascii-lingo
37 cycles, bi2ascii-lingo 2
34 cycles, uDw2Ascii - NightWare
22 cycles, udword-p.dixon
20 cycles, udwordl-lingo 3
Press any key to exit...

---------------------------------------------

2nd run:

123456789
123456789
123456789
123456789
123456789
123456789
84 cycles, bin2ascii-AMD
71 cycles, bin2ascii-lingo
54 cycles, bi2ascii-lingo 2
49 cycles, uDw2Ascii - NightWare
22 cycles, udword-p.dixon
20 cycles, udwordl-lingo 3
Press any key to exit...

---------------------------------------------

3rd run:

123456789
123456789
123456789
123456789
123456789
123456789
84 cycles, bin2ascii-AMD
73 cycles, bin2ascii-lingo
53 cycles, bi2ascii-lingo 2
34 cycles, uDw2Ascii - NightWare
22 cycles, udword-p.dixon
20 cycles, udwordl-lingo 3
Press any key to exit...

---------------------------------------------

CUL8'er.

The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: lingo on March 26, 2008, 01:07:37 PM