Entry ECX = Unsigned value of duration in milliseconds
EDX = Pointer to ASCII output.
Divisors dd 3600000, 60000, 1
HrsFmt db '%d:', 0
MinFmt db '%02d:', 0
SecFmt db '%06d', 0
Formats dd HrsFmt, MinFmt, SecFmt
Bin2Time
0 57 push edi
1 56 push esi
2 53 push ebx
; Epilog set up for applications loop
3 52 push edx ; Preserve initial value
4 BE <-> mov esi, Divisors
9 8BFA mov edi, edx
B BB <-> mov ebx, Formats
10 8BD1 mov edx, ecx
12 33C9 xor ecx, ecx
14 880F mov [edi], cl ; Nullify previous contents
16 B1 03 mov cl, 3
18 51 push ecx
19 AD lodsd ; Get next divisor
1A 8BC8 mov ecx, eax
1C 8BC2 mov eax, edx
1E 33D2 xor edx, edx
20 F7F9 idiv ecx
22 8BC8 mov ecx, eax
24 87F3 xchg ebx, esi
26 AD lodsd ; Get pointer to next format string
27 87F3 xchg ebx, esi
29 23C9 and ecx, ecx
2B 75 05 jnz 32 ; Was quotient null
2D 803F 00 cmp byte ptr [edi], 0 ; Has anything been written to output
30 74 0F je 59
Write next segment of string with applicable format
32 52 push edx
33 51 push ecx
34 50 push eax
35 57 push edi
36 E8 <-> call wsprintf ; Make this segment of string
3B 83C4 0C add esp, 0C
3E 5A pop edx
3F 03F8 add edi, eax
41 59 pop ecx
42 E2 D4 loopd 19
The last part of output always has 6 characters and positions 1 & 2 must be moved to 0 & 1 and 2
replaced with the decimal point
44 8BCF mov ecx, edi
46 2BF8 sub edi, eax
48 66:8B47 01 mov ax, [edi+1]
4C 66:AB stosw
4E C607 2E mov byte ptr [edi], 2E
51 5F pop edi
52 2BCF sub ecx, edi
54 803F 30 cmp byte ptr [edi], 30
57 75 0A jnz 63
The possibility that either hours or seconds can have a leading zero and they must
be removed from output by shifting everything upward by 1
59 8BF7 mov esi, edi
5B 51 push ecx
5C 57 push edi
5D 46 inc esi
5E F3:A4 rep movsb
60 5F pop edi
61 59 pop ecx
62 49 dec ecx
Epilog ECX = Number of characters in output
EDX = Unchanged, pointer to ASCII string
63 8BD7 mov edx, edi
65 33C0 xor eax, eax
67 5B pop ebx
68 5E pop esi
69 5F pop edi
6A C3 ret
6B = 107 bytes
lods,movs and idiv are all pretty slow but since your using wsprintf you are not really looking for any conversion speed. I use the following function I wrote for WinExplorer...
UpTime FRAME pszTimeString
uses edx
LOCAL Minutes :D
LOCAL Hours :D
LOCAL Days :D
xor ebx,ebx
invoke GetTickCount
xor edx,edx
mov ecx,86400000 ; Milliseconds in one day
div ecx
mov [Days],eax
mov eax,edx
xor edx,edx
mov ecx,3600000 ; Milliseconds in one hour
div ecx
mov [Hours],eax
mov eax,edx
xor edx,edx
mov ecx,60000 ; Milliseconds in one minute
div ecx
mov [Minutes],eax
; convert to seconds with fraction
push edx
pop eax
xor edx,edx
mov ecx,1000
div ecx
CInvoke(wsprintf,[pszTimeString],offset szUptimeFmt,[Days],[Hours],[Minutes],eax,edx)
RET
szUptimeFmt: DB "%u days %02u:%02u:%02u.%u",0
ENDF
Interesting to see as to how you functionally did the same thing in a completely different way in an extra 2 bytes. Of course you don't trim null information from your output string, but I don't think that would take anymore than 20 to 30 bytes. I had contemplated accounting for days, but decided for my purposes hours would suffice.
"lods,movs and idiv are all pretty slow"
div too.. :lol
option prologue:none ; turn it off
option epilogue:none ;
align 16 ;
d2dtl proc lpointTimeBuffer : DWORD, ntime:DWORD
mov eax, [esp+2*4] ; eax-> time in miliseconds
mov ecx, 31B5D4h ; 2^48 / 86400000
mov [esp+2*4], esi ; saving esi
mul ecx ; high 16 bits of edx = days, dx:eax = fractional days(hours,mins,sec)
mov esi, [esp+1*4] ; esi->lpTimeBuffer
mov ecx, edx ; high 16 bits of ecx-> days
mov [esp+1*4], ebx ; saving ebx
shr ecx, 16 ; edx-> days
mov [esi], ecx ; saving days
and edx, 0FFFFh ; edx-> fractional days
mov ebx, eax ;
mov ecx, edx ;
add eax, eax ;
adc edx, edx ;
add eax, ebx ;
adc edx, ecx ;
shrd eax, edx, 13 ;
shr edx, 13 ; edx = hours, eax = fractions(min,sec)
mov ebx, eax ;
shr ebx, 4 ;
mov [esi+4], edx ; saving hours
sub eax, ebx ;
mov ebx, eax ;
and eax, 03FFFFFFh ; eax-> low 26 bits;(min:sec)
shr ebx, 26 ;
mov edx, eax ;
shr edx, 4 ;
mov [esi+8], ebx ; saving minutes
sub eax, edx ;
mov ebx, [esp+1*4] ; restoring ebx
shr eax, 20 ;
mov [esi+12], eax ; saving seconds
mov esi, [esp+2*4] ; restoring esi
ret 2*4 ;
d2dtl endp ;
option prologue:prologuedef ; turn back on the defaults
option epilogue:epiloguedef ;
Regards,
Lingo
tight coder ex,
if you choose to use idiv, you should use "cdq" instruction instead of "xor edx,edx" (just a question of logic...).
another thing, xchg is also a well known slow instruction, so you should better use :
xor ebx,esi
xor esi,ebx
xor ebx,esi
here speed-ups are not essentials, but if you start to code like that, you will obtain very slow result on heavily used algo.
lingo,
:lol, nice, but you spent time for a proc that is only used once per global loop (so, few clock cycles more or less doesn't really matter here...)
Quote from: NightWare on February 09, 2008, 02:01:37 AM
tight coder ex,
if you choose to use idiv, you should use "cdq" instruction instead of "xor edx,edx" (just a question of logic...).
Only problem is that EAX is an unsigned value therefore CDQ sign extending into EDX produces inaccurate results.
Quote from: NightWare on February 09, 2008, 02:01:37 AM
another thing, xchg is also a well known slow instruction, so you should better use :
xor ebx,esi
xor esi,ebx
xor ebx,esi
here speed-ups are not essentials, but if you start to code like that, you will obtain very slow result on heavily used algo.
Do you have a resource where the timing of each instruction is depicted. I've read a bit about the techs behind dual core and wonder if that has any significance in chip design. I somehow have a problem thinking Intel would design something that is more efficient using 6 bytes versus one
The cycle counts disappeared from the Intel manuals somewhere between the P1 and the P4. Agner Fog's optimization manuals, available here (http://www.agner.org/optimize/), list cycle counts for some of the instructions on some of the processors, but the manuals deal more with optimization techniques, technical details, problematic instructions, etc.
I attempted to time your code against Donkey's and lingo's, but it appears to have a stack imbalance when it reaches the return instruction. Although I did not try to debug the code, I did determine that adding two pops ahead of the return instruction allows it to return successfully with most input values, but there are still some input values < 86400000 (24 hours in ms) that will cause it to fail. The stack is not balanced within the loop, with 4 pushes and, effectively, 5 pops.
MichealW:
Maybe due to the new technologies the specific clocks for each cycle are irrelevant due to pipelining etc. This may be why Intel dropped that information.
I'm going to look through the code and see if I can spot the possible failure you depicted. Stack imbalances are one of the things I use to my advantage or at least I like to thing it's an advantage.
invoke BeginPaint, hWnd, addr Ps
sub esp, 8
... Whatever code
call EndPaint
As you can see, the values I need for EndPaint are already there so I just adjust the stack and use call instead of invoke.
push edi
push esi
push ebx
enter 20, 0
..... code that may or may not corrupt stack
leave
pop ebx
pop esi
pop edi
ret
A lot of my algos use additional stack space dependent upon conditionals. So one time the procedure might need 1024 bytes and then another time 4096. This method guaranttees me that the essential registers will always be visible at epilogue.
Another I'll do is push 12 parameters onto stack, call GetWindowEx and instead of pushing 12 params again, I'll just subtract 48 from stack, change what needs to be changed and then call CallWindowEx again.
I'm not trying to advocate these methods are sound. There are a few API's where this can't be done as the stack is not returned with the same values as they were called with. If I remember correctly, DrawFrame is one of them. You can imagine what kind of fiasco that would be if M$ tweaked one of thier API's in a new version, it could render all the software I've done useless.
This may be why, whatever algo your using to detect stack imbalances is detecting them.
MichealW: I found the problem you encountered and corrected it in this new revision. My algo was sensitive to leading null, but once a value greater than 1 was encountered it also suppressed trailing nulls. Therefore anything evenly divisible by 60 was generating an error.
I wrote a test application of 96,000,000 iterations and they average out to 465 nanoseconds / call
000 57 push edi
001 56 push esi
002 53 push ebx
003 52 push edx
004 BE <-> mov esi, Divisors
009 89D7 mov edi, edx
00B BB <-> mov ebx, Formats
010 89CA mov edx, ecx
012 31C9 xor ecx, ecx
014 880F mov [edi], cl
016 B1 03 mov cl, 3
018 51 push ecx
019 AD lodsd
01A 89C1 mov ecx, eax
01C 89D0 mov eax, edx
01E 31D2 xor edx, edx
020 F7F9 idiv ecx
022 59 pop ecx
023 21C0 and eax, eax
025 75 05 jnz 02C
027 F6C5 01 test ch, 1
02A 74 15 je 041
02C 80CD 01 or ch, 1
02F 51 push ecx
030 52 push edx
031 50 push eax
032 FF33 push dword ptr [ebx]
034 57 push edi
035 E8 <.> call wsprintfA
03A 83C4 0C add esp, 0C
03D 01C7 add edi, eax
03F 5A pop edx
040 59 pop ecx
041 83C3 04 add ebx, 4
044 FEC9 dec cl
046 75 D0 jnz 018
048 5A pop edx
049 29C7 sub edi, eax
04B 89FE mov esi, edi
04D 46 inc esi
04E 66:AD lodsw
050 66:AB stosw
052 B0 2E mov al, 2E
054 AA stosb
055 803A 30 cmp byte ptr [edx], 30
058 75 01 jnz 05B
05A 42 inc edx
05B 5B pop ebx
05C 5E pop esi
05D 5F pop edi
05E C3 ret
05F = 95 bytes
The corrected code works OK. Unlike yours and Donkey's code, lingo's code stores the hour, minute, and second values in the buffer. Running on a P3, the cycle count for yours and Donkey's code, with the calls to wsprintf, are around 1400 and 1800. With the calls to wsprintf commented out the cycle counts are 200 and 164, and for lingo's code 24, all with no run to run variation observed.
[attachment deleted by admin]
this speed test is a bit unfair... an asm beginner (he hasn't understood yet, he has to unroll his code when it's possible) and someone who said it's not an optimised algo, against a guy well know for the speed of his algos...
so here, donkey's approach (without the div instructions) added... lingo's algo still the fastest for exactly the same work, but it's a bit less unfair...
[attachment deleted by admin]
Hi,
I never really worried about speed for my algorithm, as a matter of fact I had mentioned that "using wsprintf you are not really looking for any conversion speed". But it is interesting, when I have a bit of time to spend on the problem I may revisit it, for now I am occupied with other programming stuff.
Donkey
The test wasn't intended to be unfair, it was intended to show how big of a difference optimization could make in the code that calculates the component times. I included the cycle counts for the code with the wsprintf calls to show why it makes little sense to optimize the code that calculates the component times. If I were doing this I would have done something similar to Donkey's code, straightforward and easy to understand.
QuoteIf I were doing this I would have done something similar to Donkey's code, straightforward and easy to understand.
I agree, fast code is great but it's not everything. There's a lot to be said for stable, straightforward and easy to understand code. Once in a while, you really need to optimize for speed, but it's not that often.
Quote from: MichaelW on February 12, 2008, 11:18:03 AM
The test wasn't intended to be unfair
perfectly understood, but if you show to a beginner an algo x time faster than the one he has developped, be sure he gonna try to use some pratice seen in the optimised algo (it's one of the goals of the forum after all...). here lingo has a long practice behind him, he made some choice concerning his coding technics (for example esp+ use), and there is, imho, things that should be avoided. so it's essential to show there is alternatives...
Quote from: Greg on February 12, 2008, 08:30:13 PM
I agree, fast code is great but it's not everything. There's a lot to be said for stable, straightforward and easy to understand code. Once in a while, you really need to optimize for speed, but it's not that often.
i didn't say the contrary :P
For a beginner,i will show GetTimeFormat that do all the job with NULL for required you don't want.
Then a messagebox can display the buffer,that all.
Quote from: ToutEnMasm on February 13, 2008, 01:54:56 PM
For a beginner,i will show GetTimeFormat that do all the job with NULL for required you don't want.
Then a messagebox can display the buffer,that all.
I also prefer GetTimeFormat as it will format the time for the proper locale but it does not accept the tick counter as input, it only accepts SYSTEMTIME structures, that is the only reason I didn't use it.
updated donkey's approach... this time it's faster than lingo's algo on my core2, and it's also possible to optimise it a bit...
; diviser par 1000 (pour réduire la valeur à jours/heures/minutes/secondes)
mov eax,2199023256 ;; ) diviser par le nombre de millisecondes (1000) (vérifié de 0 à FFFFFFFFh)
mul ecx ;; )
shr edx,9 ;; )
mov ecx,edx ;; copier le résultat dans ecx
; obtenir le nombre de jours et d'heures
mov eax,2443359173 ;; ) diviser par le nombre d'heures (3600) (vérifié de 0 à 4320000)
mul edx ;; )
shr edx,11 ;; )
mov ebx,edx ;; placer le nombre d'heures et de jours dans ebx
mov eax,3600 ;; ) soustraire le nombre de jours et d'heures obtenus à notre sauvegarde
mul edx ;; )
sub ecx,eax ;; placer le nombre de minutes et de secondes dans ecx
; obtenir le nombre de jours
mov eax,178956971 ;; ) diviser par le nombre d'heures (24) (vérifié de 0 à 1200)
mul ebx ;; )
mov ElapsDays,edx ;; sauvegarder le nombre de jours
; obtenir le nombre d'heures
mov eax,24 ;; ) soustraire le nombre de jours obtenus à notre sauvegarde
mul edx ;; )
sub ebx,eax ;; )
mov ElapsHours,ebx ;; sauvegarder le nombre d'heures
; obtenir le nombre de minutes
mov eax,71582789 ;; ) diviser par le nombre de secondes (60) (vérifié de 0 à 3600)
mul ecx ;; )
mov ElapsMinutes,edx ;; sauvegarder le nombre de minutes
; obtenir le nombre de secondes
mov eax,60 ;; ) soustraire le nombre de minutes obtenues à notre sauvegarde
mul edx ;; )
sub ecx,eax ;; )
mov ElapsSeconds,ecx ;; sauvegarder le nombre de secondes
edit: :wink
[attachment deleted by admin]
"it's faster than lingo's algo on my core2"
Hence, you can post your speed testing prog to prove it (as Michael did).... :lol
Later:
Michael's code is correct but your "test prog" isn't
because you compare apples with bananas... :lol
:bg i see, then this one... apples and bananas are really near... :lol
edit : ok, i didn't see there was register preservation in your code, apologies :(
edit2 : here it should be ok, but you push me to my limits man :lol
[attachment deleted by admin]
I took a bit of time to look at another way of attacking the problem and came up with this, using The Svin's magic divider, the AAM trick to convert 0-99, and a little trick for inserting the digits in a string, I came up with this...
DATA SECTION
szTimeFormat:
szDays DB "00 Days "
szHours DB "00:"
szMin DB "00:"
szSec DB "00."
szFrac DB 0,0,0,0,0,0
CODE SECTION
converttime:
invoke GetTickCount
mov esi,eax
// Days
mov edx, 3335999724
mul edx
shr edx, 26
mov eax,edx
aam
add eax,3030h
bswap eax
shr eax,16
mov [szDays],ax
// Hours
mov eax,86400000
mul edx
sub esi,eax
mov eax,esi
mov edx, 2501999793
mul edx
mov eax,edx
shr edx, 21
mov eax,edx
aam
add eax,3030h
bswap eax
shr eax,16
mov [szHours],ax
// Minutes
mov eax,3600000
mul edx
sub esi,eax
mov eax,esi
mov edx, 2345624806
mul edx
mov eax,edx
shr edx, 15
mov eax,edx
aam
add eax,3030h
bswap eax
shr eax,16
mov [szMin],ax
// Seconds
mov eax,60000
mul edx
sub esi,eax
mov eax,esi
mov edx, 2199023256
mul edx
mov eax,edx
shr edx, 9
mov eax,edx
aam
add eax,3030h
bswap eax
shr eax,16
mov [szSec],ax
// Fraction
mov eax,1000
mul edx
sub esi,eax
mov eax,esi
mov edx, 3435973837
mul edx
shr edx, 3
mov eax,edx
aam
add eax,3030h
bswap eax
shr eax,16
mov [szFrac],ax
ret
If someone wants to time it I would appreciate it, I am curious to see where it ends up on the time scale.
Donkey
EDIT: Sorry, left out one multiply and reformatted it a bit so it looks a little clearer.
Mmmm, not too fast, based on 100 itterations it comes in at around 207 milliseconds, oh well it was a good idea while it lasted. I don't run MASM so I can't use MichaelW's test program, I wrote one to do basic timing though I did not bother with thread priority etc...
Donkey
The procedures are so different that it's difficult to do valid comparisons. For timing without the string conversions Bin2Time and ConvertTime are at a disadvantage because you cannot reasonably remove the effects of the string conversion code (for ConvertTime I didn't even try to do so). For d2dtl and UpTime2 the method of passing the parameters are different, and for timing with the string conversions almost all of the time is spent in wsprintf. So for what it's worth, running on a P3 this is what I get without the string conversions:
200 cycles, Bin2Time
164 cycles, UpTime
24 cycles, d2dtl
30 cycles, UpTime2
And this is what I get with the string conversions, with significant variation in the cycle counts for all except ConvertTime, and using wsprintf to do the conversion for all except ConvertTime:
1401 cycles, Bin2Time
1786 cycles, UpTime
1249 cycles, d2dtl
1277 cycles, UpTime2
198 cycles, ConvertTime
[attachment deleted by admin]
to be fair, all data should be aligned :wink
now the results are differents on my cpu, :'( what a deception to see the results on another cpu...
Hi MichaelW,
If you comment out the conversion sections of converttime, you get the following for 100 itterations...
Line 128: eax = 8618
Line 147: eax = 8754
Line 166: eax = 8831
Line 185: eax = 8558
Line 204: eax = 8813
Line 223: eax = 8866
Line 242: eax = 8670
Line 261: eax = 8775
Or ~87.4 milliseconds, ofcourse the algorithm is pretty useless without conversion, as are all of the ones posted so I see these figures as meaningless. The only benchmark that means anything is the complete conversion from a binary value to a formatted string and for that I am pretty slow because of the choice of AAM and BSWAP, coming in as I said at around 207 milliseconds. Note that I tested with GetTickCount commented out as you did in your tests.
I should also note that my algorithm fails once the tick counter passes 99 days 23:59:59.99, this is because it can only convert 2 digit values.
Donkey
Quoteto be fair, all data should be aligned
Or at least all data except byte-size data that is accessed as bytes, and I failed to do that for yours and lingo's data, and Donkey's data that is accessed as words.
For an operation that has a slow resolution the whole concept of squeezing extra microseconds seems rather academic.
I would go for a simple GetTimeFormat and SYSTEMTIME struc and save a whole lot of my time. :U
Chris
Quote from: ChrisLeslie on February 16, 2008, 08:47:41 AM
For an operation that has a slow resolution the whole concept of squeezing extra microseconds seems rather academic.
I would go for a simple GetTimeFormat and SYSTEMTIME struc and save a whole lot of my time. :U
Chris
I guess it would be helpful if I was a little clearer as to the scope an purpose of the original post.
This is part of a algo to show typing speed in words per minute and total elapsed times as one enters text into a multi line control character by character. Values in SYSTEMTIME would have to be converted and if someone is using the application at the end of the day (midnight) then additional calculation would be required. In this particular case speed is definitely not a concern as the average person has a maximum speed of 1 character every 145,000,000 nano seconds and the procedure processes every 450. Formatting of output was another priority, as just for personal preferences I like to omit extraneous output whenever possible, ie: 0:00:03.415 to 3.415.
This thread did however open my mind to possibilities that I had not contemplated, especially using multiplication instead of division. If I ever get involved with embedded systems that operate at considerably lower frequencies and less memory, these examples will serve as a good benchmark to optimize space and time.
Quote from: ChrisLeslie on February 16, 2008, 08:47:41 AM
For an operation that has a slow resolution the whole concept of squeezing extra microseconds seems rather academic.
I would go for a simple GetTimeFormat and SYSTEMTIME struc and save a whole lot of my time. :U
Chris
Sometimes it is interesting just to solve the problem, not worrying about whether you can find an application for the solution.
Quote from: Tight_Coder_Ex on February 16, 2008, 04:04:23 PM
This thread did however open my mind to possibilities that I had not contemplated, especially using multiplication instead of division. If I ever get involved with embedded systems that operate at considerably lower frequencies and less memory, these examples will serve as a good benchmark to optimize space and time.
tight coder ex,
then we didn't loose our time :wink
donkey,
few instructions removed, if it was a signature then forget this post :
mov esi,eax
; Days
mov edx, 3335999724
mul edx
shr edx, 26
mov eax,edx
aam
add eax,3030h
bswap eax
shr eax,16
mov WORD PTR [szDays],ax
; Hours
mov eax,86400000
mul edx
sub esi,eax
mov eax, 2501999793
mul esi
shr edx, 21
mov eax,edx
aam
add eax,3030h
bswap eax
shr eax,16
mov WORD PTR [szHours],ax
; Minutes
mov eax,3600000
mul edx
sub esi,eax
mov eax, 2345624806
mul esi
shr edx, 15
mov eax,edx
aam
add eax,3030h
bswap eax
shr eax,16
mov WORD PTR [szMin],ax
; Seconds
mov eax,60000
mul edx
sub esi,eax
mov eax, 2199023256
mul esi
shr edx, 9
mov eax,edx
aam
add eax,3030h
bswap eax
shr eax,16
mov WORD PTR [szSec],ax
; Fraction
mov eax,1000
mul edx
sub esi,eax
mov eax, 429496730
mul esi
mov eax,edx
aam
add eax,3030h
bswap eax
shr eax,16
mov WORD PTR [szFrac],ax
ret
Quote from: NightWare on February 17, 2008, 02:34:29 AM
donkey,
few instructions removed, if it was a signature then forget this post :
...
Thanks NightWare,
I have to admit that I didn't spend a lot of time on the algorithm so didn't do too much to optimize it or reduce it, I threw it together and tested it. By the way there is a bug in it if the fraction is less than 100, you have to add the following lines to the fraction section...
// Fraction
mov eax,1000
mul edx
sub esi,eax
mov eax,esi
mov edx,eax
cmp eax,99 ; <<<<<<<<<<<<<<<< Add lines from here
jbe >
mov edx, 3435973837
mul edx
shr edx, 3
: ; <<<<<<<<<<<<<<<<<<<<<<<< to here
mov eax,edx
aam
add eax,3030h
bswap eax
shr eax,16
mov [szFrac],ax
Donkey
Thanks Michael,
Here is something faster...
[attachment deleted by admin]
Thanks lingo,
I would like to see the results, but apparently the synthesized paddq, pmuludq, and psubq are SSE2 instructions that will not run on my P3, and my P4 is not available.
Quote from: MichaelW on February 16, 2008, 02:51:27 AM
Or at least all data except byte-size data that is accessed as bytes, and I failed to do that for yours and lingo's data, and Donkey's data that is accessed as words.
:lol understood later... (i know, it tooks me some times :()
edit :
lingo's algo, without aam :
.data
ALIGN 16
Mask01 DWORD 000000000h,0FFFF0000h
Mask03 DWORD 000000000h,0FFFFFFFFh
Mask04 DWORD 0FFFFFFFFh,000000000h
Mask05 DWORD 0FFFF0000h,0FFFFFFFFh
Mask07 DWORD 003FFFFFFh,000000000h
PdDiv DWORD 00031B5D4h,000000000h ; 31B5D4
PwMul10 DWORD 0000A000Ah,0000A000Ah
PwDiv10 DWORD 0199A199Ah,0199A199Ah ; tester de 199a à 199d
Mask0Fh DWORD 0000F000Fh,0000F000Fh
MaskF0h DWORD 00F000F00h,00F000F00h
Mask30h DWORD 030303030h,030303030h
.code
ALIGN 16
;
Routine5 PROC
mov eax,[esp+1*4] ; eax-> time in miliseconds
pxor MM0,MM0
movd MM1,eax
pmuludq MM1,QWORD PTR [PdDiv]
movq MM0,MM1
pand MM0,QWORD PTR [Mask01]
pxor MM1,MM0
psrlq MM1,13
movq MM2,MM1
paddq MM1,MM1
paddq MM1,MM2
por MM0,MM1
pand MM1,QWORD PTR [Mask04]
pand MM0,QWORD PTR [Mask03]
movq MM2,MM1
psrlq MM2,4
psubq MM1,MM2
movq MM2,MM1
psrlq MM2,10
por MM0,MM2
pand MM0,QWORD PTR [Mask05]
pand MM1,QWORD PTR [Mask07]
movq MM2,MM1
psrlq MM2,4
psubq MM1,MM2
psrlq MM1,20
por MM0,MM1
movq MM1,MM0
pmulhuw MM0,QWORD PTR [PwDiv10] ; div 10 (vérifié de 0 à 99)
movq MM2,MM0
pmullw MM2,QWORD PTR [PwMul10]
psllw MM0,8
psubw MM1,MM2
pand MM1,QWORD PTR [Mask0Fh]
pand MM0,QWORD PTR [MaskF0h]
por MM0,MM1
paddw MM0,QWORD PTR [Mask30h]
movd eax,MM0
bswap eax
mov WORD PTR [szMinL], ax
shr eax,16
mov WORD PTR [szSecL], ax
punpckhdq MM0,MM0
movd eax,MM0
bswap eax
mov WORD PTR [szDaysL], ax
shr eax,16
mov WORD PTR [szHoursL], ax
ret
Routine5 ENDP
"...paddq, pmuludq, and psubq are SSE2 instructions that will not run on my P3, and my P4 is not available."
OK, the SSE2 instructions are slower here... :lol
So, I rewrote my "general" assembly proc d2dtl and I have new time: :lol
2604 cycles, Bin2Time
3362 cycles, UpTime
67 cycles, d2dtl-lingo
222 cycles, ConvertTime-NightWare
Press any key to exit...
Regards,
Lingo
[attachment deleted by admin]
Quote from: lingo on March 16, 2008, 03:08:37 AM
ConvertTime-NightWare
? donkey gonna be very happy here :lol
here it's the classic alternative (35 clock cycles with my cpu) :
.DATA
ALIGN 16
szDays DB "00 Days "
szHours DB "00:"
szMins DB "00:"
szSecs DB "00."
szFrac DB 0,0,0,0,0,0
.CODE
ALIGN 16
;
; syntax :
; mov eax,Ticks
; call GetElapsTime
;
GetElapsTime PROC
push esi
push edi
; div by 1000 (to reduce the value to days/hours/mins/secs)
mov edx,2199023256 ;; ) div 1000 (verified from 0 to FFFFFFFFh included)
mul edx ;; )
shr edx,9 ;; )
mov edi,edx ;; save it in edi
; get number of days+hours
mov eax,2443359173 ;; ) div 3600 (mins+secs) (verified from 0 to 4320000 included)
mul edx ;; )
shr edx,11 ;; )
mov esi,edx ;; save days+hours in esi
; get number of mins+secs
mov eax,3600 ;; ) mul 3600
mul edx ;; )
sub edi,eax ;; substract days+hours (mins+secs in edi)
; get number of days
mov eax,178956971 ;; ) div 24 (hours) (verified from 0 to 1200)
mul esi ;; )
mov ecx,edx ;; save days in a unused reg
; get number of hours
lea edx,[edx+edx*2] ;; ) mul 24
shl edx,3 ;; )
sub esi,edx ;; substract days (hours in esi)
mov eax,429496730 ;; ) div 10
mul ecx ;; )
lea eax,[edx+edx*4] ;; ) mul 10
add eax,eax ;; )
add edx,3030h ;; +"00"
sub ecx,eax ;; keep units
shl ecx,8 ;; << units
add ecx,edx ;; units+tens
mov WORD PTR szDays,cx ;; write days
mov eax,429496730 ;; ) div 10
mul esi ;; )
lea eax,[edx+edx*4] ;; ) mul 10
add eax,eax ;; )
add edx,3030h ;; +"00"
sub esi,eax ;; keep units
shl esi,8 ;; << units
add esi,edx ;; units+tens
mov WORD PTR szHours,si ;; write hours
; get number of mins
mov eax,71582789 ;; ) div 60 (secs) (verified from 0 to 3600)
mul edi ;; )
mov ecx,edx ;; save mins in a unused reg
; get number of secs
lea edx,[edx+edx*4] ;; ) mul 60
lea edx,[edx+edx*2] ;; )
shl edx,2 ;; )
sub edi,edx ;; substract mins (secs in edi)
mov eax,429496730 ;; ) div 10
mul ecx ;; )
lea eax,[edx+edx*4] ;; ) mul 10
add eax,eax ;; )
add edx,3030h ;; +"00"
sub ecx,eax ;; keep units
shl ecx,8 ;; << units
add ecx,edx ;; units+tens
mov WORD PTR szMins,cx ;; write mins
mov eax,429496730 ;; ) div 10
mul edi ;; )
lea eax,[edx+edx*4] ;; ) mul 10
add eax,eax ;; )
add edx,3030h ;; +"00"
sub edi,eax ;; keep units
shl edi,8 ;; << units
add edi,edx ;; units+tens
mov WORD PTR szSecs,di ;; write secs
pop edi
pop esi
ret
GetElapsTime ENDP
as you can see, there is operations repeated 4 times before writting the values, 4 times ? hmm... it sound like a simd work, and here the sse2 Code (19 clock cycles with my cpu) :
.DATA
ALIGN 16
Simd_Dw_Div_24 DWORD 178956971,0
Simd_Dw_Div_60 DWORD 71582789,0
Simd_Dw_Div_1000 DWORD 2199023256,0
Simd_Dw_Div_3600 DWORD 2443359173,0
Simd_Dw_Mul_24 DWORD 24,0
Simd_Dw_Mul_60 DWORD 60,0
Simd_Dw_Mul_3600 DWORD 3600,0
Simd_Wds_Div_10 DWORD 199A199Ah,199A199Ah
Simd_Wds_Mul_10 DWORD 000A000Ah,000A000Ah
Simd_Wds_Val_30h DWORD 30303030h,30303030h
szDays DB "00 Days "
szHours DB "00:"
szMins DB "00:"
szSecs DB "00."
szFrac DB 0,0,0,0,0,0
.CODE
ALIGN 16
;
; syntax :
; mov eax,Ticks
; call Sse2_GetElapsTime
;
Sse2_GetElapsTime PROC
; push eax ;; empiler eax
; diviser par 1000 (pour réduire la valeur à jours/heures/minutes/secondes)
movd MM0,eax ;; placer le nombre de ticks dans MM0
pmuludq MM0,QWORD PTR Simd_Dw_Div_1000 ;; ) diviser par le nombre de millisecondes (1000) (vérifié de 0 à FFFFFFFFh)
psrlq MM0,9+32 ;; ) + >> 32
movq MM3,MM0 ;; copier le résultat dans MM3
; obtenir le nombre de jours et d'heures
pmuludq MM0,QWORD PTR Simd_Dw_Div_3600 ;; ) diviser par le nombre d'heures (3600) (vérifié de 0 à 4320000)
psrlq MM0,11+32 ;; ) + >> 32
movq MM1,MM0 ;; placer le nombre d'heures et de jours dans MM2
; obtenir le nombre de minutes et de secondes
pmuludq MM0,QWORD PTR Simd_Dw_Mul_3600 ;; multiplier par 3600 (60*60)
psubd MM3,MM0 ;; placer le nombre de minutes et de secondes dans MM4
; obtenir le nombre de jours
movq MM4,QWORD PTR Simd_Dw_Div_24 ;; ) diviser par le nombre d'heures (24) (vérifié de 0 à 1200)
pmuludq MM4,MM1 ;; )
psrlq MM4,32 ;; >> 32
movq MM2,MM4 ;; MM2 = nombre de jours
; obtenir le nombre d'heures
pmullw MM4,QWORD PTR Simd_Dw_Mul_24 ;; multiplier par le nombre d'heures
psubd MM1,MM4 ;; MM1 = nombre d'heures
; obtenir le nombre de minutes
movq MM5,QWORD PTR Simd_Dw_Div_60 ;; ) diviser par le nombre d'heures (24) (vérifié de 0 à 1200)
pmuludq MM5,MM3 ;; )
psrlq MM5,32 ;; >> 32
movq MM4,MM5 ;; MM4 = nombre de minutes
; obtenir le nombre de secondes
pmullw MM5,QWORD PTR Simd_Dw_Mul_60 ;; multiplier par le nombre de secondes
psubd MM3,MM5 ;; MM3 = nombre de secondes
; ici on va traiter les valeurs en parallèle
punpckldq MM4,MM3 ;; MM4 = 0,secondes,0,minutes
punpckldq MM2,MM1 ;; MM2 = 0,heures,0,jours
packuswb MM2,MM4 ;; MM2 = secondes+minutes+heures+jours
movq MM0,MM2 ;; MM0 = secondes+minutes+heures+jours
pmulhuw MM2,QWORD PTR Simd_Wds_Div_10 ;; diviser par 10
movq MM1,MM2 ;; copier les dizaines dans MM1
pmullw MM2,QWORD PTR Simd_Wds_Mul_10 ;; multiplier les dizaines par 10 (pour les soustraire après)
paddw MM1,QWORD PTR Simd_Wds_Val_30h ;; MM1 = _x_x_x_x (dizaines) + "00000000"
psubw MM0,MM2 ;; MM0 = _x_x_x_x (unitées)
psllw MM0,8 ;; << 8 pour les unitées
paddw MM0,MM1 ;; ajouter les dizaines +3030h aux unitées
; il ne reste plus qu'a écrire les valeurs
movd eax,MM0 ;; placer heures+jours dans eax
mov WORD PTR szDays,ax ;; sauvegarder le nombre de jours
shr eax,16 ;; >> 16
mov WORD PTR szHours,ax ;; sauvegarder le nombre d'heures
psrlq MM0,32 ;; >> 32 MM0 = 0,0,minutes,secondes
movd eax,MM0 ;; placer secondes+minutes dans eax
mov WORD PTR szMins,ax ;; sauvegarder le nombre de minutes
shr eax,16 ;; >> 16
mov WORD PTR szSecs,ax ;; sauvegarder le nombre de secondes
; pop eax ;; désempiler eax
ret ;; retourner (sortir de la procédure)
Sse2_GetElapsTime ENDP
now, this topic is really a lesson, not because of the speed (it really doesn't matter, especially for an algo like this one...), not because of the mul/div substitutions, not because of the bits manipulations, not because of the simd usage or any other optimization hint. but because when you code clear algos, you are ABLE TO SEE ALL THE POSSIBILITIES. and it's the most essential thing, far before any optimization hint. :8)
OK :wink
Regards
Lingo
i fully agree, but it's the result returned by the speed test on my core2...
now, the alternative was (like is name say) another way, and i didn't said it was faster than your algo... now the sse2 "slower" version IS the final algo for speed test :lol no manipulation from me here... every people can include the algos in the test... and you can also count the clock cycles for this one too :lol