I am trying to get a somewhat faster and handier version of cat$. Here are some timings on a Celeron:
672 clocks CAT$a 113 bytes 7143 LAMPs
853 clocks CAT$aHA 113 bytes 9068 LAMPs
710 clocks CAT$b 127 bytes 8001 LAMPs
728 clocks CAT$bHA 127 bytes 8204 LAMPs
1115 clocks old cat$ 99 bytes 11094 LAMPs
LAMPs = Lean And Mean Points = cycles * sqrt(size)
The "a" version is typically 33% faster than the old cat$ but slows down drastically if there is high ascii byte such as " with thä new CAT$ macro" at the beginning of a string, marked as "HA" above.
The "b" version is typically 20-25% faster than the old cat$ and does not have the high ascii problem described inter alia in this post (http://www.masm32.com/board/index.php?topic=1589.msg12333#msg12333).
Grateful for comments and timings, especially on non-Celerons.
[attachment deleted by admin]
JJ,
The library cat$ must be able to do this.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
comment * -----------------------------------------------------
Build this template with
"CONSOLE ASSEMBLE AND LINK"
----------------------------------------------------- *
.code
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
call main
inkey
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
main proc
LOCAL hMem :DWORD
LOCAL flen :DWORD
LOCAL buff :DWORD
mov hMem, InputFile("\masm32\include\windows.inc")
mov flen, ecx
mov eax, flen
add eax, eax
add eax, eax
add eax, eax
add eax, eax ; flen * 16
mov buff, alloc$(eax) ; allocate buffer
invoke GetTickCount
push eax
; -------------------------------
; cat 16 copies of file to "buff"
; -------------------------------
mov buff, cat$(buff,hMem,hMem,hMem,hMem,hMem,hMem,hMem,hMem,hMem,hMem,hMem,hMem,hMem,hMem,hMem,hMem)
invoke GetTickCount
pop ecx
sub eax, ecx
print str$(eax),13,10
free$ buff
ret
main endp
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
Here are the timings on my old PIV, its typical of most late PIVs.
696 clocks CAT$a 113 bytes 7399 LAMPs
909 clocks CAT$aHA 113 bytes 9663 LAMPs
743 clocks CAT$b 127 bytes 8373 LAMPs
755 clocks CAT$bHA 127 bytes 8508 LAMPs
863 clocks old cat$ 99 bytes 8587 LAMPs
Quote from: hutch-- on June 28, 2008, 02:41:02 AM
JJ,
The library cat$ must be able to do this.
You are really demanding! See attachment.
[attachment deleted by admin]
JJ-
What did you use to assemble your last version? When I assemble the source, it is consistently slower than the exe you provided.
Quote from: Jimg on June 28, 2008, 04:15:15 PM
JJ-
What did you use to assemble your last version? When I assemble the source, it is consistently slower than the exe you provided.
I used jjTurboAsm with the option /AfterBurner=ON
Seriously, I just downloaded the zip, and the time stamps are 10:53:10 for the exe and 10:52:50 for the asm - unlikely that I was able to do drastic improvements in 20 seconds. So this is somewhat mysterious. I use polink, but that shouldn't affect speed afaik. Can you post your version, maybe with some timings?
This is the typical difference in execution-
The one you assembled-
----------------------- CAT$ timings: -------------
566 clocks CAT$a 113 bytes 6017 LAMPs
775 clocks CAT$aHA 113 bytes 8238 LAMPs
571 clocks CAT$b 127 bytes 6435 LAMPs
645 clocks CAT$bHA 127 bytes 7269 LAMPs
759 clocks old cat$ 99 bytes 7552 LAMPs
LAMPs = Lean And Mean Points = cycles * sqrt(size)
The one I assembled-
----------------------- CAT$ timings: -------------
584 clocks CAT$a 113 bytes 6208 LAMPs
772 clocks CAT$aHA 113 bytes 8206 LAMPs
587 clocks CAT$b 127 bytes 6615 LAMPs
657 clocks CAT$bHA 127 bytes 7404 LAMPs
830 clocks old cat$ 99 bytes 8258 LAMPs
LAMPs = Lean And Mean Points = cycles * sqrt(size)
[attachment deleted by admin]
The files look pretty similar in OllyDbg, although there is a slight variation at address 4016BE - polink?? But apart from that, your timings are almost identical - differences are within the normal "statistical noise".
I have good and bad news on the CAT$ macro.
First, the bad news: StdOut teases me with non-standard chars; where I expect an ä, I get õ; any hints why?
Ciao bellissimo, this is test B with thõ new CAT$ macro
Second bad news: The previous version had a bug - it would not accept mov eax, CAT$(addr MyBuffer, ..), only the CAT$("Here I am", ) or CAT$(0, addr MyBuffer...) worked. Ok, that was fixed
Now the good news: I taught the beast something common in BASIC, i.e. having the destination string as one or more of the sources:
mov eax, CAT$(addr mcDefBuffer, "this is test X", addr strN)
print CAT$(addr mcDefBuffer, "Ciao caro, ", addr mcDefBuffer, CrLf$, "Fantastic ", str$(127), " bytes"
Ciao caro, this is test X with the new CAT$ macro
A cute little routine allowing you to concatenate
multiple strings of different origins
Fantastic 127 bytes short
Etc., full source attached. Suggestions for improving it?
Cheers, jj
----------------------- CAT$ usage: -----------------------
1. Like old cat$ (but no zero-ing of MyBuffer needed):
mov eax, CAT$(addr MyBuffer, "Test", addr Src2)
2. Write strings to a default buffer:
invoke MessageBox, 0, CAT$(0, "Ciao ", addr YourName),
chr$("Title"), MB_OK
invoke MessageBox, 0, CAT$("Ciao ", addr YourName),
chr$("Title"), MB_OK
(Oops, no zero after CAT$? The macro knows
that "Ciao" is not a destination ...)
mov eax, CAT$(0, "Test1: ", addr Src2, str$(eax), "bytes")
3. Append to last position after a CAT$(0) or CAT$(addr MyBuffer):
mov eax, CAT$(1, "Test3", addr Src4)
----------------------- CAT$ timings: ---------------------
973 clocks CAT$a 113 bytes 10343 LAMPs
870 clocks CAT$aHA 113 bytes 9248 LAMPs
706 clocks CAT$b 127 bytes 7956 LAMPs
764 clocks CAT$bHA 127 bytes 8610 LAMPs
849 clocks old cat$ 99 bytes 8447 LAMPs
LAMPs = Lean And Mean Points = cycles * sqrt(size)
EDIT: New version attached, allows destination in sources with this "lazy" syntax:
mov eax, CAT$("this is test X", addr strN) ; concat 2 strings into mcDefBuffer
print CAT$("Ciao caro, ", 0, CrLf$, "Fantastic ", str$(127), " bytes short!", CrLf$, CrLf$)
0 is the currrent content of destination.
[attachment deleted by admin]
jj,
Regarding the non-standard characters, it has to do with the code page and font you are using for the console. You can use the CHCP command at the command-prompt to get and set the code page. It's kind of a mess.
Keep your eye on the code page (http://blogs.msdn.com/oldnewthing/archive/2005/03/08/389527.aspx)
Console Code Pages (http://msdn.microsoft.com/en-us/library/ms682064(VS.85).aspx)
CHCP (http://technet2.microsoft.com/windowsserver/en/library/6556a0bb-29ba-4489-876e-852344661cbe1033.mspx?mfr=true)
jj-
may I have a copy of the timers.asm you are using? It seems to call ultoa from msvcrt where mine doesn't. That seems to produce alignment differences that account for the timing differences I am seeing.
Quote from: Jimg on June 29, 2008, 05:19:50 AM
may I have a copy of the timers.asm you are using?
JimG: Here they are.
Greg: Thanks for the code page hint.
[attachment deleted by admin]
Thanks jj. I found the actual difference. I'm using the older macros.asm
ustr$ was changed to use the c stuff. The older macros called dwtoa, which added just enough bytes to shift the code to a place that runs about 5% slower on my machine.
Amazing how touchy these AMD's are about placement in memory. I can often make one routine faster than another just by moving it up in memory.
Quote from: Jimg on June 29, 2008, 04:42:46 PM
Amazing how touchy these AMD's are about placement in memory. I can often make one routine faster than another just by moving it up in memory.
Check in Olly if calls and jumps and memory accesses change from near to far...
Quote from: jj2007 on July 03, 2008, 08:45:08 PM
Quote from: Jimg on June 29, 2008, 04:42:46 PM
Amazing how touchy these AMD's are about placement in memory. I can often make one routine faster than another just by moving it up in memory.
Check in Olly if calls and jumps and memory accesses change from near to far...
No, it's not the jumps. I've seen this so many times where just one extra byte makes a large difference that I had to give up on presenting my best optimizations because it's just totally different on an intel chip. Sad, because I really enjoy it.
New version of CAT$ attached. The macro has become pretty flexible but also awfully complex. Usage is simple but debugging is a nightmare... let me know where it crashes please.
MsgBox CAT$(\
"We found 'FR_MatchAlefHamza' in", CrLf$, LastPath$, CrLf$,\
"at pos ", str$(ecx), CrLf$, CrLf$, FifRet$, CrLf$,\
"Case-insensitive search took only ", str$(esi), " ms"),\
addr AppName, MB_OK
----------------------- CAT$ usage: -----------------------
1. Like old cat$ (but no zero-ing of MyBuffer needed):
mov eax, CAT$(addr MyBuffer, "Test", addr Src2)
2. Write strings to a default buffer:
invoke MessageBox, 0, CAT$(0, "Ciao ", addr YourName),
chr$("Title"), MB_OK
invoke MessageBox, 0, CAT$("Ciao ", addr YourName),
chr$("Title"), MB_OK
(Oops, no zero after CAT$? Well, an intelligent macro
knows that "Ciao" is not a destination buffer ...)
mov eax, CAT$(0, "Test1: ", addr Src2, str$(eax), "bytes")
3. Append to last position after a CAT$(0) or CAT$(addr MyBuffer):
mov eax, CAT$(1, "Test3", addr Src4)
[attachment deleted by admin]