This demo scans a text file and extracts a list of unique words, sorts them in either ascending or decending oder and writes the result to a disk file. I have not used STDIO because it is too slow and preferred to chop up the input and output for speed reasons. It uses an unbalanced iterative tree to test for duplicates, the string sorts from the MASM32 lbrary and the documented file IO macros so it should be a useful example for anyone who is interested in this type of work.
[attachment deleted by admin]
Nice job :thumbu.
I tried running a 6 mb text file through it, which only took approximately 1 second.
As a comparison I threw together 10 lines of perl using stdin/stdout and a regular expression, which took approximately 2 seconds on the same file.
Jibz,
Thanks for running the test, I have nothing to test it against. Is it posible to load the file in one read in the perl interpreter and write the result directly back to disk as std IO is slow. It also means that the perl interpreter is performing well on the word scanning and duplicate detection.
Jibz,
<<I tried running a 6 mb text file through it, which only took approximately 1 second.>>
Approximately 1 second? It only takes 0.2s for a 4.9MB file here. Athlon 2600+ CPU.
Hutch,
<< I have nothing to test it against>>
You seem to be doing a very similar job to the PowerBASIC contest of a few weeks ago. My contest entry, modified to do the same as your code, and your code run in about the same time (within the limits of the timer resolution).
Paul.
dioxin
So it is also a good example of converting a PowerBasic Utility to Assembly. Good ideas can come from many sources.
Paul
:U I like it. I have find this function a long time ago.
Hutch,
One thing, and not the only thing, I like about this utility is the resulting list is case sensitive. Other versions I have tried never seemed to get this right and, to me, this is an important feature. Also, I was surprised to learn that the KJV has only 13,748 unique words in a file that is 4,432,363 bytes wide and 176 of them are verse numbers so the result is even smaller.
Paul
dioxin ,
I must be getting slow in my old age, did not see the competition in PB.
Paul (Brennick),
Its very easy to make it case inensitive by just bashing the source with a case algo which would probably reduce the average word count a bit more. Depends on what result is required but a case sensitive version is probably more use.
Quote from: dioxinApproximately 1 second? It only takes 0.2s for a 4.9MB file here. Athlon 2600+ CPU.
Yes, we have different hardware .. what's your point?
5712944 Source file length in bytes
43906 unique words in file
15 MS file input
282 MS scan and list unique words
15 MS load and sort array of unique words
625 MS write output to file
for a grand total of 937 ms plus some time spent outside the counting parts .. approximately 1 second on this file for a full run on my box.
I have been playing with the file tweaking a few bits here and there but the scan/load tree part will not drop in timing so it appears to be an algo limit.
Jibz,
The sample text I have had about 1 third of the word count for the sample you have used so I would expect the ratio of file input time to output time being about 6 times the input time which would roughly be between 90 and 100 MS on your box. Is there any reason why you file output time wold be so much slower. I am interested in what box you ran it on or if it is slow on file output.
This has been my typical timing with the sample I have.
4375977 Source file length in bytes
13680 unique words in file
15 MS file input
172 MS scan and list unique words
0 MS load and sort array of unique words
31 MS write output to file
Jibz,
<<what's your point?>>
Just that there seems to be something slowing yours down unexpectedly.
From the timings your file write time is way out of line with the other folk who have posted times. Perhaps this points to a problem on your system?
Hutch,
I assume you're aware that the timings are very coarse. Resolution of the timer is only 15-16ms so most of your timings are 0,1 or 2 timer ticks. A finer resolution timer would give more representative results.
For comparison:
4947038 Source file length in bytes
13798 unique words in file
15 MS file input
188 MS scan and list unique words
0 MS load and sort array of unique words
31 MS write output to file
Paul.
Oops, forgot to post the results...
4432363 Source file length in bytes
13747 unique words in file
41 MS file input
460 MS scan and list unique words
10 MS load and sort array of unique words
50 MS write output to file
Paul
Paul (D),
I know that GetTickCount() has poor resolution but so does cycle counts from ring3 access. I just don't take much notice of GetTickCount under about 150 - 200 MS. I couled not find the link to the PB competition you mentioned, if you have it handy could you post a link here ?
Jibz,
Is there any way to get a timing from the PERL script without counting the std IO ? I get the impression that the processing time for getting the list of unique words is a lot faster than the test is showing because of delays in the file IO method.
Hutch,
the initial thread is here:
http://www.powerbasic.com/support/forums/Forum12/HTML/001601.html
The main "here are the rules" thread is here:
http://www.powerbasic.com/support/forums/Forum12/HTML/001733.html
It has also come up in other threads during/since including:
http://www.powerbasic.com/support/forums/Forum12/HTML/001612.html
http://www.powerbasic.com/support/forums/Forum12/HTML/001700.html
http://www.powerbasic.com/support/forums/Forum12/HTML/001816.html
and the latest thread (still current), about releasing the results to everyone can be found here:
http://www.powerbasic.com/support/forums/Forum12/HTML/002193.html
Paul.
Paul(D),
I did read the lot but after the "no inline asm" rule, I stopped taking much notice of it, when the criterion bacame things like style, readability, no gotos and the like, it started to sound like ancient chinese floral arrangements to me.
If you kept te modification that duplicaed the unique word toy posted here, I would be interested to see what it did.
Hutch,
I didn't try to do it in ASM but I expected the main processing to be probably 3 to 4 times faster in ASM than the PB version.
The file read, file write and sort times I'd expect to be very similar in ASM or PB.
The file is attached, it's a PBCC program.
Paul.
edit:
see later for a newer version of the attachment
Paul,
It looks like it fast but I cannot get it to output any results. It does get the word count right.
H:\asm2\uwords\pdixon>contest asvbible.txt /a /n
Processing asvbible.txt
Total number of countable words= 833411
Unique countable words= 13680
Done in .172sec.
Hutch,
the original contest was just to count the words so the output of words to file is an optional extra.
From the console prompt type:
contest myfile.txt /a >outputfile.txt
This will process "myfile.txt" sort /a = alphabetically or /1 = in order of frequency and the default output is to the console so if you want it in a file then redirect it to the file using >outputfile.txt
Paul.
Paul,
I may be too tired to properly understand what you mean but when I run tha command line I get the same data as I pasted in before in the test file for the redirected output.
Processing asvbible.txt
Total number of countable words= 833411
Unique countable words= 13680
Done in .172sec.
press a key to continue.
Have I messed this up or have I misunderstood the capacity ?
Oops, sorry. When I cut it down from the original I accidentally cut out the "output to file" option!
Attached is one that will output.
Paul.
edit:
I replaced the version here with a (hopefully!) working one.
[attachment deleted by admin]
Thanks Paul,
This one will redirect to stdout fine. I am not getting any timings at all with it though.
H:\asm2\uwords\pdixon>contesth asvbible.txt /a > tst.txt
Input file= 0
Scan and list unique words= 0
sort words= 0
Total time= 0
It looks fast though but when I try to run both /a /n I get no output. Results and word count are corrct.
Hutch,
I'll investigate tomorrow.. but the times will be the same as the first time you ran it.
Paul.
Thanks Paul.
This much I have digested so far, the tree structure you are using looks like it is very fast. The one I used in the uwords example is actually designed for another purpose and has a matching retreival algo that tests words already in the tree. It is a lot faster with sorted order entries, both writing and reading the table and I need it that way for its original purpose of being able to stack a tree with a set of keywords and dump the user defined words onto it later. It takes a user defined DWORD variable which can either be an ID number or a pointer to other data stored elsewhere.
Hutch,
sorry for the problem, Wednesday is a bad day for me!
I've replaced the above attachment with a working version which shows times and the word list and includes the exe file.
Although I just use a straightforward sort to get the alphabetic list there is a much quicker way as the data is stored in an alphabetically organised linked list anyway so the list could be scanned to extract the data a lot more quickly, but that would take a bit more programming effort.
<<the tree structure you are using looks like it is very fast>>
The tree structure used is the fastest I could think of!
Possible improvements to the code to make it faster include:
1)Extracting words from the tree properly by scanning instead of sorting the whole tree.
2)Changing all the "if..then..else" structures in the main loop to a single jump table.
3)Using a separate thread to read in the file so I can start processing while the file is loading.
4)Maybe processing the file in cache sized chunks so the data is always in cache when I need it.
5)Using ASM!
I might get around to doing those to see how fast I can really get it,
Paul.