Print Page - Toy that extracts sorted unique words from a file

Title: Toy that extracts sorted unique words from a file
Post by: hutch-- on July 25, 2005, 08:33:08 AM

This demo scans a text file and extracts a list of unique words, sorts them in either ascending or decending oder and writes the result to a disk file. I have not used STDIO because it is too slow and preferred to chop up the input and output for speed reasons. It uses an unbalanced iterative tree to test for duplicates, the string sorts from the MASM32 lbrary and the documented file IO macros so it should be a useful example for anyone who is interested in this type of work.

[attachment deleted by admin]

Title: Re: Toy that extracts sorted unique words from a file
Post by: Jibz on July 26, 2005, 09:13:14 AM

Nice job :thumbu.

I tried running a 6 mb text file through it, which only took approximately 1 second.

As a comparison I threw together 10 lines of perl using stdin/stdout and a regular expression, which took approximately 2 seconds on the same file.

Title: Re: Toy that extracts sorted unique words from a file
Post by: hutch-- on July 26, 2005, 11:12:30 AM

Jibz,

Thanks for running the test, I have nothing to test it against. Is it posible to load the file in one read in the perl interpreter and write the result directly back to disk as std IO is slow. It also means that the perl interpreter is performing well on the word scanning and duplicate detection.

Title: Re: Toy that extracts sorted unique words from a file
Post by: dioxin on July 26, 2005, 11:54:15 AM

Jibz,
   <<I tried running a 6 mb text file through it, which only took approximately 1 second.>>

   Approximately 1 second? It only takes 0.2s for a 4.9MB file here. Athlon 2600+ CPU.

Hutch,
   << I have nothing to test it against>>

   You seem to be doing a very similar job to the PowerBASIC contest of a few weeks ago. My contest entry, modified to do the same as your code, and your code run in about the same time (within the limits of the timer resolution).


Paul.

Title: Re: Toy that extracts sorted unique words from a file
Post by: PBrennick on July 26, 2005, 01:08:05 PM

dioxin
So it is also a good example of converting a PowerBasic Utility to Assembly. Good ideas can come from many sources.

Paul

Title: Re: Toy that extracts sorted unique words from a file
Post by: Farabi on July 26, 2005, 01:20:32 PM

:U I like it. I have find this function a long time ago.

Title: Re: Toy that extracts sorted unique words from a file
Post by: PBrennick on July 26, 2005, 01:49:21 PM

Hutch,
One thing, and not the only thing, I like about this utility is the resulting list is case sensitive. Other versions I have tried never seemed to get this right and, to me, this is an important feature. Also, I was surprised to learn that the KJV has only 13,748 unique words in a file that is 4,432,363 bytes wide and 176 of them are verse numbers so the result is even smaller.

Paul

Title: Re: Toy that extracts sorted unique words from a file
Post by: hutch-- on July 26, 2005, 02:03:58 PM

dioxin ,

I must be getting slow in my old age, did not see the competition in PB.

Paul (Brennick),

Its very easy to make it case inensitive by just bashing the source with a case algo which would probably reduce the average word count a bit more. Depends on what result is required but a case sensitive version is probably more use.

Title: Re: Toy that extracts sorted unique words from a file
Post by: Jibz on July 26, 2005, 02:36:36 PM

Quote from: dioxinApproximately 1 second? It only takes 0.2s for a 4.9MB file here. Athlon 2600+ CPU.

Yes, we have different hardware .. what's your point?

Code Select

5712944 Source file length in bytes
43906 unique words in file
15 MS file input
282 MS scan and list unique words
15 MS load and sort array of unique words
625 MS write output to file

for a grand total of 937 ms plus some time spent outside the counting parts .. approximately 1 second on this file for a full run on my box.

Title: Re: Toy that extracts sorted unique words from a file
Post by: hutch-- on July 26, 2005, 03:11:57 PM

I have been playing with the file tweaking a few bits here and there but the scan/load tree part will not drop in timing so it appears to be an algo limit.

Jibz,

The sample text I have had about 1 third of the word count for the sample you have used so I would expect the ratio of file input time to output time being about 6 times the input time which would roughly be between 90 and 100 MS on your box. Is there any reason why you file output time wold be so much slower. I am interested in what box you ran it on or if it is slow on file output.

This has been my typical timing with the sample I have.

Code Select


4375977 Source file length in bytes
13680 unique words in file
15 MS file input
172 MS scan and list unique words
0 MS load and sort array of unique words
31 MS write output to file

Title: Re: Toy that extracts sorted unique words from a file
Post by: dioxin on July 26, 2005, 03:24:20 PM

Jibz,
<<what's your point?>>

Just that there seems to be something slowing yours down unexpectedly.
From the timings your file write time is way out of line with the other folk who have posted times. Perhaps this points to a problem on your system?

Hutch,
I assume you're aware that the timings are very coarse. Resolution of the timer is only 15-16ms so most of your timings are 0,1 or 2 timer ticks. A finer resolution timer would give more representative results.

For comparison:

Code Select

 
4947038 Source file length in bytes
13798 unique words in file
15 MS file input
188 MS scan and list unique words
0 MS load and sort array of unique words
31 MS write output to file

Paul.

Title: Re: Toy that extracts sorted unique words from a file
Post by: PBrennick on July 26, 2005, 03:45:06 PM

Oops, forgot to post the results...

4432363 Source file length in bytes
13747 unique words in file
41 MS file input
460 MS scan and list unique words
10 MS load and sort array of unique words
50 MS write output to file

Paul

Title: Re: Toy that extracts sorted unique words from a file
Post by: hutch-- on July 27, 2005, 12:06:12 PM

Paul (D),

I know that GetTickCount() has poor resolution but so does cycle counts from ring3 access. I just don't take much notice of GetTickCount under about 150 - 200 MS. I couled not find the link to the PB competition you mentioned, if you have it handy could you post a link here ?

Jibz,

Is there any way to get a timing from the PERL script without counting the std IO ? I get the impression that the processing time for getting the list of unique words is a lot faster than the test is showing because of delays in the file IO method.

Title: Re: Toy that extracts sorted unique words from a file
Post by: dioxin on July 27, 2005, 01:02:14 PM

Hutch,
   the initial thread is here:
http://www.powerbasic.com/support/forums/Forum12/HTML/001601.html

   The main "here are the rules" thread is here:
http://www.powerbasic.com/support/forums/Forum12/HTML/001733.html

   It has also come up in other threads during/since including:

http://www.powerbasic.com/support/forums/Forum12/HTML/001612.html
http://www.powerbasic.com/support/forums/Forum12/HTML/001700.html
http://www.powerbasic.com/support/forums/Forum12/HTML/001816.html

and the latest thread (still current), about releasing the results to everyone can be found here:
http://www.powerbasic.com/support/forums/Forum12/HTML/002193.html

Paul.

Title: Re: Toy that extracts sorted unique words from a file
Post by: hutch-- on July 27, 2005, 01:51:56 PM

Paul(D),

I did read the lot but after the "no inline asm" rule, I stopped taking much notice of it, when the criterion bacame things like style, readability, no gotos and the like, it started to sound like ancient chinese floral arrangements to me.

If you kept te modification that duplicaed the unique word toy posted here, I would be interested to see what it did.

Title: Re: Toy that extracts sorted unique words from a file
Post by: dioxin on July 27, 2005, 02:05:52 PM

Hutch,
I didn't try to do it in ASM but I expected the main processing to be probably 3 to 4 times faster in ASM than the PB version.
The file read, file write and sort times I'd expect to be very similar in ASM or PB.

The file is attached, it's a PBCC program.

Paul.

edit:
see later for a newer version of the attachment

Title: Re: Toy that extracts sorted unique words from a file
Post by: hutch-- on July 27, 2005, 02:23:28 PM

Paul,

It looks like it fast but I cannot get it to output any results. It does get the word count right.

Code Select


H:\asm2\uwords\pdixon>contest asvbible.txt /a /n
Processing asvbible.txt

Total number of countable words= 833411
Unique countable words= 13680

Done in  .172sec.

Title: Re: Toy that extracts sorted unique words from a file
Post by: dioxin on July 27, 2005, 02:28:56 PM

Hutch,
the original contest was just to count the words so the output of words to file is an optional extra.

From the console prompt type:
contest myfile.txt /a >outputfile.txt

This will process "myfile.txt" sort /a = alphabetically or /1 = in order of frequency and the default output is to the console so if you want it in a file then redirect it to the file using >outputfile.txt

Paul.

Title: Re: Toy that extracts sorted unique words from a file
Post by: hutch-- on July 27, 2005, 02:35:07 PM

Paul,

I may be too tired to properly understand what you mean but when I run tha command line I get the same data as I pasted in before in the test file for the redirected output.

Code Select


Processing asvbible.txt

Total number of countable words= 833411
Unique countable words= 13680

Done in  .172sec.

press a key to continue.

Have I messed this up or have I misunderstood the capacity ?

Title: Re: Toy that extracts sorted unique words from a file
Post by: dioxin on July 27, 2005, 02:39:00 PM

Oops, sorry. When I cut it down from the original I accidentally cut out the "output to file" option!

Attached is one that will output.

Paul.

edit:
I replaced the version here with a (hopefully!) working one.

[attachment deleted by admin]

Title: Re: Toy that extracts sorted unique words from a file
Post by: hutch-- on July 27, 2005, 11:01:09 PM

Thanks Paul,

This one will redirect to stdout fine. I am not getting any timings at all with it though.

Code Select


H:\asm2\uwords\pdixon>contesth asvbible.txt /a > tst.txt
Input file= 0
Scan and list unique words= 0
sort words= 0
Total time= 0

It looks fast though but when I try to run both /a /n I get no output. Results and word count are corrct.

Title: Re: Toy that extracts sorted unique words from a file
Post by: dioxin on July 28, 2005, 12:27:03 AM

Hutch,
I'll investigate tomorrow.. but the times will be the same as the first time you ran it.

Paul.

Title: Re: Toy that extracts sorted unique words from a file
Post by: hutch-- on July 28, 2005, 02:08:01 AM

Thanks Paul.

This much I have digested so far, the tree structure you are using looks like it is very fast. The one I used in the uwords example is actually designed for another purpose and has a matching retreival algo that tests words already in the tree. It is a lot faster with sorted order entries, both writing and reading the table and I need it that way for its original purpose of being able to stack a tree with a set of keywords and dump the user defined words onto it later. It takes a user defined DWORD variable which can either be an ID number or a pointer to other data stored elsewhere.

Title: Re: Toy that extracts sorted unique words from a file
Post by: dioxin on July 28, 2005, 11:39:26 AM

Hutch,
   sorry for the problem, Wednesday is a bad day for me!
   I've replaced the above attachment with a working version which shows times and the word list and includes the exe file.

   Although I just use a straightforward sort to get the alphabetic list there is a much quicker way as the data is stored in an alphabetically organised linked list anyway so the list could be scanned to extract the data a lot more quickly, but that would take a bit more programming effort.

   <<the tree structure you are using looks like it is very fast>>

   The tree structure used is the fastest I could think of!

   Possible improvements to the code to make it faster include:
   1)Extracting words from the tree properly by scanning instead of sorting the whole tree.
   2)Changing all the "if..then..else" structures in the main loop to a single jump table.
   3)Using a separate thread to read in the file so I can start processing while the file is loading.
   4)Maybe processing the file in cache sized chunks so the data is always in cache when I need it.
   5)Using ASM!

   I might get around to doing those to see how fast I can really get it,

Paul.

The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: hutch-- on July 25, 2005, 08:33:08 AM