News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Toy that extracts sorted unique words from a file

Started by hutch--, July 25, 2005, 08:33:08 AM

Previous topic - Next topic

hutch--

This demo scans a text file and extracts a list of unique words, sorts them in either ascending or decending oder and writes the result to a disk file. I have not used STDIO because it is too slow and preferred to chop up the input and output for speed reasons. It uses an unbalanced iterative tree to test for duplicates, the string sorts from the MASM32 lbrary and the documented file IO macros so it should be a useful example for anyone who is interested in this type of work.

[attachment deleted by admin]
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Jibz

Nice job :thumbu.

I tried running a 6 mb text file through it, which only took approximately 1 second.

As a comparison I threw together 10 lines of perl using stdin/stdout and a regular expression, which took approximately 2 seconds on the same file.

hutch--

Jibz,

Thanks for running the test, I have nothing to test it against. Is it posible to load the file in one read in the perl interpreter and write the result directly back to disk as std IO is slow. It also means that the perl interpreter is performing well on the word scanning and duplicate detection.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dioxin

Jibz,
   <<I tried running a 6 mb text file through it, which only took approximately 1 second.>>

   Approximately 1 second? It only takes 0.2s for a 4.9MB file here. Athlon 2600+ CPU.


Hutch,
   << I have nothing to test it against>>

   You seem to be doing a very similar job to the PowerBASIC contest of a few weeks ago. My contest entry, modified to do the same as your code, and your code run in about the same time (within the limits of the timer resolution).

   
Paul.

PBrennick

dioxin
So it is also a good example of converting a PowerBasic Utility to Assembly.  Good ideas can come from many sources.

Paul
The GeneSys Project is available from:
The Repository or My crappy website

Farabi

 :U I like it. I have find this function a long time ago.
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"

PBrennick

Hutch,
One thing, and not the only thing, I like about this utility is the resulting list is case sensitive.  Other versions I have tried never seemed to get this right and, to me, this is an important feature.  Also, I was surprised to learn that the KJV has only 13,748 unique words in a file that is 4,432,363 bytes wide and 176 of them are verse numbers so the result is even smaller.

Paul
The GeneSys Project is available from:
The Repository or My crappy website

hutch--

dioxin ,

I must be getting slow in my old age,  did not see the competition in PB.

Paul (Brennick),

Its very easy to make it case inensitive by just bashing the source with a case algo which would probably reduce the average word count a bit more. Depends on what result is required but a case sensitive version is probably more use.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Jibz

Quote from: dioxinApproximately 1 second? It only takes 0.2s for a 4.9MB file here. Athlon 2600+ CPU.

Yes, we have different hardware .. what's your point?

5712944 Source file length in bytes
43906 unique words in file
15 MS file input
282 MS scan and list unique words
15 MS load and sort array of unique words
625 MS write output to file


for a grand total of 937 ms plus some time spent outside the counting parts .. approximately 1 second on this file for a full run on my box.

hutch--

I have been playing with the file tweaking a few bits here and there but the scan/load tree part will not drop in timing so it appears to be an algo limit.

Jibz,

The sample text I have had about 1 third of the word count for the sample you have used so I would expect the ratio of file input time to output time being about 6 times the input time which would roughly be between 90 and 100 MS on your box. Is there any reason why you file output time wold be so much slower. I am interested in what box you ran it on or if it is slow on file output.

This has been my typical timing with the sample I have.

4375977 Source file length in bytes
13680 unique words in file
15 MS file input
172 MS scan and list unique words
0 MS load and sort array of unique words
31 MS write output to file
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dioxin

Jibz,
   <<what's your point?>>

   Just that there seems to be something slowing yours down unexpectedly.
   From the timings your file write time is way out of line with the other folk who have posted times. Perhaps this points to a problem on your system?



Hutch,
    I assume you're aware that the timings are very coarse. Resolution of the timer is only 15-16ms so most of your timings are 0,1 or 2 timer ticks. A finer resolution timer would give more representative results.


For comparison:

4947038 Source file length in bytes
13798 unique words in file
15 MS file input
188 MS scan and list unique words
0 MS load and sort array of unique words
31 MS write output to file


Paul.

PBrennick

Oops, forgot to post the results...

4432363 Source file length in bytes
13747 unique words in file
41 MS file input
460 MS scan and list unique words
10 MS load and sort array of unique words
50 MS write output to file

Paul
The GeneSys Project is available from:
The Repository or My crappy website

hutch--

Paul (D),

I know that GetTickCount() has poor resolution but so does cycle counts from ring3 access. I just don't take much notice of GetTickCount under about 150 - 200 MS. I couled not find the link to the PB competition you mentioned, if you have it handy could you post a link here ?

Jibz,

Is there any way to get a timing from the PERL script without counting the std IO ? I get the impression that the processing time for getting the list of unique words is a lot faster than the test is showing because of delays in the file IO method.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php


hutch--

Paul(D),

I did read the lot but after the "no inline asm" rule, I stopped taking much notice of it, when the criterion bacame things like style, readability, no gotos and the like, it started to sound like ancient chinese floral arrangements to me.

If you kept te modification that duplicaed the unique word toy posted here, I would be interested to see what it did.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php