News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Toy that extracts sorted unique words from a file

Started by hutch--, July 25, 2005, 08:33:08 AM

Previous topic - Next topic

dioxin

#15
Hutch,
   I didn't try to do it in ASM but I expected the main processing to be probably 3 to 4 times faster in ASM than the PB version.
  The file read, file write and sort times I'd expect to be very similar in ASM or PB.

  The file is attached, it's a PBCC program.

Paul.

edit:
see later for a newer version of the attachment

hutch--

Paul,

It looks like it fast but I cannot get it to output any results. It does get the word count right.


H:\asm2\uwords\pdixon>contest asvbible.txt /a /n
Processing asvbible.txt

Total number of countable words= 833411
Unique countable words= 13680

Done in  .172sec.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dioxin

Hutch,
   the original contest was just to count the words so the output of words to file is an optional extra.

  From the console prompt type:
  contest myfile.txt  /a >outputfile.txt

This will process "myfile.txt" sort /a = alphabetically or /1 = in order of frequency and the default output is to the console so if you want it in a file then redirect it to the file using  >outputfile.txt

Paul.


hutch--

Paul,

I may be too tired to properly understand what you mean but when I run tha command line I get the same data as I pasted in before in the test file for the redirected output.


Processing asvbible.txt

Total number of countable words= 833411
Unique countable words= 13680

Done in  .172sec.

press a key to continue.


Have I messed this up or have I misunderstood the capacity ?
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dioxin

#19
Oops, sorry. When I cut it down from the original I accidentally cut out the "output to file" option!

Attached is one that will output.

Paul.

edit:
I replaced the version here with a (hopefully!) working one.

[attachment deleted by admin]

hutch--

Thanks Paul,

This one will redirect to stdout fine. I am not getting any timings at all with it though.


H:\asm2\uwords\pdixon>contesth asvbible.txt /a > tst.txt
Input file= 0
Scan and list unique words= 0
sort words= 0
Total time= 0


It looks fast though but when I try to run both /a /n I get no output. Results and word count are corrct.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dioxin

Hutch,
    I'll investigate tomorrow.. but the times will be the same as the first time you ran it.

Paul.

hutch--

Thanks Paul.

This much I have digested so far, the tree structure you are using looks like it is very fast. The one I used in the uwords example is actually designed for another purpose and has a matching retreival algo that tests words already in the tree. It is a lot faster with sorted order entries, both writing and reading the table and I need it that way for its original purpose of being able to stack a tree with a set of keywords and dump the user defined words onto it later. It takes a user defined DWORD variable which can either be an ID number or a pointer to other data stored elsewhere.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dioxin

Hutch,   
   sorry for the problem, Wednesday is a bad day for me!
   I've replaced the above attachment with a working version which shows times and the word list and includes the exe file.

   Although I just use a straightforward sort to get the alphabetic list there is a much quicker way as the data is stored in an alphabetically organised linked list anyway so the list could be scanned to extract the data a lot more quickly, but that would take a bit more programming effort.

   <<the tree structure you are using looks like it is very fast>>

   The tree structure used is the fastest I could think of!

   Possible improvements to the code to make it faster include:
   1)Extracting words from the tree properly by scanning instead of sorting the whole tree.
   2)Changing all the "if..then..else" structures in the main loop to a single jump table.
   3)Using a separate thread to read in the file so I can start processing while the file is loading.
   4)Maybe processing the file in cache sized chunks so the data is always in cache when I need it.
   5)Using ASM!
   
   I might get around to doing those to see how fast I can really get it,

Paul.