News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Sorting a table of strings, or a text file

Started by jj2007, October 04, 2009, 12:07:43 AM

Previous topic - Next topic

jj2007

How would one translate strings into an integer value that can be sorted with one of the standard algos? Here is one way I could imagine, but maybe there are faster/more elegant ways to do this:

- create a table of structures for each string:
  StringInfo STRUCT
  initialCounter   DWORD
  initialPointer   DWORD
  sumofbytes   REAL10
  StringInfo ENDS

- take (the Ascii value of) byte 1 and multiply it with 256^16 (3.4e38)
- store result on the FPU, ST(1)
.Repeat
   - take byte 2, and multiply it with 256^15
   - add result to ST(1)
.Until ct==16 or byte==zero
- save as sumofbytes
- compare to current highest value, exchange if higher
- compare to current lowest value, exchange if higher
- when all strings are done,
- recalculate sumofbytes=(sumofbytes-lowest)/(highest-lowest)*2^31
- sort the structures according to sumofbytes, using an integer sort

Naturally, this will be inaccurate for string that are identical on the first 16 bytes and longer than 16 bytes.

I am aware of this thread but admit that I haven't fully understood the example (maybe because it's not my Basic dialect :bg)
Test strings from Hutch's routine:

vrmyfyozd
zfrrjbvl
uolcyqxd
zlegxnvwxtu
ghtvbn
bcglffa
ikfuhgjbzi
cmoz
pbadtebjur

Let's forget umlauts etc for the time being...

hutch--

JJ,

The basics of sorting lines in a text file is to parse it on a line by line basis into an array of pointers, the rest is simple. there is a routine in the masm32 library to parse a text file in place without having to rewrite all of the data into an array first. Once you have an array of pointers to the text objects, just shove it through your favourite text sorting algo.

LATER: The attached example should come close to doing what you need.


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
    include \masm32\include\masm32rt.inc
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL flen      :DWORD          ; file length
    LOCAL pMem      :DWORD          ; file memory pointer
    LOCAL pcl       :DWORD          ; pointer to command line buffer
    LOCAL buf[260]  :BYTE           ; 260 byte buffer
    LOCAL pArray    :DWORD
    LOCAL acnt      :DWORD          ; array member count

    mov pcl, ptr$(buf)              ; load address of buffer into pointer

    .if rv(GetCL, 1, pcl) != 1      ; check if there is a command line
      jmp clerror
    .endif

    .if rv(exist,pcl) != 1          ; check if the file exists
      jmp no_file
    .endif

    mov pMem, InputFile(pcl)        ; load file into memory
    mov flen, ecx                   ; get the file length

  ; -----------------------------------------------
  ; tokenise the lines in the text file and produce
  ; an array of pointers to the lines of text
  ; -----------------------------------------------
    mov acnt, rv(ltok,pMem,ADDR pArray)

    invoke assort,pArray,acnt,100   ; sort the lines ascending

  ; ---------------------------------
  ; dump the sorted results to STDOUT
  ; ---------------------------------
    push ebx
    push esi

    mov ebx, acnt
    mov esi, pArray

  @@:
    print [esi],13,10
    add esi, 4
    sub ebx, 1
    jnz @B

    pop esi
    pop ebx
  ; ---------------------------------

    free pArray                     ; free the pointer array memory
    free pMem                       ; release the memory when finished

    ret

  no_file:
    print "Hmmmm, I can't find that file",13,10
    ret

  clerror:
    print "It helps if you load a valid file from the command line",13,10
    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

lingo

Hutch, for speed I use my optimized Red Black Tree->InsertTopDown_String algo
The sweet thing is both average and worst-case Search/Insert/Delete node time is O(lg N).
For every string I have the following Red Black Tree structure:
; struc of Red Black Tree String Node
; [eax+0*4] -> parent address
; [eax+1*4] -> left child address
; [eax+2*4] -> right child address
; [eax+3*4] -> address of the string
; [eax+4*4] -> color Red=0 or Black=1
; [eax+5*4] -> address of the linked list of string nodes with Equal Size or 0
Price of the speed is additional 6*4 bytes for every string but who cares, I have a disk with 1.5TB too... :lol

hutch--

lingo,

I have vaguely heard of it but never seen code for it or know enough about its design. The fastest string sort algo I have seen and coded is a Robert Sedgewick hybrid Radix Quicksort that had some interesting characteristics, I absolutely could not get it faster than a version written in C and that is after doing nearly every possible optimisation to it. The algo was limited by memory speed, not code execution but it was still faster than any quick sort I had.

Have you seen a red/black sort coded in assembler ?
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

lingo

"Have you seen a red/black sort coded in assembler?"

No,just in C/C++ and Java :'(

jj2007

Quote from: hutch-- on October 04, 2009, 01:26:49 AM

LATER: The attached example should come close to doing what you need.


It does, it does :thumbu

I was basically wondering how the part done by masm32\m32lib\asqsort.asm could be accelerated. Doing a string comparison for a million strings must be kind of slow, therefore my attempt to scan the strings once and assign a "point score" for each of them, then do an integer sort. But then, why reinvent the wheel if your library has it already...?
:bg

lingo

"The fastest string sort algo I have seen and coded is a Robert Sedgewick hybrid Radix Quicksort"
He has something about red-black trees too but I prefer Cormen's book.

The books are here

In the attached example I read all names of files and folders from local disk c:
and use red-black trees to sort them when the user presses the column's header.
Please, pay attention of the number of the files and folders and the sorting speed. :wink



hutch--

Thanks, I ran it an it seems to be fast. I did a CHKDSK on my boot drive and it has 51000 files so I gather the number at the top on the titlebar is the item count it sorts.

I get 2 numbers on the titlebar, 44864 and 1484, what are they ?
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

bomz

#8
Quoteinclude \masm32\include\masm32rt.inc

.data
buffer      db '4444444',13,10,0
      db '6666666',13,10,0
      db '5555555',13,10,0
      db '2222222',13,10,0
      db '3333333',13,10,0
      db '1111111',13,10,0
stringpointer   dd offset buffer, offset [buffer+10],\
      offset [buffer+20], offset [buffer+30],\
      offset [buffer+40], offset [buffer+50]

.code
start:
   invoke assort,ADDR stringpointer,6,10
   lea ebx, stringpointer
@@:
   invoke StdOut, [ebx]
   add ebx, 4
   cmp ebx, offset [stringpointer+24]
   jb @B
   inkey
   invoke ExitProcess,0
end start

Quoteinclude \masm32\include\masm32rt.inc

.data
string   dd '5555','1111','3333','2222','4444',0

.code
start:
   invoke CombSortA,ADDR string,5
   invoke MessageBox,0,addr string,0,0
   invoke ExitProcess,0
end start

KeepingRealBusy

Quote from: hutch-- on October 04, 2009, 04:59:22 AM
lingo,

I have vaguely heard of it but never seen code for it or know enough about its design. The fastest string sort algo I have seen and coded is a Robert Sedgewick hybrid Radix Quicksort that had some interesting characteristics, I absolutely could not get it faster than a version written in C and that is after doing nearly every possible optimisation to it. The algo was limited by memory speed, not code execution but it was still faster than any quick sort I had.

Have you seen a red/black sort coded in assembler ?

Hutch, I have one.

Dave.