News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Masmlib Quick Sort

Started by Jimg, November 21, 2007, 06:55:37 PM

Previous topic - Next topic

Jimg

I've been playing with the nrQsortA routine in Masmlib.  Does anyone know the worst case possible, or any failure cases for this routine?
I've test presorted, reverse sorted, constant, alternating high/low, and random input data.  The slowest is the random, so no suprises.   But I'm really interested in any "quadratic" behavior.

hutch--

Jim,

It a recode of a very old design from the days when stack memory was very limited but it should be its fastest on random data and get slower with more ordered data. No quick sort is safe on highly ordered data or a simple mistake where all the data is identical. Normally when you have to process fully unknown data you use a hybrid that will handle extreme behaviour so you don't get quadratic behaviour.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Jimg

That's what I thought too, but testing shows otherwise.
I generated 1,000,000 random number using nrandom.  This ran the slowest.
I tried a million sequential numbers, which ran in less than half the time of the random number pass.
I tried a it reversed, 1000000 to 1 which ran about the same as the sequential number run.
I tried the sequence -500000,500000,-499999,499999,-499998,499998, etc. which ran in about 11/16ths the time of the random pass.
I tried 1000000 numbers all identical, equal to 9999.  It ran only slightly slower than the sequential number run.
This seems to be a very robust version of quicksort, which is why I asked for a worst case.

I'll try to put together a test program that fit for human consumption shortly.  In the meantime, here's a gif of my results-



[attachment deleted by admin]

Jimg

Well, I've found one really bad sequence.  Tapering down from outside in,  eg. 10 8 6 4 2 1 3 5 7 9
Previous to this, the worst was random numbers.  One million random numbers take 132 milliseconds to sort.
One million numbers in this tapering sequence take 33770 milliseconds, or over 250 times longer.  Ugh.

Rockoon

thats why I dont use a quick sort

Its BEST runtime is O(n log n) which is not an acceptable runtime in some cases
Its WORSE runtime is O(N^2) which is NEVER an acceptable runtime

For arbitrary data I use merge sort, which has both a BEST and WORST case of O(n log n)
For mostly sorted data I use an insertion sort

For small lists I always use bubble or a network sort

Remember also that if the sort routine will be used in any sort of public setting, you should EXPECT that it will be handed data that produces worst-case behavior. I recall a public chess server that had been open source which used quicksort. A disgruntled user was causing the server to bog down, effectively a Denial Of Service attack, by handing it worst-case data.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

hutch--

I basically agree with ths view, an unprotected quick sort is a free kick to deliberately structured data. It is clearly the fastest on truly random data, (not necessarily the nrqsort in the masm32 lib) but it fails badly on ordered data or a large count of identical data.

To use it on open data i prefer to set a recursion depth indicator and if it exceeds that depth process the rest with a sort that is insensitive to data order. My own choice is a double ended comb sort (cocktail shaker but on a comb sort) then use an insertion sort when the gap reduces to one (1). A comb sort is highly predictable with the number of passes it makes and its time varies only on the number of swaps to perform.

A tri-median quick sort usually has some legs on a normal one but fails on ordered data. I found an Incerpi - Sedgewick shell sort that worked OK but it was easily beaten by the quick sort on random data and would still fail on some types of ordered data.

Comparing an insertion sort to a bubble sort, the insertion sort was always faster on near ordered data, usually about twice as fast.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Rockoon


Quick is often prefered over Merge because Merge typically requires O(N) elements of auxilary space whereas Quick only requires O(Log N) elements of auxilary space.

However, Merge has the advantage that it is a stable sort (if two equal sort keys are present, merge sort will preserve their ordering..) Quick cannot make such a guarantee unless it is implimented in a way that offers even more avenues of algorithmic attack.

I just can't get around the fact that when Quick is bad, it can be really really bad. Merge will always take the same amount of time over all permutations of inputs. It is unattackable from an algorithmic perspective; bullet proof.

Special-case alternatives include Radix sort, which is truely one of the fastest sorts you will find. Runtime can approach O(n). Unfortunately it requires a lot of extra storage space, so in practice it is even more prone to clever attacks on the algorithm.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

Jimg

I agree.  I'm going back to combsort.  But it's been fun playing.  If anyone is interested, here's my mods to the masmlib quicksort (qsorta.asm).  On my Amd computer, it runs 20 to 30 percent faster.  On my celeron, only 5-10 percent faster.
nrQsortAx4 proc Arr:DWORD,count:DWORD
    LOCAL First,Last,cntr,bLen,hMem,temp
    ;note: this first section can be smaller, but need to kill
    ;      some bytes to get good 16 byte alignment at MainLoop anyway
    push ebx
    push esi
    push edi
    mov esi,Arr    ; source address kept in esi througout
    mov cntr,0     ; pointer into heap

    ;inv GetProcessHeap ; for testing, will always be known
    ;mov hHeap,eax
    mov eax,count       ; allocate temporary buffer
    add eax,40
    mov bLen,eax
    invoke SysAllocStringByteLen,0,eax
    ;inv HeapAlloc,hHeap,0,eax  ; better if you already have the heap handle
    mov hMem,eax
    mov edi,eax ; hMem  ; keep buffer address in edi thoroughout
    mov First,0         ; set First and Last reference values
    mov eax,count
    sub eax,1
    mov Last,eax        ; Last, when zero based array
   
    align 16
MainLoop:
    mov edx,Last
MainLoop2:              ; second entry into Main loop
    mov ecx,First
MainLoop3:
    mov eax,ecx          ; calculate midpoint
    add eax,edx
    shr eax,1            ; \2  cannot be combined with *4 in next instruction
    mov ebx,[esi+eax*4] ; value at midpoint kept in ebx throughout
    mov temp,ebx
    dec ecx         ; setup for tight loop
    .repeat
        inc ecx     ; look for value greater than midpoint on left side of group
InnerLoop2:
    .until sdword ptr [esi+ecx*4]>=ebx  ; else will match at midpoint and fall out
    inc edx         ; found one, or midpoint itself
    .repeat
        dec edx ; look for value less than midpoint on right side
    .until sdword ptr [esi+edx*4]<=ebx  ; will match at midpoint and fall out
    .if ecx<=edx
        mov eax,[esi+ecx*4]
        mov ebx,[esi+edx*4]     ; swap elements
        mov [esi+ecx*4], ebx
        mov [esi+edx*4], eax
   
        mov ebx,temp            ; restore midpoint
        inc ecx
        dec edx
        cmp ecx,edx
        jle InnerLoop2          ; go back and look for more values to move
    .endif

    .if sdword ptr ecx<Last  ; changed from <= to <
        mov eax, cntr
        mov [edi+eax*4], ecx
        mov ebx, Last
        mov [edi+eax*4+4], ebx
        add cntr, 2
    .endif
    mov Last, edx             ; new Last point
    cmp edx, First            ; compare Last & First
    jg MainLoop2
   
    mov eax,cntr
    .if eax!=0      ; are we done?
        sub eax,2   ; no, get previous First and Last
        mov cntr,eax
        mov ecx, [edi+eax*4]
        mov First, ecx
        mov edx, [edi+eax*4+4]
        mov Last, edx
        jmp MainLoop3
    .endif

    invoke SysFreeString,hMem
    ;inv HeapFree,hHeap,0,hMem

    pop edi
    pop esi
    pop ebx

    ret

nrQsortAx4 endp

Attached is the interactive test program I use...


[attachment deleted by admin]

Mark Jones

Interesting. Here's my results with an AMD XP 2500+ (1800MHz)

[attachment deleted by admin]
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

Jimg

Rockoon-

I hate to admit it, but you're right.  I hacked together a non-recursive mergesort, (I just hate recursion), or at least what I would call a merge sort even though it's not much like any descriptions I read.  There was no particular attempt at optimizing.
The first time out of the box, it ran better than twice as fast as combsort.  Only 15% slower than quicksort on random data.  I don't really know what O(Log N)  means, but if N is array size, it uses at most an N/2 scratch array.
I can hardly wait to see what a little optimizing can do!

Rockoon

Quote from: Jimg on November 25, 2007, 07:50:06 PM
I don't really know what O(Log N)  means, but if N is array size, it uses at most an N/2 scratch array.
I can hardly wait to see what a little optimizing can do!

Big-O notation is used in computer science to classify algorithms based on how their runtime or memory footprint grows with input size (approximately)

For instance a bubble sort is O(N^2) meaning that its runtime growth is exponential. It isnt an exact count of the iterations but instead how iteration counts, given multiple N, will proportionally relate to each other. A naive bubble and an optimized bubble are both classified equally as O(N^2)

Most 'divide and conquer' algorithms have at their root a runtime of O(Log N) .. for instance a single binary search or a probe into binary space partition is O(Log N)

(Log is almost always a base-2 logarithm in Big-O notation .. which is approximately the number of bits needed to encode N)

In the case of Merge sort, there are N such 'divide and conquer' stages so the runtime is said to be O(N log N)

Probing a hash table is O(1) meaning it runs in ~constant time

For any specific algorithm, you have actual constant costs such as the cost of each iteration so an algorithm with a 'worse' Big-O runtime may perform better with sufficiently small N. An obvious case of this is bubble sort which has a very small per-iteration cost so when N is very small, bubble trumps most other sorts.

There exist a 'sweet spot' where given a specific N, bubble performs the same as merge in practice. For any N larger, bubble is inferior and very rapidly becomes worthless as N grows even larger.

Common Big-O runtimes you will encounter, from best to worst, are:


O(1)
    ex: Hash lookups
O(log N)   
    ex: Binary searches
O(N)
    ex: Linear searches
O(N log N)
    ex: Merge sort
O(N^2)
    ex: N-body problems
O(N^3)
    ..can't remember (traveling salesman?)
O(B^N)
    MiniMax/AlphaBeta gametree searches and such

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

hutch--

Jim,

This much, what use do you have for an integer sort, there are tons of examples of different types but in most instances an integer sort is useless. The most common application is string sorts with slightly more specialised tasks being array sorts where a number of fields or elements are grouped together.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Jimg

Thanks Rockoon.

Hutch-
I agree, it's just for proof of concept.  A string sort will follow.  Sorting indexes to strings is very close to sorting integers, just have to free up an extra register or two for the comparison...
As far as arrays, the beauty of the mergesort, as Rockoon said above, is it's "stable", so multi element sorts can be done with consecutive sorts (though not very efficient that way).

...   Ok, so I'm a little excited.  Mergesort seems to be very efficient, and, unlike quicksort, easy to understand how and why it works.

cork

Guys we should have a contest!

How does this sound...

Someone could produce a 50 MB file of strings between 5 and 40 characters long (length is random). Then have a general routine to load this file into memory that everyone can use. Use as much memory as you want for data structures or a scratch buffer.

The platform would be Windows. The routine would be locked to a single processor and run as high priority. No multi-threaded code. No making any calls to outside routines, the code must be all your own x86 ASM code. The code would be 1 file that assembles in MASM32.

Then see who can code the fastest version.

We'd have to have a reference machine... on which the code would be run and judged.

Anybody interested in making an x86 Assembly Language sorting contest happen? It could be a clash of the titans for supremacy and bragging rights, and perhaps a small prize.

hutch--

I have got a genuinely fast Sedgewick hybrid Quick sort for string data but truly its no joy coded in masm.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php