Merge Sort

Jimg · January 02, 2008, 02:58:31 PM

That's very impressive on the random numbers, I'll have to give it a try.

Jimg · January 03, 2008, 03:28:04 AM

Ok, I wrote a radix sort the best way I could conceive of from the description of what it was.

Even though mine seems to be slower than the one tested by Neo (just looking at the numbers, I don't have his code to directly compare), I'm still impressed with how much faster it is on random numbers-

Code Select

AuthenticAMD  Family 7  Model 10  Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE


 Sort Timing tests, sorting 1048575 integers

 Input data    Routine      Cycles   Millisec

  Random:      qSortdx       439014   122.6
               MergeSort     387348   108.2   -11.77%
               RadixSort1    278557    77.8   -36.55%
  Sorted:      qSortdx       174444    48.7
               MergeSort      74006    20.7   -57.58%
               RadixSort1    1286643  359.4    637.57%
  Reversed:    qSortdx       251661    70.3
               MergeSort      88995    24.9   -64.64%
               RadixSort1    1279538  357.5    408.44%
  Constant:    qSortdx       211822    59.2
               MergeSort      73311    20.5   -65.39%
               RadixSort1    139994    39.1   -33.91%
  Alternating: qSortdx       256916    71.8
               MergeSort     347235    97.0    35.16%
               RadixSort1    1008451  281.7    292.52%
  Cyclic:      qSortdx       174382    48.7
               MergeSort      73411    20.5   -57.90%
               RadixSort1    1286068  359.3    637.50%
  Tapering:    qSortdx       951773   265.9
               MergeSort      86523    24.2   -90.91%
               RadixSort1    1331616  372.0    39.91%

But I'm really confused why it would take over 4 times longer to sort numbers that are already sorted than random numbers.
The routine goes through the exact same steps, moves the exact same number of dwords, does the exact same number of operations, regardless of what the input numbers are. The only difference is exactly where it stores each number on each pass. I can see that cache is probably the culprit (nothing else is left to blame), but why in the world would random numbers work better?

This being the Laboratory, would anyone like to wade in on this???

edit: remove file. later version below.

hutch-- · January 03, 2008, 07:44:11 AM

Jim,

here are the results on my older Northwood PIV.

Code Select


GenuineIntel  Family 15  Model 2  Stepping 5
Intel Brand String: Intel(R) Pentium(R) 4 processor
Features: FPU TSC CX8 CMOV CLFSH FXSR HTT MMX SSE SSE2


 Sort Timing tests, sorting 1048575 integers

 Input data    Routine      Cycles   Millisec

  Random:      qSortdx       460190   128.6
               MergeSort     496356   138.7    7.86%
               RadixSort1    235230    65.7   -48.88%
  Sorted:      qSortdx        87931    24.6
               MergeSort      29402     8.2   -66.56%
               RadixSort1    1059169  295.9    1104.55%
  Reversed:    qSortdx       131154    36.6
               MergeSort      38912    10.9   -70.33%
               RadixSort1    1061986  296.7    709.72%
  Constant:    qSortdx       129773    36.3
               MergeSort      29475     8.2   -77.29%
               RadixSort1    343402    95.9    164.62%
  Alternating: qSortdx       216239    60.4
               MergeSort     539864   150.8    149.66%
               RadixSort1    1008597  281.8    366.43%
  Cyclic:      qSortdx        89778    25.1
               MergeSort      29385     8.2   -67.27%
               RadixSort1    1061655  296.6    1082.53%
  Tapering:    qSortdx       436026   121.8
               MergeSort      82419    23.0   -81.10%
               RadixSort1    925596   258.6    112.28%

Jimg · January 03, 2008, 02:07:30 PM

So your results are about the same, it takes 4.5 times longer to sort an array that is already sorted than it does to sort random numbers. Every other sort generally has the worst time with random numbers, this one does the best on random numbers, this just doesn't make any sense to me at all. Perhaps the very nature of the random number generator produces a pattern that oscillates just right to be efficiently sorted??? It has to be some stupid mistake I'm just not seeing. At the very worst, I expected the radix sort the take just as long to sort any sequence, i.e. slow on sorted data compared to other types of sorts, but not to take longer for sorted vs. random.

hutch-- · January 03, 2008, 10:51:15 PM

Jim,

This is normal with a quick sort that is dependent on data ordering for its timing. The more random the data is the lower the tree it builds at runtime to sort the data therefore it is faster as it has to traverse fewer iterations to get to the end of the tree.

Something I found from practice, beware of designing a sort algo on the basis of its integer comparison performance as there are other factors that come into play like comparison and swap speed of integers that distort the results. I found this when i was working on string sorts that have a string comparison in every iteration.

Jimg · January 04, 2008, 01:23:20 AM

After playing with the radix sort for hours, I found that 90% of the time was taken up by a single instruction, mov [edi],eax.
It seems that storing data randomly throughout memory is very expensive. I precomputed all the addresses for where the items should be copied to thinking that would be the fastest way. It only takes 6 millisecs to scan through the array and compute where everything should go. The next small loop simply picks up a value and looks up where it should go in a table, and then stores it. If I comment out the actual mov to memory, the loop runs in 7 millisecs, if I do the mov to memory, the loop takes 170 millisecs. This is for when the input is presorted. When the input is random data, the loop takes 19 millisecs including the mov to memory. I guess that's just the way it goes.

Code Select

    xor ebx,ebx     ; where the table address will be constructed
    mov esi,Arr     ; first input
    mov ecx,count
    align 4
    .repeat
        mov bl,byte ptr [esi] ; get lowest byte
        mov edi,Table[ebx*4] ; output address for this value byte
        ;movsd works faster on the AMD
        mov eax,[esi]
        add esi,4
        mov [edi],eax     ;   <---  this is the culprit that takes all the time, but why only on presorted data and not random data????
        add edi,4    
        mov Table[ebx*4],edi ; where next one will go
        dec ecx
    .until zero?

hutch-- · January 04, 2008, 04:00:11 AM

Jim,

That makes sense, some time ago I did a very large test piece (1.5 gig of code with a 350 meg result) to test memory page thashing and it sounds like you have that problem depending on the data arrangement. Unfortunate as the algo you have designed is showing legs on many data arangements.

Jimg · January 04, 2008, 03:55:41 PM

Okay, this is my last post about radix sort (he said hopefully), I'm going to use mergesort for my purposes.

As far as already sorted input goes, radix sort can sort 2 million numbers faster than 1,048,575 numbers. And not just a little faster either.

It seems that radix sort has trouble sorting an amount of numbers that is close to a power of 2.

I'm sure this is some kind of clue as to how to fix it, but it's beyond me.

Included is my latest radix sort source-

Code Select

AuthenticAMD  Family 7  Model 10  Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE


 Sort Timing tests, sorting 1048575 integers

 Input data    Routine      Cycles   Millisec

  Random:      qSortdx       442545   123.6
               MergeSort     390202   109.0   -11.83%
               RadixSort3    274794    76.8   -37.91%

  Sorted:      qSortdx       175156    48.9
               MergeSort      73684    20.6   -57.93%
               RadixSort3    1027596  287.1    486.67%

  Reversed:    qSortdx       251750    70.3
               MergeSort      90730    25.3   -63.96%
               RadixSort3    1019483  284.8    304.96%

  Constant:    qSortdx       212458    59.4
               MergeSort      74361    20.8   -65.00%
               RadixSort3    138483    38.7   -34.82%

  Alternating: qSortdx       257120    71.8
               MergeSort     346754    96.9    34.86%
               RadixSort3    931664   260.3    262.35%

  Cyclic:      qSortdx       174937    48.9
               MergeSort      73712    20.6   -57.86%
               RadixSort3    1018647  284.6    482.29%

  Tapering:    qSortdx       954909   266.8
               MergeSort      86655    24.2   -90.93%
               RadixSort3    1194400  333.7    25.08%

  


 Sort Timing tests, sorting 2000000 integers

 Input data    Routine      Cycles   Millisec

  Random:      qSortdx       894450   249.9
               MergeSort     750559   209.7   -16.09%
               RadixSort3    399618   111.6   -55.32%

  Sorted:      qSortdx       397105   110.9
               MergeSort     134977    37.7   -66.01%
               RadixSort3    491904   137.4    23.87%

  Reversed:    qSortdx       574870   160.6
               MergeSort     167419    46.8   -70.88%
               RadixSort3    510455   142.6   -11.21%

  Constant:    qSortdx       434186   121.3
               MergeSort     135334    37.8   -68.83%
               RadixSort3    265460    74.2   -38.86%

  Alternating: qSortdx       557377   155.7
               MergeSort     658192   183.9    18.09%
               RadixSort3    492800   137.7   -11.59%

  Cyclic:      qSortdx       397377   111.0
               MergeSort     134435    37.6   -66.17%
               RadixSort3    510752   142.7    28.53%

  Tapering:    qSortdx       2112309  590.1
               MergeSort     162962    45.5   -92.29%
               RadixSort3    470553   131.5   -77.72%

p.s. My version of a radix sort is not a true classical version. It is more of a hybrid bucketsort/radixsort.

[attachment deleted by admin]

Jimg · January 16, 2008, 02:38:13 AM

Here's a preliminary look at a more fully developed multi input type merge sort. The routine is so large (over 900 bytes) that I made it general purpose rather than making separate routines for each type of input.

Code Select

MergeSort25  PROC uses esi edi ebx Arr, count , stype
    ;
    ; This routine sorts by merging streams of already sorted numbers.  It
    ;  differs from a classic merge sort in that the streams will vary in size
    ;  depending upon the input data, and a table of stream information is kept
    ;  to aid rapid merging.

    ; Arr is the starting address of the input array.  Arr can contain either an
    ;  array of values, or an array of addresses of the items to sort depending
    ;  upon stype.

    ; count is the number of items to sort
    
    ; stype it the type of input that Arr contains, and how they are to be sorted:

    ;   1 = 32 bit signed integer values to be sorted ascendendingly
    ;   2 = addresses of strings, sort ascending
    ;   3 = addresses of strings, sort decending
    ;   4 = 32 bit signed integers, sort decending
    ;   5 = addresses of 32 bit integers, sort ascending
    ;   6 = addresses of 32 bit integers, sort decending

more types will be added as needed.

It was harder to set up test data than to add the extra code to handle strings. This test data is from two different files. Each file is read over and over until I get about 4,000,000 bytes of string data to sort. One of the file is coded as "\masm32\include\windows.inc" so you will have to change it if your windows.inc file is elsewhere. (I would be interested in testing some more realistic data if you have any.) I used assort from masmlib for the comparison.

The first try results for strings:

Code Select

AuthenticAMD  Family 7  Model 10  Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE


 Alphabetic Sort Timing test

 Input data    Routine      Cycles   Millisec

  constitution.txt:  4194300 bytes of text, 714069 total strings to sort

               assort        9468964  2645.3
               MergeSort25   1778495  496.8   -81.22%

  masm32includewindows.inc:  4194300 bytes of text, 478153 total strings to sort

               assort        4248938  1187.0
               MergeSort25   847573   236.8   -80.05%

I've also included the interger test to show it's still speedy after being converted to a general purpose sorting routine.

Note: I haven't done exhaustive testing on this yet, so there may still be some bugs :)

Edit: I removed the uploads from here because of some problems downloading them, and re uploaded new versions below.

hutch-- · January 16, 2008, 03:00:19 AM

Jim,

I tried to download the string version b8ut it won't download. is there any chance of you making your algo available without it having embedded test data so I can independently test it in various test beds ?

Jimg · January 16, 2008, 04:59:26 AM

Yes, surely. But I just realized, the test isn't fair. I preallocated the scratch buffer, so ignore the above results until I can do a fairer test. In the meantime, here's the routine itself, but I haven't changed the part to allocate the scratch. The code's in there, just commented out. I'll work on it.
edit: removed old version of file.

Jimg · January 16, 2008, 05:58:05 AM

Sorry Hutch, there's a bug somewhere in the scratch allocator code when doing windows.inc, works ok for constitution.txt, but it's just too late at night to find, I'll get it first thing in the morning.

Jimg · January 16, 2008, 03:14:11 PM

Ok, let's try this again now that I've had some sleep. Fixed simple typo that caused allocation bug. Added Scratch parameter. You can pass the address of a scratch buffer to use, or pass a zero and the routine will allocate scratch for itself. Scratch must be at least "count" dwords in size.

Here are the results of the string sort test where MergeSort allocates it own buffer, just to be fairly compared to assort.

Code Select

AuthenticAMD  Family 7  Model 10  Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE


 Alphabetic Sort Timing test

 Input data    Routine      Cycles   Millisec

  constitution.txt:  4194300 bytes of text, 713891 total strings to sort

               assort        8902892  2487.2
               MergeSort25   1782771  498.0   -79.98%

  masm32includewindows.inc:  4194300 bytes of text, 478036 total strings to sort

               assort        4246724  1186.4
               MergeSort25   849232   237.2   -80.00%

Since the previous upload seems to be messed up, I've deleted them and included here the updated string sort test, the interger sort test, and copy of the routine by itself. I've also changed the stype values to put the integer sorts together.

Code Select

MergeSort25  PROC uses esi edi ebx Arr, count , stype , Scratch 
    ; MergeSort by Jim Giordano
    ; This routine sorts by merging streams of already sorted numbers.  It
    ;  differs from a classic merge sort in that the streams will vary in size
    ;  depending upon the input data, and a table of stream information is kept
    ;  to aid rapid merging.

    ; Arr is the starting address of the input array.  Arr can contain either an
    ;  array of values, or an array of addresses of the items to sort depending
    ;  upon stype.

    ; count is the number of items to sort
    
    ; stype is the type of input that Arr contains, and how they are to be sorted:

    ;   1 = addresses of strings to be sorted ascendingly
    ;   2 = addresses of strings, sort decending
    ;   3 = 32 bit signed integers, sort ascending
    ;   4 = 32 bit signed integers, sort decending
    ;   5 = addresses of 32 bit integers, sort ascending
    ;   6 = addresses of 32 bit integers, sort decending
    
    ; Scratch is the first address of a scratch buffer of at least count dwords
    ;  in size, or zero if the routine is to allocate the scratch buffer.

More types will be added as needed.
edit: removed old version of mergesort25.zip

[attachment deleted by admin]

hutch-- · January 16, 2008, 11:07:55 PM

Jim,

Hre is a random word generator I used some years back when i was working on string sorts. It is slow as it calls a random generator for each chaacter in each word but it was useful for testing algos that required random data.

[attachment deleted by admin]

Jimg · January 17, 2008, 01:38:25 AM

Thanks, Hutch, I'll give it a go.

Later.....

Using Hutch's program to generate a big enough file to fill the test program, about 465000 strings

Code Select

AuthenticAMD  Family 7  Model 10  Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE


 Alphabetic Sort Timing test

 Input data    Routine      Cycles   Millisec

  constitution.txt:  4194300 bytes of text, 713891 total strings to sort

               assort        9542917  2666.0
               MergeSort25   1781689  497.7   -81.33%

  masm32includewindows.inc:  4194300 bytes of text, 478036 total strings to sort

               assort        4385612  1225.2
               MergeSort25   849880   237.4   -80.62%

  tst.txt:  4194300 bytes of text, 465675 total strings to sort

               assort        1275146  356.2
               MergeSort25   875821   244.7   -31.32%

Not nearly as impressive, but still pretty good. I guess the duplicates were causing quicksort problems. The big reason I pursued mergesort is that it is stable, so I can sort on multiple "columns" properly. And being non-quadratic, as the above shows what can happen, is a big plus.

News:

Merge Sort