News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Merge Sort

Started by Jimg, December 27, 2007, 04:30:40 AM

Previous topic - Next topic

Jimg

I needed a fast sort, but all the fast ones (i.e. quicksort) were not stable, and had some quadratic behavior.  So I wrote a Merge Sort because I knew it would be stable and non-quadratic, and I though it should be pretty fast.  Turns out it's even faster than quicksort!  The only drawback I can think of is that it's rather large, it add about 600 bytes to your exe.

Here's the results on my AMD-
AuthenticAMD  Family 7  Model 10  Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE


Sort Timing tests, sorting 1048575 integers

Input data    Routine      Cycles   Millisec

  Random:      nrQsortAx     609562   170.3
               MergeSort     387694   108.3   -36.40%
  Sorted:      nrQsortAx     261351    73.0
               MergeSort      74257    20.7   -71.59%
  Reversed:    nrQsortAx     263952    73.7
               MergeSort      88983    24.9   -66.29%
  Constant:    nrQsortAx     232466    64.9
               MergeSort      73743    20.6   -68.28%
  Alternating: nrQsortAx     576810   161.1
               MergeSort     347091    97.0   -39.83%
  Cyclic:      nrQsortAx     260739    72.8
               MergeSort      73818    20.6   -71.69%


and the results on my clunky old celeron laptop-
GenuineIntel  Family 15  Model 2  Stepping 7
Intel Brand String: Mobile Intel(R) Celeron(R) processor
Features: FPU TSC CX8 CMOV CLFSH FXSR HTT MMX SSE SSE2


Sort Timing tests, sorting 1048575 integers

Input data    Routine      Cycles   Millisec

  Random:      nrQsortAx     837075   233.8
               MergeSort     718441   200.7   -14.17%
  Sorted:      nrQsortAx     209885    58.6
               MergeSort      71795    20.1   -65.79%
  Reversed:    nrQsortAx     222656    62.2
               MergeSort      92744    25.9   -58.35%
  Constant:    nrQsortAx     270274    75.5
               MergeSort      71855    20.1   -73.41%
  Alternating: nrQsortAx     540182   150.9
               MergeSort     771398   215.5    42.80%
  Cyclic:      nrQsortAx     209835    58.6
               MergeSort      71616    20.0   -65.87%



Enjoy---



[attachment deleted by admin]

six_L

Quote
GenuineIntel  Family 6  Model 13  Stepping 6
Intel Brand String: Intel(R) Pentium(R) M processor
Features: FPU TSC CX8 CMOV CLFSH FXSR MMX SSE SSE2

Sort Timing tests, sorting 1048575 integers

Input data    Routine      Cycles      Millisec

Random:       nrQsortAx    779449   217.8
                  MergeSort    772169   215.7    -0.93%
Sorted:        nrQsortAx    263314    73.6
                  MergeSort    85396     23.9    -67.57%
Reversed:     nrQsortAx    281017    78.5
                  MergeSort    103580    28.9    -63.14%
Constant:     nrQsortAx    460908   128.8   
                  MergeSort    76724     21.4    -83.35%
Alternating:  nrQsortAx    575977   160.9
                  MergeSort    653427   182.5   13.45%
Cyclic:         nrQsortAx    263770    73.7
                 MergeSort    85655     23.9     -67.53%

Press ENTER to exit...
regards

hutch--

Jim,

Here is a reasonable Sedgewick/Bentley quick sort. See if this one clocks any better. It is not protected so with unusual data sets in can go quadratic but on random data it should be fast.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 16

sbsort proc arr:DWORD,lft:DWORD,rgt:DWORD

comment * ---------------------------------
        A quick sort algorithm published by
        Robert Sedgewick and John Bentley
        --------------------------------- *

    mov eax, rgt
    cmp eax, lft
    jg @F
    ret

  @@:
    push ebx
    push esi
    push edi

    mov ebx, arr
    mov edi, lft
    sub edi, 1
    mov esi, rgt
    mov edx, [ebx+esi*4]

  align 4
  entry:
  REPEAT 7
    add edi, 1
    cmp [ebx+edi*4], edx
    jge first
  ENDM

    add edi, 1
    cmp [ebx+edi*4], edx
    jl entry

  align 4
  first:
    sub esi, 1
    cmp edx, [ebx+esi*4]
    jge next
    cmp esi, lft
    je next

    sub esi, 1
    cmp edx, [ebx+esi*4]
    jge next
    cmp esi, lft
    jne entry

  align 4
  next:
    mov ecx, [ebx+edi*4]
    mov eax, [ebx+esi*4]
    cmp edi, esi
    jge @F
    mov [ebx+esi*4], ecx
    mov [ebx+edi*4], eax
    jmp entry

  align 4
  @@:
    mov esi, rgt
    mov ecx, [ebx+edi*4]
    mov eax, [ebx+esi*4]
    mov [ebx+esi*4], ecx
    mov [ebx+edi*4], eax

    sub edi, 1
    invoke sbsort,ebx,lft,edi
    add edi, 2
    invoke sbsort,ebx,edi,esi

    pop edi
    pop esi
    pop ebx

    ret

sbsort endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Jimg

Hutch,

I wasn't sure what lft and rgt were supposed to be.  I tried lft=0, rgt=randsize-1 and lft=1, rgt=randsize.  Both produced many errors in sorting, and
could not sort (went quadratic and exceeded the stack) for the pre-sorted and cyclic input tests.  Here are the results I could get, even ignoring the errors in sorting-

AuthenticAMD  Family 7  Model 10  Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE


Sort Timing tests, sorting 1048575 integers

Input data    Routine      Cycles   Millisec

  Random:      sbsort        435059   121.5
               MergeSort     387685   108.3   -10.89%
  Reversed:    sbsort        323261    90.3
               MergeSort      88975    24.9   -72.48%
  Constant:    sbsort        219061    61.2
               MergeSort      74326    20.8   -66.07%
  Alternating: sbsort        199813    55.8
               MergeSort     347585    97.1    73.96%

hutch--

Jim,

I had a rat around the quick sorts I could find and found the tri-median one that seems to work OK. Its been a few years since I worked on sorts and I have a mountain of this stuff floating around so its not easy to pick the correct ones but this one seems to test OK.

I used it for an unusual purpose, a quick sort fails on certain data orderings so instead of trying to protect it, I set one up with a recursion depth indicator and when it went past that it exited and ran another sort.

I could not work out in a hurry how to use your sort as it had data buried in the data section but this one should be easy enough to benchmark using your own testbed.

[attachment deleted by admin]
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Jimg

Hutch-

Sorry I didn't get back sooner, the motel I was staying at last night had problems with their internet. (Snow storm buried the satellite dish!).  It will be 4 more days before I get home again, so I can only test on my old celeron, but the new quick sort looks very good-
GenuineIntel  Family 15  Model 2  Stepping 7
Intel Brand String: Mobile Intel(R) Celeron(R) processor
Features: FPU TSC CX8 CMOV CLFSH FXSR HTT MMX SSE SSE2


Sort Timing tests, sorting 1048575 integers

Input data    Routine      Cycles   Millisec

  Random:      qSortd        663144   185.3
               MergeSort     700863   195.8    5.69%
  Sorted:      qSortd        160102    44.7
               MergeSort      72547    20.3   -54.69%
  Reversed:    qSortd        238438    66.6
               MergeSort      93528    26.1   -60.77%
  Constant:    qSortd        241700    67.5
               MergeSort      72924    20.4   -69.83%
  Alternating: qSortd        362287   101.2
               MergeSort     824375   230.3    127.55%
  Cyclic:      qSortd        158682    44.3
               MergeSort      72946    20.4   -54.03%
  Tapering:    qSortd        790334   220.8
               MergeSort     135983    38.0   -82.79%


It even did well on the Tapering input I designed especially to trip up quicksort.
The Alternating input I designed for worst case on my mergesort is the only one that shows up badly, so I'm still pretty happy  :bg


[attachment deleted by admin]

Jimg

After playing a bit with the last sort you posted, I can see several places that could be faster, so this was probably not your final version. 
    mov eax, rgt
    add eax, lft
    cdq
    sub eax, edx
    sar eax, 1
    mov esi, eax

I still can't figure out the purpose of the cdq as rgt and lft can never be negative, or they would be referring to locations out of bounds of the array, thus edx can never be anything other than zero, so this could be just    mov esi, rgt
    add esi, lft
    sar esi, 1


can you explain the purpose of the cdq?

This one is probably the fastest qsort I've seen, but a few simple changes makes it 3% faster.

hutch--

Jim,

It had a funy history, I got it as a three function C algo, merged the three in C to make it more compact and not dependent on call/ret syntax, converted it to MASM asm with CL.EXE then manually optimised it to get its speed up. I mainly worked on the critical loop code and in its target usage its reasonably fast. It will fail on known data sets which is characteristic of a quick sort. The tri-median front end isolates it from a few of the known problems but not from deliberate data ordering to make it quadratic.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

hutch--

Jim,

I thought this insertion sort may be useful to you. In themselves they are too slow on almost all data but they are very fast on near ordered data which makes them useful with hybrid sorts. I have seen them used and used them to trim the end of another sort that is fast on large movements but gets slower near the end.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

iSortb proc arr:DWORD,lft:DWORD,rgt:DWORD

    mov ecx, [esp+4]    ; arr
    mov edx, [esp+8]    ; lft
    mov eax, [esp+12]   ; rgt

    push ebp
    push ebx
    push esi
    push edi

    mov ebp, edx

    add edx, 1
    jmp lbl1

  align 4
  lbl0:
    mov [ecx+esi*4], edi
    add edx, 1

  lbl1:
    cmp edx, eax
    ja quit

  lbl2:
    mov esi, edx
    mov edi, [ecx+edx*4]
    cmp esi, ebp
    jbe lbl0

  inner:
  REPEAT 7
    mov ebx, [ecx+esi*4-4]
    cmp ebx, edi
    jle lbl0
    mov [ecx+esi*4], ebx
    sub esi, 1
    cmp esi, ebp
    jle lbl0
  ENDM

    mov ebx, [ecx+esi*4-4]
    cmp ebx, edi
    jle lbl0
    mov [ecx+esi*4], ebx
    sub esi, 1
    cmp esi, ebp
    jg inner

  lbl3:
    mov [ecx+esi*4], edi
    add edx, 1
    cmp edx, eax
    jle lbl2

  quit:
    pop edi
    pop esi
    pop ebx
    pop ebp

    ret 12

iSortb endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Jimg

#9
Thanks Hutch.   When I started out, I was initially doing just a two stream merge, as in the classical definition of a merge sort.  In this case, I found the best results when I first did an insertion sort on every eight values throughout the array before starting the merge process.  Since I have started merging multiple streams of varying size at once, I haven't looked at insertion, but I'll go back and give it a try.

A little later...

Nope, doesn't make any speed increase with current scheme of merging.

Neo

I've never seen this QueryPerformanceCounter function being used before.  It definitely doesn't appear to be returning values in clock cycles, because you've got less than one "cycle" per number in all of the results displayed.  I usually use RDTSC for timing, so it's in the CPU's timing regime.  Does QueryPerformanceCounter use some standardised unit like microseconds?

Is it worth testing against radix sort?  I should dust off my old implementation some time and check.  I seem to recall that I got it down to less than 80 clock cycles per number.  Plus, you can adapt radix sort to do floats almost as fast as integers.  :wink

MichaelW

The Performance Counter runs at the frequency specified by QueryPerformanceFrequency. Within my experience the frequency is 1193180 Hz under Windows 9x and 3579545 Hz under Windows2000/XP, and the effectuve resolution in either case is something like 2 microsecond.
eschew obfuscation

Neo

Quote from: MichaelW on January 01, 2008, 08:06:35 AM
The Performance Counter runs at the frequency specified by QueryPerformanceFrequency. Within my experience the frequency is 1193180 Hz under Windows 9x and 3579545 Hz under Windows2000/XP, and the effectuve resolution in either case is something like 2 microsecond.
Okay, thanks!  I guess then it uses the Programmable Interval Timers (PITs) for that.  Just one of the timers for Windows 9x, and all three timers synchronized just right for Windows 2000/XP.  The APIC timers are tied up with thread scheduling, and the CMOS timer doesn't have high enough resolution.  I would have thought they'd just use the PITs at 1000.15 Hz (1193180/1193) for timer messages etc, but I guess they serve a dual purpose.

After doing the conversion to clock cycles, it looks like radix sort may actually be a contender.  It's got the added advantage that it's almost completely independent of the actual numbers in the array.  I'll try it out tomorrow.

Jimg

Quote from: Neo on January 01, 2008, 07:40:25 AM
I've never seen this QueryPerformanceCounter function being used before.  It definitely doesn't appear to be returning values in clock cycles, because you've got less than one "cycle" per number in all of the results displayed.  I usually use RDTSC for timing, so it's in the CPU's timing regime.  Does QueryPerformanceCounter use some standardised unit like microseconds?

Is it worth testing against radix sort?  I should dust off my old implementation some time and check.  I seem to recall that I got it down to less than 80 clock cycles per number.  Plus, you can adapt radix sort to do floats almost as fast as integers.  :wink
Hi Neo,
I would also use RDTSC for short duration tests, but for tests measured in milliseconds, it's easier to visually compare actual millisecs, so that's what I used.  I look forward to seeing your results with radix sort.  The only problem I see with radix sort is it's slower for already sorted data.  If implemented properly, it should be non-quadratic, and stable, two big problems with quicksort.

So far, the mergesort is faster than any other sort I can find on my AMD cpu.  Even faster than Hutch's tri-median quicksort (about 14% faster on random data), but a little slower on my celeron (about six percent slower).  Otherwise, the tri-median quicksort even if it is the fastest I've found, but it still suffers from the above two problems.  Since I can't think of any more ways to make mergesort faster, the next step is sorting text, which I expect to be slower than quicksort, even on the AMD.  I hope the radix sort turns out to be a competitor also.



Neo

Wow, the results for radix sort were rather different than I expected.  I did a standard base-256, 4-pass radix sort with a pre- and post-adjustment to account for that we're sorting as if the numbers are signed instead of unsigned.


GenuineIntel  Family 6  Model 13  Stepping 8
Intel Brand String: NA, unknown Brand ID
Features: FPU TSC CX8 CMOV CLFSH FXSR MMX SSE SSE2


Sort Timing tests, sorting 1048575 integers

Input data    Routine      Cycles   Millisec

  Random:      nrQsortAx     716349   200.1
               MergeSort     690272   192.8   -3.64%
               RadixSort     291333    81.4
  Sorted:      nrQsortAx     247970    69.3
               MergeSort      88904    24.8   -64.15%
               RadixSort     809210   226.1
  Reversed:    nrQsortAx     273571    76.4
               MergeSort     122598    34.2   -55.19%
               RadixSort     826531   230.9
  Constant:    nrQsortAx     315613    88.2
               MergeSort      92966    26.0   -70.54%
               RadixSort     286531    80.0
  Alternating: nrQsortAx     558341   156.0
               MergeSort     584224   163.2    4.64%
               RadixSort     684534   191.2
  Cyclic:      nrQsortAx     247528    69.2
               MergeSort      90171    25.2   -63.57%
               RadixSort     819808   229.0


It's much faster than the other two for random numbers, and somehow it's much slower than itself on most of the other inputs.  I didn't expect that the input numbers would affect it so much, because it doesn't do any branching based on the numbers.  I suppose that it would make a difference to the caching, but I'd have guessed that having initially sorted or reverse-sorted numbers would improve the caching, making the alternating one the worst case for it, which isn't at all what happened.  Either way, it might just need larger inputs before the fact that it's linear time instead of nlogn gives it the upper hand, but 1048575 integers is a pretty big input for most stuff.

If someone's up for a challenge, try implementing JSort (see the Wikipedia article).  The insertion sort part of it can be implemented much faster in assembly than in C because you can repeatedly do xchg to move things forward while going forward in memory instead of going backward in memory.

Edit: I felt obliged to try base-65536, 2-pass radix sort too, just for the heck of it, but it turns out to be much slower.  Something in between base-256 and base-65536 might happen to be faster, but without the advantage of being byte-aligned, I doubt it.