News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Merge Sort

Started by Jimg, December 27, 2007, 04:30:40 AM

Previous topic - Next topic

hutch--

Jim,

I have a testbed waiting but I have yet to decypher a separate algo that I can use to test the string sort capacity. Any change of pasting in a complete algo without embedded data in it that I can benchmark against ?
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Jimg

#31
Hi Hutch,

I'm not understanding what you are asking for here.
There is no embedded data (other than a first time flag and a 6 word lookup table).

The call is:  invoke MergeSort25, Arr, count , stype , Scratch

where Arr is the address of an array of addresses of strings
count is the number of strings
stype is 1 to do a string sort ascending
Scratch is the address of scratch space, or zero to have the routine allocate scratch.

Just in case there is something wrong with the previous upload, here is the complete standalone routine again-
edit:  removed file, later version below.

hutch--

Jim,

Sorry to be a bit thick but I was not used to the args not having data types and i had to read the algo for a while just to see how it hung together. I have it in a test piece and its performing well. Here are my results with a test bed that works on optimum case quick sort data with the library version of assort and your new algo. I had a quick play with the quick sort algo again but I bashed it to death about 4 years ago and it will not clock any faster.


qssorta timing = 828 ms
MergeSort25 timing = 672 ms
qssorta timing = 828 ms
MergeSort25 timing = 672 ms
qssorta timing = 828 ms
MergeSort25 timing = 672 ms
qssorta timing = 828 ms
MergeSort25 timing = 672 ms
qssorta timing = 843 ms
MergeSort25 timing = 672 ms
qssorta timing = 828 ms
MergeSort25 timing = 672 ms
qssorta timing = 828 ms
MergeSort25 timing = 672 ms
qssorta timing = 828 ms
MergeSort25 timing = 672 ms

Average lib assort  = 829
Average MergeSort25 = 672
Press any key to continue ...


You need to run ranwords.exe with the name "rand.txt" with a count of 1000000 to have a test file to run.

[attachment deleted by admin]
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Jimg

Thanks for testing Hutch.  Sorry about the dword thing.  Probably bad form, but I got tired of typing it every time.  I don't think I've every written a routine where the input type wasn't dword.  I find it more readable without, but that's just my quirkyness.

Here's a very minor update.  Added error comment if it can't do the VirtualProtect, and switched to HeapAlloc from SysAllocStringByteLen as it's no slower, and I don't have to include oleaut32, which I normally don't do.


[attachment deleted by admin]

Jimg

Would someone test this on Vista for me?  I don't care about the answers, I just want to know if there is any problem with DEP on vista.  Thanks.

[attachment deleted by admin]

lingo

Vista Ultimate with SP1... :bdg
GenuineIntel  Family 15  Model 4  Stepping 1
Intel Brand String: NA
Features: FPU TSC CX8 CMOV CLFSH FXSR HTT MMX SSE SSE2 SSE3


Sort Timing tests, sorting 1048575 integers

Input data    Routine      Cycles   Millisec

  Random:      qSortdx       493993   138.0
               MergeSort26   588661   164.5    19.16%

  Sorted:      qSortdx       102311    28.6
               MergeSort26    43729    12.2   -57.26%

  Reversed:    qSortdx       153538    42.9
               MergeSort26    53946    15.1   -64.86%

  Constant:    qSortdx       128596    35.9
               MergeSort26    45431    12.7   -64.67%

  Alternating: qSortdx       234346    65.5
               MergeSort26   468236   130.8    99.81%

  Cyclic:      qSortdx        88719    24.8
               MergeSort26    44617    12.5   -49.71%

  Tapering:    qSortdx       477491   133.4
               MergeSort26    66823    18.7   -86.01%



  Press ENTER to exit...

Regars,
Lingo

[attachment deleted by admin]

Jimg

Thank you very much, Lingo.  Just as I suspected, virtualprotect still works in Vista :bg

Mark_Larson

Quote from: Jimg on January 04, 2008, 01:23:20 AM
After playing with the radix sort for hours, I found that 90% of the time was taken up by a single instruction, mov [edi],eax.
It seems that storing data randomly throughout memory is very expensive.  I precomputed all the addresses for where the items should be copied to thinking that would be the fastest way.  It only takes 6 millisecs to scan through the array and compute where everything should go.  The next small loop simply picks up a value and looks up where it should go in a table, and then stores it.  If I comment out the actual mov to memory, the loop runs in 7 millisecs, if I do the mov to memory, the loop takes 170 millisecs.  This is for when the input is presorted.  When the input is random data, the loop takes 19 millisecs including the mov to memory.   I guess that's just the way it goes.

    xor ebx,ebx     ; where the table address will be constructed
    mov esi,Arr     ; first input
    mov ecx,count
    align 4
    .repeat
        mov bl,byte ptr [esi] ; get lowest byte
        mov edi,Table[ebx*4] ; output address for this value byte
        ;movsd works faster on the AMD
        mov eax,[esi]
        add esi,4
        mov [edi],eax     ;   <---  this is the culprit that takes all the time, but why only on presorted data and not random data????
        add edi,4   
        mov Table[ebx*4],edi ; where next one will go
        dec ecx
    .until zero?


you can use the PREFETCH instruction to get around this issue.  Use the version of it that only fetches the data into the L1 cache.

If you post your latest code with this test in it, I can take a look at it for you.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Jimg

Hi Mark,

This is the latest code for the radix test.  I quit working on it when I couldn't figure out what was making it slow.
For comparison, here's my latest results:AuthenticAMD  Family 7  Model 10  Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE


Sort Timing tests, sorting 1048575 integers

Input data    Routine      Cycles   Millisec

  Random:      qSortdx       424136   118.5
               MergeSort     374340   104.6   -11.74%
               RadixSort3    256124    71.6   -39.61%

  Sorted:      qSortdx       173079    48.4
               MergeSort      70238    19.6   -59.42%
               RadixSort3    1022137  285.5    490.56%

  Reversed:    qSortdx       241878    67.6
               MergeSort      86153    24.1   -64.38%
               RadixSort3    1022298  285.6    322.65%

  Constant:    qSortdx       203579    56.9
               MergeSort      70270    19.6   -65.48%
               RadixSort3    134203    37.5   -34.08%

  Alternating: qSortdx       251230    70.2
               MergeSort     334675    93.5    33.21%
               RadixSort3    894057   249.8    255.87%

  Cyclic:      qSortdx       173621    48.5
               MergeSort      70294    19.6   -59.51%
               RadixSort3    1021670  285.4    488.45%

  Tapering:    qSortdx       924151   258.2
               MergeSort      83612    23.4   -90.95%
               RadixSort3    1169310  326.7    26.53%

  Press ENTER to exit...

I just don't understand why it would take 71 ms. to sort random and 285 ms. to sort already sorted.   A little longer I can see, that much, no.

[attachment deleted by admin]

hutch--

Jim,

That is not that uncommon, it means the algo is starting to become order dependent. A normal recursive quick sort wil go quadratic if that data is not suitable and they are generally at their fastest with random data.

Interestingly enough the string sort version you posted does not have that characteristic. It is faster on sorted data than with random data.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php