News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Beginner Help Optimizing a Function

Started by aalaardb, May 04, 2006, 05:41:35 AM

Previous topic - Next topic

aalaardb

I'd like to ask several questions regarding assembly.  I am a C++ student using MSVC++ 2005's inline __asm {} to learn the basics of assembly.
The code I'm trying to optimize:
template<class T>
int CBinaryHeap<T>::moveUp(int index, int newKey)
{
  register int current2 = index>>1;
  while (index > 1) {//while there is a parent to compare
   if (BinaryHeap[current2].key > newKey) {
    BinaryHeap[index] = BinaryHeap[current2];//move parent down
    index = current2;//on the next iteration, compare the parent
    current2 >>= 1;//divided by 2, points to the parent of the parent
   }
   else {
     break;//we can stop comparing, we've found the index to insert at
   }
  }
  return (index);
}//end moveUp


A Binary Heap is a sorted structure, basically a tree inside an array, and its indexs serve as pointers to tree members.  To insert an element onto the heap, it is placed at the end of the array.  The newly placed element is then iteratively compared with its parent node and filters up, the parent moving down each time.  To avoid temporary objects and extra assignments, the insertion and moving up have been seperated into two functions.  The new element is not actually placed on the binary heap until its index has been determined, and all its parents have moved down.  I have implemented the binary heap to have one extra element.  The root of the tree is at Heap[1], its two children are at 2 and 3, etc; this saves a bit of pointer calculation.  The datatype of data  and key is an int.

In the moveUp function, index is the heapSize, the index of the last element - or where it would be placed if following the standard algorithm.  Current2 is the parent, or index / 2.

Now, the moveDown function, which is used when removing an element from the heap, is more complex.  It needs to compare both children, and move down.  Because the children are at index*2 and index*2+1 I wanted to optimize this.  I found that I couldn't in C++, anything I did made it slower.  So I tried rewriting moveDown in assembly, but again things were slower, this time much worse!  So I tried rewriting moveUp in assembly because it was simpler.  This too is slower than if I compiled with MSVC++.

So I looked at how MSVC++ compiled the above, and I got this:

mov  eax, dword ptr[index]
sar  eax, 1
mov  dword ptr[current2], eax

while_current#1:
cmp  dword ptr[index], 1
jle  end#1;

mov  ecx, dword ptr[this];
mov  edx, dword ptr[ecx+4];
mov  eax, dword ptr[current2];
mov  ecx, dword ptr[edx+eax*8+4];
cmp  ecx, dword ptr[newKey];
jle  end#2;

mov  edx, dword ptr[this];
mov  eax, dword ptr[edx+4];
mov  ecx, dword ptr[current2];
mov  edx, dword ptr[eax+ecx*8];
mov  eax, dword ptr[eax+ecx*8+4];
mov  ecx, dword ptr[this];
mov  ecx, dword ptr[ecx+4];
mov  esi, dword ptr[index];
mov  dword ptr[ecx+esi*8], edx;
mov  dword ptr[ecx+esi*8+4], eax;

mov  edx, dword ptr[current2];
mov  dword ptr[index], edx;
mov  eax, dword ptr[current2];
sar  eax, 1;
mov  dword ptr[current2], eax;
jmp  while_current#2;

end#2:
jmp  end#1;

while_current#2:
jmp while_current#1;

end#1:
mov  eax, dword ptr[index]

I have a few questions as to how / why MSVC++ compiled it this way, but first my rewritten assembly, copy and pasting some of this code:

__asm {
mov  ebx, index;
shr  ebx, 1;

while_current:
cmp  index, 1;
jle  end;

mov  ecx, dword ptr [this];
mov  edx, dword ptr [ecx+4];
mov  eax, ebx;
mov  ecx, dword ptr [edx+eax*8+4];
cmp  ecx, dword ptr [newKey];
jle  end;

mov  edx, dword ptr [this];
mov  eax, dword ptr [edx+4];
mov  ecx, ebx;
mov  edx, dword ptr [eax+ecx*8];
mov  eax, dword ptr [eax+ecx*8+4];
mov  ecx, dword ptr [this];
mov  ecx, dword ptr [ecx+4];
mov  esi, dword ptr [index];
mov  [ecx+esi*8], edx;
mov  [ecx+esi*8+4], eax;

mov  index, ebx;
shr  ebx, 1;
jmp  while_current;

end:
}
return index;

So a few things were changed that in my mind should yield extra performance
1 I removed the double jumps
2 I replaced current2 with ebx
3 I've cut down some moves when assigning index = current; and current >>= 1;
4 Other changes, but this is the current state of my code and only includes things that really seem like they would work.

However, after compiling my optimized code, it is slower by about 25 - 33%.
Now that you have all the information, here are my questions:

1: Are there any multiplications or additions in statements like this: mov edx, [eax+ecx*8]? I see many moves, with pointer arrithmetic, but I'm not sure how much that move costs in ops. Or to put it another way, aren't there really 2 moves and an add in this: mov ecx, dword ptr [this]; mov edx, dword ptr [ecx+4];?
2: Would it be any faster to store eax+ecx*8 in a free register, so that it can be reused on the next move?
3: Why the double jumps? It looks like it's to handle the break, but surely it could / should have been eliminated.
4: Could it possibly be that the code is slower because it doesn't know how to handle the assembly in templates?
5: Could it possibly be that the code is slower because with assembly there aren't any global optimizations being done? (this is where a profiler would help, but my trial edition doesn't seem to have this, I'm just going by time)
6: Do you see any way you can improve on the MSVC++ assembly here, and/or are any of my optimizations improving or hurting?
7: I thought [] means to load from the memory using the address stored inside the [], so why is it doing things like mov edx, dword ptr [this]?
8: Where online can I find resources for AMD and Intel that show cost in cycles for ops, for up to date processors?
9: Would compiling in something like MASM32 be better than inlining?
10: How can I measure in clocks rather than time?

Thanks for your help!

Ossa

Hi,

not got time at the moment to look at the code too closely, but I'll answer the questions that I can quickly,

Quote
8: Where online can I find resources for AMD and Intel that show cost in cycles for ops, for up to date processors?
9: Would compiling in something like MASM32 be better than inlining?
10: How can I measure in clocks rather than time?

8: No, it differs so much from processor to processor that you won't find it... also previous and following instructions can and will affect speed as much as how long an instruction takes.

You should take a look at the Intel Optimisation manual - http://www.intel.com/design/pentium4/manuals/index_new.htm - and the AMD equivalent - http://developer.amd.com/ - can't find it quickly, sorry.

Maybe also you might like to grab a copy of the AMD code profiler. - http://developer.amd.com/downloads.aspx

9: Shouldn't make any difference (well.. you could change the function prologure/epilogue, but in this case this will only be a small performance gain).

10: The rdtsc instruction, but in a multitasking environment, it can occasionally give you incorrect results.

Will try to reply to the rest later,
Ossa

[edit] oh, one other thing - please state what your processor is. [/edit]
Website (very old): ossa.the-wot.co.uk

aalaardb


hutch--

There are too many things involved to make a simple answer. In rough order, memory access reduction yields the best speed gain, jump reducion in high speed loop code shows lesser gains, code alignment sometimes helps with labels that are being bashed at high speed but where you can, architecture changes done properly yield the best gains.

Generally the version of CL in VC2005 works very well if you set the correct command line switches in terms of optimisation and it is not a trivial task to improve on it time wise. The method I use with CL is to put the C code in a seperate module, build it and link it into an app to see how fast it is. I then build it again with all of the optimisation turned off to get much slower expanded code that has the advantage that it uses less registers and does not remove the stack frame so you have enough room to work on the code.

You then manually optimise the results and keep timing the code to see if you can get it faster than the C code. If you know enough about the algo and how to optimise it in assembler, you will usually get a speed increase but its never a simple task.

Try porting the code to MASM, building it as a seperate module and linking it into your C app making sure you use the EXTERN syntax for its prototype. The dual advantage here is the module will not effect internal optimisation and you can work on it seperately.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Ian_B

#4
hi aalaardb

I love this sort of thing...  :toothy

The trick with these sort of optimisations is to minimise the most time-consuming operations, which are generally memory reads. If you can hold values in registers instead of continually pulling them from memory, you are going to have faster code every time. So use as many registers as you can, the extra push/pop is usually worth it if the code will access the memory location you are storing in the saved register more than twice in a loop.

The next step is to examine which values don't change throughout the loop. If you pull these in from memory once before the loop instead of in the loop, that will speed things up. Then work out what you can save without writing back to memory, and what you can't, making use of the most useful features of the different registers to prioritise their use, like direct sub-access to CL/CH in ECX compared with only being able to sub-access SI with ESI, for instance. I am making the assumption here (because I've never used C++ inlining) that the compiler is handling things like initial register pushes/pops for you, as you have used EBX and ESI freely in your code without saving restoring them explicitly.

So if you look carefully at the code the compiler produced, apart from the bizarre multiple jumps, you'll see that it continually accesses the same memory locations, mostly to create memory indexing that doesn't change (the base pointer for your BinaryHeap array). Worse than that, it reads data into one register, does something with it, then trashes that register with other data it could have saved and loads the same data immediately into another register.

Start by holding the most important numbers, [index] and [current2], in registers and not trashing them. If you keep [index] in EAX throughout the function you won't even have to read it in again at the finish!  :bg  Then notice that you reuse the [[this]+4] and [newKey] values. If you start by keeping those values in registers then you'll soon see how this can simplify out, and the loop handling becomes trivial. I'd advise you go through the process yourself to understand how I reached it, but this is what I ended up with:

mov  ebx, dword ptr [index]
mov  edi, dword ptr [this]
mov  esi, dword ptr [newKey]        ; ESI holds [newKey] value
mov  edi, dword ptr [edi+4]         ; EDI holds [[this]+4] value

while_current#1:
mov  eax, ebx                       ; EAX always holds [index] value
shr  ebx, 1                         ; EBX always holds [current2] value
cmp  eax, 1
jle  end#1

mov  ecx, dword ptr [edi+ebx*8+4]
mov  edx, dword ptr [edi+ebx*8]
cmp  ecx, esi
jle  end#1
mov  dword ptr [edi+eax*8+4], ecx
mov  dword ptr [edi+eax*8], edx
jmp  while_current#1

end#1:                              ; EAX already contains index value


That sure looks simpler to me!  :U  And it's almost identical to your original C code logic, which gives me high confidence that it's correct. Four memory accesses in the loop instead of 20!!  :dazzled:  Doing that sort of code analysis and reduction will get you better code every time. After that, the intricacies of one op over another, op ordering etc. is a black art, with emphasis on the word "art". Some of the best information for Intel is at Agner Fog's site: www.agner.org/assem

For instance, if speed really was an issue here, I'd want to look at those *8 multiplications in the addressing. By doing a single multiplication of the index value by 8 first before the loop, doing your EAX comparison with 8 instead of 1 and finishing with a SHR EAX, 3 at the end#1 point to get the correct index value again, you could replace all the complex EBX*8/EAX*8 addressing with EBX and EAX instead, which should certainly be faster. Like pushing/popping extra working registers, always trade a few slower operations outside a loop, to get frequently used values into a more useful range, for many slow ones inside. You'd want to test it empirically to check it is definitely faster, but this should be even better for many loop iterations:

mov  ebx, dword ptr [index]
mov  edi, dword ptr [this]
mov  esi, dword ptr [newKey]        ; ESI holds [newKey] value
lea  ebx, [ebx*8]
mov  edi, dword ptr [edi+4]         ; EDI holds [[this]+4] value

while_current#1:
mov  eax, ebx                       ; EAX always holds [index] value *8
shr  ebx, 1                         ; EBX always holds [current2] value *8
cmp  eax, 8
jle  end#1

mov  ecx, dword ptr [edi+ebx+4]
mov  edx, dword ptr [edi+ebx]
cmp  ecx, esi
jle  end#1
mov  dword ptr [edi+eax+4], ecx
mov  dword ptr [edi+eax], edx
jmp  while_current#1

end#1:
shr  eax, 3                        ; EAX contains index value


That should answer most of your original questions, but...
Quote7: I thought [] means to load from the memory using the address stored inside the [], so why is it doing things like mov edx, dword ptr [this]?
It's a redundant cast of the memory location as a DWORD, assuming "this" and "current2" and "index" are defined as DWORDs. It isn't required in MASM, the assembler "knows" that you are pulling a DWORD because you are saving it in a full register, ie. ECX or EDX etc. It doesn't do any harm though.

Hope that helps. Seeing the result of the compiler and the difference in optimised code sure reconfirms why I only write in Assembler now and wouldn't touch a HLL ever again...  :wink

Ian_B

aalaardb

Thanks, this is exactly the kind of help I needed.  Yes MSVC++ is handling the push and pops for me.  It's a lot of work to copy the code as is, because MSVC++ doesn't allow you to copy and paste the disassembly, I copied it all by hand.

Now I may be a newb but I'm not a dummy.  When I looked at the memory accesses that was the first thing I changed.  But I didn't include it here because I wasn't sure that they were actually memory accesses or whatever.  Notice the shifts.
Here is the code that I optimized as much as I could:
int test = (int) BinaryHeap;
__asm {
mov  eax, index;
cmp  eax, 1;
jle  end;

mov  ebx, newKey;
mov  ecx, eax;
shr  ecx, 1;//parent = child/2
shl  ecx, 3;//multiply times 8 for size
mov  esi, test;
add  esi, 4;//pointer to BinaryHeap[0].key

while_current:
mov  edx, esi;
add  edx, ecx;
mov  edx, [edx];//parentKey
cmp  ebx, edx;
//branch if (newKey > BinaryHeap[current2].key) {
jg  end;

//move parent down
shl  eax, 3;
mov  dword ptr [esi+eax], edx;
sub  esi, 4;
mov  edx, dword ptr [esi+ecx];
mov  dword ptr [esi+eax], edx;
shr  eax, 4;
cmp  eax, 1;
jle  end;
mov  ecx, eax;
shr  ecx, 1;//parent = child/2
shl  ecx, 3;//multiply times 8 for size
add  esi, 4;
jmp  while_current;

end:
mov  index, eax;
}
return index;


my code 8.3 seconds
MSVC++ 7.75 seconds
So even when I reduced the memory accesses from 20 to 4 it was still slower?

I copied your code, and verified it produces correct results (the one with *8 optimization produced incorrect results).  It took 9 seconds!  So I thought what in the world is MSVC++ doing here?  Missing optimizations?  Reporting unoptimized disassembly?

Then it struck me, I am viewing the disassembly in debug mode, but benchmarking in release mode!  Silly error :/
So, I turned on optimizations for debug mode and viewed the disassembly.  Both the insertElement and the moveUp were inlined.  But it looks like I could still optimize the *8 with shifts.  So it's back to the drawing board, where I will remerge insertElement and moveUp, then try optimizing that.

Ian_B

Quote from: aalaardbI copied your code, and verified it produces correct results (the one with *8 optimization produced incorrect results).  It took 9 seconds!

Kinda worried that one produced correct results but the other didn't, since every reference to a *8 in the first code has been replaced by an already multiplied value in the second. Try debugging it again if you can, I have to suspect you copied something incorrectly. I'm also very bemused that it appears to be slower than the atrocious compiler code. However you are timing it HAS to be wrong, that's surely not possible. So many fewer opcodes on that scale cannot be slower.  :eek

QuoteSo even when I reduced the memory accesses from 20 to 4 it was still slower?
If you have replaced the compiler code with the version you just posted, then that is very possible. The reason is you are shifting registers all over the place, which is a slow operation. You are also repeatedly doing SHR ECX, 1 immediately followed by a SHL ECX, 3, which can be combined into a single SHL ECX, 2 which would take half the time.  :wink  Or even replaced by ADD ECX, ECX then ADD ECX, ECX which is even faster. Shifting like that only improves your speed if you do it as I showed outside the loop.

hutch--

aalaardb,

The way you have isolated the code is probably where the problem is in that you have taken a single piece of a working algo and seperated it from the internal compiler optimisation so even if the inline code you write is fast, it will be penalised by not being properly compatible with how the compiler intergrates the entire algo.

The trick would be to put the entire algo into a seperate module and build it first with all optimisation turned on to see how fast it is. Then rebuild it with the optimisation turned off and the option to output assembler code. When you write the seperate algo, try and inline the entire algo as this reduced the stack overhead when you generate an assembler output. Once you have the asembler output, do what you must to get it to build in MASM and then time it. Unoptimised it will be a lot slower but then you can then manually optimise the code until it gets up to speed.

You commonly do this with an algo like a quick sort which is usually written as 3 functions. Inline it then get the assembler output then manually optimise it.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

aalaardb

@Ian_B
I am using shifts instead of multiply and divide, since all multiples and divides are base 2.  This is still faster, even when the multiple is *2 right?  Either way, I'll try stuff with adds instead.

I can't combine shr 1 and shl 3 into shl 2 - consider the case of index 3 (right child) whose parent is 1.

Also, I found why it was slower.  It was differences between debug and release settings, specifically the debug disassembly was not optimized, but the release assembly was optimized.

@hutch--
Right, I am going to more or less inline algorithm just as the compiler would.  So instead of breaking up insertElement and moveUp, they are now back to one function (as the binary heap is implemented by default).

I am wanting to emulate inlining by having the insertElement function take its 2 arguements directly into registers.  Is there a way to define my prologue to accomplish this, or will it just mean 2 moves?
Edit: also how do I access the C++ variables without passing them as arguements?  Would I need to use the this pointer?