Inlining ASM code

Glenn9999 · December 01, 2010, 12:53:36 PM

Right now in my code optimization project I'm working on, the only thing left that I see a possible improvement on is a section where I'm using 64 separate procedure calls, all to assembler code. The profiler shows that the act of calling those (and not the assembler code that is called) takes 6% of the total run. This translates out to about 2.5 seconds.

So I'm asking if anyone has any strategies or suggestions on how people handle situations where inlining ASM code (as opposed to procedure calls) would be useful?

dedndave · December 01, 2010, 01:03:22 PM

it would help if we could see the example code, so we know what to suggest :P

Glenn9999 · December 01, 2010, 05:00:28 PM

Quote from: dedndave on December 01, 2010, 01:03:22 PM
it would help if we could see the example code, so we know what to suggest :P

Well I mean generally...I didn't know if there were any tricks or automation or the like to help out in inlining code. I'll provide an example.

In my MD5 code, I optimized the typical shifting of memory that most algorithms do. In doing that, I ended up with 4 procedures called 16 times with various blocks of data and the constants typical with MD5. This provided a substantial speed increase overall, but as I stated the profiler indicates that 6% of the total run is procedure build/teardown of these 4 procedures.

One of the procedures (advise welcome on optimizing it, too, if anyone is willing):

Code Select


procedure FF(var a: DWORD; b, c, d, x: DWORD; s: BYTE; ac: DWORD); assembler;
// address of a = EAX  b = EDX c = ECX
// d = SS:[ESP+24] x = SS:[ESP+20] s= SS:[ESP+16] ac=SS:[ESP+12]
asm
  AND   ECX, EDX                    // c and b
  PUSH  EDX                         // save b
  NOT   EDX                         // not b
  AND   EDX, SS:[ESP+24]            // (not b) and d
  OR    EDX, ECX                    // ECX or EDX
  ADD   EDX, [EAX]                  // add a
  ADD   EDX, SS:[ESP+20]            // add x
  ADD   EDX, SS:[ESP+12]            // add ac
  MOV   CL, Byte Ptr SS:[ESP+16]    // move s to CL register
  ROL   EDX, CL                     // rot(a, s);
  POP   ECX                         // return b value for later
  ADD   EDX, ECX                    // add b
  MOV   [EAX], EDX                  // return value to a
end;

Now in the place that this procedure gets called, I have:

Code Select


  FF (a, b, c, d, Block[ 0],  7, $d76aa478);
  FF (d, a, b, c, Block[ 1], 12, $e8c7b756);
  FF (c, d, a, b, Block[ 2], 17, $242070db);
  FF (b, c, d, a, Block[ 3], 22, $c1bdceee);
  FF (a, b, c, d, Block[ 4],  7, $f57c0faf);
  FF (d, a, b, c, Block[ 5], 12, $4787c62a);
  FF (c, d, a, b, Block[ 6], 17, $a8304613);
  FF (b, c, d, a, Block[ 7], 22, $fd469501);
  FF (a, b, c, d, Block[ 8],  7, $698098d8);
  FF (d, a, b, c, Block[ 9], 12, $8b44f7af);
  FF (c, d, a, b, Block[10], 17, $ffff5bb1);
  FF (b, c, d, a, Block[11], 22, $895cd7be);
  FF (a, b, c, d, Block[12],  7, $6b901122);
  FF (d, a, b, c, Block[13], 12, $fd987193);
  FF (c, d, a, b, Block[14], 17, $a679438e);
  FF (b, c, d, a, Block[15], 22, $49b40821);

Now by inling, what I mean is that instead of the procedure call, you have the code that is contained within the procedure.

Any constructive thoughts or ideas welcome...

dedndave · December 01, 2010, 05:44:24 PM

well - i don't read C very well :bg
but, it looks to me as though you could create an array of values and loop through the code
it would be considerably faster, as there would only be one call/ret pair

Glenn9999 · December 01, 2010, 06:15:17 PM

Quote from: dedndave on December 01, 2010, 05:44:24 PM
but, it looks to me as though you could create an array of values and loop through the code
it would be considerably faster, as there would only be one call/ret pair

Actually, setting up a loop is what required the memory shifting. Note the pattern of a, b, c, and d on the procedure calls. Removing the need to do something like:

Code Select


t = a
a = b
b = c
c = FF(bla, bla, ...);
d = t

is the thing that caused the substantial gain (about 5 seconds on my test run, if I recall right).

dedndave · December 01, 2010, 10:55:34 PM

Block[ 0] is an address ???
what is the numerical address difference between Block[ 0] and Block[ 1] ???

Glenn9999 · December 02, 2010, 05:47:41 AM

Quote from: dedndave on December 01, 2010, 10:55:34 PM
Block[ 0] is an address ???
what is the numerical address difference between Block[ 0] and Block[ 1] ???

Block is an array of 16 DWord values accessed randomly (procedure #1 happens to be sequential). The number in the [] is a logical offset. So Block[0] is the base address of the array, and Block[1] is Base+4.

dedndave · December 02, 2010, 10:26:55 AM

randomly - that's the killer
where does the random index value come from ?
can we stick the random function inside the proc ?

if i understood the procedure a little better, i think it could go in a 4-pass loop inside a 4-pass loop
the inner loop could be unrolled for a little extra speed

Glenn9999 · December 02, 2010, 03:06:36 PM

Quote from: dedndave on December 02, 2010, 10:26:55 AM
randomly - that's the killer
where does the random index value come from ?

Actually it's just a random constant as specified in the RFC. People have published formulas for procs 2-4 as part of making it into a loop, but a lot easier to just specify it if the loop is going to cost more.

Maybe the better question since I'm using a HLL with inline assembler (specifically Turbo Assembler) for this forum. Can I use MASM and make a master proc I can call from an OBJ that has all these calls and make the secondary calls be inline?

drizz · December 02, 2010, 03:28:51 PM

So, borland-inprise-codegear-embarcadero-whatever-is-it-called-nowadays delphi compiler still can't inline procedures ::)

Quote from: Glenn9999 on December 02, 2010, 03:06:36 PMMaybe the better question since I'm using a HLL with inline assembler (specifically Turbo Assembler) for this forum. Can I use MASM and make a master proc I can call from an OBJ that has all these calls and make the secondary calls be inline?

Yes you can use TASM - no obj conversion neccessary, or you can use MASM/JWASM - obj conversion required. Either use omf2d (EliCZ) or objconv (Agner Fog) if you choose masm.

http://www.masm32.com/board/index.php?topic=5610.msg41785#msg41785

Glenn9999 · December 02, 2010, 09:27:53 PM

Quote from: drizz on December 02, 2010, 03:28:51 PM
So, borland-inprise-codegear-embarcadero-whatever-is-it-called-nowadays delphi compiler still can't inline procedures ::)

Actually it can. It just won't inline assembler procedures. I did a test on the equivalent delphi code inlined compared to my assembler and evidently the optimizations affected within the assembler (specifically using the ROL instruction instead of emulating it) turned it into a wash, time-wise. It may not know the difference if I try a generated OBJ, but might be worth finding out.

I do like the capability to write the ASM into the HLL source that the Delphi/C builder set offers , since it's been easy to pick out spots and just write a few instructions to substitute as opposed to developing full blown asm modules to do simple things. And coupling that with a good profiler has made it easy to see where I need to be spending my time. The inlining issue really has been the only problem I've run into.

Unfortunately, though, it'll probably go the same way as I read in other posts on here wrt Visual Studio 2010 and get pulled because the majority of the programming populace now thinks assembler to be a waste of time. Probably why they didn't make the HLL capable of inlining ASM procedures as well as HLL procedures.

Slugsnack · December 02, 2010, 09:33:49 PM

Not sure what compiler you're using but in VC++ you can inline assembler routines with something similar to '__inline __declspec(naked)'. I seem to recall doing it for GCC as well in the past. What compiler are you using ?

drizz · December 02, 2010, 10:50:26 PM

So your real problem is that delphi does not have ROTL/ROTR Intrinsics. Does CBuilder have them?

There is a third option you can do, and that is just include generated asm.

http://www.paco.net/~tol/hash/x86hotk.html

Glenn9999 · December 03, 2010, 01:31:44 AM

Quote from: drizz on December 02, 2010, 10:50:26 PM
So your real problem is that delphi does not have ROTL/ROTR Intrinsics.

Actually, application of assembler and some redesign of things solved a lot of the performance problems with the code I had. This is the only remaining performance problem I could identify but if it doesn't get solved, I won't be too upset since I still pulled 7.6 seconds off the test run I have set up in my optimization efforts. But I'd still like to solve it.

The "real problem" as indicated is the fact that calling these procedures like this (ROTL intrinsic put aside, the problem would still exist if I had a ROTL intrinsic) is costing about 6% of the total run. I'll try putting them into their own ASM module to see if the run time goes down.

Antariy · December 03, 2010, 02:25:55 AM

Quote from: Glenn9999 on December 03, 2010, 01:31:44 AM
Actually, application of assembler and some redesign of things solved a lot of the performance problems with the code I had. This is the only remaining performance problem I could identify but if it doesn't get solved, I won't be too upset since I still pulled 7.6 seconds off the test run I have set up in my optimization efforts. But I'd still like to solve it.

The "real problem" as indicated is the fact that calling these procedures like this (ROTL intrinsic put aside, the problem would still exist if I had a ROTL intrinsic) is costing about 6% of the total run. I'll try putting them into their own ASM module to see if the run time goes down.

Did not you tried to change calling convention from Borland's fastcall to stdcall? This is can looks strange, but in one thread I have code, which is initially uses fastcall, but when I have replaced it to stdcall - it stand a bit faster. Maybe, in OoO level it is easyer than to prediction of registers content when you make a call.

News:

Inlining ASM code