News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Macros and assembly speed

Started by jj2007, May 11, 2008, 03:16:10 PM

Previous topic - Next topic

jj2007

I am tinkering with some fairly complex macros and wonder whether there are any optimisation rules and/or experience regarding assembly speed. For example, is it expensive to use a FORC loop?
Benchmarking assembly speed is not that easy - I would need some thousand lines of codes to test this, which is in slight contradiction to have only a few lines for testing the macros...
I tried ml's /Sc option, but don't see any effect.

bozo

using macros usually results in faster code because it doesn't have the overhead of branching.
there are alot of neat tricks you can do in regards to generation of the code.
if i was at my home computer right now i'd post some examples..if drizz is reading this, he could post a link to his crypto sources,
which have good examples on how to use macros.

jj2007

Quote from: Kernel_Gaddafi on May 11, 2008, 09:30:01 PM
using macros usually results in faster code because it doesn't have the overhead of branching.

Sure, faster and leaner code is my goal ;-)
But the question was actually how much it would slow down assembly of, say, 80, 000 lines of code.

Quote
there are alot of neat tricks you can do in regards to generation of the code.
if i was at my home computer right now i'd post some examples..if drizz is reading this, he could post a link to his crypto sources,
which have good examples on how to use macros.

Looking forward to some crispy examples, thanx.

u

Quote from: jj2007 on May 11, 2008, 09:49:59 PM
But the question was actually how much it would slow down assembly of, say, 80, 000 lines of code.
In my experience, by just dozens of milliseconds.
Please use a smaller graphic in your signature.

MichaelW

Judging from my crude test of a macro that performs computations but no "string" operations, the relationship between the number of macro expansions and the assembly time appears to be somewhat non-linear. The assembly times varied from .25ms per iteration for 1000 iterations, to .46ms per iteration for 20000 iterations.

    @znew_seed@ = 362436069
    @wnew_seed@ = 521288629

    @rnd MACRO base:REQ
      LOCAL znew, wnew

      @znew_seed@ = 36969 * (@znew_seed@ AND 65535) + (@znew_seed@ SHR 16)
      znew = @znew_seed@ SHL 16

      @wnew_seed@ = 18000 * (@wnew_seed@ AND 65535) + (@wnew_seed@ SHR 16)
      wnew = @wnew_seed@ AND 65535

      EXITM <(znew + wnew) MOD base>
    ENDM

[attachment deleted by admin]
eschew obfuscation

jj2007

Thanks, very helpful. Did you notice that Masm chokes at repeat counts above 32754?

OK:
    REPEAT 32754
      mov eax, @rnd(100)
    ENDM


Throws various errors:
    REPEAT 32755 ; and higher
      mov eax, @rnd(100)
    ENDM


BogdanOntanu

REPEAT is a special kind of macro because it writes the expansion result in the same macro expansion buffer multiple times.
Same is valid for FOR, FORC or IRP kind of macros.

Macros that define data, generate code, have #ifdef statements, are recursive or nested and use many arguments locals  and "&" will be more complex to handle because it will require the macro processor to perform more operations.

Having many MACRO in your project will slow down the compilation a little because the compiler has to restart internally on the new macro buffer and it takes time to do this.

In conclusion YES many MACRO's will slow down the compilation speed in a noticeable way.
At least that is how I see it from inside the compiler point of view....
 
Ambition is a lame excuse for the ones not brave enough to be lazy.
http://www.oby.ro

jj2007

Quote from: BogdanOntanu on May 15, 2008, 09:41:30 PM
REPEAT is a special kind of macro...
In conclusion YES many MACRO's will slow down the compilation speed in a noticeable way.
That sounds logical. I will try to refine MichaelW's testbed for my purposes.
I will try to avoid recursion and REPEAT, but I cannot avoid either FORC or SUBSTR. Any idea which one is faster in parsing a string?
For example, FORC needs to be ended with ENDM, does this imply more overhead as compared to SUBSTR??

By the way, the REPEAT problem mentioned above seems to be related to the max number of temporary labels the macro compiler can generate - close to 32768.

BogdanOntanu

You should not avoid REPEAT. REPEAT is fast because it does less work ;)
You are right about SUBSTR it is an internal assembler function and hence it should be much faster than a macro.

FORC is of the same kind as REPEAT and hence it does less work than a "standard macro"

A standard macro has to check for and unknown number of parameters and substitute them symbolically in the macro expansion buffer. REPEAT does not do this. FOR and FORC do this but with only one parameter hence they should be slightly faster than standard macro's

IMHO the order of parsing speed is:

SUBSTR kind - technically not a macro
REPEAT - fastest
FOR, FORC - medium speed
MACRO - slowest 

Any way, I have huge ASM projects (SOLAR_OS, SOL_ASM, HE) and the slowdown because of macros is noticeable but not critical. On smaller projects you can safely ignore this slow down.
Ambition is a lame excuse for the ones not brave enough to be lazy.
http://www.oby.ro

bozo

i tend to use macros for any cpu critical code, but i've noticed sometimes a decrease in speed related to relative addresses.
my advice is to keep all offsets below +128 and above -128, this appears (atleast to me) to generate much faster code.

Also, if you can break up dependencies, use 2 or more registers at the same time for the same operation.

BogdanOntanu

Quote from: Kernel_Gaddafi on May 17, 2008, 05:05:46 AM
i tend to use macros for any cpu critical code,

MACROS are good. They outperform HLL languages when it comes to flexibility and ease of use and rapid concepts and algorithm implementation.

However the question here was about compilation speed not execution speed.

And it is true that MACRO's do slowdown compilation speed noticeably. Well it all depends what you mean by "noticeably". I expect my big projects in range of 1500.000 - 300.000 lines of "macro style" ASM code (almost 1:1: to C/ HLL code) to be fully compiled and linked in under 500ms.

If compilation time of a huge project starts to take more than 1s then I will start checking my compiler :D
Having fast application development requires a fast "change, compile, test, re-design" iterative loop.

Heavy Macro usage can add an extra 100-300ms compilation time depending on compiler, macro complexity and CPU.


About execution speed.
=================

Quote
but i've noticed sometimes a decrease in speed related to relative addresses.
my advice is to keep all offsets below +128 and above -128, this appears (atleast to me) to generate much faster code.

Be very careful with logical conclusions. Today CPU's are complex beasts. You can observe a speed increase result and think that it comes from one concept when in fact some hidden CPU architecture is the real and correct answer for your observations.

In the case of your statements above:

Instruction encodings with +128 -128 sign extended offset is SMALL. However SMALL does not mean FAST. Usually they are opposite.

However if your loops are critical then more code will fit in code cache if the encoding is SMALL. This way you can observe a dramatic speed boost but the reason is not exactly the sign extended encoding but the cache.

In fact LARGER encodings are FASTER from the CPU's point of view.This is because internally the CPU does have to perform work in order to extend the sign of small operands.

It is no always logical to generate small code encodings at compile time and put more "heat" on the CPU at runtime because it has to perform more work each time it decodes such instructions.

It always depends on your target. There is no best universall situation. If your code will have to fit in a FLASH then SMALL does bring profit because chips do cost money. IF you have enough memory using LARGER encodings can at times "relax" the CPU work and  speed up your program in mysteriouse ways. But It can also slow it down if it gets out ou code cache.

Hence do not base you optimizations on such "elusive" optimizations.They depend on CPU internals thata re not always exposed ot clear to you... Do them only in a very clear specific case and only as the very last action. Try a better algorithm first mainly an algorithm that is elusive to the world of HLL programmers that do not understand CPU ways of operation.

Quote
Also, if you can break up dependencies, use 2 or more registers at the same time for the same operation.

Yes this always helps on CPU's with more execution units but "dependency" is here to stay because it is the expression of our algorithms. If you do not have dependency and IF's then most likely you do not have much of an algorithm and in this case a simple DSP CPU with matrix or stream vectored instruction can outperform anything else.

Ambition is a lame excuse for the ones not brave enough to be lazy.
http://www.oby.ro

jj2007

Just found a convenient way to time assembly speed:

echo. >cr4time.tmp
time <cr4time.tmp
\masm32\bin\ml /nologo /c /coff %oDebugA% /Fo "%oBody%" tmp_file.asm >%LogAsm%
time <cr4time.tmp

Resolution is 10ms, fair enough. I get 0.75 seconds for a 100k source with difficult macros.
Typical output:

The current time is: 15:34:19.43
Enter the new time:
*** Assemble using ml /nologo /c /co

The current time is: 15:34:20.21
Enter the new time:

BogdanOntanu

Oh well... that is a method.

Or you could also write a very small program and use GetTickCount API to obtain start time and  end time of your compilation process.Insert that program at start and end of your compile batch. You could improve that by creating the compilation process under your program's control.

Or you could also use RadASM.

RadAsm does have an option to time the compilation process and it has many other project browse tools that are helpful  for huge ASM projects.
Ambition is a lame excuse for the ones not brave enough to be lazy.
http://www.oby.ro

PBrennick

I think I got this from Hutch, but it may have been MichaelW:



        .486                            ; Create 32 bit code
        .model  flat, stdcall           ; 32 bit memory model
        option  casemap :none           ; case sensitive

        include \masm32\include\windows.inc
        include \masm32\macros\macros.asm
        include \masm32\include\kernel32.inc
        include \masm32\include\masm32.inc
        include \masm32\include\user32.inc

        includelib  \masm32\lib\kernel32.lib
        includelib  \masm32\lib\masm32.lib
        includelib  \masm32\lib\user32.lib


.const

.data

.data?

.code

Start:  invoke  SetPriorityClass, FUNC(GetCurrentProcess), REALTIME_PRIORITY_CLASS
        invoke  GetTickCount
        push    eax

        ; ---------------------
        ; Run your code here
        ; ---------------------

        invoke  GetTickCount
        pop     ecx
        sub     eax, ecx
        fn      MessageBox, hWnd, str$(eax), str$(rcnt), MB_OK
        invoke  SetPriorityClass, FUNC(GetCurrentProcess), NORMAL_PRIORITY_CLASS
        mov     eax, input(13, 10, "Press enter to exit...")
        exit

        end     Start


Paul
The GeneSys Project is available from:
The Repository or My crappy website

jj2007

Quote from: BogdanOntanu on June 03, 2008, 03:04:31 PM
Oh well... that is a method.

I know it's not the most elegant one, but it's a batch command and thus accessible to everybody including newbies.