Sharing example code, language benchmarks and thoughts

ptoulis · November 01, 2008, 04:44:19 PM

Hi all,

as a newbie I found it a little hard to start with masm and this forum was really helpful.
I thought giving something back, so here is a code that decides which number is prime among 1..100,000
The code might be helpful for beginners like me that find themselves in deep waters, and shows defining and using a procedure,
printing to a console, etc.

Code Select

.486

.model flat,stdcall
option casemap:none

include     \masm32\include\windows.inc
include \masm32\macros\macros.asm       ; MASM support macros

include     \masm32\include\kernel32.inc
include     \masm32\include\masm32.inc

includelib  \masm32\lib\kernel32.lib
includelib  \masm32\lib\masm32.lib


.data
TextNotPrime db "It is not prime",13,10,0
TextIsPrime db "It is prime!",13,10,0
.code

start:

 
   mov eax,1
   push eax
   forLoop:
       pop eax
       inc eax
       cmp eax,100000
       je finish
       push eax
       call isPrime
       jecxz NoPrime
       jmp YesPrime
       
NoPrime:
     invoke  StdOut,addr TextNotPrime
     jmp forLoop

YesPrime:
   invoke StdOut, addr TextIsPrime
   jmp forLoop

finish:
  exit


isPrime proc

   mov ecx,eax
   mov ebx,1 ; this it the register 
   mov eax,1 ;this has the product
   
      loopThis:
        inc ebx ;increase the register
        cmp ecx,ebx
        je defIsPrime
        xor edx,edx
        mul ebx
        xor edx,edx
        div ecx
        mov eax,edx
        cmp edx,0
        je defNotPrime
        jmp loopThis
        
    defNotPrime:
        mov ecx,0
        ret
    defIsPrime:
        inc eax
        cmp eax,ecx
        jne defNotPrime
        mov ecx,1
        ret
        
   isPrime endp   

end start

The algorithm is: For any given n, Compute start computing (n-1)! mod n. At any multiplication we apply a modulus to n, so that we don't get an overflow.
At any step we check if the product becomes a multiple of n, in which case we exit being sure the number is not a prime.
If when we are finished (n-1)! == (n-1) mod n, then we consider it a prime.

Just out of curiosity, I wrote the exact same algorithm in C, Java and Perl. Here are the results that were striking for me:

Lang/Metric	Time (secs)	Memory(KB)

Assembly	30	136
C	32	203
Java	34	28,000
Perl	>120	498

I am running on a Pentium 1.86Ghz, gcc 3.4.5, Java 1.6, Perl 5.10 and of course MASM 8
Java runs lighting fast but the entire JVM is a significant burden for the memory. But did you expect that Java would run close to C?
Also C and Java are both very close to Assembly code speed. Is this expected or have I missed something?

Mark Jones · November 01, 2008, 05:16:05 PM

Quote from: ptoulis on November 01, 2008, 04:44:19 PM
Also C and Java are both very close to Assembly code speed. Is this expected or have I missed something?

Hi ptoulis, thanks for the example. As you know, code execution speed is completely dependent on the efficiency of the instructions used to accomplish the goal. Here, the assembly results closely match the others probably because the instructions and methodologies used are similar or identical in all three. The JVM does have a known tendency to be slow and consume large amounts of memory, but for a simple algorithm such as this, the optimizing stage of its compiler must be producing efficient code. C has been around for a long time, and their optimizing stage can be quite good. (Did you try to compile it with full speed optimizations?)

That said, programming in assembly opens doors which are otherwise unavailable in higher-level languages. Since no compiler is deciding which instructions to use for you, then you are free to code using instructions and methods which may be more efficient. The downside to this is, that you must know quite a bit about the CPU and its instructions in order to write more efficient code. Still, changing a few instructions could have a big impact on the efficiency of the program. (Often, assembly optimization is a trade-off between code length and code speed -- ironically, replacing one small loop which contains one slow instruction with two or four (or 32) loops which contain four faster instructions, results in a faster overall performance.

Mark Larson has put together quite a few optimization tips. Agner Fog has also released some very useful papers and tools which are essential in this persuit. Please take a look at both:

http://www.mark.masmcode.com/
http://www.agner.org/optimize/

jj2007 · November 01, 2008, 10:58:49 PM

Quote from: ptoulis on November 01, 2008, 04:44:19 PM
Just out of curiosity, I wrote the exact same algorithm in C, Java and Perl. Here are the results that were striking for me:

Lang/Metric Time (secs) Memory(KB)
Assembly 30 136
C 32 203
Java 34 28,000
Perl >120 498
...
Also C and Java are both very close to Assembly code speed. Is this expected or have I missed something?

Try eliminating the StdOut. Printing to screen is by far the slowest part of your code, and both assembly and high level languages go through this bottleneck.

ptoulis · November 02, 2008, 12:15:08 AM

Thank you for the great optimization links! Based on what I read my code sucks! The docs say it is better to replace INC with ADD, remove unnecessary CMP's, care for dependence etc. I hope to learn this stuff along the way.

Quote from: jj2007 on November 01, 2008, 10:58:49 PM
Try eliminating the StdOut. Printing to screen is by far the slowest part of your code, and both assembly and high level languages go through this bottleneck.

I did cut the printing function and there was 50% increase in performance! This holds both for the other languages (e.g. Java dropped to 17 secs)

Another weird thing was that when I launched the Java and Assembly program at the same time, Windows seemed to favor the Assembly program,
handing it off almost 50% of the CPU and only ~26% to the Java program. And despite in the order that the programs were loaded.

Is there any reasonable explanation for this? Windows favors small-sized programs in CPU allocation?

hutch-- · November 02, 2008, 12:23:28 AM

The JAVA support files may be changing the processor priority to favour shared processor resource.

ptoulis · November 02, 2008, 12:27:24 AM

Quote from: hutch-- on November 02, 2008, 12:23:28 AM
The JAVA support files may be changing the processor priority to favour shared processor resource.

You mean that Java itself could lower the priority of the running thread?
Process Explorer in Windows reported that both applications had a priority of 'Normal'.

Mark Jones · November 02, 2008, 03:21:17 AM

Quote from: ptoulis on November 02, 2008, 12:15:08 AM
The docs say it is better to replace INC with ADD, remove unnecessary CMP's, care for dependence etc. I hope to learn this stuff along the way.

Indeed, this is all any of us can hope for. :bg

ADD is faster on the pentium, however on the newer AMD chips, INC is significantly faster. If speed is absolutely paramount (and multiple processors are expected), then the best thing to do would be to write customized versions for each processor. Then test for the processor type at the beginning of the code and branch to the matching routine. (Or create self-benchmarking code -- of course the user must be notified of such a benchmark in that case, so they could close their other programs.)

News:

Sharing example code, language benchmarks and thoughts

ptoulis

Mark Jones

jj2007

ptoulis

hutch--

ptoulis

Mark Jones