Hi all,
as a newbie I found it a little hard to start with masm and this forum was really helpful.
I thought giving something back, so here is a code that decides which number is prime among 1..100,000
The code might be helpful for beginners like me that find themselves in deep waters, and shows defining and using a procedure,
printing to a console, etc.
.486
.model flat,stdcall
option casemap:none
include \masm32\include\windows.inc
include \masm32\macros\macros.asm ; MASM support macros
include \masm32\include\kernel32.inc
include \masm32\include\masm32.inc
includelib \masm32\lib\kernel32.lib
includelib \masm32\lib\masm32.lib
.data
TextNotPrime db "It is not prime",13,10,0
TextIsPrime db "It is prime!",13,10,0
.code
start:
mov eax,1
push eax
forLoop:
pop eax
inc eax
cmp eax,100000
je finish
push eax
call isPrime
jecxz NoPrime
jmp YesPrime
NoPrime:
invoke StdOut,addr TextNotPrime
jmp forLoop
YesPrime:
invoke StdOut, addr TextIsPrime
jmp forLoop
finish:
exit
isPrime proc
mov ecx,eax
mov ebx,1 ; this it the register
mov eax,1 ;this has the product
loopThis:
inc ebx ;increase the register
cmp ecx,ebx
je defIsPrime
xor edx,edx
mul ebx
xor edx,edx
div ecx
mov eax,edx
cmp edx,0
je defNotPrime
jmp loopThis
defNotPrime:
mov ecx,0
ret
defIsPrime:
inc eax
cmp eax,ecx
jne defNotPrime
mov ecx,1
ret
isPrime endp
end start
The algorithm is: For any given n, Compute start computing (n-1)! mod n. At any multiplication we apply a modulus to n, so that we don't get an overflow.
At any step we check if the product becomes a multiple of n, in which case we exit being sure the number is not a prime.
If when we are finished (n-1)! == (n-1) mod n, then we consider it a prime.
Just out of curiosity, I wrote the exact same algorithm in C, Java and Perl. Here are the results that were striking for me:
Lang/Metric | Time (secs) | Memory(KB) |
| | |
Assembly | 30 | 136 |
C | 32 | 203 |
Java | 34 | 28,000 |
Perl | >120 | 498 |
I am running on a Pentium 1.86Ghz, gcc 3.4.5, Java 1.6, Perl 5.10 and of course MASM 8
Java runs lighting fast but the entire JVM is a significant burden for the memory. But did you expect that Java would run close to C?
Also C and Java are both very close to Assembly code speed. Is this expected or have I missed something?
Quote from: ptoulis on November 01, 2008, 04:44:19 PM
Also C and Java are both very close to Assembly code speed. Is this expected or have I missed something?
Hi ptoulis, thanks for the example. As you know, code execution speed is completely dependent on the efficiency of the instructions used to accomplish the goal. Here, the assembly results closely match the others probably because the instructions and methodologies used are similar or identical in all three. The JVM does have a known tendency to be slow and consume large amounts of memory, but for a simple algorithm such as this, the optimizing stage of its compiler must be producing efficient code. C has been around for a long time, and their optimizing stage can be quite good. (Did you try to compile it with full speed optimizations?)
That said, programming in assembly opens doors which are otherwise unavailable in higher-level languages. Since no compiler is deciding which instructions to use for you, then you are free to code using instructions and methods which may be more efficient. The downside to this is, that you must know quite a bit about the CPU and its instructions in order to write more efficient code. Still, changing a few instructions could have a big impact on the efficiency of the program. (Often, assembly optimization is a trade-off between code length and code speed -- ironically, replacing one small loop which contains one slow instruction with two or four (or 32) loops which contain four faster instructions, results in a faster overall performance.
Mark Larson has put together quite a few optimization tips. Agner Fog has also released some very useful papers and tools which are essential in this persuit. Please take a look at both:
http://www.mark.masmcode.com/
http://www.agner.org/optimize/
Quote from: ptoulis on November 01, 2008, 04:44:19 PM
Just out of curiosity, I wrote the exact same algorithm in C, Java and Perl. Here are the results that were striking for me:
Lang/Metric | Time (secs) | Memory(KB) |
| | |
Assembly | 30 | 136 |
C | 32 | 203 |
Java | 34 | 28,000 |
Perl | >120 | 498 |
...
Also C and Java are both very close to Assembly code speed. Is this expected or have I missed something?
Try eliminating the StdOut. Printing to screen is by far the slowest part of your code, and both assembly and high level languages go through this bottleneck.
Thank you for the great optimization links! Based on what I read my code sucks! The docs say it is better to replace INC with ADD, remove unnecessary CMP's, care for dependence etc. I hope to learn this stuff along the way.
Quote from: jj2007 on November 01, 2008, 10:58:49 PM
Try eliminating the StdOut. Printing to screen is by far the slowest part of your code, and both assembly and high level languages go through this bottleneck.
I did cut the printing function and there was 50% increase in performance! This holds both for the other languages (e.g. Java dropped to 17 secs)
Another weird thing was that when I launched the Java and Assembly program at the same time, Windows seemed to favor the Assembly program,
handing it off almost 50% of the CPU and only ~26% to the Java program. And despite in the order that the programs were loaded.
Is there any reasonable explanation for this? Windows favors small-sized programs in CPU allocation?
The JAVA support files may be changing the processor priority to favour shared processor resource.
Quote from: hutch-- on November 02, 2008, 12:23:28 AM
The JAVA support files may be changing the processor priority to favour shared processor resource.
You mean that Java itself could lower the priority of the running thread?
Process Explorer in Windows reported that both applications had a priority of 'Normal'.
Quote from: ptoulis on November 02, 2008, 12:15:08 AM
The docs say it is better to replace INC with ADD, remove unnecessary CMP's, care for dependence etc. I hope to learn this stuff along the way.
Indeed, this is all any of us can hope for. :bg
ADD is faster on the pentium, however on the newer AMD chips, INC is significantly faster. If speed is absolutely paramount (and multiple processors are expected), then the best thing to do would be to write customized versions for each processor. Then test for the processor type at the beginning of the code and branch to the matching routine. (Or create self-benchmarking code -- of course the user must be notified of such a benchmark in that case, so they could close their other programs.)