News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

64 bit code performance

Started by OssaLtu, January 17, 2007, 12:47:17 PM

Previous topic - Next topic

OssaLtu

Recently i was experimenting with 64 bit assembly code and encountered an unexpected behaviour.
The 64 bit code:

mov ecx, 0
ciklas1:
mov rax, 0
cmp ecx, 10
je end0
ciklas2:
cmp eax, 00000000ffffffffh
je end1
add rax, 0fh
jmp ciklas2
end1:
inc ecx
jmp ciklas1
end0:

performs faster then 32 bit code:

mov ecx, 0
ciklas1:
mov eax, 0
cmp ecx, 10
je end
ciklas2:
cmp eax, 0xffffffff
je end1
add eax, 0xf
jmp ciklas2
end1:
inc ecx
jmp ciklas1
end:

The diference is about 30%. Both codes do the same, has same number of cycles, but 64 bit code runs faster, despite bigger size of the registers. Does anyone knows why does it happen?

Merrick

That's a good question. I haven't done any 64 bit programming yet, so take this for what it's worth...

First of all, I wonder why you handle the code differently for the two. They all seem like trivial differences, and some shouldn't be different at all once compiled, but just to mention the ones I've noticed:

Why do you add 0fh in the 64-bit code but 0xf in the 32-bit code (should be identical at run time)
Why do you add to RAX in the 64-bit code but cmp to EAX instead of RAX, and does cmp'ing a 32-bit register to a 64-bit literal cause any overhead? (does a 64-bit literal with 32 bits worth of leading zeros become a 32-bit literal in code?)

I won't bother to mention the few speed improvement the code could benefit from. You can figure those out if it matters.

I don't suspect any of these makes a final difference. How are you timing the code? Is there a tickCount call immediately before and after your code fragment? If however you're getting your timing also includes load time, then the additional overhead of loading WOW64 may be what you're seeing.

Anybody else have any ideas?

EduardoS

Can you disassembly both? Maybe there is a speed diference due to code alignment.

Merrick

And the answer is... the operating system. Go figure.

I made changes to your code as follows to get it to compile and/or run correctly (totally ignoring the obvious optimization issues)

64-bit code

mov ecx, 0
ciklas1:
mov rax, 0
cmp ecx, 1000
je end0
ciklas2:
cmp eax, 00000000fffffff0h
je end1
add rax, 0fh
jmp ciklas2
end1:
inc ecx
jmp ciklas1
end0:


32-bit code:

mov ecx, 0
ciklas1:
mov eax, 0
cmp ecx, 1000
je end0
ciklas2:
cmp eax, 0fffffff0h
je end1
add eax, 0fh
jmp ciklas2
end1:
inc ecx
jmp ciklas1
end0:


The results with my dual boot, core duo laptop

XP64 - 5:45 for both
XP - 6:45 for 32-bit

So the codes are in fact running exactly the same way, as expected.

Moral of the story: don't compare apples and oranges.

VLaaD

Few things I recently learned going harder lane...

1. Alignment is important sometimes 20% of the speed in your code... Just analyze several cases, you will see by yourself.

2. Instruction pairing became extremely important, you can at least double the speed of unpaired code. Visit Agner Fog's topics on that - URL is http://www.agner.org/optimize/#manuals... This guy is fanatic, sincere respect to his powers.

3. Choose, when possible, instructions ordered this way: DirectPath (Single) / DirectPath(Double) / VectorPath(if no other choice is available). Here be dragons. Grab a copy of "Software Optimization Guide for AMD64 processors" from the official site to see other challenges... Jesus, I didn't think that I'll have to predict when writing code what that silicone fool will perform. That's the reason I prefer Motorola 68K / IA64 way of dealing with it - if you write anything greater in size than single octet to an odd memory address, you will find yourself inside a trap (exception) caused by unaligned memory access. BTW, finally, someone understood why Motorola had PC (Motorola's EIP equivalent) based indexing twenty years ago, so now we can write relocable code with less tensionĀ  :U

CPU features and desing differs so much, that the only thing you can do is to actually write even more code (solving the returned features returned from cpuid) and buy either several different hardware platforms or the time spent on them in order to understand what is going on. But, *STOP*, there is an easier way... Download "AMD Code Analyst" tool from http://developer.amd.com/ and (if still available for free, or at least decent trial version) Intel's V-Tune. I'm using Code Analyst and only then after comparing the measured results with your source code, several things are simply self-explained. It can work with assembly written programs, too. We're finally getting Borland's Turbo profiler on our platforms, only the name differs along with expanded architectural complexity.

Personally, I'm going to stick to the delegated function implementation model, so after the initial benchmark on the end-users' machine, proper configuration will be formed (data alignment in memory, if performing non-buffered & overlapped ReadFile() access per volume / resolving the complex partitioning scheme overheads as well), different CPU's with different number of cores will behave differently... I can see only this as the good compromise.