News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Faster instruction timing?

Started by indiocolifa, September 21, 2005, 06:28:48 AM

Previous topic - Next topic

indiocolifa

I want to increase CX by one.

What is faster in modern (Pentium + CPUs)?

INC (CX);

or

ADD (1,cx);

Thank you very much.

Sevag.K

A while back I did some timing tests and ADD instruction proved faster on modern pentiums and apparantly, INC was faster on older pentiums and pre-pentiums (I don't know the cut-off point).

But then, my test was based on 32 bits INC/ADD.  Word sizes (though smaller) might be slower still.

The trade-off is that ADD (imm, reg) is bigger than INC (reg)


Randall Hyde

Quote from: indiocolifa on September 21, 2005, 06:28:48 AM
I want to increase CX by one.

What is faster in modern (Pentium + CPUs)?

INC (CX);

or

ADD (1,cx);

Thank you very much.


The PIV manual recommends the use of ADD.
Cheers,
Randy Hyde

indiocolifa

Tell me if i'm right in the following:

I'm using the following piece of code to get the correct offset for a field (it's an array of records) in EBX, where EAX is the index:

INTMUL (@size(recordType), EAX, EBX);

My record size is 9 bytes.

since this mul instruction takes many cycles, may be I should do for faster operation (since 9*EAX = (8*EAX)+EAX):

SHL (3,eax);           // eax * 8
ADD (eax,eax);       // ( eax * 8 ) + eax
MOV (eax,ebx);      // eax to ebx


This is correct?

Second approach (if I make the RECORD size to 16 bytes using padding)

SHL (4,eax); // eax * 16
MOV (eax,ebx);       

Maybe the last method is better.




V Coder

Of course you realise that adding eax to itself doubles eax, so you don't get (eax*8)+eax... you get (eax*8)+(eax*8), which is eax*16.

Then distribute this space largest elements first. For speed, you should really align all data to their size in bytes. So align 8byte data to 8 bytes, 4 byte data to 4 bytes.
So if your record consists of 4 bytes, followed by 3 bytes followed by 2 bytes, it may be better to have a 12 byte record: 4 bytes, 3 bytes (+1 byte padding), 2 bytes (+2 bytes padding). This ensures that accessing the 4 byte value will not be split across any 4 byte boundary, etc. This speeds up memory accesses.

If you have 6 byte data followed by 3 byte data, you need to align the 6 byte data to an 8 byte boundary, and the 3 byte data to a 4 byte boundary: 6 bytes (+ 2 bytes padding), 3 bytes (+5 bytes padding). I expanded the record to 16 bytes to ensure that the 6 byte data is always aligned to an 8 byte boundary. All records should probably be 16 byte aligned if you have the space, especially if you have space.

With 16 byte aligned data, you can simply use shl (4, eax);

On the other hand if you must conserve space then this multiplies by nine properly.
mov (eax, ebx);
shl (3, eax);
add (eax, ebx);

or to ensure you don't change the value of eax you can use
lea (ebx, [eax*8 + eax]);

If speed is your need, then three adds are probably faster than the shift, at least on the Pentium 4.
mov (eax, ebx);
add (eax, eax);
add (eax, eax);
add (eax, eax);
add (eax, ebx);

or

mov (eax, ebx);
add (eax, eax);
add (eax, eax);
add (eax, ebx);
add (eax, ebx);

indiocolifa