News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

AMD vs. Intel

Started by shaldan, November 30, 2005, 12:05:25 PM

Previous topic - Next topic

V Coder

I think it is reasonable to cater for reasonably recent systems (2000-), specifically: Pentium 4, Pentium M, Pentium III, Athlon XP, Athlon 64. *Pentium/MMX uv pairing is long outdated; I don't use my Pentium MMX system and K6-2 systems (time to sell), nor do I optimize compute bound code for either, though I have tested an optimization - which made no difference.

The AMD and INTEL optimization manuals can be downloaded in pdf format from the respective sites. We need to get one for Athlon XP, one for Athlon 64, one for Pentium III, and one for Pentium 4/M. Agner Fog's manual (pentopt.pdf) on optimizing for Pentium processors - updated to include information on the Pentium 4 - is very good for comparing the differences in optimizing for different processors. Mark Larson's tutorial also gives excellent tips.

The P6 is weaker than the K7 because of the 4,1,1 limitation and the fact that similar instructions that use the same port cannot be paired. There may be some programs that they execute at similar speeds, but the K7 should generally best the P6 at the same clock speed.

Optimizing for the Pentium III (P6) may be similar to optimizing for the Pentium M. The latter also includes SSE2 instructions, which may allow more compact code and the opportunity to run Pentium 4 code. Optimizing for the Pentium 4 (P7), however, is different from optimizing for the Pentium III/M. On the other hand, optimizing for the Athlon 64 is similar to optimizing for the Athlon XP. (Some people might want to stress the differences between two core updates of Pentium 4, eg. Prescott and Northwood, or two core updates of Athlon 64, eg. Venice, Winchester, but the differences may not affect optimization significantly except for tightly tuned integer routines.)

I have cited the instruction latency of particular integer instructions used by my program. A more complete list would be: mov, add, adc, lea, bswap[Integer], movd, movq, paddb, psubb, pcmpgtb, psrlq, psllq, punpckldq[MMX], pminub[SSE], paddq[SSE2].

What are the strengths and weaknesses of each processor with the instructions needed? On the Intel side, the Pentium III/M can execute 3 integer instructions per clock cycle once they meet the 4,1,1 criteria, but IIRC once MMX, SSE, (SSE2) comes into play this almost effectively drops to 1/clock. The Pentium 4 also executes a max of 1 MMX, SSE, SSE2 instruction per clock. The aim here should be to reduce instruction number. MMX often allows using fewer instructions instead of GPR integer instructions. Do the new MMX instructions introduced with the Pentium III allow you to use fewer instructions? Use them (test for SSE). Do the new MMX instructions introduced with the Pentium 4 allow you to use fewer instructions? Use them (test for SSE2). I have not been interested in FP instructions so I really cannot comment but the principle remains. Reduce instruction count. A Pentium M will run Pentium III code but will likely run shorter Pentium 4 code faster.

What are the weaknesses? adc takes 2 cycles on the Pentium III and 6 on the Pentium 4. bswap takes multiple cycles on both. MMX instructions take 2 cycles for the results to be available on the Pentium 4. Avoid serial dependencies.

The Athlon XP cam execute 3 integer instructions per cycle but there is no need to meet a 4,1,1 rule. adc takes 1 cycle. MMX instructions can be executed 2 at a time, but the results take 2 cycles to appear. Avoid serial dependencies in MMX instructions. For my purposes, SSE2 instructions did not help on the Athlon 64, so I resorted to the Athlon XP optimization.

We need to read the optimization manuals which are available online to determine the strengths and weaknesses of each processor wrt the instructions needed to determine whether separate routines need to be made for each processor.

In my program, CPUID availability is tested using,
pushfd
pop eax
mov ebx, eax
xor eax, 00200000h
push eax
popfd
pushfd
pop eax
cmp eax, ebx

Then the processor type is determined using,
mov (0, eax);
cpuid;
cmp (ecx, $444d4163); // AMD processor id
sete (al);
and (1, eax);
shl (2, eax);
add (eax, prcssr); // AMD = 4+, Intel = 0+

mov (1, eax);
cpuid;
test ($00800000, edx);
setne (al);
and (1, eax);
add (eax, prcssr); // MMX = +1

test ($02000000, edx);
setne (al);
and (1, eax);
add (eax, prcssr); // SSE = +1(2)

test ($04000000, edx);
setne (al);
and (1, eax);
add (eax, prcssr); // SSE2 = +1(3)

The program exits if at least MMX is not available or launches the routine for the appropriate processor using,
if (prcssr=1) then
w.CreateThread(NULL,NULL,&PSearch_PMMX,NULL,w.CREATE_SUSPENDED,&ThreadID);
elseif (prcssr=2) then
w.CreateThread(NULL,NULL,&PSearch_P3,NULL,w.CREATE_SUSPENDED,&ThreadID);
elseif (prcssr=3) then
w.CreateThread(NULL,NULL,&PSearch_P4,NULL,w.CREATE_SUSPENDED,&ThreadID);
elseif (prcssr=5) then
w.CreateThread(NULL,NULL,&PSearch_PMMX,NULL,w.CREATE_SUSPENDED,&ThreadID);
// w.CreateThread(NULL,NULL,&PSearch_k62,NULL,w.CREATE_SUSPENDED,&ThreadID);
else cmp (prcssr, 5); jb no_cpuid;
w.CreateThread(NULL,NULL,&PSearch_AXP,NULL,w.CREATE_SUSPENDED,&ThreadID);
endif;