News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

AMD vs. Intel

Started by shaldan, November 30, 2005, 12:05:25 PM

Previous topic - Next topic

shaldan

sorry for such question, but what are major differences between coding aplications for AMD or for Intel. Are there instructions, that do not work on AMD or Intel ? And is it usual for programmers to check the processor type and divide code for AMD and Intel for optimalization ?

thanks.

hutch--

The question is so general that its hard to give an accurate answer but on the late 32 bit hardware I have available you get this rough distinction. Intel hardware prefer very dense code using the complex addressing modes where AMD hardware tends to work better with more expanded RISC type coding.

With the preliminary 64 bit stuff around at the moment, the AMD hardware seems to faster by a reasonable amount but the general characxteristics will change quickly in the next year or so.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

EduardoS

hi hutch,
I don't have very good AMD manuals (and i have an Atlhon) but i have some of intel's ones,
According with the manuals pentium 4 seems to have a big dificult to do anything else than "work with more expanded RISC type coding",
It don't happend with my XP,
i prefer to test both processors and read good manuals from both before saying where one is better than other, maybe someone here can help.

hutch--

EduardoS,

The reason for my comments is I own both AMD and Intel hardware of similar periods and the distinction I have roughly drawn between the two comes from coding and timing algorithms over a reasonably long time on both types of boxes.

Intel hardware is faster with very dense complex addressing mode code where an AMD of similar period is faster using more instructions of the basic RISC format. Manuals are useful but nothing beats benchmarking.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

EduardoS

Complex mode is something like teste[ebx][ecx*4] rigth? So on AMD this:

mov eax, teste

should be fast than:

mov eax,teste[ebx][edx*4]

Right?
So something is wrong, on my Atlhon XP they take the same time, if i replace the mov by add the first one takes 1-2% less time.
Explain better, i don't understand where my AMD is bad with this addressing mode (and i don't have a P4 to test).

Human

in my experience as coder intel 286,amd 486dx 50mhz,mmx 233, athlon xp 1700+ i can say now amd is better for all,
differences become bigger over new cpus, like pentium and uv pipelines there was pairing optimialization and coding was like for risc, just simple instructions. since p4 there are many differences, not only 2 instructions at time but 3, branch prediction, screwed up totaly inc instruction, dont know why from 1 tick simplest even on risc instr. become 3 cycle? next thing lame amount of cache, athlon wins at every point with it due p4 16kb for data 8kb for code and 64kb l2 cache, athlon 64kb + 64kb +256 or even 512 l2, so at compresion,working on big amounts of data gives adventage to amd, also amd runs all simple instructions in 1 tick, due its build on risc technology, intel is cisc, we cant count complex instructions with 4*reg+addr there we waste 1 tick,another thing new cpu due they are much newer than pentium dont care so much about pairing, well still but more important is code,data alignment. unaligned data can wast even 300% of time, code well its a lottery, because aligning by hand is hell. about new cpus?
well again amd64 wins again, intel screwed everything in his 64 bit cpu, first amd has 32 new all purpose 64bit regs, intel 16, and we all who code in asm know what it means, another thing intel fails in all games multimedia encode even over 40%, why?
well they are faster at fpu and office software. but who in gods name uses today fpu if we can access directly 128bit floats from sse3, not play with shit 80bit stack based fpu. that is all what i can say about optimalization based on my expirience and network data

EduardoS

Human, your text is a little confuse, here i have some athlons XP a PIII, a PII and a Duron, and according to manuals:
The K7 have 3 decoders, 3 integer pipelines and 3 floating pipelines, the decoders crack instructions on uops (simples instructions) and then send to pipelines, a instruction wich generate no more than 2 uops is executed in 1 clock by a single pipeline, intructions wich generate more than 2 uops are decoded by rom and all 3 decoders are used in that process.

The P6 (PPro, PII and PIII) have 3 decoders (D0, D1, D2), D0 can handle any instruction, D1 and D2 can only handle intructions wich generates no more than 1 uops and 5 ports (pipelines), p0, p1, p2, p3, p4 used by both integer and floatint point units, p0 is used by ALU and other general instructions, p1 by ALU and jumps, p2 load data, p3 adress generation and p4 for store data.

K7 and P6 seems to be very diferent according to manuals, but here the only diference i see is the clock and cache size, now P7 (PIV):
The P7 is very close to P6, have the same 3 decoders wich works in the same way, but after decoding a instruction it puts the uops in the code cache instead of sending to execution ports, and the execution ports get instructions from code cache. Caching uops rather than opcodes enables the P4 to use RISC technology on a CISC instruction set. The ports are a little diferent too, 4 ports instead of 5, each port have execution units wich can run in diferent clocks, and each excution unit have some subunits.

In the agner pentium's optimiztion manual you have some info and very usefull tables.

I don't have a PIV with me to see the diference between it and Atlhon, and i don't have (yet) an A64 to see how diferent K8 is from K7, the manuals says they are almost the same in 32 bits mode.

hutch--

My younger brother has 2 AMD boxes, an athlon of about 18 months ago and a 64 bit athlon of about 6 months ago and while he is not using it for technical work any longer he said that the 64 bit version is faster with 32 bit code than the last 32 bit version he still has.

My own last 2 machines are a 2.8 Prescott PIV and a Sempron 2.4 and with some code they are very similar but in most things the PIV is a lot faster. To be fair the PIV is a lot more expensive box where the Sempron was a low cost later version.

The oldest machine running if I bother to plug it in is a 600 PIII and I used to be able to cross compare code with a now dead AMD K6-2 and the comparisons are very similar. Intel hardware handles the complex addressing mode style of code better than AMD where AMD works better with RISC style larger instruction counts and it is one of the irritants that no code works at its best on both.

Under current 32 bit OS versions there are far more Intel machines around than AMD, especially in the commercial world so if you want general purpose code that works reasonably everywhere, you accomodate Intel hardware with priority over AMD.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

EduardoS

hutch, please explain what is "complex addressing mode" and maybe some examples, i don't see my AMDs being slow on what i think is "complex addressing mode", please explain it.

hutch--

EduardoS,

I don't claim to be in a position to write a treatise on the subject but the Intel complex addressing modes are mentioned in their manuals in detail and have been with us since 8088s. Its the standard stuff, base address, index, data size specifier or multiplier and displacement. On earlier Intel hardware there was direct performance advantage in writing code as compact as possible as it reduced the instruction count but with the PIV it did not work exactly the same way as more complex addressing modes took longer to execute than less complex ones.

Intel hardware is still properly CISC but internally it tends to have less complex microcode that other instructions are made up from and in the microsode range they publish a preferred instruction set that schedule through multiple pipelines better than others. This means it tends to be RISC style code that has scheduling advantage. When you benchmark code on both Intel and AMD hardware you start to learn what works better on each and this is where my comments come from.

Intel hardware tends to be faster with very dense complex addressing mode code where AMD hardware tends to be faster with more instructions of a simpler type. That says the instruction set for the AMD is closer to RISC than CISC where the Intel seems to be the other way around. Now of course the only way you determine these things is to benchmark code on both hardware types to see what works on both, then you modify the code design to try and get better averages across both but by doing so you tend to miss the peak performance of either.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Human

well for amd there is codeanalyst if i remember correctly but its not for asm projects :(
and shows only pairing,stalls and other thing as colourful squares and you have to look at legend.

intel has vtune that suxx as hell, at least new, i downloaded from intel vtune 5.0 trial and when i cant find anything new to optimize then i use it, why? i optimize all like in pentium days uv pairing most of my code is around 90% pairing, vtune 5.0 can show exe as disassembled code and we can choose mmx, then we can mark part of code and see how much ticks it takes, also we can use code coach, that tries to reorder code for better pairing,agi,cache misses etc. sometimes it helps but most of time i have to do whole work. if we chose p2,p3,p4 cpu vtune just shows instruction ticks and 3 pipelines code grouping, but will not show ticks of whole code, maybe because since p2 its unpredictible, if someone has intel cpu can also run app and vtune will analyze all with msr regs, and will know better where is spot that takes most time of execution.

thats all what i can say, still uv pipes code reorder works, but sometimes when i remove 1 instruction code is slower, and have to find another place to optimize more so again alignment of code after modified instruction will again change to something that will improve speed.
if we have now 3 pipelines how to code or eax,eax je some_jump, on mmx 100% pairing, p4 and athlon they dont fit into 3 instruction group

V Coder

#11
Quote from: shaldan on November 30, 2005, 12:05:25 PM
sorry for such question, but what are major differences between coding aplications for AMD or for Intel. Are there instructions, that do not work on AMD or Intel ? And is it usual for programmers to check the processor type and divide code for AMD and Intel for optimalization ?

thanks.
Is it usual for programmers to check the processor type? Perhaps only if a compute bound process will be done. In an extremely compute bound program I have optimized for Pentium III, Pentium 4, AMD Athlon, I had the program test for MMX, SSE, SSE2 capability. It would not run without at least MMX. If only MMX, then use Pentium MMX code. If SSE/SSE2 then if AMD use Athlon code, else use Pentium III/4 code.

Differences: Several and various. In my benchmarking I noticed that at the same clock speed, the Pentium 4 is outclassed by the AMD Athlon 64, AMD Athlon, Pentium M, Pentium III, AMD k6-2 and Pentium MMX. However, the Pentium 4's advantage/saving grace, is that it does NOT run at the same clock speed as these processors - it runs faster. In addition, the Pentium 4 has the much needed paddq instruction allowing us to process 8 bytes fully parallel using fewer instructions.

The Pentium III/Pentium M execute MMX instructions with 1 cycle latency, but they cannot often execute more than one at a time. The Pentium 4 executes one MMX at a time, with a latency of 2 cycles. The Athlon can execute up to two MMX instructions at a time, with a latency of 2 cycles. However the Athlon (XP/64) more likely executes 3 integer instructions per cycle than the Pentium III, and it apparently reorders instructions agressively. Furthermore the Athlon executes adc in 1 cycle whereas the Pentium III takes 2 cycles. This is why the Athlon is stronger with my Athlon (integer laden) code than my Pentium III (MMX laden) code, and the Athlon 64 stronger with the longer Athlon code than the shorter Pentium 4 (MMX & SSE2) code.

Results: Your results WILL differ. My program does only SIMD integer processing. NO FPU. My Pentium 4 code uses 46 instructions in a loop which is repeated over and over after loading with different data. My Pentium III code uses 56 instructions and my Athlon code uses 63 instructions. The Pentium III code is shorter than the Athlon code because it uses only MMX instructions. It is longer than the Pentium 4 routine because it needs to propagate carries from dword to dword within each MMX register.

The Athlon XP runs faster than the Pentium III. The Pentium III code uses 56 instructions whereas the Athlon code uses 63 instructions to do the same. At the same clock speed, the Athlon would execute those 56 instructions in 20% more time, however, it executes its own 63
instructions in 10% less time than the Pentium III executes its 56 instructions.

The Pentium M is almost identical to the Pentium III, except that it can also run Pentium 4 code (SSE2 instructions). My Pentium 4 code uses 46 instructions. At the same clock speed, the Pentium M executes the 46 instructions in 83% of the time the Pentium III takes to execute its 56 instructions (almost linear ratio). This is 94% of the time the Athlon takes to execute 63 instructions, ie, the Athlon strictly also executes instructions faster than the Pentium M. However, since the Pentium M uses the shorter code whereas the Athlon uses a longer instruction sequence, the Pentium M is slightly faster than the Athlon at the same clock speed.

Time to complete task:
Pentium 4 HT 2400MHz - 261 seconds
Pentium III 1066MHz - 312 seconds. A 2000MHz Pentium 4 competes with this.
Athlon XP 1666MHz - 179 seconds. You would need a 3500MHz Pentium 4 Desktop to compete with this desktop.
Pentium M 1600MHz - 169 seconds. You would need a 3700MHz Pentium 4 Desktop to compete with this notebook.

Dealing with processors at the top of their respective series:
2200MHz Pentium M [notebook - old notebook processor design] would match a 5100MHz Pentium 4.
2200MHz Athlon XP [desktop - out of date processor] would match a 4600MHz Pentium 4
2800MHz Athlon 64 [desktop] (running Athlon XP code not Pentium 4 SSE2 code) would match a 5900MHz Pentium 4.

That is, all things being equal. IIRC Prescott eliminated some of the advantages of the double pumped ALU of the Pentium 4, negatively affecting some integer but not so much MMX code.

V Coder

Quote from: Human on December 11, 2005, 05:25:51 AM
well for amd there is codeanalyst if i remember correctly but its not for asm projects :(
and shows only pairing,stalls and other thing as colourful squares and you have to look at legend.
I have been wanting to use CodeAnalyst to simulate the Athlon pipeline running a program, but it requires .pdb (program database files), which apparently needed to be created by Visual C or whatever, ie not for assembly. However, I have now realized that the Linker creates .pdb files if the /DEBUG option is set. I have now set it in my environment variables, and will give it a try. Apart from that CodeAnalyst does identify processor counters and can thus be used for checking processor stalls, etc in assembly projects. See http://developer.amd.com/downloads.aspx

V Coder

The Athlon 64 also runs 32 bit code faster than an Athlon XP. You need to profile, and test your speed and optimize on all processors if you want the best speed from each.

See also:
http://www.azillionmonkeys.com/qed/cpujihad.shtml
http://www.aceshardware.com/Spades/read.php?article_id=90
http://chip-architect.com/mw.pdf

and many other references.

EduardoS

Quote from: V Coder on March 01, 2006, 03:39:33 AM
Dealing with processors at the top of their respective series:
2200MHz Pentium M [notebook - old notebook processor design] would match a 5100MHz Pentium 4.
2200MHz Athlon XP [desktop - out of date processor] would match a 4600MHz Pentium 4
2800MHz Athlon 64 [desktop] (running Athlon XP code not Pentium 4 SSE2 code) would match a 5900MHz Pentium 4.
V Coder, there is a note here, you test it with one algo, there are some algo where a Pentium 4 is faster than anyother in the same clock speed, this kind of comparation won't help us, a manual on how to otimize for each processor will.