News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Assembly instructions and Nehalem architecture CPUs

Started by dsouza123, June 29, 2008, 05:46:46 PM

Previous topic - Next topic

dsouza123

Assembly language instruction improvements in the upcoming (Q4 2008) Intel Nehalem architecture CPUs.

http://realworldtech.com/page.cfm?ArticleID=RWT040208182719

Nehalem also includes an instruction set extension which spans the entire core pipeline since they impact microcode.
Nehalem includes the SSE4.2 instructions, which include several instructions for string manipulations, a CRC instruction
and a popcount. The string instructions are all microcoded and will only show a small performance gain. The CRC instruction
is used for calculating checksums which is useful for storage and networking and provides fairly substantial benefits
in the range of 6-18X for the code snippets that Intel demonstrated. Of course, the overall speedup will be much smaller,
since Intel’s examples just deal with the tightest inner loop.

Nehalem refines and improves the macro-op fusion already found in the previous generation. In 32 bit mode,
the Core 2 could decode comparisons (CMP) or tests (TEST) and conditional branches (Jcc) into a single uop, CMP+JCC.
This increased the decode bandwidth of the Core 2 and reduced the uop count, making the machine effectively wider.
Macro-op fusion in Nehalem works with a wider variety of branch conditions, including JL/JNGE, JGE/JNL, JLE/JNG, JG/JNLE,
so any of those, in addition to the previously handled cases will decode into a single CMP+JMP uop.
Best of all, Nehalem’s macro-op fusion operates in both 32 bit and 64 bit mode.
...
In addition to fusing x86 macro-instructions, the decoding logic can also also fuse uops,
a technique first demonstrated with the Pentium M.

Nehalem also lowers the latency for synchronization primitives such as LOCK, XCHG and CMPXCHG,
which are necessary for multi-threaded programming. Intel claims that the latency for LOCK CMPXCHG instructions
(which serializes the whole pipeline) is 20% of what it was for the P4 (which was absolutely horrible)
and about 60% of the Core 2. While the latency is lower, the behavior is still similar to prior generaitons;
Lock instructions are not pipelined, although younger operations can execute ahead of a LOCK instruction.

hutch--

Thanks, this is good stuff. I particularly like the fusion concept and the removal of the stall in CMP/Jxx operations.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

evlncrn8

crc thing seems quite interesting too, wonder how long it'll take for amd to put it in theirs...

hutch--

I just read the entire article and it does have some interesting stuff in it, the effective redundancy of aligned SSE instructions being one of them.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php