News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

rotates on P4 and Core quad.

Started by hutch--, September 07, 2009, 01:02:42 PM

Previous topic - Next topic

hutch--

This result was an eye opener, the PIV kicks ass here on ROL and ROR and by more than the clock speed difference.


Core 3.0 gig quad.
-------------

563
578
562
563
562
578
563
562
Press any key to continue ...

PIV 3.8 gig
-------

359
359
360
343
360
359
360
359
Press any key to continue ...





; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
    include \masm32\include\masm32rt.inc
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    .data?
      value dd ?

    .data
      item dd 0

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    push esi

  REPEAT 8

    invoke GetTickCount
    push eax

    mov esi, 100000000

  @@:
    mov eax, 12345678

    rol eax, 1
    rol eax, 1
    rol eax, 1
    rol eax, 1
    rol eax, 1
    rol eax, 1
    rol eax, 1
    rol eax, 1

    ror eax, 1
    ror eax, 1
    ror eax, 1
    ror eax, 1
    ror eax, 1
    ror eax, 1
    ror eax, 1
    ror eax, 1

    sub esi, 1
    jnz @B


    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print str$(eax),13,10

  ENDM

    pop esi

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

NightWare

not exactly, here you intensivly use the effect of the trace cache with this speed test, and the trace cache no longer exist in quad core. in NORMAL use (no speed test) rol/rol are considerably more slower on p4, than p3/core2/...

hutch--

Interestingly enough I have one test primarily for read after write testing and by putting one rotate in the sequence the Core slowed badly against the PIV. Note though that the PIV I am testing with is a 3.8 clock speed where the Core quad is 3 gig.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Mirno

I know that the PIV had some caches of the translated micro-ops. Hitting these tended to be productive, as it dropped parts of the pipeline. It also made repetative testing look better than it actually would be.
It's similar to the timings you see when comparing cmovs to their cmp/jmp equivelents, as the repetition seeds the branch predictors, and so hides the real cost of them.

The PIV was known (by Intel engineers) to be very fast or very slow on certain bits of code (sometimes the same bit of code, but after execution of something else), and they couldn't work out why! The levels of caching, converted u-ops, and the various execution engines all meant that it was incredibly complicated and not even Intel fully understood it - at least when it came out.

Mirno

hutch--

Yes that makes sense, vaguely I remember Intel started to cut corners on the middle range of PIVs due to having to compete with the then Athlon range from AMD so the middle and later PIVs did not have the grunt they should have had.

I knew from the PIVs I owned that the Northwood series were faster than the later Prescott series clock for clock, mainly due to the shorter pipeline of the earlier core but the Northwood core ran out of puff at about 3 gig due to heat while the Prescott was wound up to 3.8 in production.

Looks like the rotate pair have ended up in microcode for the Core series which is unfortunate as they are useful instructions in certain cases.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Ghandi

Intel® Pentium® Processor E5400 (2M Cache, 2.70 GHz, 800 MHz FSB)
** a baby Dual Core CPU, not to be confused with a Core2Duo  :wink **


625
640
625
625
641
625
641
625
Press any key to continue ...

hutch--

Sounds like an interesting processor, is it in a desktop ?

The quad I have just built is a Q9650 at 3 gig but this one sounds later.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Ghandi


Q9650 Specs:

sSpec Number: SLB8W
CPU Speed: 3 GHz
PCG: 05A
Bus Speed: 1333 MHz
Bus/Core Ratio: 9.0
L2 Cache Size: 12 MB
L2 Cache Speed: 3 GHz

3 GHz Package Type: LGA775
Manufacturing Technology: 45 nm
Core Stepping: E0
CPUID String: 1067Ah
Thermal Design Power: 95W
Thermal Specification: 71.4°C
VID Voltage Range: 0.85V – 1.3625V


Your CPU sounds nice Hutch, did you want to swap? :wink

Yes, i got it about mid March for AU$160, the supplier who i got it off kept increasing the price on me. I was quoted $120 so i ordered it and when i called to check on delivery it became $140, then when i picked it up it was $160. But i didnt care as i was using an LGA-775 PIV 3.06Ghz, 533mhz FSB and i think it had 512k or 1MB L2 cache.

My machine isnt big or fast by any means, im still using 667mhz RAM, 2 gig only. My video card is a slightly older ATI Radeon HD and my HDD is IDE still, but i like my rig and it does what i want it to. :wink

Although my current CPU was only released in December 2008, it is already a discontinued model, and to be honest i'll be looking at getting a newer one as soon as funds allow for it. Im thinking something with 1333mhz FSB, 6MB L2 cache and hopefully 4 cores running around the 3ghz mark.

I have to say though, going to a multi core processor and coding to utilize all the cores, the difference is remarkable. A friend asked me to look at a custom MD5 bruteforcer they had. The full iteration is from 00000000 - FFFFFFFF (it uses a DWORD to 'seed' the MD5) and with the orignal code it was taking 35 mins to complete on the single core. I managed to optimize the code so that it was able to complete in 15 mins, still single threaded. Then i made it multithread (for multi-core) and i got that down to just under the 7 minute mark.

Finally i took that to SSE2 and made a bruter that does 4 MD5 calculations simultaneously, per thread (2 max) and now it can cycle through the whole 00000000-FFFFFFFF range in less than 2 minutes for up to 10 values at once. Because adding more values to check increases the checking loop iteration (i cant unroll it because the amount of values isnt static), the speed changes with how many are being bruted at once.

This is compounded by the sheer number of iterations performed, 4,294,967,295 to be precise, so its easy to see where the speed and optimizations came into play. Remove 1 instruction and you have 4,294,967,295 less executions of the said instruction.



HR,
Ghandi

hutch--

It looks like a very useful processor and if you are happy with it, its worth holding out until the Nehalem core stuff drops in price. That is what Iwas originally going to do but I had 1 box fail and the second was getting glitchy so I built a couple of new ones, a fast PIV and the new quad. The Nehalem has a better instruction set than the Core series and it has better facilities for additional multithreading support. I think from memory they have re-introduced the Hyperthreading from the PIV era as well.

I envy you on the price, I was slugged about $450 AU for the quad which was about market price recently. Everyone wants to sell you an i7 at the moment but I would rather pick up the end development of a core, not the start.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Astro

QuoteI think from memory they have re-introduced the Hyperthreading from the PIV era as well.
They have, yes.

Best regards,
Astro.