News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

How many registers do I have?

Started by frktons, July 15, 2010, 04:44:02 PM

Previous topic - Next topic

oex

Oh OK so how many conveyor belts?.... I take it you mean something like....

mov eax, 7
add eax, 3

mov ebx, 5
add ebx, 5

add eax, ebx


Pipeline 1                       Pipeline 2 - Processed on completion P1

mov eax, 7
add eax, 3
                                    add eax, ebx
mov ebx, 5
add ebx, 5

????

If so how many 'component' conveyor belts in each pipeline (ie 2 in this example in P1 and 1 in P2)

Maybe I'm way off again.... I guess this looks more like FPGA but it's been a long night....

Maybe someone could point me in the direction of a diagram and/or code example explanation?
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

jj2007

Quote from: hutch-- on July 19, 2010, 08:45:54 AM
Frank and oex, forget old timing manuals in cycles on anything later than a 386

Indeed.
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
5434    cycles for 100*div
470     cycles for 100*mul
174     cycles for 100*imul
177     cycles for 100*shl

frktons

On my pc I get different results:

Intel(R) Core(TM)2 Duo CPU     E4500  @ 2.20GHz (SSE4)
175     cycles for 100*mul
95      cycles for 100*imul
62      cycles for 100*shl

--- ok ---


shl looks faster.
Probably because you only shifted 1 position:

mov eax, 2
shl eax, 1

Mind is like a parachute. You know what to do in order to use it :-)

oex

hmmmm that just creates even more questions for me :lol.... Hutch just said to forget output of individual instructions so what do those timings tell us *out of context*?

From what I can see here now no code can be judged by any means other than testing it?

Quote from: hutch
on some of the later processors the throughput of any single instruction without a stall may be 40 to 50 cycles from entry to retirement
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

jj2007

Quote from: frktons on July 19, 2010, 09:08:50 AM
shl looks faster.
Probably because you only shifted 1 position:

That "shl reg, 1 is faster than shl reg, 15" might be valid for very old CPUs... see updated attachment above.

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
4728    cycles for 100*div
468     cycles for 100*mul
172     cycles for 100*imul
183     cycles for 100*shl 1
181     cycles for 100*shl 2

frktons

Not many changes, actually:

Intel(R) Core(TM)2 Duo CPU     E4500  @ 2.20GHz (SSE4)
1221    cycles for 100*div
179     cycles for 100*mul
95      cycles for 100*imul
63      cycles for 100*shl 1
63      cycles for 100*shl 2

1226    cycles for 100*div
179     cycles for 100*mul
95      cycles for 100*imul
62      cycles for 100*shl 1
63      cycles for 100*shl 2


--- ok ---


shl still looks 50% faster than imul  ::)

Well I'm working on a Win XP pro/32 bit with a Core 2 duo, I don't
know if, but it seems to make difference.
Mind is like a parachute. You know what to do in order to use it :-)

jj2007

Quote from: frktons on July 19, 2010, 09:21:02 AM

shl still looks 50% faster than imul  ::)

Well I'm working on a Win XP pro/32 bit with a Core 2 duo, I don't
know if, but it seems to make difference.

We are talking 0.95 cycles instead of 0.63 cycles per multiplication. And you still have not explained how you want to replace imul  eax, 6554 with some intelligent shift, add etc operations that perform in less than 0.95 cycles...

frktons

Quote from: jj2007 on July 19, 2010, 09:25:49 AM
We are talking 0.95 cycles instead of 0.63 cycles per multiplication. And you still have not explained how you want to replace imul  eax, 6554 with some intelligent shift, add etc operations that perform in less than 0.95 cycles...

Sorry JJ, I was just showing you what I get. In order to change
the  imul  eax, 6554 with something smarter I've no clue
for the time being, I have to think about that for a while. That is
a magic number and I don't know how to deal with them
without offending them  :lol

Could you suggest something?

By the way, have you any idea why on your machine the imul
and the shift have different performances?  ::) your machine should
be faster than mine according to what is displayed.

Mind is like a parachute. You know what to do in order to use it :-)

Rockoon

Lets get some AMD representation:

AMD Phenom(tm) II X6 1055T Processor (SSE3)
1899    cycles for 100*div
193     cycles for 100*mul
96      cycles for 100*imul
61      cycles for 100*shl 1
61      cycles for 100*shl 2

1896    cycles for 100*div
193     cycles for 100*mul
96      cycles for 100*imul
61      cycles for 100*shl 1
61      cycles for 100*shl 2


But honestly, the timing of individual instructions is useless. When choosing between SHL and IMUL (and hey, why wasnt LEA represented here?) the other instructions in the pipeline mean everything. IMUL takes over a different execution unit than the SHL does on the latest from both Intel and AMD.

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

hutch--

oex,

The concept of 1 or more pipelines is not something you can control very well independently, it comes more in understanding how they work. You have 2 basic classes of instructions, the RISC preferred set and the old junk, mainly stored in microcode and what recent processors do is present an interface with the x86 instruction set. From a variety of sources you get a reasonably good idea of what the preferred instruction set is and its usually the simpler instructions. MOV ADD SUB TEST CMP, then you have more complex instructions that get slower and this varies from one processor to another, shifts, rotates are usually off the pace on late hardware, XCHG is a lemon, string instructions without REP are worth avoiding but there is special case circuitry when used with REP that cut in after about 500 bytes. On older hardware IMUL MUL were very slow and still are in comparison to preferred instructions but later hardware is getting faster with multiplications as they have additional execution units to do stuff like this.

You get the fastest code for the data size by using preferred instructions and avoiding stalls from a variety of situations, dependency being one of the bad ones that will stop a pipeline until the result it depends on is available. Earlier processors had problems with alignment and some had problems with different data sizes apart from the native unit size, 32 bit and on later stuff, 64 bit.

LEA was fast on everything from a 486 up to the early PIVs where it was off the pace and could be replaced by a number of ADDs in some contexts, on the Core series and later LEA is fast again.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

frktons

Quote from: Rockoon on July 19, 2010, 11:35:11 AM
Lets get some AMD representation:

AMD Phenom(tm) II X6 1055T Processor (SSE3)
1899    cycles for 100*div
193     cycles for 100*mul
96      cycles for 100*imul
61      cycles for 100*shl 1
61      cycles for 100*shl 2

1896    cycles for 100*div
193     cycles for 100*mul
96      cycles for 100*imul
61      cycles for 100*shl 1
61      cycles for 100*shl 2


But honestly, the timing of individual instructions is useless. When choosing between SHL and IMUL (and hey, why wasnt LEA represented here?) the other instructions in the pipeline mean everything. IMUL takes over a different execution unit than the SHL does on the latest from both Intel and AMD.



What about lea, where is she?

Post some representing code of lea please, let's have
a taste of her  :lol
Mind is like a parachute. You know what to do in order to use it :-)

frktons

Quote from: hutch-- on July 19, 2010, 11:58:32 AM
oex,

The concept of 1 or more pipelines is not something you can control very well independently, it comes more in understanding how they work. You have 2 basic classes of instructions, the RISC preferred set and the old junk, mainly stored in microcode and what recent processors do is present an interface with the x86 instruction set. From a variety of sources you get a reasonably good idea of what the preferred instruction set is and its usually the simpler instructions. MOV ADD SUB TEST CMP, then you have more complex instructions that get slower and this varies from one processor to another, shifts, rotates are usually off the pace on late hardware, XCHG is a lemon, string instructions without REP are worth avoiding but there is special case circuitry when used with REP that cut in after about 500 bytes. On older hardware IMUL MUL were very slow and still are in comparison to preferred instructions but later hardware is getting faster with multiplications as they have additional execution units to do stuff like this.

You get the fastest code for the data size by using preferred instructions and avoiding stalls from a variety of situations, dependency being one of the bad ones that will stop a pipeline until the result it depends on is available. Earlier processors had problems with alignment and some had problems with different data sizes apart from the native unit size, 32 bit and on later stuff, 64 bit.

LEA was fast on everything from a 486 up to the early PIVs where it was off the pace and could be replaced by a number of ADDs in some contexts, on the Core series and later LEA is fast again.

It looks like you never get rest with CPU modifications and upgrades.
Probably you have to stick with whatever is the best for a timeframe
and be ready to change as far as it is needed. ::)
Mind is like a parachute. You know what to do in order to use it :-)

dedndave

QuoteFrom what I can see here now no code can be judged by any means other than testing it?

even testing it is only valid if you test it on a variety of CPU's
P4 cores are quickly becoming obsolete, in spite of the fact that they may not be all that old

here is the method i use...


jj2007

Quote from: Rockoon on July 19, 2010, 11:35:11 AM
and hey, why wasnt LEA represented here?

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
5440    cycles for 100*div
471     cycles for 100*mul
173     cycles for 100*imul
426     cycles for 100*lea, 2*eax
277     cycles for 100*lea, 2*eax+eax
426     cycles for 100*lea, 2*eax+eax+99
177     cycles for 100*shl 1
177     cycles for 100*shl 2

sinsi

FWIW,

Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)
1219    cycles for 100*div
178     cycles for 100*mul
94      cycles for 100*imul
94      cycles for 100*lea, 2*eax
94      cycles for 100*lea, 2*eax+eax
94      cycles for 100*lea, 2*eax+eax+99
62      cycles for 100*shl 1
62      cycles for 100*shl 2

1217    cycles for 100*div
178     cycles for 100*mul
94      cycles for 100*imul
94      cycles for 100*lea, 2*eax
94      cycles for 100*lea, 2*eax+eax
94      cycles for 100*lea, 2*eax+eax+99
62      cycles for 100*shl 1
62      cycles for 100*shl 2

No different to the earlier test (I do keep an eye on you mr jj :bg)
Tests should operate on the same data  :naughty:
Light travels faster than sound, that's why some people seem bright until you hear them.