News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

How many registers do I have?

Started by frktons, July 15, 2010, 04:44:02 PM

Previous topic - Next topic

GregL


oex

If you have dual core+ you still technically have exactly the same registers though they are silently doubled+
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

dedndave

i was waiting for someone to bring up HTT   :P
i have thought some of playing around with "hogging both threads" of my prescott - lol
i don't think it is a great idea in practice, though
seems like you tie up the machine by doing that - better to let the OS manage threads
but - it could give you a whole extra set of registers to play with
not that you could exchange from one set to another efficiently, but you might be able to find some advantage in there

frktons

Thanks everybody for your suggestions.  :U

Actually my CPU doesn't support Hyper-Threading Technology so
HTT is not an issue for the time being, and it is probably a too advanced
subject for n00bs of my level.  :P

Algorithms in Assembly well that's the matter I'd really like
to grasp a little. I have seen a lot of good books on algorithms in
C/C++/Java and the like. Probably the C/C++ category is the most
close to the machine.
From C, that I'm actually learning, I'll take advantage to get some
Algorithm attitude, so to speak, and then all the way long to translate
or adapt them in MASM/GoAsm whatever.

It's quite a long way though, and the sources are overwhelming  :eek
A step, slow one, at a time, no other choice.  :lol
Mind is like a parachute. You know what to do in order to use it :-)

frktons

#34
One of the thing I'd like to test is the use of 64 bit registers
to perform the division, that is quite resource consuming, as
many of you have explained to me.

This short mixed code I use for dividing by ten a number
is an example I'd like to improve a little with a better algorithm,
maybe a divide by multiply and shift, and/or with the use of
some 64 bit Assembly trick I'm not aware of:


    long div_result = 0;
    long remain = 0;
    const long ten = 10;
    num2 = rand() % 10000;
    __asm{
    xor   edx, edx
    mov eax, num2
    mov ecx, ten
    idiv   ecx
    mov  div_result, eax
    mov  remain, edx
    }


Probably MMX registers are not well suited for this purpose,
or are slower than GPR, I actually don't know. Surely if I use
the following code, that is obviously in C language:

      num2 = rand() % 10000;
      div_result = (num2 * 6554UL) >> 16;
      remain = num2 - div_result * ten;   


I get a better performance because the algorithm is smarter
and doesn't use division, but a magic number to
multiply the number to divide and after it shifts right the same
number a given number of position.

Of course 6544 works for number not bigger than 9999
and I'd have to calculate the magic number depending on the range
I'm going to use.

So I was wondering what performance could we get using methods
like this with 64 bit registers and a full set of magic numbers
to use.  ::)


Mind is like a parachute. You know what to do in order to use it :-)

frktons

I translated the C code for divide by multiply and shift:


      div_result = (num2 * 6554UL) >> 16;
      remain = num2 - div_result * ten;   


in Assembly this way:


      mov  eax, num2
      imul  eax, 6554
      shr    eax, 16
      mov  div_result, eax
      mov  ecx, num2
      imul  eax, ten
      sub   ecx, eax
      mov  remain, ecx


But the performances are about the same, and I don't
know if it depends on how good the compiler is to
translate the code, or how bad I am to do the same.  :P

Any suggestion to improve the above code?
Mind is like a parachute. You know what to do in order to use it :-)

KeepingRealBusy

Magic numbers are good for dividing by using a magic number multiply and shifting, but you get no remainder, and need the shift, and are usually used for dividing by constants and not for dividing by variables. For variables, you would need a table of all possible magic numbers, or a table that contained a pair of number/magic_number entries which had to be searched for a number match to get the magic number to use. The full table would exceed allowable memory (especially for 64 bit). The search would take more time than you would save with the Magic number multiply.

Until you start using 64 bit processing, you do not have 64 bit gp registers (rax,rdx). I do not see any MMX 64 bit register instructions that did divides. Some MMX 64 bit register packed multiplies exist, but nothing that you cannot do with multiply eax and edx. Note, to save a register, put one value in eax, the other in edx, then mul, the 64 bit result in eax:edx (low 32 bits:high 32 bits).

Dave.

frktons

Thanks Dave.

I was doing naive assumptions, typical beginner stuff  :P

By the way, the code I used to translate the C code is good enough
or could I do better in some ways?
Mind is like a parachute. You know what to do in order to use it :-)

oex

Quote from: frktons on July 19, 2010, 02:20:14 AM
Any suggestion to improve the above code?

You could swap memory for registers though it really does depend on the surrounding code.... ie I see no need for this line in current code:

mov  div_result, eax

you could also:

mov  eax, num2
mov  ecx, eax
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

frktons

Quote from: oex on July 19, 2010, 03:23:00 AM
You could swap memory for registers though it really does depend on the surrounding code.... ie I see no need for this line in current code:

mov  div_result, eax

well I need the div_result variable to use in the C code.

Quote
you could also:

mov  eax, num2
mov  ecx, eax

Well, this is good  :U I can spare some cycles this way. Thanks:



      mov  eax, num2
      mov  ecx, eax
      imul  eax, 6554
      shr    eax, 16
      mov  div_result, eax
      imul  eax, ten
      sub   ecx, eax
      mov  remain, ecx


Nevertheless I'm not able to beat the Pelles'C compiler.
The C code is as fast as the Assembly.  :eek
Mind is like a parachute. You know what to do in order to use it :-)

oex

Most of the time is taken up in the imuls.... If you can find a way to remove or combine them you should be in luck but it's too late for me to do that math :lol
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

jj2007

Quote from: oex on July 19, 2010, 04:56:23 AM
Most of the time is taken up in the imuls...

imuls are actually pretty fast, much faster than normal muls, so don't waste too much efforts for finding a workaround.

oex

I was working off the MASM opcodes manual which has them at 13-42 clocks each.... Is there a better ref?

mov, sub and shr are down as 1-3 clocks....

I dont know for sure and it's been a VERY long night but shr, 16 would be:
movzx ebx, ax
I think.... (maybe the other way round.... bswap first) being 16 bit this might be slightly faster?
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

frktons

Quote from: oex on July 19, 2010, 07:09:36 AM
I was working off the MASM opcodes manual which has them at 13-42 clocks each.... Is there a better ref?

mov, sub and shr are down as 1-3 clocks....

I dont know for sure and it's been a VERY long night but shr, 16 would be:
movzx ebx, ax
I think.... (maybe the other way round.... bswap first) being 16 bit this might be slightly faster?

Thanks oex, this is another option to try:

movzx ebx, ax


or the code that works for it, I still don't know.  ::)

Back home, on my pc, I'll try it and see if it performs any better.  :P
Mind is like a parachute. You know what to do in order to use it :-)

hutch--

Frank and oex, forget old timing manuals in cycles on anything later than a 386 as they have pipelines that "SCHEDULE" instructions and on some of the later processors the throughput of any single instruction without a stall may be 40 to 50 cycles from entry to retirement.

Think of one or more pipelines as instruction assembly production lines like in a factory, performance is measured by the output, not the individual component.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php