How many registers do I have?

GregL · July 17, 2010, 02:44:58 AM

oex · July 17, 2010, 02:47:39 AM

If you have dual core+ you still technically have exactly the same registers though they are silently doubled+

dedndave · July 17, 2010, 02:57:30 AM

i was waiting for someone to bring up HTT :P
i have thought some of playing around with "hogging both threads" of my prescott - lol
i don't think it is a great idea in practice, though
seems like you tie up the machine by doing that - better to let the OS manage threads
but - it could give you a whole extra set of registers to play with
not that you could exchange from one set to another efficiently, but you might be able to find some advantage in there

frktons · July 17, 2010, 08:18:46 AM

Thanks everybody for your suggestions. :U

Actually my CPU doesn't support Hyper-Threading Technology so
HTT is not an issue for the time being, and it is probably a too advanced
subject for n00bs of my level. :P

Algorithms in Assembly well that's the matter I'd really like
to grasp a little. I have seen a lot of good books on algorithms in
C/C++/Java and the like. Probably the C/C++ category is the most
close to the machine.
From C, that I'm actually learning, I'll take advantage to get some
Algorithm attitude, so to speak, and then all the way long to translate
or adapt them in MASM/GoAsm whatever.

It's quite a long way though, and the sources are overwhelming :eek
A step, slow one, at a time, no other choice. :lol

frktons · July 19, 2010, 12:20:39 AM

One of the thing I'd like to test is the use of 64 bit registers
to perform the division, that is quite resource consuming, as
many of you have explained to me.

This short mixed code I use for dividing by ten a number
is an example I'd like to improve a little with a better algorithm,
maybe a divide by multiply and shift, and/or with the use of
some 64 bit Assembly trick I'm not aware of:

Code Select


    long div_result = 0;
    long remain = 0;
    const long ten = 10;
     num2 = rand() % 10000;
     __asm{
     xor   edx, edx
     mov eax, num2
     mov ecx, ten
     idiv   ecx	
     mov  div_result, eax
     mov  remain, edx
     }

Probably MMX registers are not well suited for this purpose,
or are slower than GPR, I actually don't know. Surely if I use
the following code, that is obviously in C language:

Code Select


      num2 = rand() % 10000;
      div_result = (num2 * 6554UL) >> 16;
      remain = num2 - div_result * ten;

I get a better performance because the algorithm is smarter
and doesn't use division, but a magic number to
multiply the number to divide and after it shifts right the same
number a given number of position.

Of course 6544 works for number not bigger than 9999
and I'd have to calculate the magic number depending on the range
I'm going to use.

So I was wondering what performance could we get using methods
like this with 64 bit registers and a full set of magic numbers
to use. ::)

frktons · July 19, 2010, 02:20:14 AM

I translated the C code for divide by multiply and shift:

Code Select


      div_result = (num2 * 6554UL) >> 16;
      remain = num2 - div_result * ten;

in Assembly this way:

Code Select


      mov  eax, num2
      imul  eax, 6554
      shr    eax, 16
      mov  div_result, eax
      mov  ecx, num2
      imul  eax, ten
      sub   ecx, eax	
      mov  remain, ecx

But the performances are about the same, and I don't
know if it depends on how good the compiler is to
translate the code, or how bad I am to do the same. :P

Any suggestion to improve the above code?

KeepingRealBusy · July 19, 2010, 03:12:14 AM

Magic numbers are good for dividing by using a magic number multiply and shifting, but you get no remainder, and need the shift, and are usually used for dividing by constants and not for dividing by variables. For variables, you would need a table of all possible magic numbers, or a table that contained a pair of number/magic_number entries which had to be searched for a number match to get the magic number to use. The full table would exceed allowable memory (especially for 64 bit). The search would take more time than you would save with the Magic number multiply.

Until you start using 64 bit processing, you do not have 64 bit gp registers (rax,rdx). I do not see any MMX 64 bit register instructions that did divides. Some MMX 64 bit register packed multiplies exist, but nothing that you cannot do with multiply eax and edx. Note, to save a register, put one value in eax, the other in edx, then mul, the 64 bit result in eax:edx (low 32 bits:high 32 bits).

Dave.

frktons · July 19, 2010, 03:18:02 AM

Thanks Dave.

I was doing naive assumptions, typical beginner stuff :P

By the way, the code I used to translate the C code is good enough
or could I do better in some ways?

oex · July 19, 2010, 03:23:00 AM

Quote from: frktons on July 19, 2010, 02:20:14 AM
Any suggestion to improve the above code?

You could swap memory for registers though it really does depend on the surrounding code.... ie I see no need for this line in current code:

mov div_result, eax

you could also:

mov eax, num2
mov ecx, eax

frktons · July 19, 2010, 03:48:38 AM

Quote from: oex on July 19, 2010, 03:23:00 AM
You could swap memory for registers though it really does depend on the surrounding code.... ie I see no need for this line in current code:

mov div_result, eax

well I need the div_result variable to use in the C code.

Quote
you could also:

mov eax, num2
mov ecx, eax

Well, this is good :U I can spare some cycles this way. Thanks:

Code Select


      mov  eax, num2
      mov  ecx, eax
      imul  eax, 6554
      shr    eax, 16
      mov  div_result, eax
      imul  eax, ten
      sub   ecx, eax	
      mov  remain, ecx

Nevertheless I'm not able to beat the Pelles'C compiler.
The C code is as fast as the Assembly. :eek

oex · July 19, 2010, 04:56:23 AM

Most of the time is taken up in the imuls.... If you can find a way to remove or combine them you should be in luck but it's too late for me to do that math :lol

jj2007 · July 19, 2010, 06:43:14 AM

Quote from: oex on July 19, 2010, 04:56:23 AM
Most of the time is taken up in the imuls...

imuls are actually pretty fast, much faster than normal muls, so don't waste too much efforts for finding a workaround.

oex · July 19, 2010, 07:09:36 AM

I was working off the MASM opcodes manual which has them at 13-42 clocks each.... Is there a better ref?

mov, sub and shr are down as 1-3 clocks....

I dont know for sure and it's been a VERY long night but shr, 16 would be:
movzx ebx, ax
I think.... (maybe the other way round.... bswap first) being 16 bit this might be slightly faster?

frktons · July 19, 2010, 08:38:16 AM

Quote from: oex on July 19, 2010, 07:09:36 AM
I was working off the MASM opcodes manual which has them at 13-42 clocks each.... Is there a better ref?

mov, sub and shr are down as 1-3 clocks....

I dont know for sure and it's been a VERY long night but shr, 16 would be:
movzx ebx, ax
I think.... (maybe the other way round.... bswap first) being 16 bit this might be slightly faster?

Thanks oex, this is another option to try:

Code Select


movzx ebx, ax

or the code that works for it, I still don't know. ::)

Back home, on my pc, I'll try it and see if it performs any better. :P

hutch-- · July 19, 2010, 08:45:54 AM

Frank and oex, forget old timing manuals in cycles on anything later than a 386 as they have pipelines that "SCHEDULE" instructions and on some of the later processors the throughput of any single instruction without a stall may be 40 to 50 cycles from entry to retirement.

Think of one or more pipelines as instruction assembly production lines like in a factory, performance is measured by the output, not the individual component.

News:

How many registers do I have?