only innerloop that fits in cache+recursion faster than innerloop+several outerL

Started by daydreamer, March 16, 2006, 08:05:14 AM

Previous topic - Next topic

daydreamer

???
I have an algo that is almost the same code for inner and 2 outerloops
if I make it fit in 32byte and align it, can it be faster than a 3 times bigger code
I am reusing the same regs+push/pop anyway for outerloops
also how much do I lose on PIV, which has slow shifts
if I should choose between penalty for partial register mov dl,ah vs sar eax,8+add edx,eax
the code is having full 32bit reg operations earlier in loop



Ratch

 !Czealot,

Quote???

      Question marks usually go after a sentence, not before.  For example, what is your question?  Ratch

daydreamer

Quote from: Ratch on March 16, 2006, 09:57:16 PM
!Czealot,

Quote???

      Question marks usually go after a sentence, not before.  For example, what is your question?  Ratch
only innerloop that fits in cache+recursion faster than innerloop+several outerLoops ??? (doesnt fit into title, it gets too long)

Tedd

only innerloop that fits in cache+recursion faster than innerloop+several outerLoops ???

Translation:
    which of the following would be faster:
     - an inner-loop that fits into cache, called recursively;
     - or a (possibly too large to fit in cache) inner-loop, with serveral outer-loops?


My guess would be the first. But it's just a guess :lol
No snowflake in an avalanche feels responsible.

Mincho Georgiev

Tedd, it is not necessarily for recursion to resolve the problem. Let's just remember for moment  Fibona4i. The recursion for calculating Fib's number is  infelicity choosen. /Except maybe on Moaver's Formula/
Depending of the exact alogrithm, itterative method may be better, but that's DEPENDING of the algo.
Maybe  posting a piece  your code is a good idea ,!Czealot.



Tedd

Shaka: you're right, a faster algorithm will usually beat any type of optimization.
But, assuming the algorithm stays (almost) the same, keeping code in cache should cause it to be more efficient than constantly swapping.
No snowflake in an avalanche feels responsible.

P1

I believe you have gotten into cache issues with an uP.  At 32 bytes ( size of a cache line, depends upon uP ), on an alignment boundary, is quicker to execute.

Regards,  P1  :8)