News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Faster array access for small arrays

Started by LithiumDex, January 30, 2007, 04:07:07 AM

Previous topic - Next topic

LithiumDex

Is there any way to have small (<1kb) frequently used arrays stored in the stack?

Or what about the heap, could someone explain what this is and how much faster/slower it is than Mallocated memory?

hutch--

Putting it on the stack is no advantage apart from perhaps being simpler to allocate. Allocate fixed memory and align the start address with something like the "memalign" macro and you will be doing it about as fast as it can be done.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Tedd

malloc does take from the heap :wink -- it's just a C name for a generalised function which resolves to the OS memory allocation function, depending which OS you compile for at the time. The heap is the name given to the 'heap' of memory available to your process while it's running, which you can allocate chunks from.

When it comes down to it memory is just memory, once the cpu loads it into cache that's as good as it gets. However, since the stack is used regularly, it's more likely to already be in cache. And allocating on the stack is much faster: one instruction, rather than a whole function.

Array on the stack example (LOCAL allocates variables on the stack for you, and the stack is cleaned up on return.)
test proc
    LOCAL myArray[512]:BYTE
    lea eax,[myArray]
    mov BYTE PTR [eax],1
    ret
test endp
No snowflake in an avalanche feels responsible.

Mark_Larson



  I'd recommend putting it on the stack.  Here is why.  Usually you won't get the initial cache miss, which would require it to load it into the cache, because usually the stack is in cache from other memory accesses.  Keep in mind that usually cache lines are 64 bytes, so if your array is too big, it will still get cache misses on the loads, just not the first load, which gives a slight speed advantage.  I generally recommend local variables instead of global ones for procedures for that reason.  And hutch was spot on with the memory alignment.  Non-aligned accesses can cause a big degradation in performance.  So always align!  The rule of thumb is you align by the memory size you will be accessing.  So dwords need 4 byte alignment.  Qword 8 byte, etc.  If you are using SSE/SSE2 most of the instructions require a 16-byte alignment or you will get an exception.

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

LithiumDex

hmm.. well my array is to big to fit on the stack, unless a coppied portions of it that were going to be used directly before I accessed them, which could prove to be slower... also I'm using bytes, so I'm guessing memalign isn't going to do much for me, eh?

dsouza123

If the arrays are multidimensional then the order of the parameters can make a difference in speed,
also if possible uses powers of two for all but the first parameter, (that one also if close to a power of two).
Issues: loading a cache line can prefetch multiple sequential values and stride length.

If the access is random then it doesn't really help much.
If sequential you can get four at a time loading it in as a dword.

If the arrays can be grouped together in 4K pieces they will fit in a cache page.

If the array(s) need to be accessed by multiple routines
then putting them as global variables allows them to be universally usable.

Arrays less than 1kb should fit on the stack, stack size can be adjusted.

dsouza123

If they are accessed sequentially then SSE2 may help.
With SSE2, 16 bytes can be loaded together and the same operations
done on all simultaneously.

LithiumDex

allright... so I've adapted it to use only dword read writes, my arrays are allocated in exe space (static), and I've used memalign

so...

1st question, If I'm using dwords and my array was defined in the .data section, I just say,

memalign myArray, 4

right? is there any perticular place this should be? right now I have it before the arrays are filled with the lut values...

so... right now my arrays are setup like this:

x = colour range, y = depth range

DLUT size = x*y*3 (dwords)
0        -> x*y-1     : red
x*y     -> 2*x*y-1  : green
2*x*y -> 3*x*y-1  : blue

ok now, within each one of those arrays, to pull a value given the input colour and input depth I would say:

red: [DLUT + clr + depth*x]
green: [DLUT + clr + depth*x + (x*y)]
blue: [DLUT + clr + depth*x + 2*(x*y)]

now, my idea for an optimization goes like this... since for each slice I draw in sequence, the depth is the same, what if I stored the arrays such that the red, green, and blue would be all together for each depth? how much faster would this make things (i.e. there are 63 dwords * 3 for each depth)

the funny thing is though, making my array static and using memalign didn't improve my speed, it was switching from bytes to dwords after the fact that gave me a 30% increase.. that's still not good enough though.