Faster array access for small arrays

LithiumDex · January 30, 2007, 04:07:07 AM

Is there any way to have small (<1kb) frequently used arrays stored in the stack?

Or what about the heap, could someone explain what this is and how much faster/slower it is than Mallocated memory?

hutch-- · January 30, 2007, 04:43:28 AM

Putting it on the stack is no advantage apart from perhaps being simpler to allocate. Allocate fixed memory and align the start address with something like the "memalign" macro and you will be doing it about as fast as it can be done.

Tedd · January 30, 2007, 12:31:17 PM

malloc does take from the heap :wink -- it's just a C name for a generalised function which resolves to the OS memory allocation function, depending which OS you compile for at the time. The heap is the name given to the 'heap' of memory available to your process while it's running, which you can allocate chunks from.

When it comes down to it memory is just memory, once the cpu loads it into cache that's as good as it gets. However, since the stack is used regularly, it's more likely to already be in cache. And allocating on the stack is much faster: one instruction, rather than a whole function.

Array on the stack example (LOCAL allocates variables on the stack for you, and the stack is cleaned up on return.)

Code Select

test proc
    LOCAL myArray[512]:BYTE
    lea eax,[myArray]
    mov BYTE PTR [eax],1
    ret
test endp

Mark_Larson · January 30, 2007, 07:32:14 PM

I'd recommend putting it on the stack. Here is why. Usually you won't get the initial cache miss, which would require it to load it into the cache, because usually the stack is in cache from other memory accesses. Keep in mind that usually cache lines are 64 bytes, so if your array is too big, it will still get cache misses on the loads, just not the first load, which gives a slight speed advantage. I generally recommend local variables instead of global ones for procedures for that reason. And hutch was spot on with the memory alignment. Non-aligned accesses can cause a big degradation in performance. So always align! The rule of thumb is you align by the memory size you will be accessing. So dwords need 4 byte alignment. Qword 8 byte, etc. If you are using SSE/SSE2 most of the instructions require a 16-byte alignment or you will get an exception.

LithiumDex · January 30, 2007, 10:11:21 PM

hmm.. well my array is to big to fit on the stack, unless a coppied portions of it that were going to be used directly before I accessed them, which could prove to be slower... also I'm using bytes, so I'm guessing memalign isn't going to do much for me, eh?

dsouza123 · January 30, 2007, 10:28:18 PM

If the arrays are multidimensional then the order of the parameters can make a difference in speed,
also if possible uses powers of two for all but the first parameter, (that one also if close to a power of two).
Issues: loading a cache line can prefetch multiple sequential values and stride length.

If the access is random then it doesn't really help much.
If sequential you can get four at a time loading it in as a dword.

If the arrays can be grouped together in 4K pieces they will fit in a cache page.

If the array(s) need to be accessed by multiple routines
then putting them as global variables allows them to be universally usable.

Arrays less than 1kb should fit on the stack, stack size can be adjusted.

dsouza123 · January 30, 2007, 10:33:31 PM

If they are accessed sequentially then SSE2 may help.
With SSE2, 16 bytes can be loaded together and the same operations
done on all simultaneously.

LithiumDex · January 31, 2007, 01:09:31 AM

allright... so I've adapted it to use only dword read writes, my arrays are allocated in exe space (static), and I've used memalign

so...

1st question, If I'm using dwords and my array was defined in the .data section, I just say,

memalign myArray, 4

right? is there any perticular place this should be? right now I have it before the arrays are filled with the lut values...

so... right now my arrays are setup like this:

x = colour range, y = depth range

DLUT size = x*y*3 (dwords)
0 -> x*y-1 : red
x*y -> 2*x*y-1 : green
2*x*y -> 3*x*y-1 : blue

ok now, within each one of those arrays, to pull a value given the input colour and input depth I would say:

red: [DLUT + clr + depth*x]
green: [DLUT + clr + depth*x + (x*y)]
blue: [DLUT + clr + depth*x + 2*(x*y)]

now, my idea for an optimization goes like this... since for each slice I draw in sequence, the depth is the same, what if I stored the arrays such that the red, green, and blue would be all together for each depth? how much faster would this make things (i.e. there are 63 dwords * 3 for each depth)

the funny thing is though, making my array static and using memalign didn't improve my speed, it was switching from bytes to dwords after the fact that gave me a 30% increase.. that's still not good enough though.

News:

Faster array access for small arrays

LithiumDex

hutch--

Tedd

Mark_Larson

LithiumDex

dsouza123

dsouza123

LithiumDex