News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Memory Speed, QPI and Multicore

Started by johnsa, August 24, 2011, 10:32:47 AM

Previous topic - Next topic

johnsa

ok.. i tried something quite different, at the moment my source data and destination data are two totally separate large buffers...

IE: [ESI] => 200mb in data
[EAX] => seperate 200mb data

if I change the store from going to [EAX] to storing back to the original vertex, the code doubles in speed. VTune still telling me that there are massive delays, but the overall loop goes from 16ms to 10ms. Obviously this isn't ideal cause i don't want to over-write my original vertex... but what it doesn't indicate is that perhaps i should change the data structure so the original and transformed versions are adjacent rather than in two TOTALLY separate buffers...

So overall the performance is much better (I guess due to locality of the data being read and written)... however having this executed by multiple threads still doesn't add any performance... and with all VTunes issues, i'm sure this code is still far from optimal

hutch--

John,

With data of that size you must be getting some memory page thrashing and this may be the limiting factor or at least one of them.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

johnsa

Reoganized my vertices according to the above post so i can see the result, improvement holds.. 16ms down to 10ms (better) but i'm sure there's a way to get more from multiple cores on this..(although this is now hitting 10.4Gb/s worth of data processed).
At 32bytes per iteration (16byte vertex in and written out again).