News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

why is multicore such a load of junk?

Started by johnsa, October 02, 2008, 10:20:42 AM

Previous topic - Next topic

bozo

IMHO, it all depends on your algorithm.
if you can split the work up between 2 cores, you'll see twice the speed..split it between 50 computers, it'll be 50 times (or more, depending on specs) faster.

the problem most face is how to calculate what each of those cpus does.

in the colleges, they're only starting to teach this..

johnsa

I still am not convinced.. I understand that for purely CPU driven code, more cores (assuming you can parallel the algorithm) will give more performance.

But take ray-tracing as the example...

1. Calculation of camera rays to project into your scene - these could be calculated faster with more cores as the data necessary is small and remains relatively constant
2. Rotations/Transformations of Scene Data into World/Camera space - this is 50/50 to my mind, doing a matrix multiply or vec. calc is 50% cpu and 50% memory, perhaps even more in terms of memory.. from testing I can state that having 2 cores perform matrix multiplies is only about 20% faster than 1..
3. Testing for intersections with obects/polys in the scene.. whats the most intensive part of this? a simple intersection algo(assuming line/plane) or the fact that the scene data could be several hundred meg.. to me there is more load on memory for this operation even with BSP trees etc than there is on the CPU.. cache will be an issue here too
4. Performing the materials/texture/properties lookup for that intersection point ... almost entirely memory driven..
etc.. etc..

So certain parts will benefit, but overall the net result will be about 20% for core2 and less for core 3 so on...

Mark_Larson

  you are over-thinking it.

  raytracing scales linearly and thus it makes good sense to use multi-core.

  not ALL algorithms scale linearly. 

  In raytracing, if you draw HALF of the screen size the FPS doubles. 

  so let's assume the FULL frame size is 640x480 and HALF is 320x480  clear so far?



  So if at HALF the resolution we are getting 60 FPS, and FULL resolution we are getting 30 FPS.  What would happen if you ran half resolution, but did one half on one core and one half on the other core?  You'd have 60FPS again but with 640x480

make better sense?
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

johnsa


Hehe.. I get what your're saying and in a perfect world I would agree 100%.

If we had 4 completely independant computers/cores each with their own memory and a full copy of the scene data in each memory, then yes.. we could subdivide the screen into quarters and get 4x the framerate and so on. This method works well for rendering non-interactive stuff, like the work done by Pixar etc. They can farm out the rendering either sub-frame or have different machines rendering frames concurrently.

However, in the average PC this isn't going to work because of the previously stated reasons.

I bet 20 dollars to anyone who can write an RT ray-tracer and prove beyond a reasonable doubt that by dividing the area to be rendered between 1/2/4 cores can show the sort of performance increase that would make it worth the extra effort. IE:
1 core (640x480) at 10fps
2 cores (640x480 halves) at 20fps
4 cores (640x480 quarters) at 40fps

My feeling is that the results would look more like:
1 core (640x480) at 10fps
2 cores (640x480 halves) at 15fps
4 cores (640x480 quarters) at 17fps

The bottom line for me (and this is how I think about each algo)
If I have a machine with memory bandwidth of 1gig/sec can this loop iterate/process through 1gig/sec of data on a single core.. if it can, then adding cores will yield nothing.
If the algorithm can process through 800mb/sec then i'll see 20% increase from the 2nd core (assuming overheads are taken into account) and so on..

In my experience VERY few algorithms land up in a situation where memory usage per second is like 200Mb (or less... something low) and the CPU is taking strain, don't get me wrong there are cases and for those multiple cores is brilliant. But in general it's not all that helpful.

As an example.. one could try something like the following:
Take a grid of 1000x1000 points between -1,-1,-1 and 1,1,-1 and calculate the vector from 0,0,0 to that point and normalize it and maybe additional calculate the dot with the vector 0,0,-1.
Id be surprised if that got any faster with more cores.. and thats a pretty real-world example.

Note: One of the little gems about trying to improve cache coherency when raytracing (and this applies to many algos) is to render in small blocks to ensure that the rays which are cast land up (mostly) hitting geometry and data which has good locality with each other and should be cached. Just that one simple trick can yield substantially improvements and this would indicate that memory and caching are of far more importance in overall RT performance.

Mark_Larson

http://ompf.org/forum/
they have forums

they have plenty of raytracers that go 2x on dual core.

You are way off base on your assumption again with raytracers.  Please read up more about it on that website before you post anymore way off data on raytracers.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark_Larson

this should make it easier for you to understand

pdf
http://www.devmaster.net/articles/raytracing_series/State-of-the-Art%20in%20interactive%20ray%20tracing.pdf

it's written by several guys with Doctorates who do real time raytracing research.

there is a page on the benefit of multi-core.

it is figure 14.  Go and look at the figure since it has a graph that helps explain.

up to 7 processors they are getting about a 7x speed up.

here is the quote beneath the figure.

Quote
Figure 14: Our implementation shows almost perfect scalability
of from 1 to 7 dual CPU PCs if the caches are already
filled.With empty caches we see some network contention effects
with 4 clients but scalability is still very good. Beyond 6
or 7 clients we start saturating the network link to the model
server.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

BogdanOntanu

Quote
With empty caches we see some network contention effects
with 4 clients but scalability is still very good. Beyond 6
or 7 clients we start saturating the network link to the model
server.

Apparently they are using a client server network model. This suggests multiple computers each with his own CPU, cache, memory, harddisk and network.

Yes in this case the only problem is the network and IF you divide the job in a small part for each CPU and copy the whole scene model on each computer THEN you will see an almost linear speed up with each CPU/ computer added until the time needed to copy the scene and the results of rendering back to the central server will grow too big.

I have seen this done with 4-5 computer even back in 199x when rendering 3D films in DOS with 3dStudioMax and it worked ok.  3DStudion max for DOS used to have a queue manager that would distribute scene and frames to each computer via network and get back the results of rendering. It works but you need separated computers.

I kind of agree with what johnsa presents here, it is logical and I kind of remember some university lecture on multiple CPU (before the multicore hype) that demonstrated exactly this: after a few CPU's added the benefits are minimal and lost in "inter CPU" communications and wait. The benefits remain high only for a limited set of algorithms and with costly preparation phases.

In a multi core system there are too many things "shared": the cache, the RAM, some busses, the disk drive, the network (sometimes) and all those represent new bottle necks. It does not matter that the second CPU could eventually perform some useful work IF in real life situations it has to WAIT from a shared resource the be freed by the "other" CPU. Of course the greatest penalty hit is from the cache and the RAM and shared busses and not from the HDD or network but still it is better to have those devices separated.

I have written a minimal realtime raytracer in ASM... I will convert it to multicore and see the results... if any ;)

-----

As a side note I would never ever trust a guy with a doctorate.

I have proven them wrong again and again in university once I have started to make my own research and experiments and when I have confronted them to my results they started to move from corner to corner showing that they have no real knowledge but only pretend to know and copy paste from others "non doctorands" concepts and research.

In the end they started to threaten me as the "last logical argument" and as a last years student I have had to "bend" IF I wanted to get my diploma... In the end I have realized that I was "bending" the truth just in order to get a better life and a doctorate diploma and hence I have quit the university in the last year for this reason ... a diploma is valuable in our society but in the face of the truth it is an abomination.

Later in my work experience I have meet a lot of western "doctor" of science and technology and IT with impressive diplomas and "research" but when I have started to talk and discuss with them I have realized that thy have no clue whatsoever despite their social status and references.... of course I had to be polite but still. 

Real education never ever started on this planet.

I am sure that at my physical death  I should regret the 5 years I have lost in university study...
Ambition is a lame excuse for the ones not brave enough to be lazy.
http://www.oby.ro

johnsa

Bogdan, I concur 100%.

The articles I've read on the topics of RT ray-tracing as with the above article look to gain the needed performance through multiple fully independent computers, each with their own ram,cache,cpu's etc.

I did in fact mention that in my previous post as a perfectly valid way to get almost linear scaling of performance with cores(assuming each core comes with it's own cache and memory) :)

Once again, I would challenge anyone to write some code (either a full RT ray-tracer) or perhaps just a block of sample code doing some scene or vector work as I mentioned before and get it to run at 400% (even 300%) of original performance when comparing 1 to 4 cores, obviously in the same machine with no cheating and not using one of the limited number of algorithms which DO genuinely benefit (scale linearly) with core count... It has to be a real world example.

I personally think that multi-core is all marketing hype to try and cover up the fact that chip makers have hit stumbling blocks in terms of performance gains.
In the long run it's still good that we have gone through the process of multi-cores as it teaches one a lot and given the right peripheral architecture in the machine to support it (think cell processors) could be viable in terms of task separation within one application or at the very least just make general multi-tasking a bit more slick.

On a side note.. I suspect that if they could increase the size of the independant cache per core and the OS ensures that threads affinities are preserved and that cache isn't thrashed then you'd see more and more benefit from the cores. I wouldn't say multi-core is a bad idea, I just think that in it's current implementation within consumer PCs it's not very effective.


johnsa




BNT MACRO
db 2eh
ENDM

BTK MACRO
db 3eh
ENDM

include \masm32\include\masm32rt.inc
include c:\dataengine\timers.asm

.686p
.mmx
.k3d
.xmm
option casemap:none

Vector3D_Normalize     PROTO ptrVR:DWORD, ptrV1:DWORD
Vector3D_Normalize_FPU PROTO ptrVR:DWORD, ptrV1:DWORD

test_thread1 PROTO :DWORD
test_thread2 PROTO :DWORD

Vector3D STRUCT
x REAL4 0.0
y REAL4 0.0
z REAL4 0.0
w REAL4 0.0
Vector3D ENDS

;###############################################################################################################
; DATA SECTION
;###############################################################################################################
.const

.data?

align 4
thread1  dd ?
thread2  dd ?
var      dd ?
var2 dd ?

.data

VECTOR_COUNT equ (20000000) ; Number of vectors in the list.

align 16
TestVector REAL4 2.4,4.3,1.9,1.0 ; A test vector to fill the structure with.
VectorListPtr dd 0 ; A pointer to a list of VECTOR_COUNT vectors (X,Y,Z,W) AOS format.

objcnt2  dd 0,0 ; Event Handles.

;###############################################################################################################
; CODE SECTION
;###############################################################################################################
.code

start:

; Allocate the vector list memory.
invoke GetProcessHeap
invoke HeapAlloc,eax,HEAP_ZERO_MEMORY,(16*VECTOR_COUNT)
mov VectorListPtr,eax

; Fill the VectorList with some data.
mov edi,VectorListPtr
mov ecx,VECTOR_COUNT
movaps xmm0,TestVector
@@:
movaps [edi],xmm0
add edi,16
dec ecx
jnz short @B

; Time how long it takes to normalize this list with 1 core.
timer_begin 1, HIGH_PRIORITY_CLASS

mov edi,VectorListPtr
mov ecx,VECTOR_COUNT
align 16
@@:
invoke Vector3D_Normalize,edi,edi
add edi,16
dec ecx
BTK
jnz short @B

timer_end
print ustr$(eax)
    print chr$(" ms",13,10)

; Fill the VectorList with some data again.
mov edi,VectorListPtr
mov ecx,VECTOR_COUNT
movaps xmm0,TestVector
@@:
movaps [edi],xmm0
add edi,16
dec ecx
jnz short @B

; Time how long it takes to normalize this list with 2 cores.
    mov esi,offset objcnt2
    invoke CreateEvent,0,FALSE,FALSE,0
    mov [esi],eax
    invoke CreateEvent,0,FALSE,FALSE,0
    mov [esi+4],eax
    mov thread1,rv(CreateThread,NULL,NULL,ADDR test_thread1,[esi],CREATE_SUSPENDED,ADDR var)
    mov thread2,rv(CreateThread,NULL,NULL,ADDR test_thread2,[esi+4],CREATE_SUSPENDED,ADDR var2)

timer_begin 1, HIGH_PRIORITY_CLASS

invoke ResumeThread,thread1
invoke ResumeThread,thread2

    invoke WaitForMultipleObjects,2,OFFSET objcnt2,TRUE,INFINITE

timer_end
print ustr$(eax)
    print chr$(" ms",13,10)
   
mov esi,offset objcnt2
    invoke CloseHandle,[esi]
    invoke CloseHandle,[esi+4]
    invoke CloseHandle,thread1
    invoke CloseHandle,thread2
   
; Free Vector List Memory.
invoke GetProcessHeap
invoke HeapFree,eax,HEAP_NO_SERIALIZE,VectorListPtr
   
invoke ExitProcess,0

;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
; Normalize Vector V1 into VR (AOS Format).
; VR = 1/||V1|| * V1.
;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
align 16
Vector3D_Normalize PROC ptrVR:DWORD, ptrV1:DWORD
push esi
push edi
mov esi,ptrV1
mov edi,ptrVR
movaps xmm0,[esi] ; xmm0 [ w | z | y | x ]
movaps xmm3,xmm0 ; xmm3 [ w | z | y | x ]
mulps xmm0,xmm0 ; xmm0 [ w^2 | z^2 | y^2 | x^2 ]
pshufd xmm1,xmm0,00000001b ; xmm1 [ x^2 | x^2 | x^2 | y^2 ]
pshufd xmm2,xmm0,00000010b          ; xmm2 [ x^2 | x^2 | x^2 | z^2 ]
addss xmm0,xmm1 ; xmm0 [                 | x^2 + y^2 ]
addss xmm0,xmm2                  ; xmm0 [                 | x^2 + y^2 + z^2 ]
rsqrtss xmm1,xmm0                ; xmm1 [ 1 | 1 | 1 | 1/|v| ]
pshufd xmm1,xmm1,00000000b      ; xmm1 [ 1/|v| | 1/|v| | 1/|v| | 1/|v| ]
mulps xmm3,xmm1                  ; xmm3 [ w*1/|v| | z*1/|v| | y*1/|v| | x*1/|v| ]
movaps [edi],xmm3
pop edi
pop esi
ret
Vector3D_Normalize ENDP

;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
; Normalize Vector (AOS) using FPU.
;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
align 16
Vector3D_Normalize_FPU PROC ptrVR:DWORD, ptrV1:DWORD
push esi
push edi
mov esi,ptrV1
mov edi,ptrVR
fld dword ptr (Vector3D PTR [esi]).x ;st0=x
fmul st,st(0) ;st0=x^2
fld dword ptr (Vector3D PTR [esi]).y ;st0=y | st1=x^2
fmul st,st(0) ;st0=y^2 | st1=x^2
faddp st(1),st ;st0=x^2 + y^2
fld dword ptr (Vector3D PTR [esi]).z ;st0=z | st1 = x^2+y^2
fmul st,st(0) ;st0=z^2 | st1 = x^2+y^2
faddp st(1),st ;st0=z^2+y^2+x^2
fsqrt                        ;st0=len
fld1 ;st0=1.0 | st1=len
fdivr ;st0=1.0/len
fld dword ptr (Vector3D PTR [esi]).x ;st0=x | st1=1/len
fmul st,st(1) ;st0=x*1/len | st1=1/len
fstp dword ptr (Vector3D PTR [edi]).x ;st0=1/len
fld dword ptr (Vector3D PTR [esi]).y ;st0=y | st1=1/len
fmul st,st(1) ;st0=y*1/len | st1=1/len
fstp dword ptr (Vector3D PTR [edi]).y ;st0=1/len
fmul dword ptr (Vector3D PTR [esi]).z ;st0=z*1/len
fstp dword ptr (Vector3D PTR [edi]).z ;--
pop edi
pop esi
ret
Vector3D_Normalize_FPU ENDP

;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
; THREAD 1 (does first half of the list)
;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
align 16
test_thread1 PROC nThread:DWORD
push edi
push ecx
mov edi,VectorListPtr
mov ecx,VECTOR_COUNT/2
align 16
NormVectors1:
invoke Vector3D_Normalize,edi,edi ; or use FPU version.. same result
add edi,16
dec ecx
BTK
jnz short NormVectors1
invoke SetEvent,nThread
pop ecx
pop edi
ret
test_thread1 ENDP

;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
; THREAD 2 (does second half of the list)
;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
align 16
test_thread2 PROC nThread:DWORD
push edi
push ecx
mov edi,VectorListPtr
add edi,(VECTOR_COUNT/2)*16
mov ecx,VECTOR_COUNT/2
align 16
NormVectors2:
invoke Vector3D_Normalize,edi,edi ; or use FPU version same result
add edi,16
dec ecx
BTK
jnz short NormVectors2
invoke SetEvent,nThread
pop ecx
pop edi
ret
test_thread2 ENDP

end start



So there is an example.. normalizing a list of 20 million vectors(X,Y,Z,W) on 1 core and then use 2 cores via threads... this case the threads using 2 cores are actually slower... its a bit of a simplified example and the list of vectors
is just divided into 2 halves.. but if something like this has no benefit from 2+ cores.. then it follows logically that 90% of all the code you're going to be using/writing especially in the time critical parts won't either.

johnsa

Quote
Parallel Scalability Ray tracing is known for being "trivially
parallel" as long as a high enough bandwidth to the
scene data is provided. Given the exponential growth of
available hardware resources, ray tracing should be better
able to utilize it than rasterization, which has been difficult
to scale efficiently 8. However, the initial resources
required for a hardware ray tracing engine are higher than
those for a rasterization engine.

Page 2 of the PDF Mark posted. Paying close attention to the part "as long as high enough bandwidth to the scene data is provided". This implies the memory intensive nature I was referring to.

johnsa

Quote
Contrary to general opinion a ray tracer is not bound by
CPU speed, but is usually bandwidth-limited by access to
main memory. Especially shooting rays incoherently, as done
in many global illumination algorithms, results in almost
random memory accesses and bad cache performance. On
current PC systems, bandwidth to main memory is typically
up to 8-10 times less than to primary caches. Even more importantly,
memory latency increases by similar factors as we
go down the memory hierarchy.

From Page 7 of that same PDF...
I think my case is clear.

Mirno

The top 3 places on the top 500 supercomputers disagrees with you.
Roadrunner (dual core opterons), Jaguar (quad core opteron), and Pleiades (quad core xeon) all seem to run pretty well and use multicore processors.

Parallel algorithms may well be less efficient than monolithic algorithms, but if you want ridiculous speeds, then the history of supercomputing seems to show it's the way to go.

You've got a choice, develop something faster than silicon (and cheaper than Galluim arsenide), or go parallel. Economics points us to multi core. Them's the breaks.

johnsa

Quote from: Mirno on November 21, 2008, 01:02:09 PM
The top 3 places on the top 500 supercomputers disagrees with you.
Roadrunner (dual core opterons), Jaguar (quad core opteron), and Pleiades (quad core xeon) all seem to run pretty well and use multicore processors.

Parallel algorithms may well be less efficient than monolithic algorithms, but if you want ridiculous speeds, then the history of supercomputing seems to show it's the way to go.

You've got a choice, develop something faster than silicon (and cheaper than Galluim arsenide), or go parallel. Economics points us to multi core. Them's the breaks.


100% .. but there is a HUGE difference between a shared-memory or distributed grid model and a single pc with multiple cores using limited cache with shared memory bandwidth.. that is my point.. we're talking consumer level general purpose desktop/laptop here (the sort of thing you're going to run games or business apps on).. and this is where multi-core is not effective. Not even in mid range servers for enterprise environments (as long as they're based on the same general PC architecture albeit with swappable power and raid).

If we had a PC with memory divided into say 4 blocks of 4gig each, or 1 block per core and each block can be any size (64bit computing+) and some new instructions to manage such so that you could do something like:
special_mov [block0_edi],[block1_edi]   .. to move data from block to block via a seperate shared memory controller ... and then each core had full exclusive access to it's block with a full bandwidth pipe ..  then and only then would parallel / multi-core become truly effective.

Yes synchronization might still be necessary, data will still need to be moved from block to block.. but in critical sections where it counts each core can sub-divide / scatter-gather internally its processing.

The key is to have MEMORY_BANDWIDTH*N CORES... every core must have it's own full pipe to memory.

Mirno

All 3 of those machines are using "consumer grade" (albiet validated for use in server level environs - there being no difference between Opterons and Athlons, Xeons, and Core processors).
I agree that memory bandwidth does become a bottle neck, which is why Intel moved away from the FSB with the latest Nehalem processors, and AMD has used hypertransport to avoid (memory) bandwidth contention with inter-process communication.

Multi-socket AMD processors have supported NUMA for a long time because of this (and I guess Intel will do the same), increasing bandwidth as they go. Any processor is hobbled by a lack of memory bandwidth, and perhaps the current generation of multi-core processors are underfed, however it isn't an artifact of multi-core processor design, and future generations of multi-core processors may well be designed with wider memory interfaces because of this.

Essentially I'm saying that multi-core isn't a bad idea (as the thread title suggests), although current implementations may not be as effective as we would like. The top500 (where cost of high performance memory subsystems is no obstical) multi-core designs offer a good way of upping CPU power, without a big increase in space/electrical power.

I use some fairly high powered server grade hardware (quad socket, dual core, 16GB of RAM), and they perform as well as required (IO over the network to the databases seems to be our bottleneck). Of course if your problem was already IO bound, then more CPU power will never help you, but then again that's true of single core machines too.

Mirno

Mark_Larson

Quote from: johnsa on November 21, 2008, 09:13:56 AM



BNT MACRO
db 2eh
ENDM

BTK MACRO
db 3eh
ENDM

include \masm32\include\masm32rt.inc
include c:\dataengine\timers.asm

.686p
.mmx
.k3d
.xmm
option casemap:none

Vector3D_Normalize     PROTO ptrVR:DWORD, ptrV1:DWORD
Vector3D_Normalize_FPU PROTO ptrVR:DWORD, ptrV1:DWORD

test_thread1 PROTO :DWORD
test_thread2 PROTO :DWORD

Vector3D STRUCT
x REAL4 0.0
y REAL4 0.0
z REAL4 0.0
w REAL4 0.0
Vector3D ENDS

;###############################################################################################################
; DATA SECTION
;###############################################################################################################
.const

.data?

align 4
thread1  dd ?
thread2  dd ?
var      dd ?
var2 dd ?

.data

VECTOR_COUNT equ (20000000) ; Number of vectors in the list.

align 16
TestVector REAL4 2.4,4.3,1.9,1.0 ; A test vector to fill the structure with.
VectorListPtr dd 0 ; A pointer to a list of VECTOR_COUNT vectors (X,Y,Z,W) AOS format.

objcnt2  dd 0,0 ; Event Handles.

;###############################################################################################################
; CODE SECTION
;###############################################################################################################
.code

start:

; Allocate the vector list memory.
invoke GetProcessHeap
invoke HeapAlloc,eax,HEAP_ZERO_MEMORY,(16*VECTOR_COUNT)
mov VectorListPtr,eax

; Fill the VectorList with some data.
mov edi,VectorListPtr
mov ecx,VECTOR_COUNT
movaps xmm0,TestVector
@@:
movaps [edi],xmm0
add edi,16
dec ecx
jnz short @B

; Time how long it takes to normalize this list with 1 core.
timer_begin 1, HIGH_PRIORITY_CLASS

mov edi,VectorListPtr
mov ecx,VECTOR_COUNT
align 16
@@:
invoke Vector3D_Normalize,edi,edi
add edi,16
dec ecx
BTK
jnz short @B

timer_end
print ustr$(eax)
    print chr$(" ms",13,10)

; Fill the VectorList with some data again.
mov edi,VectorListPtr
mov ecx,VECTOR_COUNT
movaps xmm0,TestVector
@@:
movaps [edi],xmm0
add edi,16
dec ecx
jnz short @B

; Time how long it takes to normalize this list with 2 cores.
    mov esi,offset objcnt2
    invoke CreateEvent,0,FALSE,FALSE,0
    mov [esi],eax
    invoke CreateEvent,0,FALSE,FALSE,0
    mov [esi+4],eax
    mov thread1,rv(CreateThread,NULL,NULL,ADDR test_thread1,[esi],CREATE_SUSPENDED,ADDR var)
    mov thread2,rv(CreateThread,NULL,NULL,ADDR test_thread2,[esi+4],CREATE_SUSPENDED,ADDR var2)

timer_begin 1, HIGH_PRIORITY_CLASS

invoke ResumeThread,thread1
invoke ResumeThread,thread2

    invoke WaitForMultipleObjects,2,OFFSET objcnt2,TRUE,INFINITE

timer_end
print ustr$(eax)
    print chr$(" ms",13,10)
   
mov esi,offset objcnt2
    invoke CloseHandle,[esi]
    invoke CloseHandle,[esi+4]
    invoke CloseHandle,thread1
    invoke CloseHandle,thread2
   
; Free Vector List Memory.
invoke GetProcessHeap
invoke HeapFree,eax,HEAP_NO_SERIALIZE,VectorListPtr
   
invoke ExitProcess,0

;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
; Normalize Vector V1 into VR (AOS Format).
; VR = 1/||V1|| * V1.
;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
align 16
Vector3D_Normalize PROC ptrVR:DWORD, ptrV1:DWORD
push esi
push edi
mov esi,ptrV1
mov edi,ptrVR
movaps xmm0,[esi] ; xmm0 [ w | z | y | x ]
movaps xmm3,xmm0 ; xmm3 [ w | z | y | x ]
mulps xmm0,xmm0 ; xmm0 [ w^2 | z^2 | y^2 | x^2 ]
pshufd xmm1,xmm0,00000001b ; xmm1 [ x^2 | x^2 | x^2 | y^2 ]
pshufd xmm2,xmm0,00000010b          ; xmm2 [ x^2 | x^2 | x^2 | z^2 ]
addss xmm0,xmm1 ; xmm0 [                 | x^2 + y^2 ]
addss xmm0,xmm2                  ; xmm0 [                 | x^2 + y^2 + z^2 ]
rsqrtss xmm1,xmm0                ; xmm1 [ 1 | 1 | 1 | 1/|v| ]
pshufd xmm1,xmm1,00000000b      ; xmm1 [ 1/|v| | 1/|v| | 1/|v| | 1/|v| ]
mulps xmm3,xmm1                  ; xmm3 [ w*1/|v| | z*1/|v| | y*1/|v| | x*1/|v| ]
movaps [edi],xmm3
pop edi
pop esi
ret
Vector3D_Normalize ENDP

;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
; Normalize Vector (AOS) using FPU.
;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
align 16
Vector3D_Normalize_FPU PROC ptrVR:DWORD, ptrV1:DWORD
push esi
push edi
mov esi,ptrV1
mov edi,ptrVR
fld dword ptr (Vector3D PTR [esi]).x ;st0=x
fmul st,st(0) ;st0=x^2
fld dword ptr (Vector3D PTR [esi]).y ;st0=y | st1=x^2
fmul st,st(0) ;st0=y^2 | st1=x^2
faddp st(1),st ;st0=x^2 + y^2
fld dword ptr (Vector3D PTR [esi]).z ;st0=z | st1 = x^2+y^2
fmul st,st(0) ;st0=z^2 | st1 = x^2+y^2
faddp st(1),st ;st0=z^2+y^2+x^2
fsqrt                        ;st0=len
fld1 ;st0=1.0 | st1=len
fdivr ;st0=1.0/len
fld dword ptr (Vector3D PTR [esi]).x ;st0=x | st1=1/len
fmul st,st(1) ;st0=x*1/len | st1=1/len
fstp dword ptr (Vector3D PTR [edi]).x ;st0=1/len
fld dword ptr (Vector3D PTR [esi]).y ;st0=y | st1=1/len
fmul st,st(1) ;st0=y*1/len | st1=1/len
fstp dword ptr (Vector3D PTR [edi]).y ;st0=1/len
fmul dword ptr (Vector3D PTR [esi]).z ;st0=z*1/len
fstp dword ptr (Vector3D PTR [edi]).z ;--
pop edi
pop esi
ret
Vector3D_Normalize_FPU ENDP

;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
; THREAD 1 (does first half of the list)
;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
align 16
test_thread1 PROC nThread:DWORD
push edi
push ecx
mov edi,VectorListPtr
mov ecx,VECTOR_COUNT/2
align 16
NormVectors1:
invoke Vector3D_Normalize,edi,edi ; or use FPU version.. same result
add edi,16
dec ecx
BTK
jnz short NormVectors1
invoke SetEvent,nThread
pop ecx
pop edi
ret
test_thread1 ENDP

;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
; THREAD 2 (does second half of the list)
;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
align 16
test_thread2 PROC nThread:DWORD
push edi
push ecx
mov edi,VectorListPtr
add edi,(VECTOR_COUNT/2)*16
mov ecx,VECTOR_COUNT/2
align 16
NormVectors2:
invoke Vector3D_Normalize,edi,edi ; or use FPU version same result
add edi,16
dec ecx
BTK
jnz short NormVectors2
invoke SetEvent,nThread
pop ecx
pop edi
ret
test_thread2 ENDP

end start



So there is an example.. normalizing a list of 20 million vectors(X,Y,Z,W) on 1 core and then use 2 cores via threads... this case the threads using 2 cores are actually slower... its a bit of a simplified example and the list of vectors
is just divided into 2 halves.. but if something like this has no benefit from 2+ cores.. then it follows logically that 90% of all the code you're going to be using/writing especially in the time critical parts won't either.

does it hit 100% on one processor with one thread?  or even close?  that is what raytracers do.  Simply doing a Normalization isn't a good enough example.  and yes ray-tracers are a SPECIAL case, not all code parallelizes as easily.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm