News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Multicore processor code.

Started by hutch--, May 16, 2008, 03:53:01 AM

Previous topic - Next topic

hutch--

I have started this as a new topic to get an interesting topic back up and going but with a difference, the Lab is as its description is, a place to try out ideas, improve code and look for better ways to do things. It is not a place of polemic of dogma and any reversion back to this type of nonsense will see it removed again.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

sinsi

Looking at the intel manuals, it seems that you could run one OS per core - one 'supervisor' on core0, running x number of other OS's on each other core.
There is also hardware virtualization, which seems to tie in with this but I'm not sure... :'(
Light travels faster than sound, that's why some people seem bright until you hear them.

c0d1f1ed

I'd love to see if anyone here has written some highly optimized queue-based spin lock. There's lots of pseudo-core but nothing x86-specific.

I'd also love to know to what extend macro assemblers already supports declarative programming, and how this is expected to evolve in the future when trying to program for processors with lots of cores.

I would also like to share some of my own experience. The Intel manuals suggest to write a test-and-test-and-set spin lock using the pause x86 assembly instruction. However, I found that on a multi-core CPU it is actually faster to use a number of nop instrutions. And it's also better because AMD doesn't support the pause instruction (it has the same delay as a nop instruction).

hutch--

Spinlock design is in itself interesting stuff although I normally associate it with core OS design for tasks like thread timing and similar OS defined wait states. In ring0 you actually have enough control without OS yield to do something useful in this area.

If I have the mechanism right you have a system idle loop at the lowest possible priority and any outstanding task has higher priority so that you only waste processor time when the OS has nothing left to do. In the context of an idle loop a well designed spinlock starts to become useful in scheduling events on a time basis. I would like to see such a spinlock designed well enough to be able to handle alternate tasks instead of just wasting cycles.

The problem I have found in Windows is primarily resolution and even with a multimedia timer you will not get better than 1 ms resolution where you can yield processor time back to the OS, below that you are starting to use a high performance counter and pay the price with much higher processor usage. The code that uses methods like this are strting to become crude spinlocks but without the ring0 access to do enough useful things fast enough.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

c0d1f1ed

I'm not talking about spin locks or idle loops in ring0. With high performance multi-core programming you frequently need access to non-trivial shared data structures, and critical sections. If not contended, the lock should be aquired in just a few clock cycles, while in the contended case the most important characteristic is the scalability. Queue-based spin locks have some excellent properties, but there's very little public material, and even less for x86.

hutch--

Bus lock exclusions to data will only be done well at the kernel level with an OS scheduler in control. What may be worth looking at is the source for an x86 version of Linux as I remember some years ago having a reasonable look at the idle loop / spinlock involved and while the code looked like it could have been gutted and manually optimised to get some more grunt out of it, the concept did make sense at the time.

With the current privelege level system on x86 hardware, this type of stuff must be done in ring zero or it will be just too slow to be useful.

Que based scheduling may be orthodox but a circular buffer built this way tends to waste processor time. I am inclined to think that a smarter methods needs to be developed, something like a circular buffer with cross buffer shortcuts to reduce the processor(s) wasted wait time. Now this is where some dedicated and very fast hardware could come in handly rather than putting a task of this type into an orthodox core design.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

c0d1f1ed

Quote from: hutch-- on May 16, 2008, 02:31:11 PM
Bus lock exclusions to data will only be done well at the kernel level with an OS scheduler in control.

It's perfectly possible to have fast locks at the application level. Besides, it's not like application programmers have a choice.

QuoteWith the current privelege level system on x86 hardware, this type of stuff must be done in ring zero or it will be just too slow to be useful.

Please do read that paper I linked to. Queue-based locks are extremely fast in the non-contended case and still have very good scaling behavior in the contended case. No O.S. level synchronization even comes close.

bozo

if hutch is ok with it, i don't mind posting some code of my own which shows a multi-threaded password cracker at work..but it might be too close to hackerish-stuff

hutch--

Kernel,

Thanks for the offer but it might be a bit too close to the stuff we cannot allow which is a shame as a password cracker can be useful to people who have lost a password.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

daydreamer

I have some code that is idea is kinda spinloop, but more an approach of a masterthread and its outerloop variable is controlling how a slavethread loops and my idea is to on purpose put few more clock cycles work in masterthread to make it run a bit slower so slavethreads can run ahead and get to in state of waitforspinloopvariable to inc, before masterthread inc it