Hi everyone! :bg It's been a while, and I'll be posting much more soon, since progress on Inventor IDE (http://www.codecortex.com/ide/) has been going well. However, I've got a question mostly (I think) to do with respect to memory synchronization between CPU caches. Sorry that it's pretty specific stuff this time.
I've been implementing a 64-bit OS, (and finally making progress on it thanks to my IDE; the next release will be able to assemble/link 16-bit, 32-bit, and 64-bit all together), and I've got most of an implementation of a binary lock that keeps track of the owning thread. I think I've found a way for an app to stay in privilege level 3 when calling either:
- GetLock if no other thread currently has access to the lock and no thread is waiting on the lock, or
- ReleaseLock if no thread is waiting on the lock
:U
The main idea is using a bus-locked cmpxchg16b, but details get a bit more complicated to handle the cases that require switching to privilege level 0.
The big question is: does using "lock cmpxchg16b [rsi]" correctly update/invalidate any read-cache copies of the 16-byte operand on other CPU cores? Conversely, will "lock cmpxchg16b [rsi]" correctly force any other CPU cores to write out their write caches of those 16 bytes before the instruction executes? The code may or may not require the latter, but it'd be useful information to know.
I'm also curious as to whether I've used sfence correctly to make sure updates have finished before letting others update the lock, and how pause should be used in the WaitForAccess spinloops (see below), since I can't find any examples with reputable sources of info.Here's the code (not for use outside of this forum thread, please, as specified in the forum rules) for the lock. You can assume that pCurrentThread is calculated somewhere (I'm still undecided on the best way to track it), that SleepForLock will put the thread pointed to by rbx to sleep indefinitely, and that WakeUpThreadInRdi will cause the thread pointed to by rdi to be scheduled to run when time is available. AllocateSystemMemory and FreeSystemMemory allocate or free memory on the system heap. In the final version, there will be copies of GetLock and ReleaseLock that will make a system call to get to GetLockExtended or ReleaseLockExtended from privilege level 3, but the idea is the same.
.nolist
;********************************************************************************************************************************
;* *
;* File: Lock.inc *
;* *
;* This file contains the structures used for basic access-locking mechanisms. *
;* *
;* Author: *
;* - Neil G. Dickson *
;* *
;********************************************************************************************************************************
;********************************************************************************************************************************
;* *
;* Structure: LockStruct *
;* *
;* This stores the owning thread and waiting threads for a lock. A LockStruct can be initialized to all zero (no owner and no *
;* waiting threads), but one may want an initial owner or initial waiting threads. *
;* *
;* Members: *
;* pOwner - The thread that currently has the access controlled by this lock, or NULL if none *
;* pList - Address of the list of threads waiting for the access controlled by this lock, or NULL if none, or -1 if the *
;* LockStruct is being modified non-trivially *
;* *
;********************************************************************************************************************************
LockStruct struct
pOwner PTR Thread ?
pList PTR LockWaitingList ?
LockStruct ends
;********************************************************************************************************************************
;* *
;* Structure: LockWaitingList *
;* *
;* This stores an array of threads waiting for the access controlled by a LockStruct. *
;* *
;* Members: *
;* NextElementOffset - Offset from the start of the LockWaitingList to where the next element would be put if the capacity *
;* is large enough *
;* Capacity - Offset from the start of the LockWaitingList that marks the end of available space in the array *
;* ThreadList - The array containing pointers to waiting threads; the minimum array length is 5, but the maximum is *
;* over 400 million (limited by Capacity being a DWORD) *
;* *
;********************************************************************************************************************************
LockWaitingList struct
NextElementOffset DWORD ?
Capacity DWORD ?
ThreadList PTR Thread 5 dup (?)
LockWaitingList ends
.list
;********************************************************************************************************************************
;* *
;* File: Lock.asm *
;* *
;* This file provides basic access-locking mechanisms for threads. *
;* *
;* Author: *
;* - Neil G. Dickson *
;* *
;********************************************************************************************************************************
.code
;********************************************************************************************************************************
;* *
;* Function: GetLock *
;* *
;* Parameters: *
;* pLock - *
;* *
;********************************************************************************************************************************
GetLock proc pLock:PTR LockStruct
xor eax,eax
xor edx,edx
xor ecx,ecx
mov rbx,pCurrentThread
mov rsi,pLock
ASSUME RSI:PTR LockStruct
lock cmpxchg16b [rsi]
jz HaveLock ;If LockStruct was completely zero before, this thread now owns the lock.
cmp rbx,rax ;Check if this thread already owns this lock
je HaveLock ;
call GetLockExtended ;If not, we probably need to wait, so do things the long way
HaveLock:
ret
GetLock endp
;********************************************************************************************************************************
;* *
;* Function: GetLockExtended *
;* *
;********************************************************************************************************************************
GetLockExtended proc
ASSUME RSI:PTR LockStruct
or rax,-1
WaitForAccess:
lock xchg [rsi].pList,rax
cmp rax,-1
je WaitForAccess
cmp [rsi].pOwner,0
je HaveItNow
test rax,rax
jnz AlreadyHaveList
;NOTE: AllocateSystemMemory here won't cause a deadlock because the system heap doesn't use a full lock, it just uses a spinlock.
invoke AllocateSystemMemory,sizeof LockWaitingList
ASSUME RAX:PTR LockWaitingList
mov [rax].NextElementOffset,offset LockWaitingList.ThreadList ;length = 1; save (length+1)*8=10h
mov [rax].Capacity,sizeof LockWaitingList ;capacity = 5; save (cap+1)*8=30h
mov [rax].ThreadList[0],rbx ;NOTE: Assuming rbx unchanged
sfence
mov [rsi].pList,rax ;NOTE: Assuming rsi unchanged
SleepForLock
ret
AlreadyHaveList:
mov ecx,[rax].NextElementOffset
cmp ecx,[rax].Capacity
je NeedNewList
mov qword ptr [rax+rcx],rbx
add [rax].NextElementOffset,8
sfence
mov [rsi].pList,rax
SleepForLock
ret
NeedNewList:
lea rbp,[rcx*2]
mov rdi,rax
ASSUME RDI:PTR LockWaitingList
invoke AllocateSystemMemory,rbp ;(see note about AllocateSystemMemory above)
mov ecx,[rdi].NextElementOffset
mov [rax].NextElementOffset,ecx
mov [rax].Capacity,ebp
mov rdx,8
CopyNextElement:
mov r8,qword ptr [rdi+rdx]
mov qword ptr [rax+rdx],r8
add rdx,8
cmp rdx,rcx
jne CopyNextElement
mov qword ptr [rax+rdx],rbx
xchg rax,rdi
invoke FreeSystemMemory,rax ;(see note about AllocateSystemMemory above)
sfence
mov [rsi].pList,rdi
SleepForLock
ret
HaveItNow:
mov [rsi].pOwner,rbx
sfence
mov [rsi].pList,rax
ret
GetLockExtended endp
;********************************************************************************************************************************
;* *
;* Function: ReleaseLock *
;* *
;* Parameters: *
;* pLock - *
;* *
;********************************************************************************************************************************
ReleaseLock proc pLock:PTR LockStruct
mov rax,pCurrentThread
xor edx,edx
xor ebx,ebx
xor ecx,ecx
mov rsi,pLock
ASSUME RSI:PTR LockStruct
lock cmpxchg16b [rsi]
jz LockReleased
call ReleaseLockExtended
LockReleased:
ret
ReleaseLock endp
;********************************************************************************************************************************
;* *
;* Function: ReleaseLockExtended *
;* *
;********************************************************************************************************************************
ReleaseLockExtended proc
ASSUME RSI:PTR LockStruct
or rax,-1
WaitForAccess:
lock xchg [rsi].pList,rax
cmp rax,-1
je WaitForAccess
test rax,rax
jz NobodyToWake
ASSUME RAX:PTR LockWaitingList
mov rdi,[rax].ThreadList[0]
cmp [rax].NextElementOffset,10h
jne NotEmptyList
;NOTE: FreeSystemMemory here won't cause a deadlock because the system heap doesn't use a full lock, it just uses a spinlock.
invoke FreeSystemMemory,rax
xor eax,eax
mov [rsi].pOwner,rdi
sfence
mov [rsi].pList,rax
WakeUpThreadInRdi
ret
NotEmptyList:
mov ecx,[rax].NextElementOffset
sub rcx,offset LockWaitingList.ThreadList + 8
shr rcx,3
lea rdi,[rax].ThreadList[0]
lea rsi,[rax].ThreadList[8]
rep movsq
sub [rax].NextElementOffset,8
mov [rsi].pOwner,rdi
sfence
mov [rsi].pList,rax
WakeUpThreadInRdi
ret
NobodyToWake:
mov [rsi].pOwner,rax
sfence
mov [rsi].pList,rax
ret
ReleaseLockExtended endp
Inventor IDE?
like RadASM?
also masm6148444 open source?
:eek
###
oh my fo! it's a windows java virtual machine client program!
About the caches, look at "Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide, Part 1"
Quote from: 7.1.4 Effects of a LOCK Operation on Internal Processor CachesBecause frequently used memory locations are often
cached in a processor's L1 or L2 caches, atomic operations can often be carried out
inside a processor's caches without asserting the bus lock. Here the processor's
cache coherency protocols insure that other processors that are caching the same
memory locations are managed properly while atomic operations are performed on
cached memory locations.
Quote from: sinsi on May 23, 2009, 08:32:02 AM
About the caches, look at "Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide, Part 1"
Quote from: 7.1.4 Effects of a LOCK Operation on Internal Processor CachesBecause frequently used memory locations are often
cached in a processor's L1 or L2 caches, atomic operations can often be carried out
inside a processor's caches without asserting the bus lock. Here the processor's
cache coherency protocols insure that other processors that are caching the same
memory locations are managed properly while atomic operations are performed on
cached memory locations.
Thanks! How did I miss that? I could've sworn I've read that chapter. :wink
Quote from: UtillMasm on May 23, 2009, 08:20:15 AM
Inventor IDE?
like RadASM?
also masm6148444 open source?
:eek
###
oh my fo! it's a windows java virtual machine client program!
Hehe, it's actually not much like RadASM if you know about the best features, but I'll take that as a compliment, since I've used RadASM for years. It has assembling and linking built in (the code for it is across several files in here (http://code.google.com/p/pwnide/source/browse/#svn/trunk/Base/lang) and here (http://code.google.com/p/pwnide/source/browse/#svn/trunk/Base/lang/asm), but the bulk is in ASM.xml, in the download, not the repository), and it has error detection (though not warnings yet), and as you rename things, all of their references get updated automatically. However, it is still an alpha, so it's missing major features like search. I'll be posting a thread about it shortly (when I get Alpha 5a out, since it'll be
much more stable than Alpha 5).
To be clear, there's no intended association with MASM. The code is mostly compatible with MASM, so I had named some things like that earlier-on, but I've tried to remove references like that, since I want to maintain half-decent relations with Microsoft. It should work on linux/Mac too, but it doesn't yet create linux/Mac executables, so it'll probably crash if you hit Run. Also, it won't be in Java forever, since the memory it takes up is ghastly; a friend of mine is starting the long task of porting to C/C++.