Accessing C __declspec(thread) from x64 MASM (ml64.exe)

Started by adisak, April 26, 2012, 04:47:30 PM

Previous topic - Next topic


The following C function attempts to prevent recursion in multicore code in a thread-safe manner using a thread local storage variable. However, for reasons that are somewhat complicated, I NEED to write this function in X64 assembler (Intel X86 / AMD 64-bit) and assemble it with ml64.exe from VC2010. I know how to do this if I'm using global variables but I'm not sure how to do it properly with a TLS variable that has __declspec(thread).

__declspec(thread) int tls_VAR = 0;
void norecurse(  )

Note: This is what VC2010 kicks out for the function if I request a listing file:

; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01



PUBLIC  norecurse
EXTRN   _tls_index:DWORD
pdata   SEGMENT
$pdata$norecurse DD imagerel $LN4
    DD  imagerel $LN4+70
    DD  imagerel $unwind$norecurse
pdata   ENDS
xdata   SEGMENT
$unwind$norecurse DD 040a01H
    DD  06340aH
    DD  07006320aH
; Function compile flags: /Ogtpy
xdata   ENDS
norecurse PROC
; File p:\hackytests\64bittest2010\64bittest\64bittest.cpp
; Line 19
    mov QWORD PTR [rsp+8], rbx
    push    rdi
    sub rsp, 32                 ; 00000020H
; Line 20
    mov ecx, DWORD PTR _tls_index
    mov rax, QWORD PTR gs:88
    mov edi, OFFSET FLAT:tls_VAR
    mov rbx, QWORD PTR [rax+rcx*8]
    cmp DWORD PTR [rbx+rdi], 0
    jne SHORT $LN1@norecurse
; Line 22
    mov DWORD PTR [rbx+rdi], 1
; Line 23
    call    DoWork
; Line 24
    mov DWORD PTR [rbx+rdi], 0
; Line 26
    mov rbx, QWORD PTR [rsp+48]
    add rsp, 32                 ; 00000020H
    pop rdi
    ret 0
norecurse ENDP

I was able to work a hack around the issue. My implementation in assember is less efficient than the C compiler generated code though because I was not able to figure out how to use the following two addressing modes:

     mov rax, QWORD PTR gs:88
     mov edi, OFFSET FLAT:tls_VAR

For (1), I had to load 88 into rax and use gs:[rax] to access the TLS-base for the thread.

For (2), the lack of OFFSET FLAT in MASM (ml64.exe) meant that I had to be more clever. I computed the offset by subtracting _tls_start from the TLS-base for the thread that could be applied to TLS-variables in assembler to access their thread local values.

So this is my hack implementation that I would like to improve / do correctly.

PUBLIC  norecurse
EXTRN   _tls_index:DWORD
EXTRN   _tls_start:DWORD


norecurse           PROC
    ; non-volatile
    push            rbx
    sub             rsp,32

    ; The gs segment register refers to the base address of the TEB on x64.
    ; 88 (0×58) is the offset in the TEB for the ThreadLocalStoragePointer member on x64
    mov             rax,88
    mov             edx, DWORD PTR _tls_index
    mov             rax, gs:[rax]
    mov             r11, QWORD PTR [rax+rdx*8]
    lea             r10, _tls_start
    ; r11 will be the the offset-adjusted TLS-Base
    sub             r11, r10

    ; ebx will be the the thread local address of tls_VAR
    lea             rdx, tls_VAR
    lea             rbx,[r11+rdx]

    cmp             DWORD PTR [rbx], 0
    jne             @F

    mov             DWORD PTR [rbx], 1

    call            DoWork

    mov             DWORD PTR [rbx], 0

    add             rsp,32
    pop             rbx

norecurse       ENDP



I'd love to see more efficient method or pointers on how to actually use the two addressing modes I couldn't figure out with MASM (ml64.exe) though.


If you want to do it correct, use the corresponding API functions:
TlsAlloc(), TlsSet/GetValue() and TlsFree().

Also, you can use japhets SDK translation: WinInc.
Using jWasm would also be an improvement.
Further more, there is no need for full segment declarations:

option casemap:none

_WIN64 EQU 1
include \WinInc\Include\
includelib \xyz\lib64\kernel32.lib
main proc
    and rsp,-16
    add rsp,-4*8   
    call TlsAlloc
    mov rbx,rax
    cmp rax,TLS_OUT_OF_INDEXES
    je @err
    mov rdx,what ever
    mov eax,ebx
    call TlsSetValue


    xor rcx,rcx
    call ExitProcess

main endp
end main
FPU in a trice: SmplMath
It's that simple!