PDF 文書ファイルをダウンロードする (2021 KB)
Chapter 8 Thread-Local StorageThe compilation environment supports the declaration of thread-local data. This data is sometimes referred to as thread-specific, or thread-private data, but more typically by the acronym TLS. By declaring variables to be thread-local, the compiler automatically arranges for these variables to be allocated on a per-thread basis. The built-in support for this feature serves three purposes.
C/C++ Programming InterfaceVariables are declared thread-local using the __thread keyword, as in the following examples. __thread int i; __thread char *p; __thread struct state s; During loop optimizations, the compiler can choose to create thread-local temporaries as needed.
Thread-Local Storage SectionSeparate copies of thread-local data that have been allocated at compile-time, must be associated with individual threads of execution. To provide this data, TLS sections are used to specify the size and initial contents. The compilation environment allocates TLS in sections that are identified with the SHF_TLS flag. These sections provide initialized TLS and uninitialized TLS based on how the storage is declared.
The uninitialized section is allocated immediately following any initialized sections, subject to padding for proper alignment. Together, the combined sections form a TLS template that is used to allocate TLS whenever a new thread is created. The initialized portion of this template is called the TLS initialization image. All relocations that are generated as a result of initialized thread-local variables are applied to this template. The relocated values are used when a new thread requires the initial values. TLS symbols have the symbol type STT_TLS. These symbols are assigned offsets relative to the beginning of the TLS template. The actual virtual address that is associated with these symbols is irrelevant. The address refers only to the template, and not to the per-thread copy of each data item. In dynamic executables and shared objects, the st_value field of a STT_TLS symbol contains the assigned TLS offset for defined symbols. This field contains zero for undefined symbols. Several relocations are defined to support access to TLS. See SPARC: Thread-Local Storage Relocation Types, 32-bit x86: Thread-Local Storage Relocation Types and x64: Thread-Local Storage Relocation Types. TLS relocations typically reference symbols of type STT_TLS. TLS relocations can also reference local section symbols in association with a GOT entry. In this case, the assigned TLS offset is stored in the associated GOT entry. In dynamic executables and shared objects, a PT_TLS program entry describes a TLS template. This template has the following members. Table 8–1 ELF PT_TLS Program Header Entry
Runtime Allocation of Thread-Local StorageTLS is created at three occasions during the lifetime of a program.
Thread-local data storage is laid out at runtime as illustrated in Figure 8–1. Figure 8–1 Runtime Storage Layout of Thread-Local Storage
Program StartupAt program startup, the runtime system creates TLS for the main thread. First, the runtime linker logically combines the TLS templates for all loaded dynamic objects, including the dynamic executable, into a single static template. Each dynamic objects's TLS template is assigned an offset within the combined template, tlsoffsetm, as follows.
tlssizem+1 and alignm+1 are the size and alignment, respectively, for the allocation template for dynamic object m. Where 1 <= m <= M, and M is the total number of loaded dynamic objects. The round(offset, align) function returns an offset rounded up to the next multiple of align. Next, the runtime linker computes the allocation size that is required for the startup TLS, tlssizeS. This size is equal to tlsoffsetM, plus an additional 512 bytes. This addition provides a backup reservation for static TLS references. Shared objects that make static TLS references, and are loaded after process initialization, are assigned to this backup reservation. However, this reservation is a fixed, limited size. In addition, this reservation is only capable of providing storage for uninitialized TLS data items. For maximum flexibility, shared objects should reference thread-local variables using a dynamic TLS model. The static TLS arena associated with the calculated TLS size tlssizeS, is placed immediately preceding the thread pointer tpt. Accesses to this TLS data is based off of subtractions from tpt. The static TLS arena is associated with a linked list of initialization records. Each record in this list describes the TLS initialization image for one loaded dynamic object. Each record contains the following fields.
The thread library uses this information to allocate storage for the initial thread. This storage is initialized, and a dynamic TLS vector for the initial thread is created. Thread CreationFor the initial thread, and for each new thread created, the thread library allocates a new TLS block for each loaded dynamic object. Blocks can be allocated separately, or as a single contiguous block. Each thread t, has an associated thread pointer tpt, which points to the thread control block, TCB. The thread pointer, tp, always contains the value of tpt for the current running thread. The thread library then creates a vector of pointers, dtvt, for the current thread t. The first element of each vector contains a generation number gent, which is used to determine when the vector needs to be extended. See Deferred Allocation of Thread-Local Storage Blocks. Each element remaining in the vector dtvt,m, is a pointer to the block that is reserved for the TLS belonging to the dynamic object m. For dynamically loaded, post-startup objects, the thread library defers the allocation of TLS blocks. Allocation occurs when the first reference is made to a TLS variable within the loaded object. For blocks whose allocation has been deferred, the pointer dtvt,m is set to an implementation-defined special value. Note – The runtime linker can group TLS templates for all startup objects so as to share a single element in the vector, dtv t,1. This grouping does not affect the offset calculations described previously or the creation of the list of initialization records. For the following sections, however, the value of M, the total number of objects, start with the value of 1. The thread library then copies the initialization images to the corresponding locations within the new block of storage. Post-Startup Dynamic LoadingA shared object containing only dynamic TLS can be loaded following process startup without limitations. The runtime linker extends the list of initialization records to include the initialization template of the new object. The new object is given an index of m = M + 1. The counter M is incremented by 1. However, the allocation of new TLS blocks is deferred until the blocks are actually referenced. When a shared object that contains only dynamic TLS is unloaded, the TLS blocks used by that shared object are freed. A shared object containing static TLS can be loaded following process startup with limitations. Static TLS references can only be satisfied from any remaining backup TLS reservation. See Program Startup. This reservation is limited in size. In addition, this reservation can only provide storage for uninitialized TLS data items. A shared object that contains static TLS is never unloaded. The shared object is tagged as non-deletable as a consequence of processing the static TLS. Deferred Allocation of Thread-Local Storage BlocksIn a dynamic TLS model, when a thread t needs to access a TLS block for object m, the code updates the dtvt and performs the initial allocation of the TLS block. The thread library provides the following interface to provide for dynamic TLS allocation. typedef struct {
unsigned long ti_moduleid;
unsigned long ti_tlsoffset;
} TLS_index;
extern void * __tls_get_addr(TLS_index * ti); (SPARC and x64)
extern void * ___tls_get_addr(TLS_index * ti); (32–bit x86)
Note – The SPARC and 64–bit x86 definitions of this function have the same function signature. However, the 32–bit x86 version does not use the default calling convention of passing arguments on the stack. Instead, the 32–bit x86 version passes its arguments by means of the %eax register which is more efficient. To denote that this alternate calling method is used, the 32–bit x86 function name has three leading underscores in its name. Both versions of tls_get_addr() check the per-thread generation counter, gent, to determine whether the vector needs to be updated. If the vector dtvt is out of date, the routine updates the vector, possibly reallocating the vector to make room for more entries. The routine then checks to see if the TLS block corresponding to dtvt,m has been allocated. If the vector has not been allocated, the routine allocates and initializes the block. The routine uses the information in the list of initialization records provided by the runtime linker. The pointer dtv t,m is set to point to the allocated block. The routine returns a pointer to the given offset within the block. Thread-Local Storage Access ModelsEach TLS reference follows one of the following access models. These models are listed from the most general, but least optimized, to the fastest, but most restrictive.
The link-editor can transition code from the more general access models to the more optimized models, if the transition is determined appropriate. This transitioning is achievable through the use of unique TLS relocations. These relocations, not only request updates be performed, but identify which TLS access model is being used. Knowledge of the TLS access model, together with the type of object being created, allows the link-editor to perform translations. An example is if a relocatable object using the GD access model is being linked into a dynamic executable. In this case, the link-editor can transition the references using the IE or LE access models, as appropriate. The relocations that are required for the model are then performed. The following diagram illustrates the different access models, together with the transition of one model to another model. Figure 8–2 Thread-Local Storage Access Models and Transitions
SPARC: Thread-Local Variable AccessOn SPARC, the following code sequence models are available for accessing thread-local variables. SPARC: General Dynamic (GD)This code sequence implements the GD model described in Thread-Local Storage Access Models. Table 8–2 SPARC: General Dynamic Thread-Local Variable Access Codes
The sethi, and add instructions generate R_SPARC_TLS_GD_HI22 and R_SPARC_TLS_GD_LO10 relocations respectively. These relocations instruct the link-editor to allocate space in the GOT to hold a TLS_index structure for variable x. The link-editor processes this relocation by substituting the GOT-relative offset for the new GOT entry. The load object index and TLS block index for x are not known until runtime. Therefore, the link-editor places the R_SPARC_TLS_DTPMOD32 and R_SPARC_TLS_DPTOFF32 relocations against the GOT for processing by the runtime linker. The second add instruction causes the generation of the R_SPARC_TLS_GD_ADD relocation. This relocation is used only if the GD code sequence is changed to another sequence by the link-editor. The call instruction uses the special syntax, x@TLSPLT. This call references the TLS variable and generates the R_SPARC_TLS_GD_CALL relocation. This relocation instructs the link-editor to bind the call to the __tls_get_addr() function, and associates the call instruction with the GD code sequence. Note – The add instruction must appear before the call instruction. The add instruction can not be placed into the delay slot for the call. This requirement is necessary as the code-transformations that can occur later require a known order. The register used as the GOT-pointer for the add instruction tagged by the R_SPARC_TLS_GD_ADD relocation, must be the first register in the add instruction. This requirement permits the link-editor to identify the GOT-pointer register during a code transformation. SPARC: Local Dynamic (LD)This code sequence implements the LD model described in Thread-Local Storage Access Models. Table 8–3 SPARC: Local Dynamic Thread-Local Variable Access Codes
The first sethi instruction and add instruction generate R_SPARC_TLS_LDM_HI22 and R_SPARC_TLS_LDM_LO10 relocations respectively. These relocations instruct the link-editor to allocate space in the GOT to hold a TLS_index structure for the current object. The link-editor processes this relocation by substituting the GOT -relative offset for the new GOT entry. The load object index is not known until runtime. Therefore, a R_SPARC_TLS_DTPMOD32 relocation is created, and the ti_tlsoffset field of the TLS_index structure is zero filled. The second add and the call instruction are tagged with the R_SPARC_TLS_LDM_ADD and R_SPARC_TLS_LDM_CALL relocations respectively. The following sethi instruction and xor instruction generate the R_SPARC_LDO_HIX22 and R_SPARC_TLS_LDO_LOX10 relocations, respectively. The TLS offset for each local symbol is known at link-edit time, therefore these values are filled in directly. The add instruction is tagged with the R_SPARC_TLS_LDO_ADD relocation. When a procedure references more than one local symbol, the compiler generates code to obtain the base address of the TLS block once. This base address is then used to calculate the address of each symbol without a separate library call. Note – The register containing the TLS object address in the add instruction tagged by the R_SPARC_TLS_LDO_ADD must be the first register in the instruction sequence. This requirement permits the link-editor to identify the register during a code transformation. 32-bit SPARC: Initial Executable (IE)This code sequence implements the IE model described in Thread-Local Storage Access Models. Table 8–4 32-bit SPARC: Initial Executable Thread-Local Variable Access Codes
The sethi instruction and or instruction generate R_SPARC_TLS_IE_HI22 and R_SPARC_TLS_IE_LO10 relocations, respectively. These relocations instruct the link-editor to create space in the GOT to store the static TLS offset for symbol x. An R_SPARC_TLS_TPOFF32 relocation is left outstanding against the GOT for the runtime linker to fill in with the negative static TLS offset for symbol x. The ld and the add instructions are tagged with the R_SPARC_TLS_IE_LD and R_SPARC_TLS_IE_ADD relocations respectively. Note – The register used as the GOT-pointer for the add instruction tagged by the R_SPARC_TLS_IE_ADD relocation must be the first register in the instruction. This requirement permits the link-editor to identify the GOT-pointer register during a code transformation. 64-bit SPARC: Initial Executable (IE)This code sequence implements the IE model described in Thread-Local Storage Access Models. Table 8–5 64-bit SPARC: Initial Executable Thread-Local Variable Access Codes
SPARC: Local Executable (LE)This code sequence implements the LE model described in Thread-Local Storage Access Models. Table 8–6 SPARC: Local Executable Thread-Local Variable Access Codes
The sethi and xor instructions generate R_SPARC_TLS_LE_HIX22 and R_SPARC_TLS_LE_LOX10 relocations respectively. The link-editor binds these relocations directly to the static TLS offset for the symbol defined in the executable. No relocation processing is required at runtime. SPARC: Thread-Local Storage Relocation TypesThe TLS relocations that are listed in the following table are defined for SPARC. Descriptions in the table use the following notation.
Some relocation types have semantics beyond simple calculations.
32-bit x86: Thread-Local Variable AccessOn x86, the following code sequence models are available for accessing TLS. 32-bit x86: General Dynamic (GD)This code sequence implements the GD model described in Thread-Local Storage Access Models. Table 8–8 32-bit x86: General Dynamic Thread-Local Variable Access Codes
The leal instruction generates a R_386_TLS_GD relocation which instructs the link-editor to allocate space in the GOT to hold a TLS_index structure for variable x. The link-editor processes this relocation by substituting the GOT-relative offset for the new GOT entry. Since the load object index and TLS block index for x are not known until runtime, the link-editor places the R_386_TLS_DTPMOD32 and R_386_TLS_DTPOFF32 relocations against the GOT for processing by the runtime linker. The address of the generated GOT entry is loaded into register %eax for the call to ___tls_get_addr(). The call instruction causes the generation of the R_386_TLS_GD_PLT relocation. This instructs the link-editor to bind the call to the ___tls_get_addr() function and associates the call instruction with the GD code sequence. The call instruction must immediately follow the leal instruction. This requirement is necessary to permit the code transformations. x86: Local Dynamic (LD)This code sequence implements the LD model described in Thread-Local Storage Access Models. Table 8–9 32-bit x86: Local Dynamic Thread-Local Variable Access Codes
The first leal instruction generates a R_386_TLS_LDM relocation. This relocation instructs the link-editor to allocate space in the GOT to hold a TLS_index structure for the current object. The link-editor process this relocation by substituting the GOT -relative offset for the new linkage table entry. The load object index is not known until runtime. Therefore, a R_386_TLS_DTPMOD32 relocation is created, and the ti_tlsoffset field of the structure is zero filled. The call instruction is tagged with the R_386_TLS_LDM_PLT relocation. The TLS offset for each local symbol is known at link-edit time so the link-editor fills these values in directly. When a procedure references more than one local symbol, the compiler generates code to obtain the base address of the TLS block once. This base address is then used to calculate the address of each symbol without a separate library call. 32-bit x86: Initial Executable (IE)This code sequence implements the IE model described in Thread-Local Storage Access Models. Two code-sequences for the IE model exist. One sequence is for position independent code which uses a GOT-pointer. The other sequence is for position dependent code which does not use a GOT-pointer. Table 8–10 32-bit x86: Initial Executable, Position Independent, Thread-Local Variable Access Codes
The addl instruction generates a R_386_TLS_GOTIE relocation. This relocation instructs the link–editor to create space in the GOT to store the static TLS offset for symbol x. A R_386_TLS_TPOFF relocation is left outstanding against the GOT table for the runtime linker to fill in with the static TLS offset for symbol x. Table 8–11 32-bit x86: Initial Executable, Position Dependent, Thread-Local Variable Access Codes
The addl instruction generates a R_386_TLS_IE relocation. This relocation instructs the link-editor to create space in the GOT to store the static TLS offset for symbol x. The main difference between this sequence and the position independent form, is that the instruction is bound directly to the GOT entry created, instead of using an offset off of the GOT-pointer register. A R_386_TLS_TPOFF relocation is left outstanding against the GOT for the runtime linker to fill in with the static TLS offset for symbol x. The contents of variable x, rather than the address, can be loaded by embedding the offset directly into the memory reference as shown in the next two sequences. Table 8–12 32-bit x86: Initial Executable, Position Independent, Dynamic Thread-Local Variable Access CodesTable 8–13 32-bit x86: Initial Executable, Position Independent, Thread-Local Variable Access Codes
In the last sequence, if the %eax register is used instead of the %ecx register, the first instruction can be either 5 or 6 bytes long. 32-bit x86: Local Executable (LE)This code sequence implements the LE model described in Thread-Local Storage Access Models. Table 8–14 32-bit x86: Local Executable Thread-Local Variable Access Codes
The movl instruction generates a R_386_TLS_LE_32 relocation. The link-editor binds this relocation directly to the static TLS offset for the symbol defined in the executable. No processing is required at runtime. The contents of variable x, rather then the address, can be accessed with the same relocation by using the following instruction sequence. Table 8–15 32-bit x86: Local Executable Thread-Local Variable Access Codes
Rather than computing the address of the variable, a load from the variable or store to the variable can be accomplished using the following sequence. Note, the x@ntpoff expression is not used as an immediate value, but as an absolute address. Table 8–16 32-bit x86: Local Executable Thread-Local Variable Access Codes
32-bit x86: Thread-Local Storage Relocation TypesThe TLS relocations that are listed in the following table are defined for x86. Descriptions in the table use the following notation.
x64: Thread-Local Variable AccessOn x64, the following code sequence models are available for accessing TLS x64: General Dynamic (GD)This code sequence implements the GD model described in Thread-Local Storage Access Models. Table 8–18 x64: General Dynamic Thread-Local Variable Access Codes
The __tls_get_addr() function takes a single parameter, the address of the tls_index structure. The R_AMD64_TLSGD relocation that is associated with the x@tlsgd(%rip) expression, instructs the link-editor to allocate a tls_index structure within the GOT. The two elements required for the tls_index structure are maintained in consecutive GOT entries, GOT[n] and GOT[n+1]. These GOT entries are associated to the R_AMD64_DTPMOD64 and R_AMD64_DTPOFF64 relocations. The instruction at address 0x00 computes the address of the first GOT entry. This computation adds the PC relative address of the beginning of the GOT, which is known at link-edit time, to the current instruction pointer. The result is passed using the %rdi register to the __tls_get_addr() function. Note – The leaq instruction computes the address of the first GOT entry. This computation is carried out by adding the PC-relative address of the GOT, which was determined at link-edit time, to the current instruction pointer. The .byte, .word, and .rex64 prefixes insure that the whole instruction sequence occupies 16 bytes. Prefixes are employed, as prefixes have no negative inpact on the code. x64: Local Dynamic (LD)This code sequence implements the LD model described in Thread-Local Storage Access Models. Table 8–19 x64: Local Dynamic Thread-Local Variable Access Codes
The first two instructions are equivalent to the code sequence used for the general dynamic model, although without any padding. The two instructions must be consecutive. The x1@tlsld(%rip) sequence generates a the tls_index entry for symbol x1. This index refers to the current module that contains x1 with an offset of zero. The link-editor creates one relocation for the object, R_AMD64_DTMOD64. The R_AMD64_DTOFF32 relocation is unnecessary, because offsets are loaded separately. The x1@dtpoff expression is used to access the offset of the symbol x1. Using the instruction as address 0x10, the complete offset is loaded and added to the result of the __tls_get_addr() call in %rax to produce the result in %rcx. The x1@dtpoff expression creates the R_AMD64_DTPOFF32 relocation. Instead of computing the address of the variable, the value of the variable can be loaded using the following instruction. This instruction creates the same relocation as the original leaq instruction.
Provided the base address of a TLS block is maintained within a register, loading, storing or computing the address of a protected thread-local variable requires one instruction. Benefits exist in using the local dynamic model over the general dynamic model. Every additional thread-local variable access only requires three new instructions. In addition, no additional GOT entries, or runtime relocations are required. x64: Initial Executable (IE)This code sequence implements the IE model described in Thread-Local Storage Access Models. Table 8–20 x64: Initial Executable, Thread-Local Variable Access Codes
The R_AMD64_GOTTPOFF relocation for the symbol x requests the link-editor to generate a GOT entry and an associated R_AMD64_TPOFF64 relocation. The offset of the GOT entry relative to the end of the x@gottpoff(%rip) instruction, is then used by the instruction. The R_AMD64_TPOFF64 relocation uses the value of the symbol x that is determined from the presently loaded modules. The offset is written in the GOT entry and later loaded by the addq instruction. To load the contents of x, rather than the address of x, the following sequence is available. Table 8–21 x64: Initial Executable, Thread-Local Variable Access Codes II
x64: Local Executable (LE)This code sequence implements the LE model described in Thread-Local Storage Access Models. Table 8–22 x64: Local Executable Thread-Local Variable Access Codes
To load the contents of a TLS variable instead of the address of a TLS variable, the following sequence can be used. Table 8–23 x64: Local Executable Thread-Local Variable Access Codes II
The following sequence is even shorter. Table 8–24 x64: Local Executable Thread-Local Variable Access Codes III
x64: Thread-Local Storage Relocation TypesThe TLS relocations that are listed in the following table are defined for x64. Descriptions in the table use the following notation.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||