Chapter 2 Solaris x86
Assembly Language Syntax
This chapter documents the syntax of the Solaris x86 assembly language.
Lexical Conventions
This section discusses the lexical conventions of the Solaris x86 assembly
language.
Statements
An x86 assembly language program consists
of one or more files containing statements. A statement consists of tokens separated by whitespace and terminated by either a newline character (ASCII 0x0A) or a semicolon
(;) (ASCII 0x3B). Whitespace consists of spaces (ASCII 0x20),
tabs (ASCII 0x09), and formfeeds (ASCII 0x0B) that are not contained in a string or
comment. More than one statement can be placed on a single input line provided that
each statement is terminated by a semicolon. A statement can consist of a comment. Empty statements, consisting only of whitespace,
are allowed.
Comments
A comment can be appended to a statement. The comment consists of the slash
character (/) (ASCII 0x2F) followed by the text of the comment. The comment is terminated
by the newline that terminates the statement.
Labels
A label can be placed at the beginning of a statement. During assembly, the
label is assigned the current value of the active location counter and serves as an
instruction operand. There are two types of lables: symbolic and numeric.
Symbolic Labels
A symbolic label consists of an identifier (or symbol) followed by a colon (:) (ASCII 0x3A). Symbolic labels must be defined
only once. Symbolic labels have global scope and appear in the
object file's symbol table.
Symbolic labels with identifiers beginning with a period (.) (ASCII 0x2E) are
considered to have local scope and are not included in the object
file's symbol table.
Numeric Labels
A numeric label consists of a single digit in the range zero (0) through
nine (9) followed by a colon (:). Numeric labels are used only for local reference
and are not included in the object file's symbol table. Numeric labels have limited
scope and can be redefined repeatedly.
When a numeric label is used as a reference (as an instruction operand, for
example), the suffixes b (“backward”) or f (“forward”) should be added to the numeric label. For numeric
label N, the reference Nb refers to the nearest label N defined before the reference, and the reference Nf refers to the nearest label N defined after the reference. The following example illustrates the use of numeric
labels:
1: / define numeric label "1"
one: / define symbolic label "one"
/ ... assembler code ...
jmp 1f / jump to first numeric label "1" defined
/ after this instruction
/ (this reference is equivalent to label "two")
jmp 1b / jump to last numeric label "1" defined
/ before this instruction
/ (this reference is equivalent to label "one")
1: / redefine label "1"
two: / define symbolic label "two"
jmp 1b / jump to last numeric label "1" defined
/ before this instruction
/ (this reference is equivalent to label "two")
Tokens
There are five classes of tokens:
-
Identifiers (symbols)
-
Keywords
-
Numerical constants
-
String Constants
-
Operators
Identifiers
An identifier is an arbitrarily-long sequence of letters and digits. The first
character must be a letter; the underscore (_) (ASCII 0x5F) and the period (.) (ASCII
0x2E) are considered to be letters. Case is significant: uppercase and lowercase letters
are different.
Keywords
Keywords such as x86 instruction mnemonics (“opcodes”) and
assembler directives are reserved for the assembler and should not be used as identifiers.
See Chapter 3, Instruction
Set Mapping for a
list of the Solaris x86 mnemonics. See Assembler Directives for the list of as assembler directives.
Numerical Constants
Numbers
in the x86 architecture can be integers or floating
point. Integers can be signed or unsigned, with signed integers represented in two's complement representation.
Floating-point numbers can be: single-precision floating-point; double-precision floating-point;
and double-extended precision floating-point.
Integer Constants
Integers can be expressed
in several bases:
-
Decimal. Decimal integers begin
with a non-zero digit followed by zero or more decimal digits (0–9).
-
Binary. Binary integers begin with “0b”
or “0B” followed by zero or more binary digits (0, 1).
-
Octal. Octal integers begin with
zero (0) followed by zero or more octal digits (0–7).
-
Hexadecimal. Hexadecimal integers
begin with “0x” or “0X” followed by one or more hexadecimal
digits (0–9, A–F). Hexadecimal digits can be either uppercase or lowercase.
Floating Point Constants
Floating point constants have the
following format:
-
Sign (optional) – either
plus (+) or minus (–)
-
Integer (optional) – zero
or more decimal digits (0–9)
-
Fraction (optional) – decimal
point (.) followed by zero or more decimal digits
-
Exponent (optional) – the
letter “e” or “E”, followed by an optional sign (plus or minus),
followed by one or more decimal digits (0–9)
A valid floating point constant must have either an integer part or a fractional
part.
String Constants
A string constant consists of a sequence of characters enclosed in double
quotes ( ") (ASCII 0x22). To include a double-quote character ("), single-quote character ('), or backslash character
(\) within a string, precede the character with a backslash (\) (ASCII 0x5C). A character can be expressed in a string as its ASCII value
in octal preceded by a backslash (for example, the letter “J” could be
expressed as “\112”). The assembler accepts the following escape sequences
in strings:
|
Escape Sequence
|
Character Name
|
ASCII Value (hex)
|
|
\n
|
newline
|
0A
|
|
\r
|
carriage return
|
0D
|
|
\b
|
backspace
|
08
|
|
\t
|
horizontal tab
|
09
|
|
\f
|
form feed
|
0C
|
|
\v
|
vertical tab
|
0B
|
Operators
The assembler supports the following operators for use in expressions. Operators
have no assigned precedence. Expressions can be grouped in square brackets ([]) to establish precedence.
-
+
-
Addition
-
-
-
Subtraction
-
\*
-
Multiplication
-
\/
-
Division
-
&
-
Bitwise logical AND
-
|
-
Bitwise logical OR
-
>>
-
Shift right
-
<<
-
Shift left
-
\%
-
Remainder
-
!
-
Bitwise logical AND NOT
-
^
-
Bitwise logical XOR
Note –
The asterisk (*), slash (/), and
percent sign (%) characters are overloaded. When used as operators
in an expression, these characters must be preceded by the backslash character (\).
Instructions, Operands,
and Addressing
Instructions are operations performed by the CPU. Operands are entities operated upon by the instruction. Addresses are the locations in memory of specified data.
Instructions
An instruction is a statement that is executed at runtime. An x86 instruction
statement can consist of four parts:
See Statements for the description of labels
and comments.
The terms instruction and mnemonic are
used interchangeably in this document to refer to the names of x86 instructions.
Although the term opcode is sometimes used as a synonym for instruction, this document reserves the term opcode for
the hexadecimal representation of the instruction value.
For
most instructions, the Solaris x86 assembler mnemonics are the same as the Intel
or AMD mnemonics. However, the Solaris x86 mnemonics might appear to be different
because the Solaris mnemonics are suffixed with a one-character modifier that specifies
the size of the instruction operands. That is, the Solaris assembler derives its operand
type information from the instruction name and the suffix. If a mnemonic is specified
with no type suffix, the operand type defaults to long. Possible
operand types and their instruction suffixes are:
-
b
-
Byte (8–bit)
-
w
-
Word (16–bit)
-
l
-
Long (32–bit) (default)
-
q
-
Quadword (64–bit)
The assembler recognizes the following suffixes for x87 floating-point instructions:
- [no suffix]
-
Instruction operands are registers only
-
l (“long”)
-
Instruction operands are 64–bit
-
s (“short”)
-
Instruction operands are 32–bit
See Chapter 3, Instruction
Set Mapping for
a mapping between Solaris x86 assembly language mnemonics and the equivalent Intel
or AMD mnemonics.
Operands
An x86 instruction
can have zero to three operands. Operands are separated by commas (,)
(ASCII 0x2C). For instructions with two operands, the first (lefthand) operand is
the source operand, and the second (righthand) operand is the destination operand (that is, source->destination).
Note –
The Intel assembler uses the opposite order (destination<-source) for operands.
Operands can be immediate (that is, constant expressions
that evaluate to an inline value), register (a value in the processor
number registers), or memory (a value stored in memory). An indirect operand contains the address of the actual operand value. Indirect
operands are specified by prefixing the operand with an asterisk (*) (ASCII 0x2A).
Only jump and call instructions can use indirect operands.
-
Immediate operands are prefixed with a dollar sign
($) (ASCII 0x24)
-
Register names are prefixed with a percent sign (%) (ASCII 0x25)
-
Memory operands are specified
either by the name of a variable or by a register that contains the address of a variable.
A variable name implies the address of a variable and instructs the computer to reference
the contents of memory at that address. Memory references have the following syntax:segment:offset(base, index, scale).
-
Segment is any of the x86 architecture
segment registers. Segment is optional: if specified, it
must be separated from offset by a colon (:).
If segment is omitted, the value of %ds (the
default segment register) is assumed.
-
Offset is the displacement from segment of the desired memory value. Offset is
optional.
-
Base and index can
be any of the general 32–bit number registers.
-
Scale is a factor by which index is to be multipled before being added to base to
specify the address of the operand. Scale can have the
value of 1, 2, 4, or 8. If scale is not specified, the
default value is 1.
Some examples of memory addresses are:
-
movl var, %eax
-
Move the contents of memory location var into number
register %eax.
-
movl %cs:var, %eax
-
Move the contents of memory location var in the
code segment (register %cs) into number register %eax.
-
movl $var, %eax
-
Move the address of var into number register %eax.
-
movl array_base(%esi), %eax
-
Add the address of memory location array_base to
the contents of number register %esi to determine an address in
memory. Move the contents of this address into number register %eax.
-
movl (%ebx, %esi, 4), %eax
-
Multiply the contents of number register %esi by
4 and add the result to the contents of number register %ebx to
produce a memory reference. Move the contents of this memory location into number
register %eax.
-
movl struct_base(%ebx, %esi, 4), %eax
-
Multiply the contents of number register %esi by
4, add the result to the contents of number register %ebx, and
add the result to the address of struct_base to produce an address.
Move the contents of this address into number register %eax.
Assembler Directives
Directives are commands that are part of the assembler syntax but are not
related to the x86 processor instruction set. All assembler directives begin with
a period (.) (ASCII 0x2E).
-
.align integer, pad
-
The .align directive causes the next data generated
to be aligned modulo integer bytes. Integer must be a positive integer expression and must be a power of 2. If
specified, pad is an integer bye value used for padding.
The default value of pad for the text section
is 0x90 (nop); for other sections, the default value of pad is
zero (0).
-
.ascii "string"
-
The .ascii directive places the characters in string into the object module at the current location but does not terminate the string with a null byte (\0). String must
be enclosed in double quotes (") (ASCII 0x22). The .ascii directive is not valid for the .bss section.
-
.bcd integer
-
The .bcd directive generates a packed decimal (80-bit)
value into the current section. The .bcd directive is not valid
for the .bss section.
-
.bss
-
The .bss directive changes the current section
to .bss.
-
.bss symbol, integer
-
Define symbol in the .bss section
and add integer bytes to the value of the location counter
for .bss. When issued with arguments, the .bss directive
does not change the current section to .bss. Integer must be positive.
-
.byte byte1,byte2,...,byteN
-
The .byte directive generates initialized bytes
into the current section. The .byte directive is not valid for
the .bss section. Each byte must be
an 8-bit value.
-
.2byte expression1, expression2, ..., expressionN
-
Refer to the description of the .value directive.
-
.4byte expression1, expression2, ..., expressionN
-
Refer to the description of the .long directive.
-
.8byte expression1, expression2, ..., expressionN
-
Refer to the description of the .quad directive.
-
.comm name, size,alignment
-
The .comm directive allocates storage in the data section. The storage is referenced by the identifier name. Size is measured in bytes and must
be a positive integer. Name cannot be predefined. Alignment is optional. If alignment is specified,
the address of name is aligned to a multiple of alignment.
-
.data
-
The .data directive changes the current section
to .data.
-
.double float
-
The .double directive generates a double-precision
floating-point constant into the current section. The .double directive
is not valid for the .bss section.
-
.even
-
The .even directive aligns the current program
counter (.) to an even boundary.
-
.ext expression1, expression2, ..., expressionN
-
The .ext directive generates an 80387 80–bit
floating point constant for each expression into the current
section. The .ext directive is not valid for the .bss section.
-
.file "string"
-
The .file directive creates a symbol table entry
where string is the symbol name and STT_FILE is the symbol table type. String specifies
the name of the source file associated with the object file.
-
.float float
-
The .float directive generates a single-precision
floating-point constant into the current section. The .float directive
is not valid in the .bss section.
-
.globl symbol1, symbol2, ..., symbolN
-
The .globl directive declares each symbol in the list to be global. Each symbol is
either defined externally or defined in the input file and accessible in other files.
Default bindings for the symbol are overridden. A global symbol definition in one
file satisfies an undefined reference to the same global symbol in another file. Multiple
definitions of a defined global symbol are not allowed. If a defined global symbol
has more than one definition, an error occurs. The .globl directive
only declares the symbol to be global in scope, it does not define the symbol.
-
.group group, section, #comdat
-
The .group directive adds section to a COMDAT group. Refer to COMDAT Section in Linker and Libraries Guide for additional information
about COMDAT.
-
.hidden symbol1, symbol2, ..., symbolN
-
The .hidden directive declares each symbol in the list to have hidden linker scoping.
All references to symbol within a dynamic module bind to
the definition within that module. Symbol is not visible
outside of the module.
-
.ident "string"
-
The .ident directive creates an entry in the .comment section containing string. String is any sequence of characters, not including the double quote
("). To include the double quote character within a string, precede
the double quote character with a backslash (\) (ASCII 0x5C).
-
.lcomm name, size, alignment
-
The .lcomm directive allocates storage in the .bss section. The storage is referenced by the symbol name, and has a size of size bytes. Name cannot be predefined, and size must
be a positive integer. If alignment is specified, the address
of name is aligned to a multiple of alignment bytes. If alignment is not specified, the
default alignment is 4 bytes.
-
.local symbol1, symbol2, ..., symbolN
-
The .local directive declares each symbol in the list to be local. Each symbol is
defined in the input file and not accessible to other files. Default bindings for
the symbols are overridden. Symbols declared with the .local directive
take precedence over weak and global symbols.
(See Symbol Table Section in Linker and Libraries Guide for a description of global and weak symbols.) Because local
symbols are not accessible to other files, local symbols of the same name may exist
in multiple files. The .local directive only declares the symbol
to be local in scope, it does not define the symbol.
-
.long expression1, expression2, ..., expressionN
-
The .long directive generates a long integer (32-bit,
two's complement value) for each expression into the current
section. Each expression must be a 32–bit value and
must evaluate to an integer value. The .long directive is not valid
for the .bss section.
-
.popsection
-
The .popsection directive pops the top of the section
stack and continues processing of the popped section.
-
.previous
-
The .previous directive continues processing of
the previous section.
-
.pushsection section
-
The .pushsection directive pushes the specified
section onto the section stack and switches to another section.
-
.quad expression1, expression2, ..., expressionN
-
The .quad directive generates an initialized word
(64-bit, two's complement value) for each expression into
the current section. Each expression must be a 64-bit value,
and must evaluate to an integer value. The .quad directive is not
valid for the .bss section.
-
.rel symbol@ type
-
The .rel directive generates the specified relocation
entry type for the specified symbol.
The .lit directive supports TLS (thread-local storage). Refer to Chapter 8, Thread-Local Storage, in Linker and Libraries Guide for
additional information about TLS.
-
.section section, attributes
-
The .section directive makes section the current section. If section does not
exist, a new section with the specified name and attributes is created. If section is a non-reserved section, attributes must
be included the first time section is specified by the .section directive.
-
.set symbol, expression
-
The .set directive assigns the value of expression to symbol. Expression can be any legal expression that evaluates to a numerical value.
-
.skip integer, value
-
While generating values for any data section, the .skip directive
causes integer bytes to be skipped over, or, optionally,
filled with the specified value.
-
.sleb128 expression
-
The .sleb128 directive generates a signed, little-endian,
base 128 number from expression.
-
.string "string"
-
The .string directive places the characters in string into the object module at the current location and terminates
the string with a null byte (\0). String must be enclosed
in double quotes (") (ASCII 0x22). The .string directive
is not valid for the .bss section.
-
.symbolic symbol1, symbol2, ..., symbolN
-
The .symbolic directive declares each symbol in the list to havesymbolic linker scoping.
All references to symbol within a dynamic module bind to
the definition within that module. Outside of the module, symbol is
treated as global.
-
.tbss
-
The .tbss directive changes the current section
to .tbss. The .tbss section contains uninitialized
TLS data objects that will be initialized to zero by the runtime linker.
-
.tcomm
-
The .tcomm directive defines a TLS common block.
-
.tdata
-
The .tdata directive changes the current section
to .tdata. The .tdata section contains the initialization
image for initialized TLS data objects.
-
.text
-
The .text directive defines the current section
as .text.
-
.uleb128 expression
-
The .uleb128 directive generates an unsigned, little-endian,
base 128 number from expression.
-
.value expression1, expression2, ..., expressionN
-
The .value directive generates an initialized word
(16-bit, two's complement value) for each expression into
the current section. Each expression must be a 16-bit integer
value. The .value directive is not valid for the .bss section.
-
.weak symbol1, symbol2, ..., symbolN
-
The .weak directive declares each symbol in the argument list to be defined either externally or in the
input file and accessible to other files. Default bindings of the symbol are overridden
by the .weak directive. A weak symbol definition
in one file satisfies an undefined reference to a global symbol of the same name in
another file. Unresolved weak symbols have a default value of
zero. The link editor does not resolve these symbols. If a weak symbol
has the same name as a defined global symbol, the weak symbol
is ignored and no error results. The .weak directive does not define
the symbol.
-
.zero expression
-
While filling a data section, the .zero directive
fills the number of bytes specified by expression with
zero (0).