## GVPHarmonyOS / OpenArkCompilerC++MulanPSL-1.0

MapleIRDesign.md
 2019-10-31 [TOC]# IntroductionMAPLE IR is an internal program representation to support program compilation and execution.  Because any information in the source program may be useful for program analysis and optimization, MAPLE IR aims to provide information about the source program that is as complete as possible.Program information consists of two parts: declaration part for defining the program constructs and the executable program code.  The former is commonly referred to as the *symbol table*.MAPLE IR is target-independent.  It is not pre-disposed towards any specific target processor or processor characteristic.MAPLE IR can be regarded as the ISA of a virtual machine (VM).  The MAPLE VM can be regarded as a general purpose processor that takes MAPLE IR as input and directly executes the program portion of the MAPLE IR.  MAPLE VM can be regarded as the first consumer of MAPLE IR.  A program compiled into MAPLE IR can be executed on MAPLE VM without the need to finish its compilation to the machine instructions of any target processor.MAPLE IR is the common representation for programs compiled from different programming languages, which include general purpose languages like C, C++ and Java.  MAPLE IR is extensible.  As additional languages, including domain-specific languages, are compiled into MAPLE IR, more constructs will beadded to represent constructs unique to each language.MAPLE IR is capable of supporting all known program analyses and optimizations, owing to its flexibility of being able to represent program code at different semantic levels.   MAPLE IR's program representation at the highest level exhibits the following characteristics:- Many language constructs- Short code sequences- Constructs are hierarchical- No loss of program informationLanguage-specific analyses and optimizations are best performed at the high level.  As compilation proceeds, MAPLE IR is gradually lowered so that the granularity of its operations corresponds more closely to the instructions of a general purpose processor.  At the lower levels, general purpose optimizations are performed.  In particular, at the lowest level, MAPLE IR instructions map one-to-one to machine instructions most of the time, for the mainstream processor ISAs.  This is important to maximize the effects of optimizations at the IR level, as each eliminated operation will have the corresponding effect on the target machine.  At the lowest level, all operations, including type conversion operations, are explicitly expressed at the IR level. MAPLE IR represents program code intrinsically in the form of trees.  At the highest level, it honors the hierarchical form of the program as it exists at the source level via the tree representation.  It also honors the abstract operations defined by the language.  As compilation proceeds, the abstract operations are lowered into general-purpose operations that require longer code sequences.  The program structure also becomes more flat, as general-purpose processors work by executing lists of instructions sequentially.Since MAPLE IR is one IR that can exist at multiple levels of semantics, the level of a MAPLE IR program is dictated by the constraints that it adheres to. These constraints are of the following two types:- Opcodes allowed - The higher the level, the more types of opcodes allowed, including opcodes generated only from specific languages.  At the lowest level, only opcodes that correspond one-to-one to operations in a general purpose processor are allowed.- Code structure - The program structure is hierarchical at the higher levels.  The hierarchical constructs become less and less as lowering proceeds.  At the lowest level, the program structure is flat, consisting of sequences of primitive instructions consumed by the general purpose processor.Though MAPLE IR is target-independent at the highest level, the lowering process will make it become target-dependent.  # Program RepresentationThere are additional design criteria at the implementation level.  To be friendly to compiler developers, MAPLE IR exists in both binary and ASCII forms.  Conversion between the two forms can be viewed as assembly and dis-assembly.  Thus, the ASCII form can be viewed as the assembler language of the MAPLE VM instructions.  The ASCII form is editable, which implies that it is possible to program in MAPLE IR directly.  Thus, the ASCII form of MAPLE IR is modeled after a typical C program, which is made up of:- declaration statements - these represent the symbol table information- executable statements - these represent the executable program codeBut an ASCII MAPLE IR is *not* intended to obey C programming syntax.The language front-end compiles a source file into a MAPLE IR file.  Thus, each MAPLE IR file corresponds to a CU (compilation unit).  A CU is made up of declarations at the global scope.  Among the declarations are functions, also known as PUs (program units).  Inside PUs are declarations at the local scope followed by the executable code of the function.There are three kinds of executable nodes in MAPLE IR:- Leaf nodes - Also called terminal nodes, these nodes denote a value at execution time, which may be a constant or the value of a storage unit.- Expression nodes - An expression node performs an operation on its operands to compute a result. Its result is a function of the values of its operands and nothing else. Each operand can be either a leaf node or another expression node. Expression nodes are the internal nodes of expression trees. The type field in the expression node gives the type associated with the result of the operation.- Statement nodes - These represent the flow of control. Execution starts at the entry of the function and continues sequentially statement by statement until a control flow statement is executed. Apart from modifying control flow, statements can also modify data storage in the program. A statement nodes has operands that can be leaf, expression or statement nodes. In all the executable nodes, the *opcode* field specifies the operation of the node, followed by additional field specification relevant to the opcode. The operands for the node are specified inside *parentheses* separated *commas*.  The general form is:*opcode fields (opnd0, opnd1, opnd2)*For example, the C statement "a = b" is specified using the direct assignment opcode **dassign** that assigns the rhs operand *b*  to *a* .dassign $a (dread i32$b)To enable easy visualization in the ASCII IR, whenever an operand is *not* a leaf node, we require the start of a new line indented to the right by two spaces relative to the last line.  Thus, the statement "a = b + c" is:dassign $a ( add i32(dread i32$b, dread i32 $c))and the statement "a = b + c - d" is:dassign$a ( sub i32( add i32(dread i32 $b, dread i32$c), dread i32 $d))The general rules regarding line breaks are as follows:- Each expression or statement node must occupy its own line, and each line cannot contain more than one expression or statement node.- When there is at least one operand that is not a leaf node, then all the operands of the current expression or statement node must be specified in separate new lines, including operands that are leaf nodes.- Comments can be specified via the character '\#', which can be regarded as the end of line character by the IR parser.For human-edited MAPLE IR files, the line breaks are not enforced for expressions, as they do not affect the correctness of the program, since the end of operand specification is indicated by the closing parenthesis. But there must not be more than one statement node per line, because we do not use the ';'character to delimit statement boundary.## Symbol TablesProgram information that is of declarative nature is best stored in the symbol table portion of the IR. Having the executable instructions refer to the symbol tables reduces the amount of information that needs to be stored in the executable instructions. For each declaration scope, there is a main table called the Symbol Table that manages all the user-defined names in that scope. This implies one global Symbol Table and a local Symbol Table for each function declared in the file. The various types of symbol entries correspond to the various types of declarations supported in typical programming languages:1. Storage variables2. Types3. PUs (program units, functions or their prototypes)The PU table only exists at the global scope.In ASCII IR, the IR instructions refer to the various symbols by their names. In the binary representation, only the appropriate scope+index needs to be encoded in the instruction.## Primitive TypesPrimitive types can be regarded as pre-defined types supported by the execution engine such that they can be directly operated on. They also play a part in conveying the semantics of operations, as addresses are distinct from unsigned integers. The number in the primitive type name indicates the storage size in bits.The primitive types are:- no type - void- signed integers - i8, i16, i32, i64- unsigned integers - u8, u16, u32, u64- booleans- u1- addresses - ptr, ref, I'a32, a64- floating point numbers - f32, f64- complex numbers - c64, c128- javascript types: - dynany - dynu32 - dyni32 - dynundef - dynnull - dynhole - dynbool - dynptr - dynf64 - dynf32 - dynstr - dynobj- SIMD types - (to be defined)- unknownAn instruction that produces or operates on values must specify the primitive type in the instruction, as the type is not necessarily implied by the opcode. There is the distinction between *result type* and *operand type*. Result type can be regarded as the type of the value as it resides in a machine register, because arithmetic operations in the mainstream processor architectures are mostly register-based. When an instruction only specifies a single type, the type specified applies to both the operands and the result. In the case of instructions where the operand and result types may differ, the type specified is the result type, and a second field specifies the operand type.Some opcodes are applicable to non-primitive (or derived) types, as in an aggregate assignment. When the type is derived, the designation *agg* can be used. In such cases, the data size can be looked up from the type of the symbol.The primitive types *ptr* and *ref* are the target-independent types for addresses. *ref* conveys the additional semantics that the address is a reference to a run-time managed block of memory or object in the heap. Uses of *ptr* or *ref* instead of *a32* or *a64* allow the IR to be independent of the target machine by not manifesting the size of addresses until the later target-dependent compilation phases.The primitive type *unknown* is used by the language front-end when the type of a field in an object has not been fully resolved because the full definition resides in a different compilation unit.## ConstantsConstants in MAPLE IR are always of one of the primitive types.Integer and address (pointer) types can be specified in decimal or in hexadecimal using the 0x prefix.Floating point types can be specified in hexadecimal or as floating point literal as in standard C.Single characters enclosed in single quotes can be used for i8 and u8 constants.String literals are enclosed in double quotes.For the complex and SIMD types, the group of values are enclosed in square brackets separated by commas.## IdentifiersIn ASCII MAPLE IR, standalone identifier names are regarded as keywords of the IR language. To refer to entries in the symbol tables, identifier names must be prefixed. Identifiers prefixed with '$' are global variables and will be looked up in the global Symbol Table.  Identifiers prefixed with '%' are local variables and will be looked up in the local Symbol Table.  Identifiers prefixed with '&' are function or method names and will be looked up in the Functions Table.  The major purpose of these prefixes is to avoid name clash with the keywords (opcode names, etc.) in the IR. ## Pseudo-registersPseudo-registers can be regarded as local variables of a primitive type whose addresses are never taken.  They can be declared implicitly by their appearances.  The primitive type associated with a pseudo-register is sticky. With the exception that integer and address types of the same size can be associated with a pseudo-register.Because pseudo-registers can only be created to store primitive types, the use of field IDs does not apply to them. Pseudo-registers are referred to by the '%' prefix followed by a number. This distinguishes them from other local variables that are not pseudo-registers, as their names cannot start with a number.The compiler will promote variables to pseudo-registers.  To avoid the loss of high level type information when a variable is promoted to pseudo-registers, reg declarations are used to provide the type information associated with the pseudo-registers.## Special RegistersSpecial registers are registers with special meaning.  They are all specified using %% as prefix.  %%SP is the stack pointer and %%FP the frame pointer in referencing the stack frame of the current function.  %%GP is the global pointer used for addressing global variables.Special registers %%retval0, %%retval1, %%retval2, etc. are used for fetching the multiple values returned by a call.  They are overwritten by each call, and should only be read *at most once* after each call.  They can assume whatever is the type of the return value. ## Statement LabelsLabel names are prefixed with '@' which serves to identify them.  Any statement beginning with a label name defines that label as referring to that text position.  Labels are only referred to locally by goto and branch statements.## Storage AccessesSince MAPLE IR is target-independent, it does not exhibit any pre-disposition as to how storage are allocated for the program variables.  It only applies rules defined by the language regarding storage.In general, there are two distinct kinds of storage accesses: direct and indirect.  Direct accesses do not require any run-time computation to determine the exact address location.  Indirect accesses require address arithmetic before the location can be determined.  Indirect accesses are associated with pointer dereferences and arrays.  Direct accesses are associated with scalar variables and fixed fields inside structures.Direct accesses can be mapped to pseudo-register if the variable or field has no alias.  Indirect accesses cannot be mapped to pseudo-registers unless the computed address does not change.  But since indirect accesses may represent array and matrix elements, there are many optimizations relevant only to indirect storage accesses.In MAPLE IR, **dassign** and **dread**are the opcodes for direct assignments and direct references; **iassign** and **iread** are the opcodes for indirect assignments and indirect references.## AggregatesAggregates (or composites) are either structures or arrays.  They both designate a grouping of storage elements.  In structures, the storage elements are designated by field names and can be of different types and sizes. In this document, the structure designation includes classes and objects in object-oriented programming languages.  In arrays, the same storage element is repeated a number of times and the elements are accessed via index (or subscript).**Arrays**Array subscripting designate address computation.  Since making the subscripts stand out facilitate data dependency analysis and other loop nest optimizations, MAPLE IR represents array subscripting using the special **array** opcode, which returns the address resulting from the subscripting operation.  For example, "a[i] = i" is:iassign<*i32>( array a32 <* [10] i32> (addrof a32 $a, dread i32$i),              # <* [10] i32> indicates pointer to an array of 10 ints dread i32 $i)and "x = a[i,j]" is:dassign$x ( iread i32 <* i32>( array a32 <* [10] [10] i32> (addrof a32 $a, dread i32$i, dread i32 \$j))) # <* [10] [10] i32 indicates pointer to a 10x10 matrix of ints**Structures**Fields in a structure can be accessed directly, but use of **dassign** or **dread** on a structure would refer to the entire structure as an aggregate.  Thus, we extend **dassign**, **dread**, **iassign** and **iread** to take an additional parameter called field-ID. In general, for a top level structure, unique field-IDs can be assigned to all the fields contained inside the structure.  Field-ID 0 is assigned to the top level structure. (Field-ID is also 0 if it is not a structure.) As each field is visited, the field-ID is incremented by 1.  If a field is a structure, that structure is assigned a unique field-ID, and then field-ID assignments continue with the fields inside the nested structure. If a field is an array, the array is assigned only one field-ID.Note that if a structure exists both standalone and nested inside another structure, the same field inside the structure will be assigned different field-IDs because field-ID assignment always starts from the top level structure.Three kinds of structures are supported: **struct, class** and**interface**.A **struct** corresponds to the struct type in C, and is specified by the **struct** keyword followed by a list of field declarations enclosed by braces, as in:struct{        @f1 i32,        @f2 }  # structz is the type name of another structA **class** corresponds to the class type in Java, to provide single inheritances.  The syntax is the same as **struct** except for an additional type name specified after the **class** keyword that specifies the class it inherits from.  Fields in the parent class are also referred to via field-IDs, as if the first field of the derived class is the parent class.  In other words, the parent class is treated like a sub-structure.class {                      # classz is the parent of this class being defined        @f1 i32,        @f2 f32 }Unrelated to storage, structures can contain member function prototypes.  The list of member function prototypes must appear after all the fields have been specified.  Each member function name starts with &, which indicates that it is a function prototype.  The prototype specification follows the same syntax as ordinary function prototypes.An **interface**corresponds to the interface type in Java, and has the same form as **class**, except that it cannot be instantiated via a var declaration, and fields declared inside it are always statically allocated.  More details are provided later in this document.# Instruction Specification In ASCII MAPLE IR, we use parentheses and braces to distinguish operands from the other fields of an instruction, to facilitate visualization of the nested structure of MAPLE IR.  The expression operands of each instruction are always enclosed by parentheses, using commas to separate the operands.  Statement operands are enclosed by braces, with each statement starting on a new line.  MAPLE IR does not allow the use of semicolon to indicate the end of each statement.  Each statement must start on a new line.## Storage Access InstructionsA memory access instruction either loads a memory location to a register for further processing, or store a value from register to memory.  For load instructions, the result type given in the instruction is the type of the loaded value residing in register.  If the memory location is of size smaller than the register size, the value being loaded must be of integer type, and there will be implicit zero- or sign-extension depending on the signedness of the result type. **dassign**syntax: dassign ()\ is computed to return a value, which is then assigned to variable \.  If \ is not 0, then the variable must be a structure, and the assignment only applies to the specified field.  If \ is of type *agg*, then the size of the structure must match.  If \ is a primitive integer type, the assigned variable may be smaller, resulting in a truncation.  If\ is not specified, it is assumed to be 0.**dread**syntax: dread Variable \ is read from its storage location.  If the variable is a structure, then \ should specify *agg*.  If \ is not 0, then the variable must be a structure, and instead of reading the entire variable, only the specified field is read.  If the field itself is also a structure, then \ should also specify *agg*.  If \ is not specified, it is assumed to be 0.**iassign**syntax: iassign  (, )\ is computed to return an address.  \ gives the high level type of \ and must be a pointer type.  \ is computed to return a value, which is then assigned to the location specified by \.  If \ is not 0, then the computed address must correspond to a structure, and the assignment only applies to the specified field.  If \ is of type *agg*, then the size of the structure must match. The size of the location affected by the assignment is determined by what \ points to.  If \ is a primitive integer type, the assigned location (according to what \ points to) may be smaller, resulting in a truncation.  If \ is not specified, it is assumed to be 0.**iread**syntax: iread  ()The content of the location specified by the address computed from the address expression \ is read (dereferenced) as the given primitive type. \ gives the high level type of \ and must be a pointer type. If the content dereferenced is a structure (as given by what \ points to), then \ should specify *agg*.   If \ is not 0, then \ must specify pointer to a structure, and instead of reading the entire structure, only the specified field is read.  If the field itself is also a structure, then \ should also specify *agg*. If \ is not specified, it is assumed to be 0. **iassignoff**syntax: iassignoff   (, )\ is computed to return a scalar value, which is then assigned to the memory location formed by the addition of \ in bytes to \.  \ gives the type of the stored-to location, which also specifies the size of the affected memory location.**iassignfpoff**syntax: iassignfpoff   ()\ is computed to return a scalar value, which is then assigned to thememory location formed by the addition of \ in bytesto %%FP.  \ gives the type of the stored-to location, whichalso specifies the size of the affected memory location. This is the sameas **iassignoff**where its \ is %%FP.**ireadoff**syntax: ireadoff  ()\ must be of scalar type.  \ in bytes is added to \ to form the address of the memory location to be read as the specified scalar type. **ireadfpoff**syntax: ireadfpoff \ must be of scalar type.  \ in bytes is added to %%FP to form the address of the memory location to be read as the specified scalar type. This is the same as **ireadoff** where its \ is %%FP.**regassign**syntax: regassign   ()\ is computed to return a scalar value, which is then assigned to the pseudo or special register given by \. \ gives the type of the register, which also specifies the size of the value being assigned.**regread**syntax: regread The given pseudo or special register is read in the scalar type specified by \.  ## Leaf Opcodes**dread** and **regread** are leaf opcodes for reading the contents of variables.  The following are additional leaf opcodes:**addrof**syntax: addrof The address of the variable \ is returned.  \ must be either *ptr, a32* or *a64*.  If \ is not 0, then the variable must be a structure, and the address of the specified field is returned instead.**addroflabel**syntax: addroflabel
C++
1
https://gitee.com/harmonyos/OpenArkCompiler.git
git@gitee.com:harmonyos/OpenArkCompiler.git
harmonyos
OpenArkCompiler
OpenArkCompiler
master