GCC and How It Works

John Cook
5 min readFeb 7, 2019

--

What is GCC, and how does it work? GCC is a compiler that was first released in 1987. Originally it was called the GNU C Compiler, as it only worked for the c programming language. Since the modern GCC handles more then just the C language it is now called the GNU Compiler Collection. It is important to note that GCC itself will only compile the C language, you need to use the proper compiler in the collection for the proper type of code. For example if you wanted to compile C++ you would use G++ and not GCC itself. So GCC is a collection of compilers, but the GCC we are interested in is the program GCC itself.

hello world program.
Output of hello world.

We can use the GCC to take source code, and turn it into an executable file. So we can take the language a programmer uses and turn it into something executable like a word processor or a video game. The first step is to create a .c file, with code in it, a simple example is the hello world program all beginner c programmers make. If you look at the provided picture it shows a very simple hello world program. Let me break down the parts of this program for you. First we include the library standard input and output (#include <stdio.h>). A library is pre-written code that contains tools that have been developed by others. Next we declare the function main where all the code goes (int main(void)), we declare it an int cause it will return the integer 0 that tells the computer the program exited properly and did not crash. We do this with the return 0 at the end. We put void in the parenthesis to tell the program that main does not need any input. What about that printf? It tells the program to use the print function form the standard input output library and display the text on the screen. The \n is a new line character it prints a new line. The semicolons tell the program that that line of input is done. Once we have some code we can compile it with “gcc main.c” if our program was called main, all c source code ends with .c. The command “gcc main.c” tells the compiler that we want to compile the source code “main.c” This is handled in 3 main parts: The Front End, the Middle End, and the Back End.

The first step of the Front end is to parse the code, that means go through it line by line and check for errors (like a missing semicolon), and make sure everything checks out. This step is different for every language, but the result is an Abstract Syntax Tree. Think of the AST as a road with different paths, every time a decision is reached in the code a new path forms, paths can converge and diverge depending on the decisions made, but ultimately every decision and variable leads to a path end, or a path loop. Once this is achieved the code is now in a Generic form and can move to the Middle End and GIMPLE.

Once the Generic is passed to GIMPLE it is translated into tuples a tuple has two parts, a header describing the instructions and location, and a body containing the operations to be performed on the data. Tuples contain no more then three operands. Tuples have a hierarchy with 3 main types or classes. The most basic or root tuple is called a “gimple statement base” or gsbase for short. The gsbase contains the basic instructions for the data. The second part of the tuple is “gimple statement with ops” while technically two tuples (“gimple statements with ops” and “gimple statements with ops base”) most people consider this one section. This section deals primarily with memory allocation for the data operations. The third tuple also deals with memory allocation but it expands upon the “gimple statements with ops” and provides additional memory about locations in the memory of the data. It is called “gimple statement with memory ops”.

Once the tuples containing no more then 3 operands are created, gimple then creates temporary storage locations, to hold data to be used later, we call these Temporaries. Gimple will take this data and will convert it to a map, with goto statements, conditional statements like “if blue then sad” and variables. Once all this is done we move onto the next step Static Single Assignment.

Static Single Assignment, or SSA is where the values get calculated, it is common practice in programming to assign a variable a different value multiple times, for example, x = 1 and then x = 3 + y, SSA uses complex algorithms to go through every equation and function and make sure the values are consistent. So basicly we take every version of x and give it a new name, x1,x2,x3 and so forth because there could be a branch in the code where x takes one value if something and another value if something else. This step compensates for that and makes sure everything adds up, it calculates every option and makes sure every option and variable appears exactly once so there is no confusion. SSA also optimizes the code and makes things more efficient if possible. Once this is all done its back to gimple.

Gimple then passes it to RTL which is Register Transfer Level. RTL simulates the hardware that the code will be running on. It detects the logic circuits of the hardware and writes the code accordingly. This step is very hardware dependent, but most chip architectures are supported.

The final step is the Back End, or the machine code. Machine code is the language of the cpu, few people can code in machine code nowadays, as it is common to use a compiler or interpreter to write the code. Also machine code is different from cpu to cpu, meaning that if one cpu is made by Intel and one is made by AMD they will each have different machine code, even an I3 cpu and an I7 cpu have different machine code. Machine code is not binary but it is very close, binary is actually a number base, machine code is coded in binary.

So to compile main.c we run “gcc main.c” which sends it to the Front End which parses the code, sends it to AST, then to Generic, then to the Middle End, Gimple then SSA, then back to gimple then RTL then to the Back End machine code. I hope my simple explanation has made this complex topic more approachable and easier to understand.

I would recommend reading:

https://en.wikibooks.org/wiki/GNU_C_Compiler_Internals/GNU_C_Compiler_Architecture

--

--