CIS 2400 (Fall 2022) HW 10 & 11: J compiler

Students will gain substantial experience writing C code and translating code to assembly.

Introduction

In this assignment you are going to write a small compiler that will transform code written in a new stack-oriented language, called J, into LC4 assembly code much the way that the lcc compiler converter converts C code into assembly.

You are NOT supposed to simulate the LC4 output. The compiler should only generate the corresponding LC4 assembly into a file. Your compiler will be designed to use the same calling convention that lcc does that way your J programs can call functions compiled by lcc and vice versa.

The name of the executable you should generate should be called jc, which is short for J Compiler. This program will act much like lcc does, when provided with a properly formatted source file in the J language it will produce the corresponding assembly code. That is, if we were to run

$ ./jc source.j

in the terminal, then this will produce a new file called source.asm if source.j contains an acceptable j program, otherwise some helpful error messages should be printed instead.

This assignment will be autograded using gradescope

Collaboration

For assignments in CS 2400, you will complete each of them on your own or solo. However, you may discuss high-level ideas with other students, but any viewing, sharing, copying, or dictating of code is forbidden. If you are worried about whether something violates academic integrity, please post on Ed or contact the instructor.

Setup

If you haven’t already, you need to follow the VMWare setup. We recommend you try and figure this out ASAP.

Once you have the VM setup, you should boot it up, and download the following .h file. To download it, you should open the terminal (which should be an application on the right hand side of the desktop) and use the provided command:

Note that this assignment only provides a single .h file and no other files. You will have to modify this file and create other files (including a makefile to finish this assignment). More details on how to structure your code is in the relevant section below.

There are also separate files that can be used for testing your program. These are mentioned in the testing section below.

The J Language Overview

The J language is a stack-oriented language loosely inspired by Forth. You may want to look at the following Wikipedia article for a quick description of stack oriented languages, but this is optional: http://en.wikipedia.org/wiki/Stack-oriented_programming_language.

In a stack oriented language all operands are passed on the stack and most operations end up manipulating the stack. Here are a few quick examples of J code.

Example Program 1

7 5 4 + *

This program starts by pushing the values 7, 5 and then 4 onto the stack (stack state: 7 5 4 – note that the rightmost element in this list, 4, is viewed as being at the top of the stack). Then the program pops the top two elements (5 and 4) adds them and pushes it onto the top of the stack (stack state: 7 9). The last thing this program does is pop the top two values of the top of the stack, multiplies them and pushes the result onto the top of the stack (stack state: 63). This type of operational syntax is sometimes referred to as Reverse Polish Notation (RPN) and you see it used on many revered HP calculators. You may be familiar with this format from the previous homework where we asked you to implement an rpn calculator.

Example Program 2

This next example introduces comparison operators, if statements, and comments. This example is a bit more complex, so if it doesn’t make sense right now, that is fine. Once you have read the full description of the J language though, you should be sure you understand what is going on in this example.

5 7 gt if 12 else 25 endif    ; this is a comment

This program starts by pushing the values 5 and then 7 onto the stack (stack state: 5 7).

The next token in the program gt will compare the two topmost elements on the stack with the top most element being the first operand and the second top most element being the second operand, resulting in doing 7 > 5 in this example. As part of doing this operation gt will pop the two elements being compared off of the stack and then push the result of the comparison onto the top of the stack, with the value 1 being used to represent true and 0 for false, so after gt in our example program the stack will just contain 1 (stack state: 1).

The next token in the program is an if token, indicating the start of an if statement like seen in other programming languages. The if statement will start by popping the top element of the stack and checking if that value is zero or non-zero. If it is non-zero, the code in-between the if and its corresponding else will be executed. In this case, the top value is non-zero so the code pushes 12 onto the top of the stack (stack state: 12). After pushing 12, we will ignore the else branch and start executing code after the endif which marks the end of the if statement. The only thing after the endif though is a ; which marks the start of a comment. Once a ; has been read, the rest of the line can be interpreted as a comment and won’t be used for executing the program. If the top value of the stack had instead been zero when the if token was reached, the code between the else and the endif would have been executed and 25 would have been pushed instead of 12.

Token Structure

As you can see from the examples, a J source file can be thought of as a sequence of tokens separated by whitespace characters (spaces, tabs, newlines, etc.). Tokens can denote literal values or operations. There are no parentheses or square brackets or curly braces to denote program structure like there is in C, Java or other programming languages. There are just tokens and comments in J. You will want to write your program such that it reads tokens sequentially from a source file and generate the corresponding assembly for each token as you read it. More details on parsing the file in the Code Structure Section below.

Operands

For simplicity, all operands in the J language will be 16-bits. There are no other data types.

Literals

Tokens consisting entirely of digits optionally preceded by a minus sign are interpreted as decimal numbers and your compiler should output LC4 assembly code to push these values onto the stack in the order they are encountered. Additionally, you should implement support for hexadecimal literals which will begin with 0x followed by a sequence of hexadecimal characters that could be either upper case or lower case (e.g. 0xD6DE or 0xd6de). You should read these hex values and then push them on the stack just as you would a decimal literal. You should assume that the decimal number can be any number that can be represented in 16-bit 2C and that the hexadecimal numbers are at most 4 hexadecimal digits (e.g. at most 16-bits). Hint: You will have to be careful about how you do this with the LC4 code your compiler generates since there are limitations on the size of integer immediates in LC4.

Arithmetic Operations

The J language has tokens that denote the basic arithmetic operations provided by the LC4 assembly language.

J Operator Corresponding LC4 Operation
+ `ADD`
- `SUB`
* `MUL`
/ `DIV`
% `MOD`

In each of these cases, the top value of the stack is popped off and used as the first operand (stored in Rs in LC4) and the next element on the stack is popped off and used as the second operand (stored in Rt in LC4). Those two values are then used to perform the specified operation and the result is pushed onto the stack. Please be mindful of the ordering of operands since this can affect the output of some operators (e.g. /, -, and %).

Comparison Operators

Let a denote the element at the top of the stack and b denote the next element. There are 5 comparison operators in the J language and they are listed below.

J Operator Meaning
lt `a < b`
le `a <= b`
eq `a == b`
ge `a >= b`
gt `a > b`

For these operators, the top two values of the stack are popped off the stack and the result of the comparison is pushed onto the top of the stack. Note that the result of this comparison should either be a 1 indicating the result is true or a 0 indicating the result is false. Also note that the values a and b are viewed as 2C values for the purposes of these comparison.

Logical Operations

The J language supports the 3 basic binary Boolean operations through the tokens and, or and not.

The and and the or operations pop the two top values on the top of the stack, compute their respective operations in a bitwise manner using the corresponding LC4 instructions and then place the result back on the top of the stack. The not operation does much the same thing, but it only consumes the top element of the stack.

Stack Operations

Stack oriented languages always have operations to manipulate the state of the stack. The J language provides many of the basic stack operations offered by Forth and these are listed below.

We will also provide the result of running each operator on a stack that starts in the state 5 7 -6 3

J Operator Meaning Result of Running on Sample Stack
`drop` Drops the top element of the stack `5 7 -6`
`dup` duplicates the top element of the stack `5 7 -6 3 3`
`swap` swaps the top two entries of the stack `5 7 3 -6`
`rot` rotates the top 3 elements of the stack `5 -6 3 7`

Lastly, there is stack operation argN associated with getting the arguments to a function. Functions will be setup in the same way they are done in LC4 with lcc as covered in lecture. More on functions in the functions section below.

argN copies an entry from the corresponding argument location in memory and pushes that argument’s value onto the top of the stack. For example, arg1 copies the first argument, the one right below the return value (RV) of the current stack frame, and pushes it to the top of the stack. arg2 will copy the argument that is one memory location below where the first argument is stored on the stack. This pattern repeats for arg3 and so on. The value N can range from 1 to 20 ie arg1, arg2, arg3, … arg19, arg20 are all legal commands.

If Statements

The J language supports if .. else .. endif constructs as mentioned earlier. As in other languages the else clause is optional. The syntax is as follows:

if <block A> else <block B> endif

When the if statement is executed, the top element of the stack should be popped off and checked to see if it is zero or non-zero. If the top element was non-zero, i.e. true, then the statements in block A are executed. If the top element was zero, i.e. false, then the statements in block B are executed.

If you take a look at how the LC4 assembly is generated for this by lcc when compiling an if statement in C, you will see that the if statement is compiled using some variant of the CMP and BR instructions to do the comparison and conditionally branch to the appropriate unique labels.

In your J compiler, you probably want to follow a similar structure to the code shown in lecture and generated by lcc. Also with J, you need to keep in mind that there are multiple if statements and your compiler must be sure to generate unique labels in the assembly code for each one. One way to keep unique labels is to keep a count of the number of if tokens that you have encountered as you parse the file from start to finish.

Your compiler will also need to be able to handle nested if statements. This is a rather complex task to manage, so here are hints for two possible ways to use this:

While Loops

The J language supports while loops. To support the while loop, there are two tokens: WHILE and ENDWHILE. The syntax is as follows:

while <block> endwhile

When the while statement is first executed, the top element of the stack should be popped off and checked to see if it is zero or non-zero. If the top element was non-zero, i.e. true, then the statements in the block between WHILE and ENDWHILE are executed. Once the block has finished executing and ENDWHILE is reached, the code should jump back to the start of the loop and execute the code for the WHILE token again (e.g. pop the top value of the stack again and check to see if it is zero or non-zero). If the top element was zero, i.e. false, then the program should start executing the next token after ENDWHILE.

Similarly to the If statements mentioned in the previous section, your compiler will need to generate unique labels for each while loop encountered in the program and need to handle nested structures. It is possible for if statements to show up inside a while loop and while loops to show up in an if statement. It is also possible for a while loop to have while loops inside of it. The tactics used for handling unique labels and nested control structures mentioned in the If Statments section above will likely also be useful for your code to generate while loops.

Comments

As we mentioned earlier, we will use comments in the J code to document the design. If you encounter a semi-colon in the file all content between that point and the end of the line is regarded as a comment and is ignored by the compiler. For example:

1 2 3 4 + - ; This is a comment - everything up to the end of the line

Functions

Like C, J is intended to operate as a procedural language which means that all of the code needs to be packaged in functions which will be turned into blocks of code just the way lcc does. Function definitions are a bit simpler in J since we can assume that the calling context has already pushed any required arguments onto the stack before calling the function, hence there is no need for an argument list.

Function definitions are started with the defun token as illustrated in the following example:

defun square
arg1
dup *
return

In the above example, the first line has the defun token which indicates the start of a function and then an identifier token to name the newly defined function, which in this case, will result in the function being called square. The next line of the function just has arg1 which will fetch the first (and only) input argument to the function from below where the frame pointer, return address and return value (FP, RA and RV) are stored in the current stack frame and then push the value of that argument onto the stack. The third line consists of two tokens, dup and * which will duplicate the argument fetched in the previous line and then multiply that argument with its duplicate, pushing the result onto the stack. The last line consists of the return token which will copy the top value of the stack (which is the result of the multiplication operation) and use that as the return value of the function and return from the function. If it helps to understand what is going on in this function, you can almost think of the code snippet above as being similar to the following C code:

int square(int arg1) {
  int n = arg1;
  int res = n * n;
  return res;
}

Once the function has been defined, we can invoke it by just using a token that is the name of the function. For example:

4 square

The code above pushes the value 4 onto the stack and that value will implicitly be used as the argument to the square function. Then, the function sqaure is called and returns the value 16, which should be used as the new top of the stack. Assuming that the stack was empty before the above lines of code were executed, the stack should contain the value 4 at the botton and 16 at the top after the above code is executed.

Function names must start with an alphabetical character and can consist of alphabetical characters, underscores ‘_’ and digits only. These identifiers are case sensitive. Function names can also not used reserved strings like gt or defun which already serve other purposes in the J language.

The return token is used to copy the value at the top of the stack into the return value slot of the current stack frame. It also invokes the function epilogue so that the stack and PC are restored so that the calling function can resume execution. This epilogue should follow the same conventions as the one used by lcc. Note that every function should end with a return statement at some point.

Functions are invoked with a token that is the name of the function to invoke as shown above. Once again you are expected to use exactly the same calling convention that the lcc compiler does where R5 acts as the frame pointer and R6 acts as the stack pointer. The only small difference is that when you generate code to call a subroutine you should ensure that once the function has returned the return value of that function is included at the top of the stack instead of being just beyond it as is the case with lcc. This is easily accomplished by modifying the stack pointer after the subroutine has returned. Remember that it is the calling functions job to clean up arguments and return values as it sees fit once the subroutine has returned. In the J language we will need to handle this explicitly in the code – that is the jc compiler should not automatically generate code to pop the return value or the function arguments.

Note that since we are adopting the lcc calling convention we should be able to call functions written in C and compiled with the lcc compiler from within our J programs and vice versa. This will be a useful way to add functionality to our system

Instructions

For this assignment, you will be writing a compiler that will compile (i.e. translate) a J source file into an LC4 file. This means that the code you write is sort of “writing” LC4 assembly. Note that your code should not actually execute any of the LC4 assembly, simply output the LC4 to an output file that PennSim will assembly into an object file and run.

This project is technically split into two assignments HW10 and HW11, this is because this project can take a lot of time and you can think of HW10 as being a “checkpoint” for HW11. In other words, both HW10 and HW11 use the same autograder, and HW10 and it’s deadline is to give you some guidance on where you should start and how to pace yourself.

For HW10, you will have to implement:

For HW11, you will have to finish the code by implementing ASM generation for:

This includes handling nested control structures and unique labels for any jumps/branches

Suggested Ordering

In order to give you a couple of ideas about how to get started in thinking about and implementing this program we have provided you with a header file called token.h. This file defines a type called token which is intended to represent the kinds of tokens you will encounter in a J program. It also declares a function called next_token which reads the next available token from the file. It should return true if it was successful and false if it encountered an error. More details on the token struct and the next_token function can be found in token.h.

You should probably start with implementing the next_token and any helper functions necessary for it in a new file token.c. After you do, you should create a jc.c file with a main function that can be compiled into an executable jc and a makefile that can compile jc.c into jc and token.c into token.o. Once you have done this, you should be able to upload all 4 files to the gradescope autograder to test your next_token implementation. Optionally, you can implement the optional function specified in token.h called print_token to be used for debugging your code locally and have your main function in jc.c open a file, read it with next_token and then print out the token with print_token.

Once you have next_token working, you will have to start implementing ASM generation based on the tokens that are read. Note that once you get next_token working and start working on other aspects of the project you will probably have to modify your makefile and jc.c as you add more functionality.

You should be able to compile your code into the executable called jc and run it like shown below:

$ ./jc source.j

Doing this should produce a new file called source.asm if source.j contains an acceptable j program, otherwise some helpful error messages should be printed instead. You should modify jc.c’s main function to start supporting this.

The way the compiler proceeds is that it repeatedly reads tokens from the file and for each token it generates the requisite assembly. You can look at what the LCC compiler does with your C files for inspiration. Here is the pseudo code for body of the compiler – note how short it is, at least conceptually.

while next_token():
  generate_asm()

The hard part is that you have to design your own version of the function that generates assembly, including the parameters it takes, and how it handles all of the different tokens. Note that this pseudo code also left out handling the command line arguments to the jc program, handling potential errors, and other details your code will need to handle.

When starting to write code that generates assembly based on tokens, we suggest you start with the LITERAL tokens, the arithmetic tokens, and the bitwise logical operation tokens. Once you those implemented try to run your program on ASM files that only have those token types and see if your ASM output is correct. You should be able to see if your code passes a few tests on gradescope.

Once you have LITERAL tokens, arithmetic tokens, and bitwise operator tokens working, you should incrementally add support for more tokens until you have supported everything.

Error Handling

Your program should be able to handle any well-formed J program so that’s what you should focus on. We will not be doing a lot of testing with malformed J programs so we are not expecting extensive error handling to catch every possible problem. That being said you probably want to do some basic error handling like checking for illegal tokens that you can’t process. Similarly, you may want to check for missing return statements and if statements that aren’t properly delimited with endifs. These error checks aren’t required but they have greatly helped in the past with debugging code, so we encourage you to add more error checks if you can do so easily.

Code Structure

Parsing

As part of implementing next_token, you will have to parse the tokens that are contained in the J file, which is stored as a text file. It may help to go to the testing section and download the sample J files to better understand the layout of a J source file. Your next_token function will want to read the next token available in the file (recall that the FILE* maintains an internal position to the file, and reading from the file advances that location by the amount read). For the sake of simplicity, you can assume that no token will contain no more than 250 characters. Remember that you need to handle comments as specified in the comments section above. Your next_token code will have to handle the fact that any number of whitespace character can separate tokens. Note that whitespace can include more than the space character, it can include new line characters, tab characters, and various others that are used in the sample J programs to format the code to be more readable for humans. You may fund various functions in <string.h> and <ctype.h> useful. Some of these functions are listed below.

Generating the ASM

As a hint to your ASM generation, recall that J is a stack-based language and involves a lot of manipulating of things on this stack. The LC4 asm that your compiler outputs should probably manipulate the stack portion of memory like is done in lecture with the topic of “register spilling” and “temporary data” to simulate what is going on in J. You will likely want to use fprintf() to print the LC4 code to the output file.

File Structure

As part of this assignment, we are requiring you to split your code up into multiple files so that you master the process of building programs and writing Makefiles. At minimum your code should use:

You should at least use one header file token.h. You can use more files if you want to, but you cannot use fewer. We highly suggest that you create another c file called asm_gen.c, compiler.c or some other similar name that will contain all the functions to generate LC4 assembly based on J tokens. Most students end up creating an additional .c and .h file as it helps a lot with organizing their code. Your final executable should make use of some of the functions placed in token.c and any other .c files you create .

Makefile

You must also write and include a Makefile named makefile that builds your program from the source components. Failure to include a working Makefile will result in all tests failing.

The executable that your Makefile produces must be named jc so typing make jc at the command line should make the final executable. Your Makefile should build intermediate object files for each .c file instead of just building the program all at once and rebuild targets accordingly when their source .h and .c files are updated. Your Makefile should also contain the phony target clean so that when you type in make clean it removes all object files and the jc executable (and nothing else, be sure to not accidentally delete your .c or .h files).

Your Makefile should also compile using the gcc compiler and use the -Wall flag at each step to enable all warnings. If you want to use gdb or valgrind to test your code, you should also compile with the -g flag so that debugging information that is used by these programs are stored in the compilation outputs.

The autograder will be testing to make sure that your makefile builds the program as described above.

Extra Credit Challenge

prime.j

An additional J program prime.j is provided and is a rather complicated J program. Due to the length of the body of the else branch and one of the while loops, it is common for the resulting LC4 assembly code that is generated by jc to not function. This is due to the constraint on how big an offset the branch and jump instructions can support. Your task for this extra credit is to make sure that your compiler outputs concise enough instructions for each token so that prime.j can be compiled, assembled, and run in PennSim. This task will be autograded on gradescope.

Towers of Hanoi

You can score up to 20 extra credit points by writing a program that, when run on PennSim produces a graphical animation of a solution to the Towers of Hanoi problem using at least 5 discs. i.e. When your program runs it should produce a nice animation on the PennSim graphics display of discs moving about in an orderly manner to solve this problem. The constraints are as follows:

On the bright side the libc and os code that was provided contains implementations of useful functions like lc4_get_event, lc4_draw_rect and lc4_reset_vmem.

Compiling and Testing

To compile your code for this assignment, you will have to create your own Makefile (as described above). We suggest looking at the makefile you did in HW08 or shown in lecture slides for a starting point on how you should create your own.

Testing

We provide some .j files and their corresponding script files to compile it and run it in PennSim that you can use to test your program as a whole. To utilizes these test cases, you should download the zip in the terminal with:

$ wget https://www.seas.upenn.edu/~cis2400/22fa/projects/code/j_tests.zip

and then unzip the download with:

$ unzip j_tests.zip

This wil give you various .j and ‘.txt’ PennSim script files (e.g. simple1.j and simple_script.txt). The .j files are example inputs into your program and the corresponding .txt are used to load your output assembly file into PennSim.

For the testing your token parsing, we highly recommend that you implement the print_token function described in token.h and then write a small program that will read a j file and print the tokens to an output file. After running the program, check that it works for all of the supplied J programs

To test your LC4 assembly generation, you should check that running the code output by your jc compiler on PennSim has the expected behaviour based on the initial .j file.

Compiler Warnings

Sometimes when you compile a C program it will issue warnings. These are the compilers way of telling you that your code is not completely clear, in order to compile your program, the compiler had to make some guesses about what you intended which may or may not have been correct. A lot of people figure that if they don’t see an error everything must be fine but that is not a good way to program.

For this assignment you are required to compile all of your code with the –Wall option which turns on all warnings. Furthermore, if we run your makefile and we see any compiler warnings we will be deducting points. It is your job to make sure that all of the warnings and errors are dealt with before you submit your code.

Coding Environment Differences

Note that there are several subtle and annoying differences between C compilers on different machines. For this assignment you are expected to use the gcc compiler in the VM we provided. The TAs cannot and will not be responsible for getting code to run on the wide variety of platforms and compilers in use today. More specifically the TAs will not be responsible for answering questions of the form, “how do I get <fill in the blank> to run on Windows/Mac/etc.”. Because of the differences between compiler implementations and C libraries on different operating systems getting something to compile and run on one system does not necessarily guarantee that it will work on a different machine. You should plan on making absolutely sure that your program will compile and run correctly on VM, which is the same environment we will be testing your code on. The safest way to do that is to develop on that platform.

Valgrind

We will also test your submission on whether there are any memory errors or memory leaks. We will be using valgrind to do this. To run it yourself, you should try running: valgrind --leak-check=full ./jc simple.j. Note that you should try this on other j files as well.

If everything is correct, you should see the following towards the bottom of the output:

==1620== All heap blocks were freed -- no leaks are possible
==1620==
==1620== For counts of detected and suppressed errors, rerun with: -v
==1620== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

If you do not see something similar to the above in your output, valgrind will have printed out details about where the errors and memory leaks occurred.

Standard C libraries

In order to program effectively in C you will want to begin to familiarize yourself with the Standard C Libraries. You can find a useful reference to them many places online, though ones that we have liked include:

These utilities are packaged into collections of functions that you can call from your code. In order to avail yourself of these routines you should #include the relevant header files at the beginning of your program like so:

#include <stdio.h>
#include <ctype.h>

Here are some standard C library routines that you may want to look at for this assignment:

The list is only suggestive not comprehensive, and feel free to use other functions that you find in the standard libraries.

Submission

Please submit your completed code files to Gradescope