CIS 2400 (Fall 2022) HW 08: LC4 Dissassembler

This assignment gives students expereince writing C modules, dealing with binary file I/O and bit manipulation.

Introduction

In this assignment, you will be writing a program to read in and decode LC4 assembly files, the ones produced by the as command in PennSim.

This assignment will be autograded using gradescope

This assignment will allow you to exercise a few aspects of the C language including

Collaboration

For assignments in CS 2400, you will complete each of them on your own or solo. However, you may discuss high-level ideas with other students, but any viewing, sharing, copying, or dictating of code is forbidden. If you are worried about whether something violates academic integrity, please post on Ed or contact the instructor.

Setup

If you haven’t already, you need to follow the VMWare setup. We recommend you try and figure this out ASAP.

Once you have the VM setup, you should boot it up, and download the following .h file. To download it, you should open the terminal (which should be an application on the right hand side of the desktop) and use the provided command:

Note that this assignment only provides a single .h file and no other files. You will have to modify this file and create other files (including a makefile to finish this assignment). More details on how to structure your code is in the relevant section below.

There are also separate files that can be used for testing your program. These are mentioned in the testing section below.

Instructions

Your goal in this assignment is to read and disassemble one or more LC4 object files and produce an ASCII text file continaing what the initial values of program and data memory should like like if those object files were loaded as a program to run. In order to do this, you need to understand the format of the object files produced by the PennSim assembler. Basically, each object file indicates how various locations in the 64K LC4 memory should be initialized at the start of a program. This includes specifications for both instructions, data, and other information as detailed in the next section.

Object File Format

It is important to note object files are binary files that cannot be read by a human the same way we can open a code file and read that. To help view the the layout of an object file, the unix utiliy hexdump may be useful. See the hexdump section for more information. The hexdump section also includes an example of what you could see in an obj file if you looked at it with hexdump.

In this section, we are detailing the file format for LC4/PennSim object files. These binary files are section based, and there are five kinds of sections: code, data, symbol, filename, and linenumber. Each section starts with a fixed size header and may have a body trailing it that would be of some variable length. After one section finishes, another could start and there can be multiple instances of a section in an object file

Note that for the descriptions below, a word is 16-bits (2 bytes) and a character is 8-bits (1-byte).

Here are the formats of each of the sections:

Although you need to recognize and parse all sections correctly, only the code and data sections actually carry information that is used to populate LC4 memory. Since the sections can be interleaved, the symbol, filename, and line number sections only need to be read so that the data and code sections can be read and parsed. You do not need to do anything with the Symbol, File name, and Line number sections once you have read them in.

Endianness

One word of warning about LC4/PennSim object files. They are “big-endian”. What does this mean?

Well, the fundamental units of memory and file storage are bytes or char’s (8-bit numbers). Many data types (short’s, int’s) occupy multiple bytes. For instance, you can think of a 2-byte short as byte containing bits 15:8 and another byte containing bits 7:0.

So what is this “big endian” deal? Well, “big endian” just says that multi-byte data-types are represented in files and in memory in mostsignificant-byte to least-significant byte order. So the short value x1234 looks in memory and in a file like x12 followed by x34.

So why does this not go without saying? Because there are some platforms which are “little-endian” and on these platforms, the value x1234 is laid out in memory and in files as x34 x12. And, wouldn’t you know it x86 and ARM (the archiectures of your computers and VMs) are “little-endian”. So when you fread 16 bits from an LC4 object file on an x86 host, you have to swap the bytes to get the value you expect. You can look at how sample obj files are formatted with the hexdump tool.

Also note that there are functions to help with converting the byte ordering of data and some are mentioned in the libraries section below.

Output File Format

Your output file should list the non-zero contents of LC4 memory after all of the object files have been loaded starting from address 0. Restricting our attention to non-zero entries will make the resulting files significantly more readable.

Your output file will consist of a sequence of lines listing the address as a 4 bit hex value followed by the contents of that memory location also as a 4 bit hex value. Below is a portion of a sample output describing data memory, code memory is special and will be discussed in a little bit. Here is a sample output for the data portion of memory:

2020 : FADE
2021 : 7EFA
2022 : ABE0
2031 : FEB0
2032 : 5634

Note that while the address values increase monotonically they need not be sequential since the contents of many memory locations will be zero.

In addition to printing out the contents of memory you are required to pay special attention to memory locations corresponding to code sections. You can consult the course slides for details on which address ranges in the LC4 memory map correspond to the user and operating system code sections – you must handle both. For memory locations in these ranges with non-zero entries you must not only print out the memory contents but also decode the corresponding instruction.

For example, if the memory location at address 0020 contained the hex value 9A0A you would print the following line for this entry:

0020 : 9A0A -> CONST R5, #10

Similarly, if the memory location at address 8210 contained the hex value 09FC youwould print the following line for this entry:

8210 : 09FC -> BRn #-4

For each instruction type you would print out the corresponding mnemonic and format found on the LC4 instruction sheet. Here are some examples of instruction strings:

ADD R5, R3, #-1
NOT R7, R1
ADD R5, R3, R2
BRn #-4

The values in immediate fields in the instruction should be printed out as decimal values preceded by the # symbol. Immediate values that are being sign extended should have the appropriate value and sign. Unsigned values will, of course, be positive or zero. If you feel that the instruction cannot be decoded into a legal LC4 instruction you should print out the string “INVALID INSTRUCTION” after the memory contents for that entry. Please include commas between elements as shown.

In order to decode the instructions, you will want to make use of various C operators for manipulating bit fields. Operators such as &, |, << and >> can be used to slice and dice 16 bit values as necessary. The end result of your parsing should be a text file that reflects the assembly code that was originally compiled into the object files. You can see an example input and output file in the Compiling and Testing section below. Of course, this version will have explicit offsets in the Branches and Jumps instead of labels.

Code Structure

In this assignment, we are not providing as much starter code as we have provided in past assignments. As a result, you will have to split your program into files yourself and decide what goes in them. In this section we will detail the expected behaviour of your program, and various details that the code you submit must follow.

Overall Program Behaviour

Your program will be invoked from the command line as follows:

$ ./disas output_filename first.obj second.obj third.obj

The first command line argument to your program (after the program name iteslf) is the name of the file you should output your results to. The remaining arguments are the names of LC4 object files which you should load and decode. You should load the object files in the order they appear in the command line so if a later file overwrites some of the memory locations specified by a previous file the later files values will be the correct result.

Internal Structure

In order to implement your program, you will want to maintain an array of (2^16) 64K ie 64 x 1024=65536 entries corresponding to the entire contents of LC4 memory. Before any object files are loaded you should clear all of these entries to zero.

We are providing you with a source file called decode.h which contains the definition of a structure you will use to represent instructions. The INSTR type has an enumerated field indicating the type of the instruction, 3 fields corresponding to the Rs, Rt and Rd fields of the instruction along with a field to store the immediate field of the instruction.

In order to complete the assignment you will have to provide a file called decode.c that provides implementations of two functions decode_instr() and print_instr().

The decode_instr() function takes as input a 16 bit value and returns an INSTR structure with the fields filled in appropriately. For example if the 16 bit field correspond to an MUL instruction the INSTR structure return should have the type field set to MUL and the Rd, Rs and Rt fields filled in with the corresponding values from the instruction. Note that the MUL instruction does not have an immediate value and the STR instruction does not have an Rd field. If an instruction does not use a particular field in the INSTR struct you do not have to fill it in. If you feel that the 16 bit input to the function is not a valid LC4 instruction the type field should be set to ILLEGAL_INSTR. Note that when you fill in the immediate field of the INSTR struct you are responsible for sign extending or zero extending the field to get the correct result. This can be done by making appropriate use of bitwise operators like &, and |.

Your print_instr() function should print out a readable version of the instruction in the format given in the previous section to the specified file. You are strongly encouraged to add additional helper functions to your decode.c file to implement decode_instr(). We plan to test the functions that you write by compiling your decode.c file against our own test code so please make sure that all of the code you need for these functions is in this file. As a result, you should not modify the existing code in decode.h, but feel free to add anything if you feel it is necessary. This also means that your main() function can not be in decode.c.

As part of this assignment we are requiring you to split your code up into multiple files so that you master the process of building programs and writing Makefiles. Your code should be split across:

Makefile

You must also write and include a Makefile named makefile that builds your program from the source components. Failure to include a working Makefile will result in all tests failing.

The executable that your Makefile produces must be named disas so typing make disas at the command line should make the final executable. Your Makefile should build intermediate object files for each .c file instead of just building the program all at once and rebuild targets accordingly when their source .h and .c files are updated. Your Makefile should also contain the phony target clean so that when you type in make clean it removes all object files and the disas exectuable (and nothing else, be sure to not accidentally delete your .c or .h files).

Your Makefile should also compile using the gcc compiler and use the -Wall flag at each step to enable all warnings. If you want to use gdb or valgrind to test your code, you should also compile with the -g flag so that debugging information that is used by these programs are stored in the compilation ouptuts.

The autograder will be testing to make sure that your makefile builds the program as described above.

Compiling and Testing

To compile your code for this assignment, you will have to create your own Makefile (as described above). We suggest looking at the makefile provided with HW07 and shown in lecture slides for a starting point on how you should create your own.

Testing

Gradescope will have public test cases for students to test thier decode_instr and print_instr functions. Aside from that, we provide some .obj and their corresponding output files that you can use to tset your program as a whole. To utilizes these test cases, you should download the zip in the terminal with:

$ wget https://www.seas.upenn.edu/~cis2400/22fa/projects/code/hw8_tests.zip

and then unzip the download with:

$ unzip hw8_tests.zip

This wil give you various .obj and .txt files (e.g. divide.obj and divide.txt). The .obj files are example inputs into your program and the corresponding .txt file is the sample output. Note that for the rectest, strtest, public-test_checkers_img, public-test_trap_rti_ldr_str, rubik, user_echo, user_square, and wireframe test cases, your program needs to load in os.obj in addition to the object file corresponding to that test case.

Once you have run your program and gotten output, you can compare it to the provided sample output. You may find it useful to use the diff command to do this. diff will compare the two provided files and print out the differences between them. If no differences were detected, nothing is printed out.

Below is an example of running the diff command in the shell.

$ diff file1.txt file2.txt 

Compiler Warnings

Sometimes when you compile a C program it will issue warnings. These are the compilers way of telling you that your code is not completely clear, in order to compile your program the compiler had to make some guesses about what you intended which may or may not have been correct. A lot of people figure that if they don’t see an error everything must be fine but that is not a good way to program.

For this assignment you are required to compile all of your code with the –Wall option which turns on all warnings. Furthermore if we run your makefile and we see any compiler warnings we will be deducting points. It is your job to make sure that all of the warnings and errors are dealt with before you submit your code.

Coding Environment Differences

Note that there are several subtle and annoying differences between C compilers on different machines. For this assignment you are expected to use the gcc compiler in the VM we provided. The TAs cannot and will not be responsible for getting code to run on the wide variety of platforms and compilers in use today. More specifically the TAs will not be responsible for answering questions of the form, “how do I get <fill in the blank> to run on Windows/Mac/etc.”. Because of the differences between compiler implementations and C libraries on different operating systems getting something to compile and run on one system does not necessarily guarantee that it will work on a different machine. You should plan on making absolutely sure that your program will compile and run correctly on VM, which is the same environment we will be testing your code on. The safest way to do that is to develop on that platform.

Valgrind

We will also test your submission on whether there are any memory errors or memory leaks. We will be using valgrind to do this. To run it yourself, you should try running: valgrind --leak-check=full ./disas.

If everything is correct, you should see the following towards the bottom of the output:

==1620== All heap blocks were freed -- no leaks are possible
==1620==
==1620== For counts of detected and suppressed errors, rerun with: -v
==1620== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

If you do not see something similar to the above in your output, valgrind will have printed out details about where the errors and memory leaks occurred.

Other Tools and Hints

Hexdump

hexdump is a unix utility that can be used at the command line to see the contents of a binary file in a more human readable format. To see the utility of this tool, below we have included what it would look like if we tried to open multiply.obj with a text editor:

hw8_binary

You may be able to notice some human readable things in the file like the word “END”, but it is largely unreadable. If we were to instead run the command

$ hexdump -C multiply.obj

we would see the following printed out to the terminal:

hw8_text

From here, the binary file is a lot more readable. Below is an example of someone interpreting the object file “by hand”:

hw8_labeled

If you would like to store the output of the hexdump operation into a file, you can run:

$ hexdump -C multiply.obj > dump.txt

Where dump.txt is the file where the output will be stored. You can also do this for the other .obj files in this assignment

Standard C libraries

In order to program effectively in C you will want to begin to familiarize yourself with the Standard C Libraries. You can find a useful reference to them many places online, though ones that we have liked include:

These utilities are packaged into collections of functions that you can call from your code. In order to avail yourself of these routines you should #include the relevant header files at the beginning of your program like so:

#include <stdio.h>
#include <ctype.h>

Here are some standard C library routines that you may want to look at:

The list is only suggestive not comprehensive, and feel free to use other functions that you find in the standard libraries.

One exception to this is that uint16_t must be used since it is in the provided decode.h file. uint16_t is not talked about in lecture but it is just an unisigned integer data type that is a fixed size (16-bits). There are other similar types like int16_t for the signed counterpart and versions for other bit sizes like uint32_t (unsigned 32-bit integer). You can use all of these as integer types. The reason we use these types for this assignment is that the size of things is very important since various data we read from the object files are of fixed size and normal C types (e.g. int, short) may have different sizes on different machines.

Submission

Please submit your completed code files to Gradescope