Zachary R. McKee

Former Teaching Assistant at Illinois Tech

B.S. Computer Science -- 2020

McCormick School of Engineering

Northwestern University

Resume

Huffman Code Text Encoder/Decoder

Source | Back to Project List

Description

Text compression/decompression algorithm showcasing usage of trees, hash tables, object serialization, and algorithm design with performance in mind.

This application takes a text input consisting of ASCII characters, encodes it into a compressed file using the Huffman coding algorithm, and then immediately takes that binary data and decodes it back into plaintext.

Performance Analysis

For the purposes of the assignment this project was developed for, this program is efficient enough. In order to analyze how well it performs with larger data sets, I went to the first large data set I could think of, which happened to be Chicago's reported crime data set. I took progressively larger segments of this data, starting with relatively small samples (a few months' worth of data) and went up from there.

Graph of a linear big-O complexity algorithm.

Based off what this graph is telling us, we can conclude that the long-term complexity of this algorithm is Θ(n). This is desirable, however, as one can also see, the runtime for large data sets is suboptimal (around 24 minutes for 300MB of data), therefore there is a large constant factor.

Flame graph of Huffman Code Text Encoder/Decoder.
A flame graph of my program, showing the expense of decoding.

The two main portions of the program are called at line 21 of main.py (encoding) and line 26 of main.py (decoding), which can be seen in the graph above, which displays the stack frame of the program through its execution.

Clearly, there is room for improvement. As can be seen above, the majority of the time spent is in the decoding portion of the program. The decoding portion uses a lot of recursive function calls in order to search the Huffman Tree as the program reads the encoded file. You can actually see the general shape of this recursion above, as new calls to traverseTree are pushed and popped to and from the stack.

As function calls are relatively expensive, even in C (because a new item on the runtime stack has to be created, stack pointer updated, etc), a good way to approach optimizing this algorithm would be re-implementing the tree-search portion of the program in an iterative way. That way we're not asking the CPU to do all the miscellaneous tasks that are required to call a new Python function.

Another thing that we want to be able to take advantage of when designing algorithms is the way the CPU is physically implemented. Unless Python is doing something truly amazing behind the scenes that I am not aware of, when we use a recursive implementation of tree searches, it's less likely for the data that we care about to be already stored in the cache when we request it (as the data is stored in a stack that potentially contains upwards of 10-20 seperate function calls, versus an iterative notation where we just have a pointer that gets changed as it goes down the tree).