By Rylan O'Connell | February 23, 2021
A year ago, we released a series of blog posts documenting our research into the world of binary hashing. While we speculated about the efficacy of this technique for binary diffing, our primary goal was to recognize similar code between binaries for the purpose of porting annotations from one analyzed binary to another and many of our design choices reflected this end-goal. Luckily, we’ve been given the opportunity to explore how these hashing techniques could be applied to the world of “bindiffing” through DARPA’s Assured Micropatching (AMP) 1 program.
As part of this ongoing research, we have developed NinjaDiff - an open source binary diffing plugin for BinaryNinja. Throughout this blog post, we will be discussing the underlying algorithms and technical design choices made while designing this tool.
Conventional Diffing
Many of our readers are likely familiar with the GNU diff
utility and how it can be used to quickly identify differences in source code.
diff -y file1.c file2.c
...
if (n == -1) if (n == -1)
{ {
perror("recv() failed"); perror("recv() failed");
return EXIT_FAILURE; return EXIT_FAILURE;
} }
> else if (n == 0)
> {
> printf("SERVER: Rcvd 0 from recv(); closing socket...\n
> }
else /* n > 0 */ else /* n > 0 */
...
However, not all may be familiar with specifically how this algorithm functions. At it’s core, this diffing algorithm is effectively just an extension of the relatively well known longest common subsequence algorithm. Although we will not be diving into the inner workings of this algorithm, the important thing to understand is that this algorithm will take n lists as input, and will attempt to find the maximal ordered subset of elements present in both input sets.
For example, given the following inputs:
source: 1, 2, 3, 4, 5, 7, 9
destination: 1, 2, 4, 5, 6, 8, 9
We should get the following output:
common: 1, 2, 4, 5, 9
From this point, it’s fairly trivial to loop through both inputs and compare each element to the “common” subsequence. If an element is present in the source list, but not in the common list, it must have been removed. By the same logic, if an element is present in the destination list but not the common one, it must have been added.
Binary Diffing
While this algorithm is all well and good for diffing text inputs, constructing such an algorithm for binary inputs is much more difficult. A naive approach may be to treat each byte as a token, and construct our inputs by generating a list of each byte in the respective binary. However, such an approach would likely be incredibly prone to error due to the frequency of repeated bytes and the variability in instruction length.
This brings us to an important consideration when diffing binary files - there’s often a great deal of structure that we can glean from a binary such as instruction encoding, basic block structure, or even function call hierarchy. Furthermore, due to certain compiler behavior (such as function reordering within a binary), the individual bytes within a binary lose much of their meaning and we are actually forced to rely on many of these higher level heuristics in order to get the most meaningful results.
However, this gives rise to another surprisingly difficult problem - if we’re forced to rely on high level structures like functions, how do we know which functions to compare? While we developed the tooling last year to detect similar functions through a fuzzy hash, the nature of binary diffing means that we are often looking to compare functions that have had much more substantial changes applied to them than our fuzzy hashing techniques could reliably handle.
Our Approach
Function Alignment
Since our existing function hashing methods were unable to reliably align functions with significant differences, we realized that we would have to develop a more flexible method for function comparison. Instead of simply comparing hashes to determine if two functions were “close enough”, we opted to construct directed graphs mirroring the control flow graph of each function where each node is represented by the fuzzy hash generated by hashashin. This simplifies the CFG considerably since it abstracts away the contents of each basic block, but will still allow us to maintain some information about the “structure” of the function while making our comparisons.
Unfortunately, figuring out how to make these comparisons was not nearly as straight forward. The canonical way to measure the similarity between two graphs is through the graph edit distance, which will try to measure the minimum number of changes need to transform one graph into the other. However, this algorithm has been shown to be NP-complete, meaning that such a distance metric quickly becomes infeasible as the functions being compared grow more complex.
While several approximation algorithms do exist, many of them are still somewhat expensive and often require us to make certain assumptions about the graphs we are dealing with. Although it may not be ideal, we found that a simple weighted sum of common nodes and edges gave relatively good results in linear time. As it turns out, this efficiency proved to be key, since this “distance” based measurement forces us to compare every function to every other function in order to find the pairing with the minimal difference.
Another optimization we chose to take was to limit the minimum size of a function to process. This decision helped us in two ways:
- It saves time diffing thunks/imports/etc.
- It cuts down on false matches between such functions, since most imports have a similar structure where the only way to differentiate between them is through examining addresses (which may not line up between binaries anyway).
Function Diffing
Once such a function mapping has been constructed, our work is still not done though. While we experimented with adding a second layer of fuzzy hashing to “align” basic blocks which may have been re-arranged by the compiler, we found that doing so would add considerable processing time with little to no real world improvement on the accuracy of the tool. Instead, after some experimentation we opted to simply let the Binary Ninja High Level IL handle most of these minor rearrangements so that we could simply loop through (theoretically) pre-aligned instructions one by one.
Unfortunately, due to the way these instructions are implemented in binary ninja, there is no catch all solution to compare instructions. This is because the AST for many instructions includes binary specific information such as true/false jump addresses for conditionals, or variable addresses in memory. To make things worse, many of these more complicated instructions can be nested in the AST, meaning that we can’t simply compare the outermost instruction type. To get around this issue, we had to write a custom instruction comparator to handle the most commonly found “complex” instructions and recursively parse their ASTs to check for non-trivial differences. While our implementation is by no means perfect, we were able to run it over large binaries and firmware blobs without issues.
Known Issues/Future Optimizations
While we are extremely pleased with how our tool has turned out, it still leaves room for improvement. Perhaps the greatest issue we currently face is the mapping of HLIL instructions in Binary Ninja to addresses. Because this mapping is not one-to-one, we often find ourselves in situations where several instructions may be mapped to a single address leading us to “overwrite” the highlight color, or worse, highlight and/or tag extra instructions.
Another possible optimization we would like to accomplish would be to parallelize as much of this code as possible.
At the moment, much of the computation time is spent constructing and comparing graphs, both of which should be fairly trivial to parallelize.
However, Python’s multiprocessing
library uses pickle
under the hood for object serialization, and pickle
is unable to serialize the
ctypes
datatypes which Binary Ninja relies on.
As such we would have to re-load the Binary Ninja database in each process, which would likely undo much of the performance gains we would get from parallelizing these operations in the first place. While we would love to experiment more with this in the future, the performance hit of reloading the database has made it a low priority for us.
Disclaimer
This research was partially developed with funding from the Defense Advanced Research Projects Agency (DARPA).
The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.
-
Assured Micropatching (AMP). Dr. Sergey Bratus. Retrieved from https://www.darpa.mil/program/assured-micropatching. ↩︎