Note: I struggle with academic language so I might have misintepreted parts of the thesis, so goodluck friend.

1.0 Intro

The thesis is broken down into 5 major sections, blah blah blah.

2.0 Joern + Code Property Graphs

  • Need an expressive way to represent source code and work on it using ML.
  • Code Property Graph – a joint representation of the program’s syntax, control flow, and data flow. For the examples, we’ll look at the following C function named foo:


  • Abstract Syntax Trees (AST) are used to represent the programs syntax


  • A Control Flow Graph (CFG) is used for representing the programs control flow, which is the possible order in which statements may be executed and the conditions that much be met for this to happen.


  • A CFG can be generated directly from an AST, with knowledge of the language keywords
  • A Program Dependence Graph (PDG) is used to track how attacker-controlled data is propagated within the program. Instead of simple control flowing from one node to the next, there are two types of dependencies that are modelled:
    • Data dependencies: This indicates where values defined by a statement are used. This is the case if the destination statement uses a variable defined by the source, and a path exists in the control flow graph from the source to the destination where none of the nodes on the path redefine the propagated variables.
    • Control dependencies: The execution of a statement may depend on the value of a predicate, for example,. The call to sink in our example is only performed if x < MAX is true. Control dependencies indicate these kinds of dependencies.


  • Finally, a Code Property Graph!


  • Using a graph query language, such as Gremlin, we are able to walk the CPG and extract interesting information from it. This is called a traversal.

  • Comment - This doesn’t seem to handle the notion of classes at all, making it poor for c++/object oriented languages.

  • Embedding of source code in vector spaces

    • Interesting because it allows for other algorithms to work on code as ‘data’, for example machine learning algorithms. This is then extended to work on CPG
    • Probably the most important thing in this thesis

3.0 Feature Spaces for Vulnerability Discovery

  • To be able to leverage machine learning techniques, which work on numerical vectors, we need a way to represent our data in this form – as features.
  • This section talks about some of the embedding techniques used throughout the thesis but it’s worth stating that these are general statements about their shape and actual use of them shapes what you put inside them. For example, if you were looking for UAF then you wouldn’t be putting in data on strcpy’s.

  • The bag of words embedding technique is used heavily throughout the rest of the thesis
    • Originally used for text documents
    • Basically you count up the number of times each word occurs in a particular text, optionally with weightings on the words (e.g. get rid of the boring “the”’s or give words we care about more value), and represent that data as a vector which each word being a unique dimension.
    • Can create very sparse vectors – if we follow the English example then it would create a vector with every single possible word as a dimension and tick off the words it sees in a text, which would mean we’d only have a handful of 1’s and a lot of 0’s. To fix this, the data is represented as a hashtable.


  • Token based feature maps, shown above, are the naïve first implementation of mapping source code to vector space. The source code is tokenised and embedded using the bag-of-words teqniq.


  • Symbol-based feature maps are an extension of token based feature maps
    • The first parser extracts target objects of interest, e.g. all functions, everything from a namespace or class, etc
    • The second level parser extracts features from these objects to unique characterise them for our uses.


  • Tree-based feature maps improve on symbol-based by enabling us to represent code by high level programming constructs – subtrees in the AST.
  • The second level parser now extracts more complex program constructs, e.g. entire statements or compound statements, as opposed to trees representing types, callees, or identifiers.
  • We now represent objects by the subtrees contained in its programs constructs
  • Tree based feature maps are used in chapter 4.
  • Well suited to compare graphs where the ordering of edges is relevant, but treat expressions like A + B to be different from B + A, even when it’s commutative.


  • Graph-based feature maps differ from tree-based in that the graph substructures are hashed and then this hash is used as dimension of the feature space. This makes it more robust for storing trees/graphs. For example, the previous AB vs BA example would be handled by this.
  • Fabs uses neighbourhood hashing to turn these graphs into a hash.


  • Multi-stage feature maps are used to group similar substructures and map them onto the same dimension e.g. clustering malloc and calloc together,
  • Instead of directly representing objects by their substructures, Fabian instead embeds substructures first and cluster them to obtain groups of similar substructures. Objects are then represented by the groups their substructures are contained in.
  • This type of map is used in chapter 6 to represent program slices by sets of strings that describe the initialisation of variables.

  • Finally, feature maps on Code Property Graphs


  • Three step teqniq at bag-of-words style embedding of a CFP
    • Object extraction – First, find an input space (I guess part of a CFP? lol) containing the objects we wish to compare and extract them by selecting a seed node in the CFP for each object, and subsequently expanding it to obtain the objects graph representation.
    • Substructure enumeration – tl;dr, break down our chosen extracted subgraph into smaller substructures/subgraphs. A single node might belong to many sub-subgraphs.
    • Mapping - Map these sub-subgraphs to vectors using hashing

4.0 Discovering Vulnerabilities using Dimensionality Reduction

Tl;dr finding new vulnerabilities similar to a known vulnerability. The idea behind this approach is that vulnerabilities are often created due to a bad programming pattern or vulnerable code is copy + pasted throughout the codebase.

Dimensionality reduction is expressing data from a higher dimensional space into a lower one e.g. 3D -> 2D, where the data can be represented with comparative expressiveness, however, with fewer features. There are two main ways to achieve this – feature selection and feature extraction, however feature extraction is the focus of this thesis. Feature extraction is the creation of new features based off existing ones, so as to reduce the amount of data the algorithm has to work on. This newly created data should be as informative and non-redundant as possible.

  • Latent Semantic Analysis is used for dimensionality reduction.
    • Originally used to on language texts to identify topics of discussion
    • Can identify code sharing the same common compositions of program constructs
    • Used to denoise the data – to obtain representation less susceptible to artifacts
  • LSA is done in three steps:
    • Embedding – documents are embedded in vector space using a bag of words embedding system, with weighted values
    • Dimensionality reduction – common topics are determined from the corpus through dimensionality reduction. Identify words that commonly occur together.
    • Measure the distance between two ‘documents’ by measuring their vector distance


  • AST extraction is done on a per-function basis
  • Map each AST to a vector encoding the sub trees contained in the function
  • Identification of structural patterns
  • Now we can extrapolate using these vectors.
    • Find the vector representing a function that contains a known vulnerability, then compare it to the others vectors to discover similar functions

5.0 Discovering Vulnerabilities using Anomaly Detection

  • Extending the method from the previous section to find code that anomalous but similar to other observed programming patterns
    • e.g. a check happens 9 out of 10 times for function X, highlight that one anomalous time


  1. Find all points of interest e.g. malloc/strcpy/whatever/source+sinks
  2. Find similar functions in which these sinks/sources are used in.
    • We want similar functions because a lot of these source/sink combo’s are contextually important e.g. a strcpy reading from a config file is very different to a strcpy from network data
  3. Use lightweight tainting to figure out which checks are relevant to our chosen source or sink
  4. Embed functions in vector space using their source/sink checks instead of their API symbol trees.
  5. Find anomalies, get bitches yo

6.0 Discovering Vulnerabilities using Clustering

  • The output of this section is search patterns/traversals for potentially vulnerable code, which can be fed to an analyst or to the methods from the previous chapters.
  • These outputted patterns aren’t based on previous vulns but are rather generated directly from the source code, making them useful for a virgin codebase.
  • Defines taint-style vulnerabilities as a vulnerability caused by a failure to restrict attacker-controlled data that reaches a security sensitive operation, e.g. they didn’t check args to strcpy(). Taint-style vulnerabilities are tried to the combination of a source and a sink.
  • Step 0: Pick an interesting sink.


  • Generate definition graphs
    • Basically a program slice from an interesting sink, with the information about how the sink’s arguments are santisied and initialised. This also works across functions.
    • The output is ‘compressed’ by storing the graphs of each unique entry instead of for each invocation of the sink.


  • Decompression and clustering
    • Decompression refers to the fact that we store graphs instead of individual invocations of the selected sink, so we have to map back to them.
    • We embed these invocations in vector space and cluster them.
    • The feature map is multi-staged and represents invocations through the clusters of API symbols their initialisers occur in.
  • Creation of Sanitisation Overlays
    • Previously generated graphs don’t contain information about argument sanitisation
    • Vague on this but I think:
    • Find all initialisation of the sinks arguments that are similar (cluster them after embedding their graph) and count the clusters occurrences as opposed to occurrences to individual conditions. This is more robust against slight changes in the way conditions are formulated.
  • Generation of Graph Traversals
    • Graph traversals are generated using the language Gremlin and are simply translated from our existing graphs

Putting this all together gives us a list of potential vulnerabilities which can be investigated by an auditor more thoroughly.