Building and using a code graph in MotleyCoder
Part 2.
This is the second part of the introduction to MotleyCoder, our toolkit for AI agents for inspecting and writing code. The first part can be found here in this installment we go deeper into how we construct and use a graph of the code structure.
When asking an LLM to answer a question, it's critical to provide it the right context in the prompt. When the data in question is freeform text, we'd typically use some form of chunking and then retrieve the relevant chunks using embeddings similarity. In addition, we might create a knowledge graph of relationships between the different entities in the text, typically using an LLM.
When the source data is not natural language, but code, the task is similar, but with important differences. On the one hand, simple text similarity between code snippets, just like naive fixed-size chunking, will not work well, as they ignore the natural structure of the code. On the other hand, code already has natural chunking, namely each chunk corresponding to an entity in the code, such as a definition of a function or a class; and a natural graph structure, defined by pieces of code referencing other pieces, for example a function calling another function.
Another natural relationship between entities defined in the code is whether they are located in the same file; but this relationship is likely to matter less than for natural language documents, as the functionality of the code typically doesn't depend on entities being located in a specific files; thus the "being in the same file" relationship is more of a stylistic one, of developers hopefully placing similar functionality in the same location.
With that in mind, we looked at the algorithm for building a code graph contained in aider. It taught us a lot, in particular using tree-sitter, a formal grammar parser that works for a lot of popular languages, to extract both entity definitions and entity references from the code. Aider's algorithm takes files as the graph nodes, and references to entities as edges connecting the files (from the file referencing the entity to the file with the entity definition).
To construct a summary description of the code base (which it calls a repo map) in response to a particular query, it first parser the query for potential matches to files and entities in the code. It then runs a PageRank algorithm, and pro-rates the rankings of the file nodes back to the edge entities, to rank the entities' importance, and thus figure out which ones should be included in a repo summary map in response to a particular user query. The entities and files matched from the query message get a higher weight while running the algorithm.
We felt it would be more natural to regard the individual entities within the code, such as class and function definitions, as graph nodes. The edges we introduce are of two kinds: firstly, from the definition of an entity to a definition of a sub-entity, for example from a class definition to one of its method definitions. Secondly, from an entity definition to the definition of an entity it references - for example, from the definition of function a, which calls function b in its body, to the definition of function b.
We also added some extra fields stored in each node, compared to aider, in particular the full text of the entity definition. For python only (as this was our main focus) we also use the Python ast package to extract the docstring, if it exists, from each entity definition. This makes it possible to embed that docstring to combine the embeddings-based lookup with the graph-based one (we don't do that yet, but it will be easy to add in the next iteration, or in your code using this).
How can this graph be used to provide information to the agent? Firstly, we can use it to provide a summary of the parts of the repository relevant to a particular query, similar to the way aider does it. We haven't yet seen the need to implement PageRank on the graph, but merely assign weights to potential matches from the query message, and then diffuse those weights over the graph. The diffused weights are then used to rank the entities to produce a repo map similar in spirit to aider's.
Secondly, we use the graph when providing a summary of a particular entity, by also including brief information on all the children (in the sense of the graph described above) of that entity, whether located in that file or not. Thus, the agent using that tool can walk the dependency graph of the code, just like a human would.
In making the code graph construction accessible as a standalone tool, as compared to aider's monolithic codebase optimized for a particular use case (coding assistant), we hope to unlock many other AI applications that require structured inspection of a codebase.