How to hack your dependencies

Suppose you're developing a Python library, that you'd want to share via PyPI. Your library almost certainly has a lot of dependencies, such as numpy or scikit-learn. What if some parts of that dependency library don't quite behave as you'd like - maybe a particular argument combination is not implemented, or it doesn't correctly handle the one corner case you care about?

If all you want is to run the code locally, once, you just edit the source code of the dependency on your local machine and that's that. But what if you want other people to use your library?

A naive, and terrible, solution is to fork the dependency, make the changes in your version, and ask people to install your version of it if they want to use your library. If a dependency is really obscure, that might work, but for anything halfway popular, people will almost certainly want to go for the original version, out of laziness if nothing else, and your library won't work.

So what can you do? The best solution is to submit a PR to the dependency repo, to fix whatever it is you don't like. You should always try that - that's why open source packages are so great and are getting better. But what if you can't wait until your PR is released, or also want your library to work with earlier versions of the dependency, or your PR is refused or ignored because the dependency's authors have other ideas?

Python being the wonderfully flexible language that it is, you still have a lot of choices. I'll go through each of them, and the situations in which it is a good idea.

First and simplest case is when you instantiate the object (defined in the dependency) that you want to modify. Then the solution is good old inheritance: just create a child class of the class in the dependency, and provide your own implementations of the functions you want to modify. Then this child class will in all respects be interchangeable with the original one as far as the interaction with the dependency is concerned (if you design your changes carefully that is 🙂 ), but will incorporate your changes. So you instantiate your child class instead of the parent, and all is good. A recent example of that in our work was when we realized that LlamaIndex's PGVector wrapper didn't return the embeddings on the nodes it retrieved. So we just inherited from that, and rewrote the half a dozen or so functions involved also to include embeddings in the nodes retrieved.

import numpy as np
from sklearn.linear_model import LinearRegression

X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2, 3, 4, 5, 6])

class Child(LinearRegression):
    def __init__(self):
        super().__init__()
    def fit(self, X, y, *args, **kwargs):
        print("I am a child class")
        return super().fit(X, y, *args, **kwargs)
    
reg = Child()
reg.fit(X, y)

The second case, almost as simple, is when you get actual instances of the class you want to modify, already instantiated. Here the robust way is to simply write a function that creates instances of the modified class from instances of the parent class. You could also make it a class method of the child class for convenience, like

@classmethod
def from_parent(cls, *args, **kwargs):
...

But if your child class has the same members, and only has some methods overridden, the lazy way is just to swap the class of the object:

reg2 = LinearRegression()
reg2.__class__ = Child
reg2.fit(X,y)

That's sneaky and potentially confusing, because people reading your code might assume the objects to have their original class, and thus misunderstand what the code is doing, but it's an option.

The final and messiest case is when your library's code doesn't directly interact with the function or method you want to modify. For example, you call some function from the dependency that calls another function that ... until something in the depths of that dependency finally calls the thing you want to tweak.

What do you do then? Enter monkey patching. That's a description for modifying the contents of an imported module, for example like this:

def my_fit(self, *args, **kwargs):
    print("I'm just a monkey patch!")

LinearRegression.fit = my_fit
reg3 = LinearRegression()
reg3.fit(X, y)

What happens here is that you create your own modified version of the thing you want to tweak, put it in your library, and after importing the module in which the original version lives, assign that thing to be your version. Then, when you call the library, it's your version that gets called.

The most frequent use of that involves replacing a method in a class somewhere deep inside a library, that your code doesn't call directly. For the cases when you just want to change the return values of the method, for example, the above approach, with a fixed new function replacing the old method, is the usual one.

However, there is a final twist on that, for the case when you want to retrieve some value that gets created in that method, deep inside some library you're using, and then discarded. How do you do that? You don't want to change the return type of the method you're monkey patching, as you would then need to change the behaviour of the whole calling chain leading to it - a hassle to write and a nightmare to maintain. Instead, what you can use is monkeypatching with a dynamically created function which has a closure on a local variable. Here's how this might look:

# A top-level class from the library from whose internal workings you want to extract something
class SomeLibraryClass:
    def __call__(self, X, y):
        # The instantiation of LinearRegression is happening deep in code of your dependency
        # Here it's just one call deep, but could be many levels
        reg = LinearRegression().fit(X, y)
        return reg

def my_function_that_extracts_internal_value(X, y):
    
    # Dynamically create a function that stores an internal value in our local variable
    container = {}
    def function_that_creates_a_value_and_throws_it_away(self, X, y):
        # Imagine some complicated calculations leading up to that
        intermediate_value_to_throw_away = np.median(y)
        print("The value I would throw away:", intermediate_value_to_throw_away)
        container["value"] = intermediate_value_to_throw_away
        # imagine some more complicated calculations here that the original function did
        return np.linalg.inv(X.T.dot(X)).dot(X.T.dot(y))
    
    # And monkey patch a method of the library class with it
    LinearRegression.fit = function_that_creates_a_value_and_throws_it_away
    
    # Call the top-level thing that somewhere in its guts calls the class we monkeypatched
    tmp = SomeLibraryClass()
    tmp(X, y)
    
    # The "container" variable now contains the internal value we wanted to extract
    return container["value"]

# Now call my 
my_function_that_extracts_internal_value(X,y)

Not for the faint of heart to be sure, but we recently used it, for example, to extract more information from pandas.ai that we were wrapping as a tool.

Of course, in coding, as in most other situations, it's best to use the simplest option that works. But hopefully this post gave you a couple more options for when the simplest approaches don't cut it.

You can find the whole code here: https://github.com/ShoggothAI/motleycrew/blob/monkey_patch/examples/hacking dependencies.ipynb

How to hack your dependencies

How to hack your dependencies

Why I avoid Python's asyncio (by Egor)

Why too much Pydantic can be a bad thing