Update 2019-10-10: we ended up going with a different approach. See the end of the post..

Here’s another fun (C)Python detail. TL;DR: for threads not created from Python, you must acquire the GIL and set up thread state before running Python code. But you need to make sure to retain a reference to the thread state for the entire lifetime of the thread—which is tricky when you’re jumping in and out of the interpreter, as Arrow Flight does.


First, some context:

For Apache Arrow Flight, we essentially maintain a specialized binding of the gRPC RPC framework for Python. Most of the heavy lifting is done in C++, especially because PyArrow is built on top of Arrow-C++ anyways, so it would be redundant—and almost certainly much less performant—to reimplement Flight on top of gRPC-Python. And there wouldn’t be much benefit, in terms of supported features, in doing this anyways: Flight services, whether Python or C++, just have to produce some Arrow data and hand it off.

However, on the maintainer side, that glue code between Python and C++ is inconvenient. First off, we can’t leverage the work done for the official gRPC-Python binding:

  • gRPC-Python and gRPC-C++ are both themselves bindings for gRPC-Core. Unfortunately, what gRPC-Core exposes is basically an asynchronous event loop, and so each binding has built its own threading, interceptors, authentication, etc. on top.
  • gRPC-Python doesn’t expose any API to access the underlying native objects anyways.

This means one big feature can’t be supported: running a Flight service alongside other gRPC services on the same port. This is possible in the Java implementation of Flight (with some reflective trickery, as we haven’t made public the relevant implementation bits), and is theoretically possible in C++ if we exported the right types.

Writing the bindings themselves is also rather tedious. In particular, to implement a Flight service in C++, you subclass FlightServerBase. PyArrow is implemented in Cython, which doesn’t let you subclass C++ classes. Instead, in C++, we define a subclass that accepts a Python object reference and a table of function pointers, then in Cython, define a set of C “trampoline” functions that bind the two together. Roughly:

class PyFlightServerBase : public FlightServerBase {
public:
  struct Vtable {
    // Callbacks to implement each RPC method
    std::function<Status(PyObject*, )> list_flights;
    // ⋮
  };

  PyFlightServerBase(PyObject* object, Vtable vtable);
};
cdef Status implement_list_flights(void* object, ) except *:
    # Call Python methods on the Python object, then convert back
    # to C++ objects

There’s lots of little details to get right:

  • In C++:
    • We have to check for Python exceptions.
    • We have to make sure to acquire and release the GIL.
    • We have to make sure a Python thread state object is created. (This will become important later!)

    Thankfully, the CPython API provides conveniences for this. In particular, PyGILState_Ensure and PyGILState_Release take care of both the GIL and the Python thread state.

  • In Cython/Python:

Currently, I’ve been working on implementing middleware/interceptors for Flight. In HTTP/REST frameworks like Flask, middleware are used to do something before/after each call, like check that the client is authenticated. Interceptors are the gRPC equivalent, allowing you to hook into the lifecycle of an RPC.

For Flight, the motivation is to be able to integrate with frameworks like OpenTracing and OpenTelemetry. As an extremely oversimplified description, what we want to do is check incoming calls for a special header that uniquely identifies the call, and to propagate that value on outgoing calls. By combining these values with some form of logging, we can reconstruct the flow of an operation as it crosses multiple services, which each may in turn make further requests (whether using Arrow Flight or not).

Manually propagating that unique identifier would get pretty annoying—it would be nice if we could stick it somewhere convenient, like a thread local, and refer to that instead. After implementing middleware for Flight, I wrote up C++ unit tests that did exactly that. However, the same implementation, using threading.local, failed in Python!

Was the thread getting destroyed in between when the middleware was called and when the RPC method body was run? That wouldn’t have made sense, since the implementation of middleware in C++ just inlines calls to middleware into the implementation of the RPC methods1.

The culprit is the aforementioned PyGILState_Release and the thread state object. Calls to PyGILState_Release and PyGILState_Ensure must always be paired, so PyArrow makes use of some RAII helpers to ensure this. But consider the call flow:

  1. The RPC starts, and we call the middleware.
  2. The Python bridge code calls into the interpreter, calling PyGILState_Ensure.
  3. We run our Python code and set a thread local value.
  4. The Python bridge code (through the RAII helper) calls PyGILState_Release.

At this point, we’ve basically told Python that we’re done with the thread, and of course, it goes and cleans up any Python state associated with it! Looking at the implementation, we see that it cleans things up. We can further verify with a sentinel like this:

class Sentinel:
    """Cause a segfault when freed so we can break into a debugger."""

    def __del__(self):
        print("I was freed!")
        import ctypes; ctypes.string_at(0)

Doing so will pretty clearly show a stack trace with PyGILState_Release in it.

The solution, after all this, is fairly simple. Before, we would wrap and directly register each Python-defined middleware with flight. Instead, now, in Cython, we define a single middleware instance. When the call starts, it makes a call to PyGILState_Ensure, then runs the actual Python middleware. And similarly, when the call ends, it runs the actual middleware, then makes a call to PyGILState_Release. This way, we keep alive a reference to the thread state until we’re actually done with the thread. And that means we can use thread locals to store our RPC identifier.

Update

After review, I settled on a different approach, as manipulating the GIL like this is tricky, and guaranteeing what threads things run on may be hard in other implementations. (Plus, gRPC itself doesn’t make any real guarantees.) We keep a reference to the middleware instances per call anyways, so instead, we just expose a getter to get middleware instances from within RPC method bodies. While this means more boilerplate for things like the OpenTracing/OpenTelemetry use case, it hopefully makes future Flight implementations easier.

  1. We don’t use gRPC interceptors for the server in Flight, as they’re fairly limited. In particular, they can’t communicate with RPC method bodies (except possibly via thread locals). However, gRPC-Java explicitly notes that thread locals are not safe, and gRPC-C++ makes no claims at all—we can use them only because we’ve chosen to implement middleware in this way.