Mar 24, 2022
Apache Arrow Flight: A Primer
David Li, Tom Drabas
Given the speed at which business changes, centralizing data in a single analytics platform impedes flexibility. We design and build composable data systems using best-in-class open source standards. This gives enterprises the flexibility needed to optimize and get the most out of their current infrastructure.
Anyone who has moved houses at least once knows that packing and unpacking boxes is no fun and extremely time-consuming, not to mention expensive if you hire someone to do it. Unfortunately, transferring data can be just as fun as moving.
Sending data from one computer–or device or database or server–to another requires serializing data into an intermediate format that both parties can understand, just like packing belongings into boxes. Serialized data is sent to the receiver, corresponding to transporting the boxes to a new place. Then, the receiver needs to deserialize the data into a format it understands, just like unpacking the boxes at the new destination. Serializing and deserializing data on each end, however, can be responsible for as much as 90% of the total time that it takes to move the data, slowing things down and driving up costs.
To move data quickly, we need to get rid of serialization and deserialization, which requires two things: a data standard that both the sender and receiver understand, so we don’t have to pack and unpack, and an efficient protocol for moving the data. The initial release of Apache Arrow provided the foundation–a columnar, cross-language format for laying out data in memory. However, there was no clear framework for sending Arrow data between computers. Until now.
Enter Arrow Flight
Arrow Flight is a framework for transporting Arrow data efficiently over the network. Arrow eliminates the need to serialize and deserialize data–or pack and unpack the boxes–and Arrow Flight brings those benefits to the network by replacing renting a truck with a conveyor belt: simply put your (data) belongings on one end and pick them up at their destination.
Flight simplifies the job of application developers by providing the ability to send and receive data quickly: a Flight client can stream Arrow data to and from a remote Flight server. No serialization and deserialization, no writing a one-off solution to move the data to another computer. It builds on top of multiple underlying technologies, but applies optimizations that make it more performant compared to custom implementations. In addition, Flight’s implementation is flexible and allows developers to customize server-side methods to accommodate more complex use cases.
At a high level, Flight is a remote procedure call (RPC) framework. RPC is a paradigm for structuring inter-process communications, whether it’s between two processes on the same machine or across the network. As the name implies, RPC models communication as calling remote functions and getting back results, just like in procedural programming—except the code being executed is in some other process.
Accordingly, Flight provides functions that fetch Arrow data from (or send data to) some remote process. It integrates tightly with Arrow libraries to make it easy to use and to find optimization opportunities. First introduced in Arrow 0.11, Flight currently ships with the Arrow libraries in Python, Java, Rust, and many other languages.
Now, let’s dive in a little deeper.
Arrow Flight Stack
The Arrow Flight “tech stack”
Arrow Flight builds on top of several open source libraries. While Flight is an RPC framework, it builds on top of the gRPC RPC framework by Google and the Cloud Native Computing Foundation. Adopted by organizations like Dropbox and Netflix, it’s fast, open source, and supported across a wide variety of programming languages. Flight lets gRPC handle all the low-level details around network communications and layers its own optimizations and integration with Arrow data on top.
As we’ve described, sending messages between processes, especially ones written in different languages, requires some way to describe the format of the data so both parties can understand each other. Protocol Buffers (“Protobuf”), a serialization format and library from Google, often fills that role for applications using gRPC, and it’s also used in Flight for metadata, alongside Arrow itself for data, of course.
Using Arrow Flight
Normally, to use gRPC, we have to agree on RPC methods and their input and output types ahead of time so that the server and client can communicate. Flight does this for you, defining abstract methods for things like describing the schema of a dataset, uploading data to or downloading data from a server, and more. A Flight server overrides only the methods it wants to support, and a client using Flight can connect to a server and call these methods.
Flight’s methods either deal with data or metadata. Everything starts with a FlightDescriptor
, which describes some dataset. For instance, you can pass it to the GetFlightInfo
method to get FlightInfo
metadata describing how to download a dataset. To do so, you can “redeem” the Tickets
inside the FlightInfo
for the data by calling the DoGet
method, which will stream data back to the client.
If you want to upload data instead, just pass the FlightDescriptor
to the DoPut
method and start writing data to the server. And for advanced use cases, use DoExchange
instead, which lets you read and write data simultaneously. Note that all of the “Do” methods are streaming, so you can work with datasets that don’t even fit in memory.
Flight also provides several methods to query various metadata about datasets, like fetching the schema without reading the data, as well as to let applications extend it with custom “actions” that don’t naturally fit into any other method. You can get the full details from the Flight documentation.
Given these methods, you can imagine a variety of ways to put them together into an application. For instance, requests in a simple key-value store might look like this, where the client directly constructs a Ticket message and sends it to the server: