Time's arrow time's cycle

5/13/2023

Apache Arrow defines a language-independent columnar format for flat and hierarchical data, and brings the following advantages:

Also, performance-wise, lots of the time spent in an ELT process is around converting data from one format to another, so we wanted to take a step back and see if we can avoid this famous XKCD (by building yet another format): For example, in database to database replication, we needed to support many more types, including nested types. This served us well, but we started to hit limitations in various use-cases. Why Arrow?īefore Arrow, we used our own type system that supported more than 14 types. In Arrow terminology, these are a schema and a record batch. So to recap, the source plugin sends mainly two things to a destination: 1) the schema 2) the records that fit the defined schema. The destination plugin can then easily create the schema for its database and transform the incoming data to the destination types. Source plugins extract information from APIs in the most performant way possible, defining a schema and then transforming the result from the API (JSON or any other format) to a well-defined type system.

This is crucial to allowing the addition of new destinations and updating old destinations without requiring updates to source plugin code (which otherwise would introduce an unmaintainable architecture). Sources and destinations are decoupled and communicate via gRPC.

0 Comments

Time's arrow time's cycle

Leave a Reply.

Author

Archives

Categories