GoAi to Move LibGDF Into Arrow


Year In Review

Almost a year ago was the formation of GoAi, an initiative with the ambitious mission to enable end-to-end data science on GPUs. Thanks to the diligent work and engagement of our partners: BlazingDB, Graphistry, UC Davis, Anaconda, H2O, and MapD, we have exciting developments to share.

A Home for LibGDF

Our open spec approach to standardization of the GPU data frame’s data format led the community to adopt Apache Arrow’s technique for libraries and applications to exchange tabular data directly on the GPU. This has allowed developers to take full advantage of the high throughput of GPUs across data science workflows. Currently, the GPU data frame’s functionalities are packaged in LibGDF, a C library. In an interest to consolidate communities and build more bridges to exchange data between GPU libraries, we’ve voted to move the functionality of LibGDF into Arrow. This will take a little while as we need to adapt the functions of LibGDF to the Arrow code base, but eventually LibGDF will disappear as a separate library.

Productivity as the Default

The Python programming language has grown in popularity among data scientists for its flexibility, ease of programming, and readability. However, Python is not known for performance, so data scientists have had to turn more and more of their attention away from the problems they’re trying to solve and instead towards implementing their hypotheses in less friendly, “more performant” systems. Luckily, work being done by a number of projects (such as TensorFlow, PyTorch, Numba, Chainer/CuPy and many others) allows Python data science workloads to leverage GPUs and get performance similar to GPU-accelerated C++ and Fortran without writing any low-level code.

We are continuing our work this year to enable efficient usage of Arrow on GPU from languages like Python, via PyGDF, and for distributed Arrow dataframes, via Dask. These projects allow users to write Pandas-like code to manipulate dataframes on the GPU, as well as create user-defined functions that are just-in-time compiled for GPU execution:

import pygdf

gdf = pygdf.DataFrame.from_pandas(df)

def to_fahrenheit(temp, temp_f_int):
    for i, t in enumerate(temp):
        temp_f_int[i] = 9/5 * t + 32.0

gdf = gdf.apply_rows(to_fahrenheit,

GoAi members are also working on support for Arrow on platforms like Node.js as well as developing proposals for representing graph networks in Arrow format.

Over the past year, we’ve also observed that there is much we could do to improve interoperability between multidimensional arrays (often called “tensors” by deep learning frameworks) on the GPU. There are many different tensor implementations available to Python data scientists, and we think it would be a huge win for the community if these different tensors could be exchanged between frameworks in a standard way. It is the opposite problem for dataframes on the GPU; instead of having no GPU data structures to communicate with, we have too many that don’t interoperate with each other.

As a first step, the Numba team is working to define a standard CUDA array interface for Python that would allow for libraries to seamlessly share GPU arrays without having to copy and convert data. This would allow for Numba and CuPy to work together for GPU-accelerated UDFs on top of CuPy arrays, as one example. We already have a proof-of-concept of this working with patched versions of the two projects:

import cupy
from numba import cuda

def add(x, y, out):
    start = cuda.grid(1)
    stride = cuda.gridsize(1)
    for i in range(start, x.shape[0], stride):
        out[i] = x[i] + y[i]

a = cupy.arange(10)
b = a * 2
out = cupy.empty_like(a)

add[1, 32](a, b, out)

CuPy and Numba are the first libraries working towards this standardization, but we hope other libraries such as PyCUDA and PyTorch will follow in their footsteps in the near future. The goal of this standard interface is to allow data scientists to leverage the libraries that they need to solve their problems without having to worry about the development burden and performance issues in building the “glue” to move data between different libraries.

Go Far, Go Together

GoAi has been a great vehicle for us to consolidate our voices to communicate with the data science community. We want to keep it that way, where it will always be the tip of the spear for end-to-end GPU acceleration. While libgdf is moving back into Apache Arrow, and Numba will continue to live on outside of GoAi, we will continue to work on new challenges that arise in GPU computing collaboratively. If you want to be a part of our ecosystem, please contribute to the current open source projects that are using GPUs, and suggest new ones we should work on. Our goal is to build more bridges, and never more walls.