Skip to content

jychen7/arrow-datafusion-ruby

Repository files navigation

DataFusion in Ruby

This is yet another Ruby library that binds to Apache Arrow in-memory query engine DataFusion.

This is an alternative to datafuion-contrib/datafusion-ruby. Please refer to FAQ below.

Quick Start

Gemfile

gem "arrow-datafusion"

App

require "datafusion"

ctx = Datafusion::SessionContext.new
# https://github.com/jychen7/arrow-datafusion-ruby/blob/main/spec/fixtures/test.csv
ctx.register_csv("csv", "test.csv")
results = ctx.sql("SELECT * FROM csv").collect

# results is array of Datafusion::RecordBatch
results.size # 1
# to_h converts Datafusion::RecordBatch to ruby Hash
results[0].to_h # {"int": [1, 2, 3, 4], "str": ["a", "b", "c", "d"], "float": [1.1, 2.2, 3.3, 4.4]}

Supported features

SessionContext

  • new
  • register_csv
  • sql
  • register_json
  • register_parquet
  • register_udf

Dataframe

  • new
  • collect
  • schema
  • select_columns
  • select
  • filter
  • aggregate
  • sort
  • limit
  • show
  • join
  • explain

Contribution Guide

Please see Contribution Guide.

FAQ

Why Magnus?

As of 2022-07, there are a few popular Ruby bindings for Rust, Rutie, Magnus and other alternatives. Magnus is picked because its API seems cleaner and it seems more clear about safe vs unsafe. The author of Magnus have a "maybe bias" comparison in this reddit thread. It is totally subjective and it should not be large effort if we decides to switch to different Ruby bindings fr Rust in future.

Why the module name and gem name are different?

The module name Datafusion follows the datafusion and datafusion-python. The gem name datafusion is occupied in rubygems.org at 2016, so our gem is called arrow-datafusion.

Similarly to the Ruby bindings of Arrow, its gem name is called red-arrow and the module is called arrow.

Why another Ruby bindings for Arrow DataFusion?

datafuion-contrib/datafusion-python was the first bindings of Arrow Datafusion (Rust). It was implemented using pyo3 for Rust -> Python. Besides Python, Datafusion Community also want to have Java and other language bindings. In order to share development resource, datafuion-contrib/datafusion-c is created and be used in datafuion-contrib/datafusion-ruby. This Rust -> C -> Ruby/Python/Java/etc implementation is published as gem "red-datafusion" and couple with "red-arrow".

Around similar time when "red-datafusion" is created, I want to use Arrow DataFusion in Ruby, mainly to query Object Store like S3/GCS, so I create a Rust -> Ruby bindings using Magnus. So I just keep this Rust -> Ruby implementation as an alternative and publish it as gem arrow-datafusion. To keep it simple, "arrow-datafusion" does not couple with "red-arrow" at the moment.

ps: Datafusion Python was coupled with PyArrow. There is a proposal to separate them in medium to long term. For detail, please refer to Can datafusion-python be used without pyarrow?.