[Ballista] Support to access remote object store, like HDFS, S3, etc #6

yahoNanJing · 2022-01-29T08:23:46Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

After introducing the object store API, to support to access remote object store for Ballista executors, there are still some gap. For example, as #22 and #10 mentioned, ballista is not able to support remote object store.

Describe the solution you'd like

Our workaround is to make the file path self described. For example, a local file path should be file://tmp/..., a hdfs file path should hdfs://localhost:xxx:/tmp/...

matthewmturner · 2022-01-29T13:14:54Z

@yahoNanJing fyi @seddonm1 and I have been working on https://github.com/datafusion-contrib/datafusion-objectstore-s3. Still early stages but would be great to have more use cases that we could steer development with.

thinkharderdev · 2022-01-29T14:55:19Z

This PR would support this use case: apache/datafusion#1677

yahoNanJing · 2022-02-07T07:17:25Z

Hi @matthewmturner, thanks for your info. It's really helpful example for creating another independent crate, like datafusion-objectstore-hdfs. However, it would better to avoid new object store registration for cases like providing sql service without any places for users to do manually registration. And it's not a good way to do the registration for each sql or each session, since it's quite stable for the data sources the service provides for.

yahoNanJing · 2022-02-07T07:20:18Z

This PR would support this use case: apache/datafusion#1677

Thanks @thinkharderdev. I think it's better for the path to be self-described, like s3://hostname:port/xxx/... or hdfs://hostname:port/xxx/.... Then the object store management service can detect which remote object store will be used.

If possible, it's better to provide a way to avoid remote object store registration.

matthewmturner · 2022-02-07T15:06:16Z

@yahoNanJing there actually already is a hdfs extension here https://github.com/datafusion-contrib/datafusion-hdfs-native.

Understood on your point - in that case what would be your ideal API for accessing S3? I would have thought the service could register the ObjectStore to share with clients and clients could access after registration at service level had been completed.

thinkharderdev · 2022-02-07T20:17:24Z

This PR would support this use case: apache/datafusion#1677

Thanks @thinkharderdev. I think it's better for the path to be self-described, like s3://hostname:port/xxx/... or hdfs://hostname:port/xxx/.... Then the object store management service can detect which remote object store will be used.

If possible, it's better to provide a way to avoid remote object store registration.

Not sure I follow. I agree that the object store should be resolved from the file URI, but we would still have to register an ObjectStore in the ExecutionContext in order to use it right?

yahoNanJing · 2022-02-08T02:55:33Z

Hi @matthewmturner and @thinkharderdev, what I'm thinking is to introduce some remote object store extensions as Datafusion features. These extensions will be inclusive crates and are managed independently so that this solution will not introduce additional maintenance effort for Datafusion.

However, this solution is blocked by apache/datafusion#1772. After the datasource module is splitted into an independent crate. Then the remote object store extensions will be able to just depend on that crate. In other places, we can setup whether to enable the extension features or not. By this way, we can avoid cyclic dependency.

Further to say, for the default object stores in the ObjectStoreRegistry, maybe better to make it configurable so that we can avoid remote object store registration manually.

milenkovicm · 2022-09-01T15:17:03Z

Had a quick look into this issue, and from what I can see, there is not nothing missing on datafusion side to have this functionality (apart from some hard work :)).

Team did a great job to add support for object store in datafusion:

use std::sync::Arc;
use datafusion::{
    datasource::listing::{ListingTable, ListingTableConfig, ListingTableUrl},
    prelude::SessionContext,
};
use log::info;
use object_store::aws::AmazonS3Builder;

    let ctx = SessionContext::new();

    let s3 = AmazonS3Builder::new()
        .with_region("us-east-1")
        .with_bucket_name("testbucket")
        .with_access_key_id("MINIO")
        .with_secret_access_key("MINIO/MINIO")
        .with_endpoint("http://localhost:9000")
        .with_allow_http(true)
        .build()
        .unwrap();

    let s3 = Arc::new(s3);

    ctx.runtime_env()
        .register_object_store("s3", "localhost:9000", s3);

    let url = ListingTableUrl::parse("s3://localhost:9000/testpath/").unwrap();

    let config = ListingTableConfig::new(url)
        .infer(&ctx.state())
        .await
        .unwrap();

    let table = ListingTable::try_new(config).unwrap();
    ctx.register_table("test", Arc::new(table)).unwrap();

    ctx.sql("SELECT * FROM test")
        .await
        .unwrap()
        .show()
        .await
        .unwrap();

I give quick try with ballista standalone, changing code a bit to expose RuntimeEnv on client, scheduler, and executor and registering store on each of them manually. At the end, it did produce correct result. Currently getting to a RuntimeEnv is not "walk in a park", few hacks here and there were needed, but it is not hard to make it easier. It would then be possible load object store providers from configuration files.

Alternatively register_object_store can be provided directly on the BallistaContext and then somehow object store configuration may be magically serialized and handled on other actors in the system. AmazonS3Builder should probably be modified so it can be serialized.

avantgardnerio · 2022-09-01T15:30:32Z

@yahoNanJing this issue seems related to apache/datafusion#3311 where we are working towards allowing users to register delta-rs tables dynamically through SQL at runtime in Ballista.

milenkovicm · 2022-09-01T15:34:26Z

@avantgardnerio SQL would be perfect fit

avantgardnerio · 2022-09-01T17:10:11Z

ObjectStore in the ExecutionContext in order to use it right?

I think the problem is that this must happen dynamically in the case of a DataFusion executor in Ballista. The solution I am proposing is a TableProviderFactory in apache/datafusion#3311

Edit: by dynamically, I mean the name of the table and the path to it will not be known at compile time.

saikrishna1-bidgely · 2023-01-04T20:11:27Z

Is loading the data from S3 possible with a distributed setup right now? If so, can you provide a small example? I tried with a path on S3 and it failed with "no object store available for s3" error. This happened after I added the s3 feature to my project. Looks like s3 feature hasn't been added to scheduler and executor.

ahmedriza · 2023-02-10T23:18:37Z

@saikrishna1-bidgely, here's an example of what I tried and this worked. I'm not 100% sure that this is how it is supposed to be :-)

Build the scheduler with the s3 feature enabled, i.e. modify its Cargo.toml:

ballista-core = { path = "../core", version = "0.10.0" , features = ["s3"] }

Define the following environment variables (I tested with a MinIO instance), and ensure that they are available to the scheduler and executor processes.

AWS_DEFAULT_REGION
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_ENDPOINT

You may need to define additional AWS environment variables depending on your S3 service.

Start scheduler and executor

I tested with the following sample code:

use ballista::prelude::BallistaContext;
use ballista_core::config::BallistaConfig;
use datafusion::prelude::ParquetReadOptions;

#[tokio::main]
pub async fn main()  {
    let config = BallistaConfig::builder().build().unwrap();
    let ctx = BallistaContext::remote("localhost", 50050, &config).await.unwrap();
    let filename = "s3://foo/test.parquet";
    let df = ctx
        .read_parquet(filename, ParquetReadOptions::default())
        .await?;
    let rows = df.count().await?;
    println!("rows: {}", rows);
}

The code correctly returns the number of rows in the Parquet file:

rows: 15309

Last bit of logging from the scheduler process:

2023-02-10T23:16:11.345868Z  INFO tokio-runtime-worker ThreadId(05) ballista_scheduler::display: === [pRdAqhp/2] Stage finished, physical plan with metrics ===
ShuffleWriterExec: None, metrics=[output_rows=1, input_rows=1, repart_time=1ns, write_time=721.748µs]
  AggregateExec: mode=Final, gby=[], aggr=[COUNT(NULL)], metrics=[output_rows=1, elapsed_compute=28.791µs, spill_count=0, spilled_bytes=0, mem_used=0]
    CoalescePartitionsExec, metrics=[]
      ShuffleReaderExec: partitions=1, metrics=[]


2023-02-10T23:16:11.346163Z  INFO tokio-runtime-worker ThreadId(05) ballista_scheduler::state::execution_graph: Job pRdAqhp is success, finalizing output partitions
2023-02-10T23:16:11.346354Z  INFO tokio-runtime-worker ThreadId(05) ballista_scheduler::scheduler_server::query_stage_scheduler: Job pRdAqhp success

From the executor process:

2023-02-10T23:16:11.241689Z  INFO          task_runner ThreadId(22) ballista_executor::metrics: === [pRdAqhp/2/0] Physical plan with metrics ===
ShuffleWriterExec: None, metrics=[output_rows=1, input_rows=1, write_time=721.748µs, repart_time=1ns]
  AggregateExec: mode=Final, gby=[], aggr=[COUNT(NULL)], metrics=[output_rows=1, elapsed_compute=28.791µs, spill_count=0, spilled_bytes=0, mem_used=0]
    CoalescePartitionsExec, metrics=[]
      ShuffleReaderExec: partitions=1, metrics=[]

Hope this helps.

saikrishna1-bidgely · 2023-03-01T20:20:36Z

@ahmedriza can you make this into a PR pls. That would be a lot helpful.

saikrishna1-bidgely · 2023-03-01T21:10:36Z

@ahmedriza I tried what you suggested. I built the scheduler and executor with the s3 feature added to ballista-core dependency in cargo.toml. But I'm getting the same error: Error: DataFusionError(Execution("No object store available for s3://ballista-test-bucket/temp.csv")). I'm running this on Windows 10. Also, I'm running the code in another repo with this as cargo.toml:

[package]
name = "ballista-test"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
ballista = "0.11.0"
datafusion = "18.0.0"
tokio = "1.0"
parquet = "29.0.0"

YuriyGavrilov · 2023-10-06T20:36:19Z

There is also could be nice to have an uplink support of storj network https://github.com/storj/uplink (Rust bindings for libuplink https://github.com/storj-thirdparty/uplink-rust). It provide a direct access to the storj network avoiding s3 gateway bottleneck. (https://github.com/storj/storj/wiki/Libuplink-Walkthrough)

yahoNanJing added the enhancement New feature or request label Jan 29, 2022

yahoNanJing mentioned this issue May 19, 2022

Ballista Enhancement Overview #7

Open

15 tasks

andygrove added the ballista label Feb 5, 2022

yahoNanJing mentioned this issue Feb 8, 2022

The returned path value of get_by_uri should be self-described with entire path apache/datafusion#1779

Merged

andygrove transferred this issue from apache/datafusion May 19, 2022

andygrove removed the ballista label May 20, 2022

This was referenced Sep 21, 2022

Add a feature based object store provider #258

Merged

Introduce the datafusion-objectstore-hdfs in datafusion-contrib as an object store feature #260

Merged

luckylsk34 mentioned this issue Mar 6, 2023

Allow accessing s3 locations in client mode #700

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ballista] Support to access remote object store, like HDFS, S3, etc #6

[Ballista] Support to access remote object store, like HDFS, S3, etc #6

yahoNanJing commented Jan 29, 2022 •

edited

matthewmturner commented Jan 29, 2022

thinkharderdev commented Jan 29, 2022

yahoNanJing commented Feb 7, 2022

yahoNanJing commented Feb 7, 2022 •

edited

matthewmturner commented Feb 7, 2022

thinkharderdev commented Feb 7, 2022

yahoNanJing commented Feb 8, 2022

milenkovicm commented Sep 1, 2022

avantgardnerio commented Sep 1, 2022 •

edited

milenkovicm commented Sep 1, 2022

avantgardnerio commented Sep 1, 2022 •

edited

saikrishna1-bidgely commented Jan 4, 2023 •

edited

ahmedriza commented Feb 10, 2023 •

edited

saikrishna1-bidgely commented Mar 1, 2023

saikrishna1-bidgely commented Mar 1, 2023

YuriyGavrilov commented Oct 6, 2023 •

edited

[Ballista] Support to access remote object store, like HDFS, S3, etc #6

[Ballista] Support to access remote object store, like HDFS, S3, etc #6

Comments

yahoNanJing commented Jan 29, 2022 • edited

matthewmturner commented Jan 29, 2022

thinkharderdev commented Jan 29, 2022

yahoNanJing commented Feb 7, 2022

yahoNanJing commented Feb 7, 2022 • edited

matthewmturner commented Feb 7, 2022

thinkharderdev commented Feb 7, 2022

yahoNanJing commented Feb 8, 2022

milenkovicm commented Sep 1, 2022

avantgardnerio commented Sep 1, 2022 • edited

milenkovicm commented Sep 1, 2022

avantgardnerio commented Sep 1, 2022 • edited

saikrishna1-bidgely commented Jan 4, 2023 • edited

ahmedriza commented Feb 10, 2023 • edited

saikrishna1-bidgely commented Mar 1, 2023

saikrishna1-bidgely commented Mar 1, 2023

YuriyGavrilov commented Oct 6, 2023 • edited

yahoNanJing commented Jan 29, 2022 •

edited

yahoNanJing commented Feb 7, 2022 •

edited

avantgardnerio commented Sep 1, 2022 •

edited

avantgardnerio commented Sep 1, 2022 •

edited

saikrishna1-bidgely commented Jan 4, 2023 •

edited

ahmedriza commented Feb 10, 2023 •

edited

YuriyGavrilov commented Oct 6, 2023 •

edited