Skip to content

Latest commit

 

History

History
216 lines (143 loc) · 5.8 KB

data_passing.rst

File metadata and controls

216 lines (143 loc) · 5.8 KB

Data passing

We use Apache Arrow to pass data between Pipeline steps and across different languages. The Orchest SDK wraps Apache Arrow so that it can be used in Orchest.

See the full data passing API reference <api transfer> for more information.

Python example

In this example, we show how to pass data between different pipeline steps using Python.

Using the following pipeline:

We will create and name data in steps 1 and 2, and pass it to step 3.

"""step-1"""
import orchest

data = "Hello, World!"

# Output the data so that step-3 can retrieve it.
orchest.output(data, name="my_string")
"""step-2"""
import orchest

data = [3, 1, 4]

# Output the data so that step-3 can retrieve it.
orchest.output(data, name="my_list")

The output data from steps 1 and 2 is copied to shared memory so that step 3 can access it. This also lets us access the data in JupyterLab.

"""step-3"""
import orchest

# Get the input for step-3, i.e. the output of step-1 and step-2.
input_data = orchest.get_inputs()

Warning

🚨 Only call orchest.transfer.get_inputs and orchest.transfer.output once. Otherwise your code will break in jobs <jobs> and overwrite data.

Step 3's input_data will be:

{
 "my_list": [3, 1, 4],
 "my_string": "Hello, World!",
 "unnamed": []
}

We will discuss unnamed in the next section.

Passing data without a name

It's best practice to pass data with a name in most cases. However, sometimes you may want to use a list rather than a dictionary to store your data. Therefore, it's not necessary to give outputted data a name.

When passing unnamed data, the receiving step treats the values as an ordered collection (see order of unnamed data <unnamed order>). In the previous example, step 3 receives input data with a special key called unnamed.

If we change the output of step 1 to:

"""step-1"""
import orchest

data = "Hello, World!"

# Output the data so that step-3 can retrieve it.
# But this time, don't give a name.
orchest.output(data, name=None)

The input_data in step 3 would then be equal to:

{
 "my_list": [3, 1, 4],
 "unnamed": ["Hello, World!"]
}

If we change the step 2 to:

"""step-2"""
import orchest

data = [3, 1, 4]

orchest.output(data, name=None)

The input_data in step 3 would be:

{
 "unnamed": ["Hello, World!", [3, 1, 4]]
}

Populating the unnamed key with the all outputted values without a name.

Ordering unnamed data

The visual pipeline editor can order data passing. This is written to the pipeline definition file. orchest.transfer.get_inputs then infers order from the pipeline definition file.

Below is a screenshot of step 3's properties from the example above. The list can be reordered with drag and drop.

image

Having the above order of connections, step 3's input_data becomes:

{
 "unnamed": [[3, 1, 4], "Hello, World!"]
}

Top-to-bottom in the visual editor corresponds to left-to-right in unnamed.

R example

Tip

👉 Import the example project showcasing R straight in Orchest (how-to-import-a-project <how-to-import-a-project>).

The Orchest SDK in R works through the reticulate package. To explain its usage, an example project is provided below.

First, create an Orchest environment which uses the orchest/base-kernel-r base image (you can find more details here <environments>). Next you want to install reticulate and configure access to Python and the Orchest SDK. You can do so by having a script (let's say Install.r) in your project with the following content:

install.packages("reticulate", repos = "http://cran.us.r-project.org")
library(reticulate)

# Dynamically find system Python
python_path <- system("which python", intern=TRUE)
use_python(python_path)

# Pre compile orchest deps
orchest <- import("orchest")

print(orchest)

and having the environment set-up script perform Rscript Install.r. You will then be able to access the Orchest SDK through R in every step that makes use of this environment . To do data passing, for example, you would do the following:

library(reticulate);
python_path <- system("which python", intern=TRUE);
use_python(python_path);
orchest <- import("orchest");
orchest$transfer$output(2, name="Test");

In a child step you will be able to retrieve the output:

library(reticulate);
python_path <- system("which python", intern=TRUE);
use_python(python_path);
orchest <- import("orchest")
step_inputs = orchest$transfer$get_inputs()
step_inputs$Test

Julia example

Refer to the Julia example project showcasing Julia in Orchest (how-to-import-a-project <how-to-import-a-project>).

JavaScript example

Refer to the JavaScript example project showcasing JavaScript in Orchest (how-to-import-a-project <how-to-import-a-project>).