We use Apache Arrow to pass data between Pipeline steps and across different languages. The Orchest SDK
wraps Apache Arrow so that it can be used in Orchest.
See the full data passing API reference <api transfer>
for more information.
In this example, we show how to pass data between different pipeline steps using Python.
Using the following pipeline:
We will create and name data in steps 1 and 2, and pass it to step 3.
"""step-1"""
import orchest
data = "Hello, World!"
# Output the data so that step-3 can retrieve it.
orchest.output(data, name="my_string")
"""step-2"""
import orchest
data = [3, 1, 4]
# Output the data so that step-3 can retrieve it.
orchest.output(data, name="my_list")
The output data from steps 1 and 2 is copied to shared memory so that step 3 can access it. This also lets us access the data in JupyterLab.
"""step-3"""
import orchest
# Get the input for step-3, i.e. the output of step-1 and step-2.
input_data = orchest.get_inputs()
Warning
🚨 Only call orchest.transfer.get_inputs
and orchest.transfer.output
once. Otherwise your code will break in jobs <jobs>
and overwrite data.
Step 3's input_data
will be:
{
"my_list": [3, 1, 4],
"my_string": "Hello, World!",
"unnamed": []
}
We will discuss unnamed
in the next section.
It's best practice to pass data with a name in most cases. However, sometimes you may want to use a list rather than a dictionary to store your data. Therefore, it's not necessary to give outputted data a name.
When passing unnamed data, the receiving step treats the values as an ordered collection (see order of unnamed data <unnamed order>
). In the previous example, step 3 receives input data with a special key called unnamed
.
If we change the output of step 1 to:
"""step-1"""
import orchest
data = "Hello, World!"
# Output the data so that step-3 can retrieve it.
# But this time, don't give a name.
orchest.output(data, name=None)
The input_data
in step 3 would then be equal to:
{
"my_list": [3, 1, 4],
"unnamed": ["Hello, World!"]
}
If we change the step 2 to:
"""step-2"""
import orchest
data = [3, 1, 4]
orchest.output(data, name=None)
The input_data
in step 3 would be:
{
"unnamed": ["Hello, World!", [3, 1, 4]]
}
Populating the unnamed
key with the all outputted values without a name.
The visual pipeline editor can order data passing. This is written to the pipeline definition file. orchest.transfer.get_inputs
then infers order from the pipeline definition file.
Below is a screenshot of step 3's properties from the example above. The list can be reordered with drag and drop.
Having the above order of connections, step 3's input_data
becomes:
{
"unnamed": [[3, 1, 4], "Hello, World!"]
}
Top-to-bottom in the visual editor corresponds to left-to-right in unnamed
.
Tip
👉 Import the example project showcasing R straight in Orchest (how-to-import-a-project <how-to-import-a-project>
).
The Orchest SDK in R works through the reticulate package. To explain its usage, an example project is provided below.
First, create an Orchest environment which uses the orchest/base-kernel-r
base image (you can find more details here <environments>
). Next you want to install reticulate
and configure access to Python and the Orchest SDK. You can do so by having a script (let's say Install.r
) in your project with the following content:
install.packages("reticulate", repos = "http://cran.us.r-project.org")
library(reticulate)
# Dynamically find system Python
python_path <- system("which python", intern=TRUE)
use_python(python_path)
# Pre compile orchest deps
orchest <- import("orchest")
print(orchest)
and having the environment set-up script perform Rscript Install.r
. You will then be able to access the Orchest SDK through R in every step that makes use of this environment . To do data passing, for example, you would do the following:
library(reticulate);
python_path <- system("which python", intern=TRUE);
use_python(python_path);
orchest <- import("orchest");
orchest$transfer$output(2, name="Test");
In a child step you will be able to retrieve the output:
library(reticulate);
python_path <- system("which python", intern=TRUE);
use_python(python_path);
orchest <- import("orchest")
step_inputs = orchest$transfer$get_inputs()
step_inputs$Test
Refer to the Julia example project showcasing Julia in Orchest (how-to-import-a-project <how-to-import-a-project>
).
Refer to the JavaScript example project showcasing JavaScript in Orchest (how-to-import-a-project <how-to-import-a-project>
).