Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bitnami/airflow] unable to use additional provider package (SFTPHook) #11423

Closed
MainRo opened this issue Jul 29, 2022 · 1 comment
Closed

[bitnami/airflow] unable to use additional provider package (SFTPHook) #11423

MainRo opened this issue Jul 29, 2022 · 1 comment
Assignees
Labels
solved tech-issues The user has a technical issue about an application

Comments

@MainRo
Copy link

MainRo commented Jul 29, 2022

Name and Version

bitnami/airflow 13.0.0

What steps will reproduce the bug?

  1. start the chart
  2. create and run an airflow dag that uses a provider. In my case I want to use SFTPHook. So I have a task that looks like this:
def _task():
    from airflow.providers.sftp.hooks.sftp import SFTPHook
    ...

with DAG(
    'my_dump',
    description='Dump',
    schedule_interval="0 1 * * *",
    start_date=datetime(2022, 7, 26),
    catchup=True,
) as dag:
    copy_to_s3 = PythonVirtualenvOperator(
        task_id="copy_to_s3",
        python_callable=_task,
    )    

Are you using any custom parameters or values?

my values.yml looks like this:

auth:
  existingSecret: "airflow-ui-auth"
executor: "KubernetesExecutor"
extraEnvVars:
#  - name: AIRFLOW__KUBERNETES__DELETE_WORKER_PODS
#    value: "False"
  - name: AIRFLOW__LOGGING__REMOTE_LOGGING
    value: "True"
  - name: AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER
    value: "s3://airflow/logs"
  - name: AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID
    value: minio_logs
git:
  # patch for https://github.com/bitnami/charts/issues/9879
  image:
    registry: docker.io
    repository: bitnami/git
    tag: 2.35.1
  dags:
    enabled: true
    repositories:
      - repository: "https://airflow:*****.git"
        branch: "master"
        name: "airflow-dags"
service:
  type: ClusterIP
  port: 8082
ingress:
  apiVersion: networking.k8s.io/v1
  enabled: true
  annotations:
    kubernetes.io/ingress.class: traefik
  hostname: airflow.****.com
  pathType: Prefix
  path: /
postgresql:
  enabled: false  # no standalone postgres
externalDatabase:
  host: "postgresql.postgresql.svc.cluster.local"
  user: airflow
  existingSecret: "airflow-pg-auth"
  database: airflow
  port: 5432
redis:
  auth:
    existingSecret: "airflow-redis-auth"
rbac:
  create: true
serviceAccount:
  create: true
  name: "airflow-worker-serviceaccount"
web:
  image:
    debug: true
worker:
  extraVolumeMounts:
    - name: aws-config
      mountPath: /.aws/
      readOnly: true
    - name: worker-requirements
      mountPath: /bitnami/python/requirements.txt
      subPath: requirements.txt
      readOnly: true
  extraVolumes:
    - name: aws-config
      secret:
        secretName: aws-config
    - name: worker-requirements
      configMap:
        name: worker-requirements

What is the expected behavior?

The import statement in _task should succeed.

What do you see instead?

This error is raised:

[2022-07-29, 13:39:56 UTC] {process_utils.py:165} INFO - Executing cmd: /tmp/venvnmmjt8sm/bin/python /tmp/venvnmmjt8sm/script.py /tmp/venvnmmjt8sm/script.in /tmp/venvnmmjt8sm/script.out /tmp/venvnmmjt8sm/string_args.txt
[2022-07-29, 13:39:56 UTC] {process_utils.py:169} INFO - Output:
[2022-07-29, 13:39:57 UTC] {process_utils.py:173} INFO - Traceback (most recent call last):
[2022-07-29, 13:39:57 UTC] {process_utils.py:173} INFO -   File "/tmp/venvnmmjt8sm/script.py", line 117, in <module>
[2022-07-29, 13:39:57 UTC] {process_utils.py:173} INFO -     res = _task(*arg_dict["args"], **arg_dict["kwargs"])
[2022-07-29, 13:39:57 UTC] {process_utils.py:173} INFO -   File "/tmp/venvnmmjt8sm/script.py", line 38, in _task
[2022-07-29, 13:39:57 UTC] {process_utils.py:173} INFO -     from airflow.providers.sftp.hooks.sftp import SFTPHook
[2022-07-29, 13:39:57 UTC] {process_utils.py:173} INFO - ModuleNotFoundError: No module named 'airflow'
[2022-07-29, 13:39:57 UTC] {taskinstance.py:1909} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/airflow/operators/python.py", line 424, in execute
    return super().execute(context=serializable_context)
  File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/airflow/operators/python.py", line 171, in execute
    return_value = self.execute_callable()
  File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/airflow/operators/python.py", line 474, in execute_callable
    execute_in_subprocess(
  File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/airflow/utils/process_utils.py", line 177, in execute_in_subprocess
    raise subprocess.CalledProcessError(exit_code, cmd)

It looks like the airflow package is not visible from the task in a PythonVirtualenvOperator. I tried installing the provider package (apache-airflow-providers-sftp):

  • in the worker (with the configmap on /bitnami/python/requirements.txt)
  • as the requirements parameter of PythonVirtualenvOperator

I also tried installing the airflow package in them without success. I always have this same error.

I do not understand why the airflow package is not visible.
I would appreciate any help.

Additional information

No response

@MainRo MainRo added the tech-issues The user has a technical issue about an application label Jul 29, 2022
@bitnami-bot bitnami-bot added this to Triage in Support Jul 29, 2022
@bitnami-bot bitnami-bot added the triage Triage is needed label Jul 29, 2022
@MainRo
Copy link
Author

MainRo commented Jul 29, 2022

I finally understood the origin of the problem:
Since I use the PythonVirtualenvOperator, it runs in a newly created virtualenv isolated from the packages being already installed.
For some reasons, when I try to install all airflow packages in it (with the requirements parameter), it does not fix the problem.

I tried to set the system_site_packages parameter to True. However, this does not fix the issue.
This is because the bitnami worker's python environment is a dedicated virtualenv (/opt/bitnami/airflow/venv/). So all packages installed on the worker are not in the default/system python installation.

The solution is to run the task in this existing virtualenv.
I ended up using the "activate_this.py" script with the PythonOperator and it works:

def _task():
    exec(open("/opt/bitnami/airflow/venv/bin/activate_this.py").read())

    from airflow.providers.sftp.hooks.sftp import SFTPHook
    ...

So I guess this is not directly an issue of the chart, but it would be great that the PythonOperator runs by default in the bitnami venv.

For information, there is also an open issue in airflow to use PythonVirtualenvOperator in an existing virtualenv:
apache/airflow#15286

@MainRo MainRo closed this as completed Jul 29, 2022
@bitnami-bot bitnami-bot moved this from Triage to Solved in Support Jul 29, 2022
@bitnami-bot bitnami-bot added solved and removed triage Triage is needed labels Jul 29, 2022
@fmulero fmulero removed this from Solved in Support Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved tech-issues The user has a technical issue about an application
Projects
None yet
Development

No branches or pull requests

3 participants