Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid airflow.providers.amazon.aws.transfers.mongo_to_s3.MongoToS3Operator execute function signature #16479

Closed
iiii4966 opened this issue Jun 16, 2021 · 2 comments
Labels
duplicate Issue that is duplicated kind:bug This is a clearly a bug pending-response provider:amazon-aws AWS/Amazon - related issues

Comments

@iiii4966
Copy link

Apache Airflow version: 2.1.0

Kubernetes version (if you are using kubernetes) (use kubectl version): not use

Environment:

  • Cloud provider or hardware configuration: x86_64
  • OS (e.g. from /etc/os-release): MAC OS
  • Kernel (e.g. uname -a): Darwin Kernel Version 20.3.0
  • Install tools: Run the command below to install it
AIRFLOW_VERSION=2.1.0
PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install "apache-airflow[postgres,google,amazon,sentry,mongo,slack]==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
  • Others:
    requirements
    pymongo==3.11.4
    apache-airflow-providers-amazon==1.4.0

What happened: I ran MongoToS3Operator, but the error below occurred.

Traceback (most recent call last):
  File "PycharmProjects/data-pipeline-2/venv/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1137, in _run_raw_task
    self._prepare_and_execute_task_with_callbacks(context, task)
  File "PycharmProjects/data-pipeline-2/venv/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1311, in _prepare_and_execute_task_with_callbacks
    result = self._execute_task(context, task_copy)
  File "PycharmProjects/data-pipeline-2/venv/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1341, in _execute_task
    result = task_copy.execute(context=context)
  File "PycharmProjects/data-pipeline-2/venv/lib/python3.8/site-packages/airflow/providers/amazon/aws/transfers/mongo_to_s3.py", line 116, in execute
    results = MongoHook(self.mongo_conn_id).find(
  File "PycharmProjects/data-pipeline-2/venv/lib/python3.8/site-packages/airflow/providers/mongo/hooks/mongo.py", line 146, in find
    return collection.find(query, **kwargs)
  File "PycharmProjects/data-pipeline-2/venv/lib/python3.8/site-packages/pymongo/collection.py", line 1523, in find
    return Cursor(self, *args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'allowDiskUse'

So when I look at the signature of pymongo.cursor.Cursor class, the allow_disk_use arg name is snake_case.

class Cursor(object):
    """A cursor / iterator over Mongo query results.
    """
    _query_class = _Query
    _getmore_class = _GetMore

    def __init__(self, collection, filter=None, projection=None, skip=0,
                 limit=0, no_cursor_timeout=False,
                 cursor_type=CursorType.NON_TAILABLE,
                 sort=None, allow_partial_results=False, oplog_replay=False,
                 modifiers=None, batch_size=0, manipulate=True,
                 collation=None, hint=None, max_scan=None, max_time_ms=None,
                 max=None, min=None, return_key=False, show_record_id=False,
                 snapshot=False, comment=None, session=None,
                 allow_disk_use=None):  <- snake_case!!
        """Create a new cursor.

        Should not be called directly by application developers - see
        :meth:`~pymongo.collection.Collection.find` instead.

        .. mongodoc:: cursors
        """

airflow.providers.amazon.aws.transfers.mongo_to_s3.MongoToS3Operator execute method is odd

  def execute(self, context) -> bool:
      """Is written to depend on transform method"""
      s3_conn = S3Hook(self.aws_conn_id)

      # Grab collection and execute query according to whether or not it is a pipeline
      if self.is_pipeline:
          results = MongoHook(self.mongo_conn_id).aggregate(
              mongo_collection=self.mongo_collection,
              aggregate_query=cast(list, self.mongo_query),
              mongo_db=self.mongo_db,
              allowDiskUse=self.allow_disk_use,  <- CamelCase!?
          )

      else:
          results = MongoHook(self.mongo_conn_id).find(
              mongo_collection=self.mongo_collection,
              query=cast(dict, self.mongo_query),
              mongo_db=self.mongo_db,
              allowDiskUse=self.allow_disk_use, <- CamelCase!?
          )

In other words, MongoHook's function aggregate, find's allowDiskUse arg name is different from pymongo's Cursor arg allow_disk_use name. So it works well if changed the name of the arg as below code.

if self.is_pipeline:
    results = MongoHook(self.mongo_conn_id).aggregate(
        mongo_collection=self.mongo_collection,
        aggregate_query=cast(list, self.mongo_query),
        mongo_db=self.mongo_db,
        allow_disk_use=self.allow_disk_use,
    )

else:
    results = MongoHook(self.mongo_conn_id).find(
        mongo_collection=self.mongo_collection,
        query=cast(dict, self.mongo_query),
        mongo_db=self.mongo_db,
        allow_disk_use=self.allow_disk_use,
    )

I'd appreciate it if you could give me an answer.

@iiii4966 iiii4966 added the kind:bug This is a clearly a bug label Jun 16, 2021
@boring-cyborg
Copy link

boring-cyborg bot commented Jun 16, 2021

Thanks for opening your first issue here! Be sure to follow the issue template!

@eladkal
Copy link
Contributor

eladkal commented Jun 16, 2021

This should have been fixed by #15680
It hasn't been officially released but we are in a process of RC:
Please test with pip install apache-airflow-providers-amazon==2.0.0rc1
We will also appreciate feedback in #16456

@eladkal eladkal added pending-response provider:amazon-aws AWS/Amazon - related issues labels Jun 16, 2021
@potiuk potiuk added the duplicate Issue that is duplicated label Jun 16, 2021
@potiuk potiuk closed this as completed Jun 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate Issue that is duplicated kind:bug This is a clearly a bug pending-response provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

No branches or pull requests

3 participants