New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Airflow2.0.2 --- TypeError: unhashable type: 'AttrDict' while trying to read logs from elasticsearch #15613
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
@jedcunningham Any ideas here -- I know you worked with logstash & ES a bit on a couple of PRs, does it ring a bell? |
@Pravka does the issue still happens in latest airflow version? |
Seems like I've seen this error about a year ago, but it didn't appear in newer versions. |
This issue has been automatically marked as stale because it has been open for 30 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author. |
If anyone else is facing this issue, I found it was caused by using the wrong 'host_field' under
When I changed from the default |
The problem is the combination of Due to the json_format setting the logs from airflow are JSON and may already contain a As the documentation of add_host_metadata says, it will by default just override the I am not seeing the "host" field actually getting written by airflow when using the celery executor. I guess its only there when using Dask? So in this case the workaround is quite simple: Rename the "host" field to something else: # In filebeat.yml
processors:
- add_host_metadata: # This (over)writes the "host" field
- rename:
fields:
- from: "host" # Still want this info, but I can't use the "host" field for it
to: "host_meta" There is no The longer story:If you actually do want to preserve the original value of the "host" field (I am guessing airflow puts a string there), it gets a bit more complicated. Originally I wanted to preserve the # WARNING, THIS WILL NOT WORK!
processors:
- rename: # The next proc will overwrite the host field but its needed by the af webserver ...
fields:
- from: "host" # ... so lets just store it somewhere else
to: "airflow_log_host" # With some AF executors there is no host-field and this will just be a NoOp
- add_host_metadata: # This writes to the "host" field
- rename:
fail_on_error: false # This is needed to move host to host_meta even if airflow_log_host doesn't exist
fields:
- from: "host" # Still want this info, but I can't use the "host" field for it
to: "host_meta"
- from: "airflow_log_host" # Move back the original value to the host field
to: "host" But it turns out that BEFORE applying any processors, filebeat will already overwrite filebeat.inputs:
- type: log
paths:
- .... .json
# JSON expansion is done in processors, DO NOT TURN IT ON HERE!
# json.keys_under_root: true
# json.overwrite_keys: true
# json.add_error_key: true
# json.expand_keys: true
# ....
processors:
- drop_fields: # First get rid of the "built in" host field
fields:
- host
- decode_json_fields: # Expand our JSON log to the root
fields:
- message # This holds a line of JSON as string
process_array: true
target: "" # Store at the root
overwrite_keys: true # message attribute will be overwritten with the message from airflow
- rename: # The next proc will overwrite the host field which is needed by the AF webserver ...
fields:
- from: "host" # ... so lets just store it somewhere else
to: "airflow_log_host" # With some AF executors there is no host-field and this will just be a NoOp
- add_host_metadata: # This writes to the "host" field
- rename:
fail_on_error: false # This is needed to move host to host_meta even if airflow_log_host doesn't exist
fields:
- from: "host" # Still want this info, but I can't use the "host" field for it
to: "host_meta"
- from: "airflow_log_host" # Move back the original value to the host field
to: "host"
|
This still exists with |
Does the #15613 (comment) fix the problem for you ? Seems that this is not an Airflow problem, but filebeat one and you need to apply some fixes to Filebeat. |
Not using Filebeat at all. Only Logstash, with input as log file directly. And, I'm using CeleryExecutor. |
So i guess you should describe your configuration. From the description above it looks like it was caused by Filebeat. Can you please provide details of your configuraiton (what and how you have configured, the exact stack trace etc? That might help to investigate the issue of somoene will look at it. The original issue was raised in 2.0.2 but having evidence from the most recent versions of both provider and Airflow might be super helpful. It seems that the problem is due to some configuration of some elasticsearch integraiton and does not exist when you use elasticsearch "as is". It might lead to either helping you to understand how to change the configureation. Also previously I think the difficulty was that it was Filebeat and people were not able/did not want to reproduce this issue. If you provide an easy reproducible configuraiton/circumstances when it happen, there is a better chance someone will be able to reproduce it. |
Found it! |
Ah cool. I pretty much hoped this would happen when you look closely. Let me just close this one then - since we have a good solution and confirmed it works. |
Hi,
I am experiencing issues with reading logs from Elasticsearch, not sure if it's a bug or my incompetence!
Apache Airflow version: 2.0.2
Elastic version: v 7.9.3
Kubernetes version: v1.19.6
Environment: Dev Kubernetes
Linux airflow-6d7d4568c-w7plk 4.14.138-rancher #1 SMP Sat Aug 10 11:25:46 UTC 2019 x86_64 GNU/Linux
What happened:
I am running Airflow with Celery Executor inside Kubernetes cluster which runs Spark jobs via KubernetesPodOperator. I have 2 pods:
Airflow pod consists of airflow-ui, airflow-scheduler, airflow-flower and aws-s3-sync container used to sync DAGs from S3.
Airflow-worker pod consists of airflow-celery-worker and aws-s3-sync containers
For now, I am trying to execute a DAG which runs spark-submit --version using KubernetesPodOperator. DAG executes and logs are present in container stdout.
I use Filebeat to pick up the logs and enrich them with "add_cloud_metadata" and "add_host_metadata". Afterwards, logs are sent to Logstash for field adjustments as Airflow writes logs to Elasticsearch in one format and tries to read them in other format. This particularly applies for execution_date field. Anyhow, logs are visible in Kibana so I have parsed the fields and assembled log_id field so that Airflow can read it which I confirmed by running a query in console in Kibana.
Follow up on execution_date field. Seems like when Airflow writes logs to Elasticsearch while running in Kubernetes, fields won't be written to elasticsearch as dag_id, log_id, execution_date and try_number but rather, [kubernetes][labels][dag_id], etc etc. So, if I assemble log_id field manually, using [kubernetes][labels]* fields it turns out example field looks like this:
log_id spark-submit-spark-submit-2021-04-28T110330.1402290000-3c11bfafa-1
which is by default incorrect because, while reading logs, Airflow tries to fetch:
log_id spark-submit-spark-submit-2021-04-28T11:03:30.140229+00:00-1
I am not sure whether this here is something that needs improving or is it expected. IMO, it should not be expected as due to vague documentation with no extensive explanations on what really happens, users have to invest hours in getting to the bottom of the issue and working out a solution on their own.
After parsing execution_date to be the same as what Airflow tries to fetch, I had to enable fileddata on offset field in elasticsearch as Airflow couldn't sort offsets. After that, the error I sent below happened.
By following Airflow logs while trying to read the log from elasticsearch, below error pops up:
What you expected to happen: Airflow UI to display task logs in UI
How to reproduce it:
Spin kubernetes cluster, deploy Airflow with CeleryExecutor in it, use filebeat to pick up logs, send through logstash to elasticsearch. Run any job using KubernetesPodOperator and try to check task logs in Airflow UI. UI task logs view should spin until timeout, then display blank page.
Relevant information/configuration settings:
airflow.cfg:
filebeat.yml:
logstash.conf:
Final thoughts:
Not sure whether I have missed something while setting the thing up following https://airflow.apache.org/docs/apache-airflow-providers-elasticsearch/stable/logging.html or Airflow crew needs to work on improving reading logs from elasticsearch.
The text was updated successfully, but these errors were encountered: