Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log files are still being cached causing ever-growing memory usage when scheduler is running #27065

Closed
1 of 2 tasks
zachliu opened this issue Oct 14, 2022 · 25 comments · Fixed by #27223
Closed
1 of 2 tasks
Labels

Comments

@zachliu
Copy link
Contributor

zachliu commented Oct 14, 2022

Apache Airflow version

2.4.1

What happened

My Airflow scheduler memory usage started to grow after I turned on the dag_processor_manager log by doing

export CONFIG_PROCESSOR_MANAGER_LOGGER=True

see the red arrow below

2022-10-11_12-06 (1)

By looking closely at the memory usage as mentioned in #16737 (comment), I discovered that it was the cache memory that's keep growing:

2022-10-12_14-42 (1)

Then I turned off the dag_processor_manager log, memory usage returned to normal (not growing anymore, steady at ~400 MB)

This issue is similar to #14924 and #16737. This time the culprit is the rotating logs under ~/logs/dag_processor_manager/dag_processor_manager.log*.

What you think should happen instead

Cache memory shouldn't keep growing like this

How to reproduce

Turn on the dag_processor_manager log by doing

export CONFIG_PROCESSOR_MANAGER_LOGGER=True

in the entrypoint.sh and monitor the scheduler memory usage

Operating System

Debian GNU/Linux 10 (buster)

Versions of Apache Airflow Providers

No response

Deployment

Other Docker-based deployment

Deployment details

k8s

Anything else

I'm not sure why the previous fix #18054 has stopped working 🤔

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@potiuk
Copy link
Member

potiuk commented Oct 23, 2022

I've addded a PR that should address it in #27223 @zachliu - would it be possible that you test it in your installation?

@zachliu
Copy link
Contributor Author

zachliu commented Oct 24, 2022

I've addded a PR that should address it in #27223 @zachliu - would it be possible that you test it in your installation?

sure thing 👍

potiuk added a commit to potiuk/airflow that referenced this issue Oct 24, 2022
The RotatingFileHandler is used when you enable it via
`CONFIG_PROCESSOR_MANAGER_LOGGER=True` and it exhibits similar
behaviour as the FileHandler had when it comes to caching
the file on the Kernel level. While it is harmless (the cache
will be freed when needed), it is also misleading for those who
are trying to understand memory usage by Airlfow.

The fix is to add a custom non-caching RotatingFileHandler similarly
as in apache#18054.

Note that it will require to manually modify local settings if
the settings were created before this change.

Fixes: apache#27065
potiuk added a commit that referenced this issue Oct 25, 2022
The RotatingFileHandler is used when you enable it via
`CONFIG_PROCESSOR_MANAGER_LOGGER=True` and it exhibits similar
behaviour as the FileHandler had when it comes to caching
the file on the Kernel level. While it is harmless (the cache
will be freed when needed), it is also misleading for those who
are trying to understand memory usage by Airlfow.

The fix is to add a custom non-caching RotatingFileHandler similarly
as in #18054.

Note that it will require to manually modify local settings if
the settings were created before this change.

Fixes: #27065
@zachliu
Copy link
Contributor Author

zachliu commented Oct 27, 2022

@potiuk the cache memory is still growing 😿

2022-10-27_10-32

@potiuk
Copy link
Member

potiuk commented Oct 27, 2022

Maybe The rotating file handler has another place where it copies files and leaves them behind. Not the end of the world (as you know this is no-harm-at-all and perfecrly normal to happen)/

Maybe i will take a look soon (or maybe you can @zahchliu - you could see how I've done that and you could potentially iterate on it and verify it in your test system and make a PR after you test it ? How about that?

Also there are ways you can check if this might be the cause. Just delete the rotated files and see if that causes drrop in cache memory used.

@potiuk
Copy link
Member

potiuk commented Oct 27, 2022

Would be great contribution back :) ?

@zachliu
Copy link
Contributor Author

zachliu commented Oct 27, 2022

yeah, the root cause of this might be somewhere else. here are the facts:

  • setting CONFIG_PROCESSOR_MANAGER_LOGGER=True does make the cache increase
  • deleting files under ~/logs/dag_processor_manager has no effect on cache memory usage, also there are cache files i cannot delete
    2022-10-27_11-02

@potiuk
Copy link
Member

potiuk commented Oct 27, 2022

You can always drop the whole cache to verify what causes it:

https://linuxhint.com/clear_cache_linux/

Also you can do some trial/error to see which files are in the cache as explained in this answer:
https://serverfault.com/questions/278454/is-it-possible-to-list-the-files-that-are-cached

Seems this is not easy to get list of files which contribute to cache, but if you have some guesses you might try to find out by using fntools.

@zachliu
Copy link
Contributor Author

zachliu commented Oct 27, 2022

the 2 .nfs* files at ~/logs/dag_processor_manager do add up to my current cache memory usage 🤔

2022-10-27_11-36

.nfsbf1544f7b7af29ea00000002 has been written to an actual file. but it was not removed properly
.nfsd261bc9d96a5d91200000001 keeps growing

@Taragolis
Copy link
Contributor

.nfs* files should be related to NFS not to Airflow.

@potiuk
Copy link
Member

potiuk commented Oct 27, 2022

Very much so. This is the choice of using NFS to store logs :)

@zachliu
Copy link
Contributor Author

zachliu commented Oct 27, 2022

wait i'm confused, so it is NFS design choice not to remove the cache file after it's written to an actual file?

2022-10-27_12-07

@Taragolis
Copy link
Contributor

wait i'm confused, so it is NFS design choice not to remove the cache file after it's written to an actual file?

2022-10-27_12-07

https://nfs.sourceforge.net/#faq_d2

@potiuk
Copy link
Member

potiuk commented Oct 27, 2022

silly rename 🤣

@potiuk
Copy link
Member

potiuk commented Oct 27, 2022

Nothing we can do about it :). But I am not sure if those are the culprits - accoding to the descriptions those should be removed when airflow stops keeping the file unless client crashes

@Taragolis
Copy link
Contributor

I also mount default logs directory to NFS (AWS EFS) so I could only suggest my personal configuration which use for a long time

  1. Change default dag processor manager log location outside of NFS, e.g. AIRFLOW__LOGGING__DAG_PROCESSOR_MANAGER_LOG_LOCATION = "/tmp/airflow/logs/dag_processor_manager/dag_processor_manager.log"
  2. Increase print stats interval AIRFLOW__SCHEDULER__PRINT_STATS_INTERVAL = 300 which could reduce final size of file

@potiuk
Copy link
Member

potiuk commented Oct 27, 2022

  1. Write a custom handler to rotate the logs outside.
  2. Use externa service for logging (CloudWatch etc.) for storing them remotely.

@potiuk
Copy link
Member

potiuk commented Oct 27, 2022

  1. Stop worrying about it. The fact that Unix cache grows has only one drawback - it will show up when you choose to in your monitoring service. You should monitor other parameters - it's perfectly OK that cache grows up until all available memory - it has no negative consequences

@zachliu
Copy link
Contributor Author

zachliu commented Oct 27, 2022

i'm also using AWS EFS 🤝

i think i'll try 1 (2 seems redundant if we're moving it out of NFS), they seem to be the easiest except 5, which involves educating all current/future maintainers to understand memory nuances 😅

@potiuk
Copy link
Member

potiuk commented Oct 27, 2022

BTW. I've heard VERY bad things about EFS when EFS is used to share DAGs. It has profound impact on stability and performance of Airlfow if you have big number of DAGs unless you pay big bucks for IOPS. I've heard that from many people.

This is the moment when I usually STRONGLY recommend GitSync instead: https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca

@potiuk
Copy link
Member

potiuk commented Oct 27, 2022

5, which involves educating all current/future maintainers to understand memory nuances sweat_smile

As counterintuitive as it is, I know what you are talking about :)

@Taragolis
Copy link
Contributor

Taragolis commented Oct 27, 2022

BTW. I've heard VERY bad things about EFS when EFS is used to share DAGs. It has profound impact on stability and performance of Airlfow if you have big number of DAGs unless you pay big bucks for IOPS. I've heard that from many people.
This is the moment when I usually STRONGLY recommend GitSync instead: https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca

It's always it depends on configuration and monitoring. I personally have this issue might be in Airflow 2.1.x and I do not know is it actually related to Airflow itself or some other stuff. Work with EFS definitely take more effort rather than GitSync.

Just for someone who might found this thread in the future with EFS performance degradation might help:

Disable save python bytecodes inside of NFS (AWS EFS) mount

  • Mount as Read-Only
  • Disable Python bytecode by set PYTHONDONTWRITEBYTECODE=x
  • Or set location for bytecodes by set PYTHONPYCACHEPREFIX for example to /tmp/pycaches

Throughput in mode Bursting in first looks like miracle but when all Bursting Capacity go to zero it could turn into your life into the hell. Each newly created EFS share has about 2.1 TB BurstingCreditBalance.

What could be done here:

  • Switch to Provisional Throughput mode permanently which might cost a lot, something like 6 USD per 1 MiB/sec without VAT, so 100 MiB/Sec would cost more than 600 USD per month.
  • Switch to Provisional Throughput mode only when BurstingCreditBalance less than some amount, like 0.5 TB, and switch back when BurstingCreditBalance exceed limit 2.1 TB. Unfortunately there is no autoscaling so it would be manual or combination of CloudWatch Alerting + AWS Lambda.

image

@potiuk
Copy link
Member

potiuk commented Oct 27, 2022

This is very close to what I've heard! Good one @Taragolis! And yeah PYTHONDONTWRITEBYTECODE is also my typical recommendation.

@zachliu
Copy link
Contributor Author

zachliu commented Oct 28, 2022

i moved the dag_processor_manager logs from /user/local/airflow/logs/ (NFS) to /tmp/airflow/logs/ (local), but the cache memory still keeps growing 😿

2022-10-28_11-31

so it is still using cache somewhere but not the .nfs*** silly-rename cache files 🤔

@potiuk
Copy link
Member

potiuk commented Oct 28, 2022

So I guess the quest shoudl continue :)/ But let me repeat again - if you have anything that creates files continuously and those files are not deleted and there is no advice from the kernel to not use cache - they will increase the cache memory used. You will likely never get it to 0 growth over time. Likely just ssh to your system and saving your history when you type in a command will increase cache used. And this is normal. So you are likely chasing a red herring.

@zachliu
Copy link
Contributor Author

zachliu commented Oct 29, 2022

worse than a red herring, this is a mirage 😆
can't change the kernel's behavior, i'll just change meself:

LOGGING_CONFIG["handlers"]["processor_manager"].update(
    {
        'maxBytes': 10485760,  # 10M
        "backupCount": 3,
    }
)

this makes the cache memory usage cap at 40~50Mb

ephraimbuddy pushed a commit that referenced this issue Nov 9, 2022
The RotatingFileHandler is used when you enable it via
`CONFIG_PROCESSOR_MANAGER_LOGGER=True` and it exhibits similar
behaviour as the FileHandler had when it comes to caching
the file on the Kernel level. While it is harmless (the cache
will be freed when needed), it is also misleading for those who
are trying to understand memory usage by Airlfow.

The fix is to add a custom non-caching RotatingFileHandler similarly
as in #18054.

Note that it will require to manually modify local settings if
the settings were created before this change.

Fixes: #27065
(cherry picked from commit 126b7b8)
ephraimbuddy pushed a commit that referenced this issue Nov 9, 2022
The RotatingFileHandler is used when you enable it via
`CONFIG_PROCESSOR_MANAGER_LOGGER=True` and it exhibits similar
behaviour as the FileHandler had when it comes to caching
the file on the Kernel level. While it is harmless (the cache
will be freed when needed), it is also misleading for those who
are trying to understand memory usage by Airlfow.

The fix is to add a custom non-caching RotatingFileHandler similarly
as in #18054.

Note that it will require to manually modify local settings if
the settings were created before this change.

Fixes: #27065
(cherry picked from commit 126b7b8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants