Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak in ONVIF #72544

Closed
shbatm opened this issue May 26, 2022 · 28 comments · Fixed by #74341
Closed

Memory Leak in ONVIF #72544

shbatm opened this issue May 26, 2022 · 28 comments · Fixed by #74341
Assignees

Comments

@shbatm
Copy link
Contributor

shbatm commented May 26, 2022

The problem

I believe the ONVIF integration has a memory leak. Not sure if it's in the core code or module dependencies.

I was experiencing frequent (1-3 times/day) forced restarts of Core by Supervisor, which I was able to trace back to OOM calls on the host vm killing the python process. Tried to narrow down the source as best as I could with help from @bdraco and it looks like we have it narrowed down to the ONVIF integration.

See memory sensor snapshot before and after disabling ONVIF on my main instance:
Prod VM Memory Sensor

To further narrow it down, I started a new Home Assistant OS VM (Proxmox) with only the original ONVIF config entries, profiler, and system resources integrations enabled on top of the default_config. It looks like it is showing the same memory leak here.

Dev VM Memory Sensor

180-sec Py-Spy, profiler.memory and profiler.start outputs from test VM attached under Additional Info below. I can attach from main instance too if needed.

What version of Home Assistant Core has the issue?

2022.5.5

What was the last working version of Home Assistant Core?

~2022.2.x

What type of installation are you running?

Home Assistant OS

Integration causing the issue

ONVIF

Link to integration documentation on our website

https://www.home-assistant.io/integrations/onvif

Diagnostics information

config_entry-onvif-cd09f9939eda6d950f4b80d7b1bac047.json.txt
config_entry-onvif-ef7f4c3b80516522e1dadc5448f7c8a5.json.txt
config_entry-onvif-97da2d4278d518eaa41f437b414cd8d1.json.txt
config_entry-onvif-295e1346abdabb24753cd7425e492c04.json.txt

Example YAML snippet

No response

Anything in the logs that might be useful for us?

## System Health

version | core-2022.5.5
-- | --
installation_type | Home Assistant OS
dev | false
hassio | true
docker | true
user | root
virtualenv | false
python_version | 3.9.9
os_name | Linux
os_version | 5.15.41
arch | x86_64
timezone | America/Chicago

<details><summary>Home Assistant Cloud</summary>

logged_in | false
-- | --
can_reach_cert_server | ok
can_reach_cloud_auth | ok
can_reach_cloud | ok

</details>

<details><summary>Home Assistant Supervisor</summary>

host_os | Home Assistant OS 8.1
-- | --
update_channel | stable
supervisor_version | supervisor-2022.05.3
docker_version | 20.10.14
disk_total | 30.8 GB
disk_used | 3.2 GB
healthy | true
supported | true
board | ova
supervisor_api | ok
version_api | ok
installed_addons | Studio Code Server (5.0.4), Samba share (9.6.1)

</details>

<details><summary>Dashboards</summary>

dashboards | 1
-- | --
resources | 0
mode | auto-gen

</details>

Additional information

Py-Spy and Profiler - Test Instance.zip
Py-Spy and Profiler - Main Instance.zip

Log File from Test VM prior to crash, profiler.start_log_objects running and logging as follows:

logger:
  default: info
  logs:
    homeassistant.components.onvif: debug
    onvif: debug
    wsdiscovery: debug
    zeep: debug

home-assistant.log.1.txt

Potential historical issues related:

@probot-home-assistant
Copy link

onvif documentation
onvif source
(message by IssueLinks)

@probot-home-assistant
Copy link

Hey there @hunterjm, mind taking a look at this issue as it has been labeled with an integration (onvif) you are listed as a code owner for? Thanks!
(message by CodeOwnersMention)

@shbatm
Copy link
Contributor Author

shbatm commented May 26, 2022

Left the test system running longer, just to confirm. See updated memory graph:

image

From looking at profiler.start_log_objects output, it looks like weakref type is growing the most. Here is a dump file:
weakref objects in memory.txt

@bdraco
Copy link
Member

bdraco commented May 27, 2022

It looks like weakrefs building up

2022-05-26 07:30:47 CRITICAL (SyncWorker_4) [homeassistant.components.profiler] Memory Growth: [('weakref', 72859, 18)]

@bdraco
Copy link
Member

bdraco commented May 27, 2022

@bdraco
Copy link
Member

bdraco commented May 27, 2022

Screen Shot 2022-05-26 at 17 22 44

@bdraco
Copy link
Member

bdraco commented May 27, 2022

2022-05-26 07:04:29 WARNING (MainThread) [homeassistant.components.onvif] Failed to fetch ONVIF PullPoint subscription messages for '20:4e:f6:56:ca:5d':
2022-05-26 07:07:47 WARNING (MainThread) [homeassistant.components.onvif] Failed to fetch ONVIF PullPoint subscription messages for '20:4e:f6:56:ca:5d':
2022-05-26 07:10:44 WARNING (MainThread) [homeassistant.components.onvif] Failed to fetch ONVIF PullPoint subscription messages for '20:4e:f6:56:ca:5d':
2022-05-26 07:20:01 WARNING (MainThread) [homeassistant.components.onvif] Failed to fetch ONVIF PullPoint subscription messages for '20:4e:f6:56:ca:5d':
2022-05-26 07:21:51 WARNING (MainThread) [homeassistant.components.onvif] Failed to fetch ONVIF PullPoint subscription messages for '20:4e:f6:56:ca:5d':
2022-05-26 07:25:48 WARNING (MainThread) [homeassistant.components.onvif] Failed to fetch ONVIF PullPoint subscription messages for '20:4e:f6:56:ca:5d':
2022-05-26 07:28:44 WARNING (MainThread) [homeassistant.components.onvif] Failed to fetch ONVIF PullPoint subscription messages for '20:4e:f6:56:ca:5d':

@bdraco
Copy link
Member

bdraco commented May 27, 2022

@bdraco
Copy link
Member

bdraco commented May 27, 2022

@shbatm I expect the leak is here as each new retry creates a new AsyncClient which triggers the leak in SSLContexts from here encode/httpx#978 (comment)

@bdraco
Copy link
Member

bdraco commented May 27, 2022

Related https://bugs.python.org/issue40727

@bdraco
Copy link
Member

bdraco commented May 27, 2022

Maybe related pyca/pyopenssl#1120

@bdraco
Copy link
Member

bdraco commented May 27, 2022

It would be good to use profiler.dump_log_objects service to see how many ZeepAsyncClient and AsyncClient objects are in memory

@bdraco
Copy link
Member

bdraco commented May 27, 2022

It would be nice if onvif used the shared httpx client so its not creating an AsyncClient for each service.

Here is the suggestion from the httpx authors: encode/httpx#978 (comment)

You don't need to create 2000 clients - create a single client, and reuse it throughout. You only need to create additional clients if you need a different configuration on them.

@shbatm
Copy link
Contributor Author

shbatm commented May 27, 2022

Just restarted my test VM, I will dump the objects later today.

@shbatm
Copy link
Contributor Author

shbatm commented May 27, 2022

After 2 hours:
ZeepAsyncClient Dump.txt

@hunterjm
Copy link
Member

It would be nice if onvif used the shared httpx client so its not creating an AsyncClient for each service.

My first implementation for moving to httpx did just that, but I think Zeep is modifying the AsyncClient under the hood because it did not turn out well across different services, so i had to revert it.

We do cache SOAP bindings to be re-used if the same service gets created more than once, but it does not apply to PullPointSubscription services because we get new endpoints from the camera every time a subscription is requested.

2022-05-26 07:04:29 WARNING (MainThread) [homeassistant.components.onvif] Failed to fetch ONVIF PullPoint subscription messages for '20:4e:f6:56:ca:5d':
2022-05-26 07:07:47 WARNING (MainThread) [homeassistant.components.onvif] Failed to fetch ONVIF PullPoint subscription messages for '20:4e:f6:56:ca:5d':
2022-05-26 07:10:44 WARNING (MainThread) [homeassistant.components.onvif] Failed to fetch ONVIF PullPoint subscription messages for '20:4e:f6:56:ca:5d':
2022-05-26 07:20:01 WARNING (MainThread) [homeassistant.components.onvif] Failed to fetch ONVIF PullPoint subscription messages for '20:4e:f6:56:ca:5d':
2022-05-26 07:21:51 WARNING (MainThread) [homeassistant.components.onvif] Failed to fetch ONVIF PullPoint subscription messages for '20:4e:f6:56:ca:5d':
2022-05-26 07:25:48 WARNING (MainThread) [homeassistant.components.onvif] Failed to fetch ONVIF PullPoint subscription messages for '20:4e:f6:56:ca:5d':
2022-05-26 07:28:44 WARNING (MainThread) [homeassistant.components.onvif] Failed to fetch ONVIF PullPoint subscription messages for '20:4e:f6:56:ca:5d':

From the logs, it looks like PullPointSubscription is failing frequently and the logic in Home Assistant attempts to restart the subscription automatically on error. This indeed does cause a new httpx AsyncClient to be created and therefore triggers the aforementioned SSL Context leak.

For ONVIF the easiest solution is probably to just add verify=False when creating the AsyncClient as ONVIF is all local over HTTP. I'll push up a package fix tonight or this weekend and let you know when I do.

@shbatm - If you could verify the fix since you have cameras that can reproduce the issue, I'll open a version bump PR to Home Assistant.

@hunterjm
Copy link
Member

As a side note - interested in what cameras you have as while this will technically fix the memory leak - it won't fix the underlying issue of your event feed subscription either not working or getting killed very frequently.

@shbatm
Copy link
Contributor Author

shbatm commented May 27, 2022

@hunterjm Happy to test; let me know when you make the change on the package repo and I can test it locally first.

The cameras are all Hikvision of various models. The bad actor based on the MAC in the logs (and what I would expect) is the Doorbell (DS-HD1).

It's Wifi and has a multitude of issues--I'm not surprised it has issues maintaining connection. I have to restart it every week and it's going in the trash as soon as I can pull POE to the Front Door ;)

@shbatm
Copy link
Contributor Author

shbatm commented May 27, 2022

I'm also having some time syncronization issues with the cameras that I noticed when testing, it doesn't look like it's causing the issue, but just wanted to mention since it whines about possible authentication issues:

The date/time on Back Yard (UTC) is '2022-05-27 11:34:01+00:00', which is different from the system '2022-05-27 10:34:04.033159+00:00', this could lead to authentication issues
The date/time on Front Sidewalk (UTC) is '2022-05-27 11:34:02+00:00', which is different from the system '2022-05-27 10:34:04.052574+00:00', this could lead to authentication issues
The date/time on Back Yard (UTC) is '2022-05-27 11:34:03+00:00', which is different from the system '2022-05-27 10:34:04.041229+00:00', this could lead to authentication issues
The date/time on Front Door Camera (UTC) is '2022-05-27 11:34:03+00:00', which is different from the system '2022-05-27 10:34:04.092536+00:00', this could lead to authentication issues

Still troubleshooting my end with that. The NVR tries to syncronize with an NTP server and then push it to the cameras, but Hikvision has some firmwares that implemented daylight savings backwards (Start in Nov, Stop in March), which causes part of the issue. I've tried setting the NTP/DST on each camera to see if that gets the error to go away.

@hunterjm
Copy link
Member

The time sync issue is a known issue in HikVision firmware. They apply DST settings to their UTC representations as well in the ONVIF response... because why not?

@shbatm
Copy link
Contributor Author

shbatm commented May 27, 2022

I just realized there might be some firewall rules interfering with the connections too. That camera is on its own SSID and VLAN with some very restrictive rules. If I find anything I'll let you know

(sidebar: core-dns really needs to get rid of the Cloudflare hardcoded fallback--forgot to reenable the coredns fix addon and my router has been spammed 4 requests per second with blocked dns packets, once the noise gets filtered out I'll see if theres anything affecting the camera)

@shbatm
Copy link
Contributor Author

shbatm commented May 28, 2022

@hunterjm, ran TCPDump+Wireshark on the camera this morning and found out:

  1. the camera pings the default gateway every 15s with an ICMP ping and ARP request packet
  2. my firewall has a default reject policy on the INPUT table for my camera VLAN and was blocking ICMP requests to the router.
  3. ? camera gets mad it can't ping the router and resets connections ?

tl;dr: Allowing camera to ping the router fixes the frequent connection issue and stops memory leak (but doesn't fix underlying root cause). I'll still be able to turn off the firewall rule and test your fix when you have it ready.

image

This also means that unless someone else is having these same kind of intermittent connection issues on a very frequent basis, then this memory leak is going to be very slow. Still worth fixing if possible, but explains why it hasn't been seen (except for maybe back with #42390.

@hunterjm
Copy link
Member

@shbatm - That makes sense to me. Sorry for the delay, but it's been a long weekend. v1.2.1 of onvif-zeep-async is now up on PyPi with the verify=False change. Would you mind re-doing your Firewall rules and testing with that version installed? Assuming that fixes the leak (although not the underlying problem) I'll try to get a patch in for the 2022.06 release.

@shbatm
Copy link
Contributor Author

shbatm commented Jun 5, 2022

@hunterjm Testing today, will let you know.

@shbatm
Copy link
Contributor Author

shbatm commented Jun 5, 2022

LGTM:

image

@clearwave1
Copy link

I have the EZVIZ DB1 doorbell camera (which is a Hikvision) and this has also been happening to me although my restarts are only once per day.

Has this been fixed and included in a HA release yet? I searched the 2202.6.X release notes and don't see any mention of it.

@shbatm
Copy link
Contributor Author

shbatm commented Jul 2, 2022

No, it doesn't look like the package has been bumped to the new version in Home Assistant yet.

@shbatm
Copy link
Contributor Author

shbatm commented Jul 2, 2022

@hunterjm I just added a PR to make the bump.

@github-actions github-actions bot locked and limited conversation to collaborators Aug 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants