Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harbor can't recover when the connection to the database drops. #14856

Closed
Vad1mo opened this issue May 11, 2021 · 16 comments · Fixed by #15211
Closed

Harbor can't recover when the connection to the database drops. #14856

Vad1mo opened this issue May 11, 2021 · 16 comments · Fixed by #15211
Assignees

Comments

@Vad1mo
Copy link
Member

Vad1mo commented May 11, 2021

When the connection to the database is lost (unreliable connectivity), Harbor can't recover itself and the health check doesn't notice that there is a problem. Restarting the pod solves the issues immediately until the next connectivity problem.

This problem occurs multiple times a day.
I am quite confident that this isn't a DB issues as other Harbor instances in the same network keep on working fine and the DB is basically idling.

2021-05-11 23:21:23 | 2021-05-11T21:21:23Z [WARNING] [/common/utils/oidc/helper.go:72]: Failed to refresh configuration, error: failed to load OIDC setting: write tcp 100.1.1.1:49918->51.59.7.7:5432: write: broken pipe
-- | --
  |   |   | 2021-05-11 23:21:20 | 2021-05-11T21:21:20Z [WARNING] [/common/utils/oidc/helper.go:72]: Failed to refresh configuration, error: failed to load OIDC setting: write tcp 100.1.1.1::40860->>51.59.7.7:5432: write: connection reset by peer
  |   |   | 2021-05-11 23:21:17 | 2021-05-11T21:21:17Z [WARNING] [/common/utils/oidc/helper.go:72]: Failed to refresh configuration, error: failed to load OIDC setting: write tcp 100.1.1.1::49912->>51.59.7.7:5432: write: broken pipe
  |   |   | 2021-05-11 23:21:15 | 2021-05-11T21:21:15Z [ERROR] [/replication/registry/healthcheck.go:65]: Health check error: write tcp 100.1.1.1::40860->51.59.7.7:5432: write: connection reset by peer

I think the health check should
Harbor 2.2.1 - Don't know if that is also happening on older versions. (This is a new setup)

@Vad1mo
Copy link
Member Author

Vad1mo commented May 14, 2021

In addition to that the job service does also not recover once temporarily the connection to the DB is dropped

2021-05-05T20:01:30Z [ERROR] [/jobservice/logger/sweeper_controller.go:40]: sweep logs error in *sweeper.DBSweeper at 1620244890: sweep logs in DB failed before 2021-04-21 20:01:30.5578332 +0000 UTC m=+1209600.049316437 with error: write tcp 100.11.12.13:43386->51.59.7.7:5432: write: broken pipe
2021-05-14T17:04:12Z [INFO] [/jobservice/worker/cworker/c_worker.go:76]: Job incoming: {"name":"IMAGE_SCAN","id":"7c482f01f524fb80c0d2712c","t":1621011851,"args":null}
2021-05-14T17:04:12Z [INFO] [/common/config/store/driver/rest.go:31]: get configuration from url: http://harbor-harbor-core:80/api/internal/configurations
2021-05-14T17:04:12Z [INFO] [/common/config/store/driver/rest.go:31]: get configuration from url: http://harbor-harbor-core:80/api/internal/configurations
2021-05-14T17:04:39Z [ERROR] [/jobservice/runner/redis.go:146]: Run job IMAGE_SCAN:7c482f01f524fb80c0d2712c error: runtime error: write tcp 100.11.12.13:43386->51.59.7.7:5432: write: broken pipe; stack: goroutine 124 [running]:
github.com/goharbor/harbor/src/jobservice/runner.(*RedisJob).Run.func2(0xc0009397d0, 0xc00050c780)
/harbor/src/jobservice/runner/redis.go:144 +0xa5
panic(0x1bcc520, 0xc00034ae40)
/usr/local/go/src/runtime/panic.go:975 +0x47a
github.com/astaxie/beego/orm.(*DB).QueryRow(0xc000802040, 0xc00099c000, 0xba, 0xc0003ce280, 0x6, 0x8, 0xa1)
/harbor/src/vendor/github.com/astaxie/beego/orm/db_alias.go:182 +0xc5
github.com/astaxie/beego/orm.(*dbBase).InsertOrUpdate(0xc000380d20, 0x20d7d60, 0xc000802040, 0xc0003f5360, 0x1c5a700, 0xc0004a6f80, 0x199, 0xc000498200, 0xc0001195a0, 0x1, ...)
/harbor/src/vendor/github.com/astaxie/beego/orm/db.go:593 +0xc35
github.com/astaxie/beego/orm.(*orm).InsertOrUpdate(0xc000802f80, 0x1b42340, 0xc0004a6f80, 0xc0001195a0, 0x1, 0x1, 0x4550af, 0xc000157200, 0xc000156900)
/harbor/src/vendor/github.com/astaxie/beego/orm/orm.go:223 +0xff
github.com/goharbor/harbor/src/common/dao.CreateOrUpdateJobLog(0xc0004a6f80, 0xc000156900, 0x8b0, 0xc000157200)
/harbor/src/common/
2021-05-14T17:04:39Z [ERROR] [/jobservice/runner/redis.go:111]: Job 'IMAGE_SCAN:7c482f01f524fb80c0d2712c' exit with error: runtime error: write tcp 100.11.12.13:43386->51.59.7.7:5432: write: broken pipe; stack: goroutine 124 [running]:
github.com/goharbor/harbor/src/jobservice/runner.(*RedisJob).Run.func2(0xc0009397d0, 0xc00050c780)
/harbor/src/jobservice/runner/redis.go:144 +0xa5

@heww heww self-assigned this May 17, 2021
@ywk253100 ywk253100 self-assigned this May 18, 2021
@dkulchinsky
Copy link
Contributor

Hey 👋🏼

Was chatting about the same issue with @ywk253100 in Slack, we're experiencing the same problem in both core and jobservice components with Harbor v2.1.4 & v2.1.5

We noticed that sometimes core is able to recover, although it seems very inconsistent, however jobservice never does and requires a restart of all the jobservice pods to get it working properly again.

In our configuration jobservice stores job logs in the database and failing when trying to write the log to the DB.

core log:
[harbor-core-5968789c58-9v2sm core] 2021-05-06T16:45:13Z [ERROR] [/lib/orm/orm.go:72]: begin transaction failed: write tcp 172.26.4.16:48366->172.27.14.179:5432: write: broken pipe
[harbor-core-5968789c58-9v2sm core] 2021-05-06T16:45:13Z [ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: deal with /api/v2.0/systeminfo request in transaction failed: write tcp 172.26.4.16:48366-\u003e172.27.14.179:5432: write: broken pipe"}]}

jobservice log:

[harbor-jobservice-759d75cdf8-k2z5x jobservice] 2021-05-06T16:54:56Z [ERROR] [/jobservice/runner/redis.go:194]: Close job logger failed: logger: DB, err: write tcp 172.26.2.45:35486->172.27.14.179:5432: write: broken pipe

The Postgres database is deployed as a StatefulSet (using Zalando's postgres-operator) and Harbor is configured to connect via the ClusterIP Service, after the DB failover/switchover the Service points to the now active/primary postgres Pod and testing connectivity via the Service works just fine, I believe the issue is that the clients (core, jobservice) still try to connect via the "old" session and fail and need to re-establish a new connection when this condition happens.

Happy to provide more information/logs/etc.

@Vad1mo
Copy link
Member Author

Vad1mo commented May 18, 2021

In our configuration jobservice stores job logs in the database and failing when trying to write the log to the DB.

Same set up that we have as well.

I believe the issue is that the clients (core, jobservice) still try to connect via the "old" session and fail and need to re-establish a new connection when this condition happens.

I have a similar feeling. The DB is online and definitely working but the Job service still holds on to a non existing handle or something. We have an RDS DB outside our K8s.

@ywk253100
Copy link
Contributor

@Vad1mo Is your database also an external one?

@dkulchinsky I cannot reproduce this issue in my env with the latest master code.
I tried to delete the database pod, both the core and jobservice can recover from the broken pipe error

@dkulchinsky
Copy link
Contributor

@ywk253100 we're still on v2.1 and I assume master is v2.2? perhaps there were related changes there, could you test the same with v2.1?

when you say I tried to delete the database pod <-- do you have just one DB pod? in our case we have two pods in a StatefulSet, one is primary and the other is standby and we perform a switchover which is where the issue happens.

@wk8
Copy link

wk8 commented May 19, 2021

Hitting this as well, using an RDS DB

@alxbse
Copy link

alxbse commented May 19, 2021

I'm also seeing this issue in a configuration where harbor v2.2.1 connects through pgbouncer as the connection pooler.

@heww
Copy link
Contributor

heww commented May 19, 2021

I'm also seeing this issue in a configuration where harbor v2.2.1 connects through pgbouncer as the connection pooler.

Could you share the arch of the DB and the config of the pgbouncer so that we can reproduce this problem? Thanks.

@Vad1mo
Copy link
Member Author

Vad1mo commented May 19, 2021

@ywk253100, we are on an external PostgreDB similar to AWS Aurora.
we are on 2.2.1, but have seen this error also on master. The thing is it needs to run for some time

on the jobservie the logs to db needs to be enabled.

I'll try to reproduce the error by dropping the idle connection.

What you can also do is reducing the idle and max connection to 1 and 10 for example.

@flbla
Copy link
Contributor

flbla commented May 31, 2021

I have the same issue, with a Azure Postgres database and Harbor 2.2.1
Core never recover, logs are full of :
[/lib/orm/orm.go:72]: begin transaction failed: write tcp X.X.X.X:36996->X.X.X.X:5432: write: connection reset by peer

the only way I found is to restart core pods.

@Vad1mo
Copy link
Member Author

Vad1mo commented May 31, 2021

@ywk253100 if you want to debug that I can provide you with a cluster with remote code debugging access to it so we could together take a look at it. Ping me on slack so we can set this up! how does that sounds?

@lukasmrtvy
Copy link

lukasmrtvy commented Jun 21, 2021

We have same issues, Azure Postgres and Harbor 2.2.1 with similar log output from core & jobservice.

@Vad1mo are You using also Azure Redis ? ( which tier ? and which region ? )

@Vad1mo
Copy link
Member Author

Vad1mo commented Jun 23, 2021

@lukasmrtvy we are using jobservice to forward logs to Postgres and the error only happens when this feature is enabled. Logging to stdout or file does not cause this issue.

@jowko
Copy link

jowko commented Jun 24, 2021

We encountered this issue a couple of times. There is a lot of logs like this:
[ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: deal with /api/v2.0/health request in transaction failed: write tcp <HIDDEN_IP_AND_PORT>-\u003e<HIDDEN_IP_AND_PORT>: write: broken pipe"}]}
[ERROR] [/lib/orm/orm.go:72]: begin transaction failed: write tcp <HIDDEN_IP_AND_PORT>-><HIDDEN_IP_AND_PORT>: write: broken pipe

The server is not responsive in that state! Manual restart of core service is fixing this issue for some time. But because of this issue there is a time window when our application is not available.
We are using PostgreSQL 12. Harbor is deployed in Kubernetes cluster 1.19 by official helm chart 1.6.2 (Harbor has version v2.2.2).

We encounter the same issue in our Java apps and we fixed them using connection validators (provided by Wildfly server). Connection validator checks all database connections from pool in specified intervals and tries to execute very simple statements, like Select 1. If query will fail, such connection is removed from pool and a new one is created.

In my opinion Harbor requires similar solution or at least it should remove connections from pool after some time. I suspect that after some time all database connections in pool can be broken. I also saw, that similar issue was fixed for MySql:
Issue: #4713
PR: #4719
Maybe similar change in https://github.com/goharbor/harbor/blob/master/src/common/dao/pgsql.go will fix the issue for PostgreSQL.

@Vad1mo
Copy link
Member Author

Vad1mo commented Jun 26, 2021

I spend a few hours on resolving this issue, @jowko recommendations to set the SetConnMaxLifetime() is a good idea.

References

In the long run, a switch to pgx would be the ideal outcome. Hence, please upvote #15209

I'll provide a patch and a test image with the fix on top of 2.3 for you to try.

@jowko
Copy link

jowko commented Jun 28, 2021

I found out that the Harbor exporter is also affected by this issue. I don't know if above fix will cover this case. There are a lot of logs such as this:
[ERROR] [/pkg/exporter/project_collector.go:236]: get data from DB failure: write tcp <HIDDEN_IP_PORT>-><HIDDEN_IP_PORT>: write: connection reset by peer
[ERROR] [/pkg/exporter/project_collector.go:236]: get project from DB failure: write tcp <HIDDEN_IP_PORT>-><HIDDEN_IP_PORT>: write: connection reset by peer
[ERROR] [/pkg/exporter/project_collector.go:236]: get project member data from DB failure: write tcp <HIDDEN_IP_PORT>-><HIDDEN_IP_PORT>: write: connection reset by peer
[ERROR] [/pkg/exporter/project_collector.go:236]: get project repo data from DB failure: write tcp <HIDDEN_IP_PORT>-><HIDDEN_IP_PORT>: write: connection reset by peer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants