Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide performance benchmarks with and without Proxy #574

Open
phoenix2x opened this issue Jul 7, 2023 · 15 comments
Open

Provide performance benchmarks with and without Proxy #574

phoenix2x opened this issue Jul 7, 2023 · 15 comments
Assignees
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. type: question Request for information or clarification.

Comments

@phoenix2x
Copy link

phoenix2x commented Jul 7, 2023

Question

Hi there,

We're trying to migrate from RDS to GCP SQL. The application is running on GKE. We're using cloud-sql-go-connector v1.3.0 to connect a Postgres instance like this:

	cloudSQLDialer, err := cloudsqlconn.NewDialer(ctx, db.WithIAMAuthN(), db.WithDefaultDialOptions(db.WithPrivateIP()))
	if err != nil {
		panic(errors.Wrap(err, "failed to setup cloud sql dialer"))
	}

	config.Dialer = func(ctx context.Context, _, _ string) (net.Conn, error) {
		span := spans.NewChildSpanFromContext(ctx, "dial-cloudsql") ----------< this span p99 = 300ms
		defer span.Finish()
		// use context.Background() to avoid cancellation
		// canceling context during dial means that connection never makes it to the pool
		// so the next query has to pay the price again, potentially hitting timeout as well
		//nolint: contextcheck
		conn, err := cloudSQLDialer.Dial(context.Background(), connectionName)
		if err != nil {
			span.SetTag(ext.Error, err)
		}

		return conn, err
	}

We noticed that it takes significantly more time for the Dialer to finish ~300ms as opposed to ~30ms for RDS. We assumed this is caused by additional work that the Cloud SQL Proxy server is doing. And indeed we confirmed it by directly connecting to the Cloud SQL Postgres instance using the IP:5432, latency was staying at ~70ms.

Since we do use connection pooling it works with no errors most of the time. But during traffic spikes or after a lot of connections died under heavy load this added latency causes a lot of errors. What is going on there is that when we try to open a significant number of connections at the same time the dialer latency spikes up with every new connection(latency goes up to ~4s), causing the application to try opening even more connections since it can't get enough to fulfill incoming requests. The end result is that the application opens up to the pool limit number of connections(500 in this specific test) and has a lot of errors when it uses cloudsqlconn. And it only opens ~100 connections with no errors when it connects to the IP:5432 directly. Both tests use the same traffic numbers.

Is there anything we can do to mitigate the issue?

Sorry for a lot of text, just want to make sure I give enough info:)

Code

No response

Additional Details

No response

@phoenix2x phoenix2x added the type: question Request for information or clarification. label Jul 7, 2023
@enocom enocom assigned enocom and unassigned jackwotherspoon Jul 7, 2023
@enocom
Copy link
Member

enocom commented Jul 7, 2023

Hey @phoenix2x thanks for the issue.

Some follow-up questions:

  1. Are you using pgxpool?
  2. What settings are you using for the pool?
  3. During the traffic spikes, what CPU and memory usage do you see?

Slightly unrelated, but are you using the built-in traces?

@phoenix2x
Copy link
Author

Thanks for the quick follow up:)

  1. we use bun which uses database/sql pool under the hood
  2. MaxIdleConns and MaxOpenConns = 20, ConnMaxIdleTime = 15m, ConnMaxLifetime = 1h
  3. a. If the question is about application cpu/mem then it's pretty similar in both tests. There are ~36 pods 250mcores each running at the time of the spike. HPA is set up to 40% target CPU utilization so there is plenty of headroom, almost no CPU throttling. The memory stays around 75Mb per pod while the requests are 250Mb, again plenty of headroom.
    b. If this is a question about cpu/mem of the database instance then the picture is completely different between tests:
    With cloudsqlconn: CPU 40%, mem component usage 30%, connections 508
    No cloudsqlconn: CPU 7%, mem component usage 7%, connections 84

We do use builtin traces, but since we can't use them for no-cloudsqlconn test we added our custom span in the dialer for both tests, just to compare apples to apples. Here they are:

image

@enocom
Copy link
Member

enocom commented Jul 7, 2023

Thanks, @phoenix2x this is really helpful.

How many instances does each dialer connect to?

For some background info, the Dialer does this:

  1. On a new connection, it reaches out to the Cloud SQL Admin API to create an ephemeral certificate that lasts for 1 hour.
  2. The certificate and associated TLS configuration is stored in a map protected by a mutex.
  3. Subsequent dial attempts will read the TLS configuration info from that map.
  4. In the background, a goroutine runs to update the certificate ~4 minutes before it expires.
  5. If any of the refreshes fail, at worst case you'll see a new connection recreating the ephemeral certificate before establishing the connection.

@phoenix2x
Copy link
Author

It's a single Postgres instance in this test.
As I can see from the internal spans all the time is spent in cloud.google.com/go/cloudsqlconn/internal.Connect span. cloud.google.com/go/cloudsqlconn/internal.InstanceInfo stays extremely fast and there were no cloud.google.com/go/cloudsqlconn/internal.RefreshConnection spans at the time of the test. I tried looking into the library code and looks good to me, I don't think there is anything in there that could cause a 4 seconds delay in opening a new connection.

@enocom
Copy link
Member

enocom commented Jul 11, 2023

These numbers are surprising to me. Let me talk with the backend folks to see if we can shed some light on what's going on.

Are you doing manual load testing for this? How are you doing it?

@enocom
Copy link
Member

enocom commented Jul 11, 2023

Also, if you have a support account, you might consider opening a case so the backend team can look at your instance.

@phoenix2x
Copy link
Author

Yes, we use nightly load testing to prevent regressions. As soon as we switched from rds to GCP SQL we noticed this issue. The load test is just a script that hits our services.
Sure, I'll open the ticket with support, was hoping that you've already seen something like this. I'll post their response here if we're able to solve the issue.

Thank you.

@enocom
Copy link
Member

enocom commented Jul 13, 2023

Those latency numbers are much higher than I'd expect. We've been thinking about publishing some baseline numbers as part of a benchmark and this helps increase the priority of that work.

Otherwise, there might be some insight the backend team can add.

@enocom
Copy link
Member

enocom commented Jul 14, 2023

Meanwhile, I'm going to make this an issue for publishing benchmark numbers with and without the Dialer.

@enocom enocom changed the title Dial latency Provide performance benchmarks with and without Proxy Jul 14, 2023
@enocom enocom added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Jul 14, 2023
@enocom
Copy link
Member

enocom commented Jul 14, 2023

Related to GoogleCloudPlatform/cloud-sql-proxy#1871.

@enocom
Copy link
Member

enocom commented Jul 26, 2023

Just to circle back here and provide some information for others who run into this issue:

One thing to keep in mind is that Auto IAM AuthN has a limit of login requests at 3000 logins to an instance / minute. When traffic spikes hit that threshold, the latency can jump way up (as we see above). Generally, though, we expect p99 latency to be much much lower, while still accounting for the network hops (app with connector -> proxy server -> instance and sometimes a call to verify the IAM user).

@enocom
Copy link
Member

enocom commented Aug 1, 2023

@phoenix2x FYI it's possible to use auto IAM authn without the Go Connector.

You'll need to ensure a few things:

  1. The token has only sql.login scope (i.e., https://www.googleapis.com/auth/sqlservice.login)
  2. The token isn't transmitted over an unencrypted channel.

We're working on making this path easier for folks, but for now I'll share the mechanics for visibility.

Assuming you're using pgx, you can do this:

import (
        "context"
        "fmt"
        "time"

        "github.com/jackc/pgx/v5"
        "github.com/jackc/pgx/v5/pgxpool"
)

func main() {
        // use instance IP + native port (5432)
        // for best security use client certificates + server cert in DSN
        config, err := pgxpool.ParseConfig("host=INSTANCE_IP user=postgres password=empty sslmode=require")
        if err != nil {
                panic(err)
        }
        config.BeforeConnect = func(ctx context.Context, cfg *pgx.ConnConfig) error {
                // This gets called before a connection is created and allows you to
                // refresh the OAuth2 token here as needed. A fancier implementation would cache the token,
                // and refresh only if the token were about to expire.
                cfg.Password = "mycooltoken"
                return nil
        }

        pool, err := pgxpool.NewWithConfig(context.Background(), config)
        if err != nil {
        }

        conn, err := pool.Acquire(context.Background())
        if err != nil {
                panic(err)
        }
        defer conn.Release()

        row := conn.QueryRow(context.Background(), "SELECT NOW()")
        var t time.Time
        if err := row.Scan(&t); err != nil {
                panic(err)
        }

        fmt.Println(t)
}

@phoenix2x
Copy link
Author

Hi @enocom,

This is very interesting, thank you:)

Is this supposed to target Server Side Proxy on port 3307 or native 5432?

@enocom
Copy link
Member

enocom commented Aug 1, 2023

Native port. We're working on making this more obvious and possibly even providing some helper functions.

@phoenix2x
Copy link
Author

Nice, we should definitely give it a try.

@enocom enocom added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Oct 4, 2023
@enocom enocom assigned jackwotherspoon and unassigned enocom May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. type: question Request for information or clarification.
Projects
None yet
Development

No branches or pull requests

3 participants