Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Git source validation will be stuck in some cases which causes a pile up of unregistered build objects #659

Closed
xiujuan95 opened this issue Mar 12, 2021 · 18 comments

Comments

@xiujuan95
Copy link
Contributor

xiujuan95 commented Mar 12, 2021

use case

During reconcile, our build controller will validate remote git source. And we are using git ls-remote to check if remote URL exists or not. About this List, it doesn't have a timeout parameter can be set. We can see there has a newUploadPackSession func is initializing a new session. And the new session is generated by a DefaultClient. The default client is nil. That means the timeout in this client is zero, also means no timeout. Then AdvertisedReferences func will do http request besed on this new session. Because the client doesn't have timeout, this causes in some cses, such as github or gitlab outage, http request will be stuck.

Above situation will cause our build controller waits all the time. With the increasing number of builds, there will have many builds queue and wait for being reconciled. If remote URL outage all the time, then this will causes new created builds will not be registered. This situation had happened internally.

Reproduce issue

Create multiple builds with this source which is provided by @HeavyWombat in paraller. Then check their status. Then create a new build. The new build should not be registered.

Expected behavior

Git validation should have a timeout. Within this time, if sourceURL checks doesn't finish, then should timeout and build controller should return some meaningful message, such as checkSourceURLTimeout and update build status so that controller can handle other build objects.

cc/ @zhangtbj

@xiujuan95
Copy link
Contributor Author

xiujuan95 commented Mar 12, 2021

Until now, discuss internally, if git ls-remote List func can support context.Context, then we may can use context.WithTimeout to fix this problem. However, it doesn't support, now.

However, we found there has a pull request to modify AdvertisedReferences func so that it can use context. That would be nice for us.

But this PR does't care about List func. For List func, he hardcode context to be context.TODO(). I think this is not our expectation. We also need context can be passed by ourself.

I have some discussion with him. He would like a new PR to solve ourselves' issue. However, now his PR is still not merged. Becasue we have a duplicate with him, to some extent, we need to wait his PR is merged so that we can do based on it.

But, to be honest, it's the priority 1 issue internally. So not sure if we can rewrite git ls-remote package to fix this issue first? Then if go-git community's PR is ready, we can submit a new one to fix it. Finally remove git ls-remote pkg written by ourself.

Do you agree it?

@imjasonh
Copy link
Contributor

imjasonh commented Mar 12, 2021

I agree this is something we need to address. Ideally it would be possible to disable git validation in the controller entirely, using feature flags in a ConfigMap, so users can disable this behavior themselves if it's causing problems (e.g., during an GitHub outage).

Either way, we should enforce a timeout. As a stopgap I don't think we even need to wait for go-git to support it:

func validateWithTimeout(url string) error {
  ch := make(chan error)
  d := time.Second
  go func() {
    select {
    case ch <- validateWithoutTimeout(url):
    case <-time.After(d):
      ch <- fmt.Errorf("timeout after %v", d)
    }
  }()
  return <-ch
}

(this is just pseudocode to demonstrate the idea, play with this in the Go Playground)

But even if we do enforce a timeout, a service outage could make every reconcile take the maximum amount of time, which would be pretty bad. Users should have some way to disable this behavior entirely, and just have affected BuildRuns fail at run-time.

@zhangtbj
Copy link
Contributor

I agree with Jason,

We think about this problem recently and there are three solutions:

  • Push go-git community accept the PR plumbing: wire up contexts for Transport.AdvertisedReferences go-git/go-git#246 or we provide a timeout related fix PR, but it will take more time (maybe more than 2 weeks)
  • Solve in the Shipwright side, change to use a patched go-git remote pkg or add a goroutine with timeout like Jason mentioned above, but it is not good and we have to rollback once we have the official fix
  • Switch the git validation as disable by default

The first two solutions seem cannot solve the problem soon, so I also suggest to disable the git validation by default first. and enable it once the problem is solved. It will make us safer to avoid the controller outage.

But also would like hear other comments or discuss next Monday together.

@zhangtbj
Copy link
Contributor

zhangtbj commented Mar 12, 2021

I agree with Jason,

We think about this problem recently and there are three solutions:

  • Push go-git community to accept the PR plumbing: wire up contexts for Transport.AdvertisedReferences go-git/go-git#246 or we provide a timeout related fix PR, but it will take more time (maybe more than 2 weeks, I think)
  • Solve in the Shipwright side, change to use a patched go-git remote pkg or add a goroutine with timeout like Jason mentioned above, but it is not best way and we have to rollback it once we have the official fix
  • Switch the git validation as disable by default

The first two solutions seem cannot solve the problem soon, so I also suggest to disable the git validation by default first. and enable it once the problem is solved. It will make us safer to avoid the controller outage.

But also would like hear other comments or discuss next Monday together.

@qu1queee
Copy link
Contributor

Note that

Switch the git validation as disable by default

is by design like that. Unless users specify this in their Builds.

@imjasonh
Copy link
Contributor

Note that

Switch the git validation as disable by default

is by design like that. Unless users specify this in their Builds.

Oh, I wasn't aware of that setting, thanks for pointing that out. Found it here:

// AnnotationBuildVerifyRepository tells the Build Controller to check a remote repository. If the annotation is not set
// or has a value of 'true', the controller triggers the validation. A value of 'false' means the controller
// will bypass checking the remote repository.
AnnotationBuildVerifyRepository = BuildDomain + "/verify.repository"

This means that validation is on by default, and can be disabled on a per-build basis. I think it could be useful to have an installation-wide option for disabling this in a ConfigMap, which would allow Shipwright operators to disable validation even if it's requested. This could be either temporarily ("GitHub is down, let's disable all validations until it comes back") or permanently ("we don't ever want to validate"). Personally as an operator I'd like to disable controller-side validation entirely, since it can cause reconciliation to take a long time, and slow down processing other unrelated Builds.

@qu1queee
Copy link
Contributor

yes, my bad, this is on by default! @xiujuan95 corrected me today. @imjasonh i think you are looking for #651, is this the same?

@imjasonh
Copy link
Contributor

Yeah, a ConfigMap to configure the controller sounds best to me.

@zhangtbj
Copy link
Contributor

I think we can discuss and plan the #651 in the next milestone, because we have more and more configurations, like: https://github.com/shipwright-io/build/blob/master/docs/configuration.md

We can decide if we want to keep using environment properties or switch to configmap way.

@xiujuan95
Copy link
Contributor Author

Go through above your discussions, my understanding is most of us is agree with disable git validation by default.

Now we enable it by default. Yep, maybe we can configure it via a configmap, but as @zhangtbj said, we have more and more configurations now, so maybe it needs more time to discuss. However, for our initernal situation, this issue is urgent.

And although @SaschaSchwarze0 makes MaxConcurrentReconciles be configured, this still can't ensure everything is ok. Especiallly, if someone wants to hack our environment, or most of build objects are created by users.

I also think we need to solve the root cause rather than workaround it. However, about go-git community side, they have a pull request to use context in git fetch and git push. We can utilize it to configure timeout. But it doesn't include git ls-remote, we need to add this part by ourselves:go-git/go-git#246 (comment). Yes, we can PR it. But we need to PR based on another PR. However, this PR has lasted for two months, so this also needs more time to finish.

So I prefer to disable git validation by default in our code. Namely, this part should be:

switch s.Build.GetAnnotations()[build.AnnotationBuildVerifyRepository] {
		case "true":
			err := git.ValidateGitURLExists(s.Build.Spec.Source.URL)
			if err != nil {
				s.MarkBuildStatus(s.Build, build.RemoteRepositoryUnreachable, err.Error())
			}
			return err
		case "", "false":
			ctxlog.Info(ctx, fmt.Sprintf("the annotation %s is set to %s, nothing to do", build.AnnotationBuildVerifyRepository, s.Build.GetAnnotations()[build.AnnotationBuildVerifyRepository]))
			return nil
		default:
			var annoErr = fmt.Errorf("the annotation %s was not properly defined, supported values are true or false", build.AnnotationBuildVerifyRepository)
			ctxlog.Error(ctx, annoErr, namespace, s.Build.Namespace, name, s.Build.Name)
			s.MarkBuildStatus(s.Build, build.RemoteRepositoryUnreachable, annoErr.Error())
			return annoErr
		}

At the same time, we can monitor the progress of community side. Once this PR is merged, we can PR what we want so that we can solve this issue from the root.

Do you agree?

@qu1queee
Copy link
Contributor

I agree on this. I think we have other issues for providing more flexibility on how to globally disable a validation, e.g. configmaps.

I´m wondering if we should have a separate issue for a timeout for the Build reconcile space. At the moment, during a Build reconciliation, all validations are in-cluster, the only one to the outside is the git one. If we have the git validation disabled by default and also a configmap that allow us to disable it globally, Im wondering if the timeout is something we still want?

@SaschaSchwarze0
Copy link
Member

Im wondering if the timeout is something we still want?

In general I would say yes. Any network communication should imo be done with connect and read timeouts in place.

If we decide to put a timeout around this (= for the overall Build reconciliation), I would be okay if it is guaranteed that - if the timeout happens - it is ensured that the hanging network connection gets closed.

@qu1queee
Copy link
Contributor

An additional thing on timeouts, all in-cluster validations in the Build reconcile do a client call with a context, this context already have a timeout, see https://github.com/shipwright-io/build/blob/master/pkg/reconciler/build/build.go#L50 . The only call we have without a timeout is the one that goes out to the internet (git).

@gabemontero
Copy link
Member

fwiw, when the git validation first dropped, I believe I was the one who originally insisted on it minimally having an on/off switch. At the time, I chose to be "un-opinionated" on what the default should be. But perhaps I should have been more opinionated :-) (my personal preference would have been off by default), and I am certainly +1 on switching the default to off now.

@qu1queee
Copy link
Contributor

@gabemontero this is true, that was certainly a great feedback.

@xiujuan95
Copy link
Contributor Author

I submit a PR:#672 to disable sourceURL validation by default, pls take a look, thanks!

@qu1queee
Copy link
Contributor

@xiujuan95 can we close this issue?

@xiujuan95
Copy link
Contributor Author

@qu1queee Yes, we can close it, now. Also I will submit a PR to go-git repo in order to add timeout context for git ls-remote command. Once this PR is ready, I think we can fix this issue from the root cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants