Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent Logging #3152

Closed
pydctw opened this issue Feb 4, 2022 · 37 comments · Fixed by #3776
Closed

Consistent Logging #3152

pydctw opened this issue Feb 4, 2022 · 37 comments · Fixed by #3776
Assignees
Labels
kind/documentation Categorizes issue or PR as related to documentation. priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@pydctw
Copy link
Contributor

pydctw commented Feb 4, 2022

I've found that the use of log format and logging levels are inconsistent in the repo.

A few examples

  1. Different log levels are used across reconcilers.
s.scope.V(2).Info("Reconciling network for cluster", "cluster-name", s.scope.Name(), "cluster-namespace", s.scope.Namespace())
s.scope.V(2).Info("Reconciling VPC")
s.scope.Info("Reconciling subnets")
s.scope.V(2).Info("Reconciling security groups")
  1. Log formats are different for the main controllers.
I0203 23:10:34.968187    1165 awsmachine_controller.go:414]  "msg"="Reconciling AWSMachine" 

I0203 23:12:40.884859    1165 awscluster_controller.go:202] controller/awscluster "msg"="Reconciling AWSCluster" "cluster"="quick-start-cghvfu" "name"="quick-start-cghvfu" "namespace"="quick-start-d0tfno" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSCluster" 
  1. Logging for AWS resource creation/deletion is inconsistent. The reason for using V(0) for Deleted association between route table and subnet while using V(2) for Deleted security group is not clear to me.
s.scope.Info("Deleted association between route table and subnet", "route-table-id", *rt.RouteTableId, "subnet-id", *as.SubnetId)
s.scope.V(2).Info("Deleted security group", "security-group-id", sg.ID, "kind", typ)

See #3143.

What do you expect?

  • We should have a guideline/documentation about logging format and level to improve consistency.
  • Create follow up issues if there are improvements that can be implemented.

Related Issues

@k8s-ci-robot k8s-ci-robot added needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 4, 2022
@k8s-ci-robot
Copy link
Contributor

@pydctw: This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sedefsavas sedefsavas added the kind/documentation Categorizes issue or PR as related to documentation. label Feb 4, 2022
@sedefsavas sedefsavas added this to the v1.x milestone Feb 4, 2022
@sedefsavas sedefsavas added priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Feb 4, 2022
@richardcase
Copy link
Member

@pydctw - we also have #3101 which is related. I know @Skarlso was looking at standardizing log levels as part of that work. It would be worth you both chatting.

@Skarlso
Copy link
Contributor

Skarlso commented Feb 7, 2022

Oh nice! I have been working on a POC to replace that stuff. It's a bit difficult to find out though which was supposed to do what. And it will be a rather large change I'm afraid.

@pydctw
Copy link
Contributor Author

pydctw commented Feb 7, 2022

@Skarlso, I would love to hear what you are thinking. Let's sync up.

@Skarlso
Copy link
Contributor

Skarlso commented Feb 7, 2022

@pydctw Sure, will write up a detailed POC and link a WIP PR tomorrow. :)

@sedefsavas
Copy link
Contributor

We should also consider adding JSON support during this refactoring, which is being discussed in CAPI:
kubernetes-sigs/cluster-api#5571 (comment)

@Skarlso
Copy link
Contributor

Skarlso commented Feb 10, 2022

@sedefsavas I would rather do a single thing / PR. This will be a bit of a pain as is anyways. :D I worked out a clean API for logging but I have to replace a LOT of log lines. :))))) We can do it in a followup PR? :)

@Skarlso
Copy link
Contributor

Skarlso commented Feb 10, 2022

Hmm, I read the whole thing. I guess I can use a different logger, like zerolog?

@Skarlso
Copy link
Contributor

Skarlso commented Feb 10, 2022

@pydctw added the WIP here: #3191 with a description and an API I'm suggesting. :)

@sedefsavas
Copy link
Contributor

Thanks for making progress on this @Skarlso

Since we are doing refactoring, it might be a good time to think about having consistency with cluster-api so that users won't have to grep differently to collect logs from different managers in the same management cluster.

It might worth further discussing this including other providers to see if a common approach could be adopted instead of having a custom logic here.
cc @sbueringer given that you started a discussion already during community meeting.

@Skarlso
Copy link
Contributor

Skarlso commented Feb 10, 2022

For sure, I'm open for suggestions as to which logger to adopt. :) I can use whatever we want as long as it's consistent with everything else, I guess.

@Skarlso
Copy link
Contributor

Skarlso commented Feb 10, 2022

To be clear, the reason for the clear interface is, of course, abstraction. I think it's sane to keep some kind of common interface and use whatever we want in the background. I would like to avoid having to change the logger again in so many places if that's okay. We can change the implementation underneath it easily if we keep a sane common interface.

@sbueringer
Copy link
Member

sbueringer commented Feb 10, 2022

Hey,
just a short info for now. In core CAPI we are only (or at least in almost all cases) logging with the standard log interface used in controller-runtime (which is logr.Logger).

So we don't really have our own log interface/implementation in core CAPI like CAPA has (as far as I know).

Core CAPI currently uses klog as underlying implementation of that interface (here we set the klogr as controller-runtime logger).

We are currently thinking about changing that underlying implementation to component-base/logs (which is used in kube-apiserver, kube-scheduler, ...). The effect of that is that as underlying implementation (behind logr.Logger) we would still be logging via klog, but in case the JSON logging format is enabled we are using the JSON logger implementation from component-base/logs. We are also inheriting the log flags from component-base/logs.

I think the best summary from a core CAPI perspective is this comment: kubernetes-sigs/cluster-api#5571 (comment)

To be honest, I don't know the reason why CAPA currently has an own log interface or the requirements why that is (probably) necessary.

@Skarlso
Copy link
Contributor

Skarlso commented Feb 10, 2022

@sbueringer Hey Stefan. We didn't have one until now. :) CAPA is also using klog and context logger and logr.

However, it's not very user friendly. It's difficult to grep for keywords, and the notion of doing this: log.V(5).Info is just not very developer friendly. And they you'll have to remember at which log level you used what output.

I was in the process of making a unified interface which is still using logr in the background, but at least from logging perspective, we use things like log.Debug, log.Info, log.Warning, and then just add a grep-able key value pair to it.

@sbueringer
Copy link
Member

sbueringer commented Feb 11, 2022

Ah, got your point.

So it's about having a thin layer on top of go-logr which essentially provides the usual info/debug/... log levels instead of V(i) (in the code as well as an additional "level": "x", k/v pair). I think I need more time to think this over :). We were kind of happy that we potentially get rid of our own thin layer on top of go-logr we've used in the ClusterClass implementation.

cc @fabriziopandini

P.S. I think this is orthogonal to replacing the logger "below" the interface, but I think it's relevant regarding some other log improvements we're thinking about to make in core CAPI.

@fabriziopandini
Copy link
Member

I don't have strong opinions about logging interfaces, because all of them have both pros and cons.
However, I'm strongly in favor of consistency across the ecosystems and in CAPI we are using the following documents as a north star

@Skarlso
Copy link
Contributor

Skarlso commented Feb 11, 2022

Regarding this bit:

  • klog.InfoS() has multiple levels:
    • klog.V(0) - Generally useful for this to ALWAYS be visible to an operator
      • Programmer errors
      • Logging extra info about a panic
      • CLI argument handling
    • klog.V(1) - A reasonable default log level if you don't want verbosity.
      • Information about config (listening on X, watching Y)
      • Errors that repeat frequently that relate to conditions that can be corrected (pod detected as unhealthy)
    • klog.V(2) - Useful steady state information about the service and important log messages that may correlate to significant changes in the system. This is the recommended default log level for most systems.
      • Logging HTTP requests and their exit code
      • System state changing (killing pod)
      • Controller state change events (starting pods)
      • Scheduler log messages
    • klog.V(3) - Extended information about changes
      • More info about system state changes
    • klog.V(4) - Debug level verbosity
      • Logging in particularly thorny parts of code where you may want to come back later and check it
    • klog.V(5) - Trace level verbosity
      • Context to understand the steps leading up to errors and warnings
      • More information for troubleshooting reported issues

I'm struggling to see the benefit of this. :) You trade readability with what exactly? Having switches like --v 4 or --v 3 asks the user to remember at what level a logging should occur as opposed to something like --v debug or --v trace which are more easy to remember. And in the code, you'll have to explicitly remember this mapping. Why put this burden on the user AND the developer too? I've read Dave's post, and I understand what he means by proliferation of information in the logs. But in this case then, there is no point in using these either. :) Also, these log entries are anything but grepabble or searchable for information such as, names, actions, filtering for errors, logs, relevant information. But maybe there is a point to make to not do that at all, or not care at all. Which would be more inline with Dave's post.

And I've read the other point why klog disagrees with Dave, but I don't actually like that direction because it's not very human friendly, I believe.

In any case. I bow my head to conform, but I feel like I had to share my opinion about the matter and why I wrote a wrapper interface in the first place.

@fabriziopandini
Copy link
Member

@Skarlso I personally appreciate and I'm truly thankful for your opinion, and I don't have problems in saying that there are a couple of points I agree with. My intent above was just to provide some context on how we are tackling this problem in CAPI. The final call here is clearly up to the CAPA community, not to me.

@sedefsavas
Copy link
Contributor

sedefsavas commented Feb 11, 2022

@Skarslo I have some concerns regarding diverging from the cluster-api ecosystem. But I think it'd be worth to discuss this during our community meeting on Feb 21st at 9 am PT.

@sedefsavas
Copy link
Contributor

Thanks all for chiming in and sharing your opinions on this.

@Skarlso
Copy link
Contributor

Skarlso commented Feb 11, 2022

Cool, and thanks to all! 😊👍

@sbueringer
Copy link
Member

sbueringer commented Feb 17, 2022

I'm with Fabrizio on that one. I also see some benefit in other log levels, but I prefer to stay consistent with the Kubernetes ecosystem.

We also have to take into account that we are using a lot of upstream libraries like controller-runtime and client-go. They are all logging with the .V(x) option. I'm not sure if client-go is doing it already but they at least want to use the logger we pass into calls with ctx (https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/3077-contextual-logging). Of course we could pass in a logger which translates the ~ V(0)-V(10) messages to our log levels, but I personally would really prefer not doing that in favor of conformance.

But I also agree with Fabrizio, it's a good discussion to have, it's just my opinion and I think it's something that the CAPA community should decide.

@pydctw
Copy link
Contributor Author

pydctw commented Mar 8, 2022

@Skarlso, we wanted to discuss this issue during yesterday's CAPA office hour but since you were not there, following up on the issue.

This is my understanding reading comments above but to confirm - you are ok for CAPA to follow k8s ecosystem's logging convention of using V(0)-V(10) messages instead of a logging interface from the PoC?

@Skarlso
Copy link
Contributor

Skarlso commented Mar 10, 2022

Thanks, sorry, I was out on vacation. :) will take a look.

@sedefsavas
Copy link
Contributor

A related issue:
#2944

@Skarlso
Copy link
Contributor

Skarlso commented Mar 13, 2022

@sedefsavas Okay, so... I'm okay with following this convention, but that will result in a not very searchable log entry which is what the initial problem also concluded. So I will have to wrap it in something that sets up searchable entries anyways. :)

@richardcase
Copy link
Member

And even if we follow the verbosity levels, we still need guidance and constants in the code that help people know which verbosity level they should use for their situation.

@Skarlso
Copy link
Contributor

Skarlso commented Mar 15, 2022

Absolutely.

@richardcase
Copy link
Member

@Skarlso - will ping you on slack.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 14, 2022
@richardcase
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 14, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 12, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 12, 2022
@Skarlso
Copy link
Contributor

Skarlso commented Oct 12, 2022

/remove-lifecycle rotten

Okay, so now that we conform more with CAPI and use proper logging formats and initiation, I think we can refactor some internal things and have a unified logging interface. Whether that's just a bunch of const-ants like:

    const (
        logLevelInfo  = iota
        logLevelError
        logLevelWarning
        logLevelDebug
        logLevelTrace
    )

or a complete interface like using Debug instead of V(logLevelDebug) is for us to decide. :)

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 12, 2022
@richardcase
Copy link
Member

I like the idea of Debug() instead of V(logLevelDebug) personally.

@Skarlso
Copy link
Contributor

Skarlso commented Oct 12, 2022

I second that.

@Skarlso
Copy link
Contributor

Skarlso commented Oct 13, 2022

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/documentation Categorizes issue or PR as related to documentation. priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants