Code review comments from @ialidzhikov #5

ialidzhikov · 2024-02-20T09:36:48Z

First, you could address the findings from Oliver's review against the main branch by creating a PR that addresses his comments.

Mid

Switch to prow for head-update and pull-requests jobs. For new repos we already use prow as it gives many possibilities such as having e2e test, defining checks in much better format than CI/CD black magics, etc.
No developer docs/instructions how to run it locally (aka local setup docs).
Switch to project to push to AR. GCR is being replaced by AR and C. Cwienk was adapting all repos to publish to AR. Looks like he missed this repo: Move from GCR to artifact registry #10
We drop vendoring from the repos. Nowadays, gardener/gardener and most of the extension repos don't have a vendor dir. As a new Project you could drop vendor dir as well: Drop vendoring #13
You could add developer docs for the existing controllers and other important units in the code base: Example: https://github.com/gardener/gardener/blob/master/docs/concepts/gardenlet.md#controllers
Can we use the latest version of the sigs.k8s.io/custom-metrics-apiserver dependency (v1.28.0)

gardener-custom-metrics/go.mod

Line 21 in 392b48a

sigs.k8s.io/custom-metrics-apiserver v1.23.0 // v1.22.0 // v1.23.0

This should also transitively update the K8s dependencies to the latest versions. I recall the you mentioned that you have problems with the dependencies: Upgrade k8s.io/* to v0.28, sigs.k8s.io/controller-runtime to v0.16 #14
gardener-custom-metrics/pkg/api/generated/openapi/openapi.go

Line 3 in 392b48a

// This file is based on output generated by openapi-gen

: How this file is being generated? The Makefile says "Code generation is currently not implemented". Set up proper code generation: Upgrade k8s.io/* to v0.28, sigs.k8s.io/controller-runtime to v0.16 #14
Don't use testIsolation approach to fake/mock TimeNow. Use k8s.io/utils/clock/.RealClock/k8s.io/utils/clock/testing.FakeClock instead. Example: https://github.com/gardener/gardener/blob/dff43d99d2128c99f6d09116145ee48aebfd97f4/pkg/controllermanager/controller/event/reconciler.go#L38. Usages that need adaptation:
- gardener-custom-metrics/pkg/metrics_provider/metrics_provider.go
  
  Line 54 in 392b48a
  
  testIsolation: metricsProviderTestIsolation{TimeNow: time.Now},
- gardener-custom-metrics/pkg/input/input_data_registry/input_data_registry.go
  
  Line 168 in 392b48a
  
  testIsolation inputDataRegistryTestIsolation // Provides indirections necessary to isolate the unit during tests
- gardener-custom-metrics/pkg/input/metrics_scraper/pacemaker.go
  
  Line 51 in 392b48a
  
  testIsolation pacemakerTestIsolation // Provides indirections necessary to isolate the unit during tests
- gardener-custom-metrics/pkg/input/metrics_scraper/scrape_queue.go
  
  Line 76 in 392b48a
  
  testIsolation scrapeQueueTestIsolation // Provides indirections necessary to isolate the unit during tests
- gardener-custom-metrics/pkg/input/metrics_scraper/scraper.go
  
  Lines 295 to 296 in 392b48a
  
  // Points to [time.Now]
  
  TimeNow func() time.Time
- gardener-custom-metrics/pkg/ha/ha_service.go
  
  Line 29 in 392b48a
  
  testIsolation testIsolation
gardener-custom-metrics/pkg/util/gardener/util.go

Line 13 in 392b48a

return strings.HasPrefix(namespace, "shoot--")

: This is not true. Some quite old Shoot control plane namespaces are in the format shoot-<project-name>-<shoot-name>. See https://github.com/gardener/gardener/blob/76704c377f34cdbdf1b0d3986b243c8b67c66909/pkg/component/kubeapiserverexposure/kube_apiserver_service.go#L237-L239 and https://github.com/gardener/gardener/blob/76704c377f34cdbdf1b0d3986b243c8b67c66909/pkg/utils/gardener/shoot.go#L696-L708. You should rather adapt the check to be "has prefix shoot-": Apply review comments. Add debug support #7
gardener-custom-metrics/pkg/util/gardener/util.go

Line 13 in 392b48a

return strings.HasPrefix(namespace, "shoot--")

- you should also think about the garden runtime cluster. As far as I understand, we would like to use the same approach to scale the virtual-kube-apiserver, maybe even the gardener-apiserver. Definitely this check won't work for the virtual-kube-apiserver or gardener-apiserver.
You could add skaffold based local setup.
You could add e2e tests.
gardener-custom-metrics/pkg/input/controller/secret/actuator.go

Line 33 in 392b48a

secretNameAccessToken = "shoot-access-prometheus"

: I am not sure reusing the prometheus Shoot access Secret is a good thing. Maybe we should rather have an own Shoot access Secret that gets created by gardenlet?: Apply review comments. Add debug support #7
gardener-custom-metrics/pkg/input/controller/reconciler_test.go

Line 31 in 392b48a

client := &test_util.FakeClient{}

: I hope you could use the fake client from "sigs.k8s.io/controller-runtime/pkg/client/fake".

gardener-custom-metrics/pkg/app/common.go

Lines 10 to 17 in 392b48a

    
           // Log verbosity 
        
           const ( 
        
           	VerbosityError   = 0 
        
           	VerbosityWarning = 25 
        
           	VerbosityInfo    = 50 
        
           	VerbosityVerbose = 75 
        
           	VerbosityDebug   = 100 
        
           )

: I agree with Oliver's comment in Proposed #3 (comment). No need to introduce our semantics. We can use the K8s sematics described in https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/logging.md#what-method-to-use. Or alternatively use the log level semantics described in https://github.com/gardener/gardener/blob/4507f53471f8f982e1351af499a5ed2804ee65d4/docs/development/logging.md#log-levels.

No readiness and liveness probe defined in the example Deployment in example/custom-metrics-deployment.yaml

Minor

Drop the .docforge dir and switch to the central manifest. After Use central and new manifest format documentation#431, the repos do no longer needs to define a .docforge dir and the manifests are maintained centrally. See the linked issue for more details. Additionally, as a consequence

gardener-custom-metrics/Makefile

Lines 104 to 106 in 392b48a

    
           .PHONY: check-docforge 
        
           check-docforge: $(DOCFORGE) 
        
           	@$(REPO_ROOT)/hack/gardener-util/check-docforge.sh $(REPO_ROOT) $(REPO_ROOT)/.docforge/manifest.yaml ".docforge/;docs/" $(NAME) false

has to be dropped. Drop also the make check-docforge target. It should be no longer needed.: Fix make verify #11

make check is reporting golangci-lint findings. You could fix them.: Fix make verify #11
make format is failing that there is no test/ dir.: Fix make verify #11

make generateis not implemented (

gardener-custom-metrics/Makefile

Lines 112 to 116 in 392b48a

    
           .PHONY: generate 
        
           generate: $(CONTROLLER_GEN) $(GEN_CRD_API_REFERENCE_DOCS) $(HELM) $(YQ) 
        
           	echo "Code generation is currently not implemented" 
        
           	# @$(REPO_ROOT)/hack/gardener-util/generate.sh ./cmd/... ./pkg/... ./test/... 
        
           	# $(MAKE) format

): Does the project need code generation at all? If not, let's remove it.

gardener-custom-metrics/Dockerfile

Line 9 in 392b48a

############# base image # TODO: Andrey: P1: Move to distroless

- +1, let's use distroless instead of alpine. It is also part of the component checklist - the component should not run as a root user, if possible.: Fix make verify #11
The pkg/version pkg - we usually don't define such pkg in other repos and rather reuse the k8s.io/component-base/version/verflag pkg. You should be already familiar with it as in Add support for a --version command line flag gardener-extension-runtime-gvisor#38 you used this pkg and eliminated a custom version pkg in the runtime-gvisor extension: Move from GCR to artifact registry #10
All source files are missing a REUSE license header: Switch to use REUSE license format #12
gardener-custom-metrics/pkg/input/metrics_scraper/metrics_client.go

Line 44 in 392b48a

testIsolation metricsClientTestIsolation // Provides indirections necessary to isolate the unit during tests

: Instead of having metricsClientTestIsolation you could directly have a field that is rest.HTTPClient. When you instantiate in non-test code, you pass real client to a constuctor func such as NewMetricsClient(httpClient). When you instantiate in test code, you pass fake/mock client.
You could move the copied code from gardener/gardener from ./pkg/util/gardener to /third_party/. Example: gardener/gardener@a1eb2fb: Drop vendoring #13

gardener-custom-metrics/cmd/gardener-custom-metrics/main.go

Lines 85 to 88 in 392b48a

    
           log.V(app.VerbosityInfo).Info("Creating client set") 
        
           if _, err := k8sclient.GetClientSet(appOptions.RestOptions.Kubeconfig); err != nil { 
        
           	return &log, nil, nil, fmt.Errorf("create client set: %w", err) 
        
           }

: A ClientSet is created but then it is not used. What is the rational behind it? I asssume we can drop it.

gardener-custom-metrics/pkg/util/errutil/errutil.go

Lines 13 to 19 in 392b48a

    
           func Wrap(prefixMessage string, err error, varargs ...any) error { 
        
           	if err == nil { 
        
           		return nil 
        
           	} 
        
           	return fmt.Errorf(prefixMessage+": %w", append(varargs, err)...) 
        
           }

: Why we need such utility func and why it is not possible to use fmt.Errorf natively as in every other place in the gardener code-base.

Nits (really, really, really minor)

Questions:

Why the repo is private?: The repo has been made public.
gardener-custom-metrics/pkg/input/controller/pod/actuator.go

Line 75 in 392b48a

metricsUrl := fmt.Sprintf("https://%s/metrics", pod.Status.PodIP)

: What happens when pod.Status.PodIP is empty. According to the doc string of the field, pod.Status.PodIP will be empty if not yet allocated.
- [andrerun-new]: See the log entry below. It's a bit ugly - a project outsider may have a hard time figuring out what's going on. @ialidzhikov, a pod gets stuck without an IP address every now and then, right? It's not an extremely rare event? If so, I think I should add special handling for this case and log a nicer message. [under-discussion]
```
ERROR	gardener-custom-metrics.input.scraper	Kapi metrics retrieval failed	{"op": "scrape", "namespace": "shoot--local--local", "pod": "kube-apiserver-5588c58789-crm72", "error": "metrics client: making http request: Get \"https:///metrics\": http: no Host in request URL"}
```
gardener-custom-metrics/pkg/input/input_data_registry/input_data_registry.go

Line 24 in 392b48a

MetricsUrl string // The URL where metrics for the pod can be scraped

: Why we store the whole MetricsUrl? Instead you could only store the Pod IP and construct the metrics URL when fetching the metrics.
gardener-custom-metrics/pkg/input/input_data_registry/input_data_registry.go

Line 2 in 392b48a

// information necessary to scrape such metrics.
- [andrerun-new]: The design reason is I want to keep the decision 'where to scrape', outside of the scraper. There's also a minor runtime concern - I prefer less object creation/GC churn.

3: Storing the same Pod labels would be a lot waste of memory. I see that you need the Pod labels to allow selecting metrics by object labelSelector. Maybe the whole model has to be adapted. We can for example accept that Pod labels are immutable and store them only once and not for every new metric value. [under-discussion]

gardener-custom-metrics/pkg/ha/ha_service.go

Line 47 in 392b48a

func NewHAService(

: IIUC, the benefit of running 2 replicas is only that the 2nd Pod waits in "stand by" mode and on issues with the leader replica, the "stand by" can take over faster. By faster - we don't to wait a new Pod to be scheduled and started. Updating the Endpoint manually to influence the traffic to go to the leader replica looks hacky. We were running metrics-server for Shoots and ManagedSeeds for years with a single replica and I don't recall us having issues related to it. https://github.com/kubernetes-sigs/metrics-server/tree/master?tab=readme-ov-file#high-availability: metrics-server seems to have a real HA mode where 2 of the replicas are serving (?). We can check what they do and how. And I agree with Proposed #3 (comment) - this approach is error-prone a lot.
- [andrerun]: The main benefit I see in the second replica is that it ties compute resources in another AZ, so it guards against AZ resource shortage disrupting failover. Overall, I have my reservations regarding the need for a second replica, considering the intended use of the component, but that was a hard requirement introduced by the GEP review process. I'll elaborate offline. [under-discussion]
I didn't manage to test the component in local setup at all (due to missing docs/instructions) but I wanted to ask how it behaves on restarts and whether the HPA acting on the custom metric is fine with it. I assume on Pod restart the leader will change and the newly elected replica won't report any metrics (or will report 0-ed metrics value). Is HPA able to deal with unavailability of the gardener-custom-metrics component?

Final notes. I didn't deep dive into non-trivial packages like ./pkg/input/metrics_scraper.

The text was updated successfully, but these errors were encountered:

ialidzhikov mentioned this issue Feb 20, 2024

☂️ [GEP-23] Autoscaling Shoot kube-apiserver via Independently Driven HPA and VPA gardener/gardener#8259

Open

31 tasks

plkokanov mentioned this issue Feb 28, 2024

Code review comments from @plkokanov #6

Open

16 tasks

This was referenced Mar 18, 2024

Apply review comments. Add debug support #7

Merged

Drop the sigs.k8s.io/metrics-server dependency #8

Merged

Fix make verify #11

Merged

This was referenced Mar 25, 2024

Switch to use REUSE license format #12

Merged

Drop vendoring #13

Merged

Upgrade k8s.io/* to v0.28, sigs.k8s.io/controller-runtime to v0.16 #14

Merged

Add liveness and readiness probes to the example Deployment #18

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code review comments from @ialidzhikov #5

Code review comments from @ialidzhikov #5

ialidzhikov commented Feb 20, 2024 •

edited

Code review comments from @ialidzhikov #5

Code review comments from @ialidzhikov #5

Comments

ialidzhikov commented Feb 20, 2024 • edited

Mid

Minor

Nits (really, really, really minor)

Questions:

ialidzhikov commented Feb 20, 2024 •

edited