Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change merge strategy to preserve Plugins when importing Configs #7347

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

rayburgemeestre
Copy link

Hi guys,

Hope you are all doing well. At work we ran into the following issue. Which is also being discussed here: #5837. This PR tries to propose a solution.

Example

Given the following config.toml for containerd:

# cat /cm/local/apps/containerd/var/etc/config.toml 
version = 2
imports = ["/cm/local/apps/containerd/var/etc/conf.d/*.toml"]

With multiple files that match this import:

# cat /cm/local/apps/containerd/var/etc/conf.d/cri.toml
[plugins."io.containerd.grpc.v1.cri".cni]
    bin_dir = "/cm/local/apps/kubernetes/current/bin/cni"
    
# cat /cm/local/apps/containerd/var/etc/conf.d/registry.toml 
[plugins."io.containerd.grpc.v1.cri".registry]
    config_path = "/cm/local/apps/containerd/var/etc/certs.d"

The current merge strategy for imports will overwrite the entire plugin io.containerd.grpc.v1.cri.

This means that depending on the filenames, containerd ends up with either just the cni config, or just the registry config. This was a conscious decision in the past. I hope however we can slightly modify this behavior in containerd, to merge the Plugin configuration instead? 😇

Current behavior

The two imports from the example effectively result in containerd having this configuration for the plugin:

[plugins."io.containerd.grpc.v1.cri".registry]
    config_path = "/cm/local/apps/containerd/var/etc/certs.d"

Note the absence of the cni bin_dir configuration, that was overwritten.

New proposed behavior

The change in this PR will have the merge result in the following configuration instead:

[plugins."io.containerd.grpc.v1.cri".cni]
    bin_dir = "/cm/local/apps/kubernetes/current/bin/cni"
    
[plugins."io.containerd.grpc.v1.cri".registry]
    config_path = "/cm/local/apps/containerd/var/etc/certs.d"

There are also two extra unit tests that attempt to demonstrate the change.

Why this change?

Whether the proposed behavior is better is probably subjective, but reasons why I am in favor of it ...

Let's say we have three configurations, cni.toml (for cni bin), registry.toml (for cert config path) and nvidia.toml (for changing the default runtime to nvidia).

Then it's nice to have separate smaller files, also arguably easier to get rid of certain configuration (get rid of nvidia being default runtime, rm -rf nvidia.toml). The problem now is that they all add something to the io.containerd.grpc.v1.cri plugin, and the last one wins.

If you have existing configuration files, it's in my opinion extra burden for the user to have the user first figure out which of the files already configures something for the same plugin before they can add their configuration without removing existing config.

In our case we try to automate this, so our code becomes more complex because of this as well 😛

How the old merge works

The following function seems to take care of the merging:

err := mergo.Merge(to, from, mergo.WithOverride, mergo.WithAppendSlice)

The config.Config being merged, has Plugins map[string]toml.Tree and is the only field in the struct that includes the toml.Tree struct. The above mentioned merge function will handle the actual merging of this struct here:

case reflect.Struct:
if hasMergeableFields(dst) {
for i, n := 0, dst.NumField(); i < n; i++ {
if err = deepMerge(dst.Field(i), src.Field(i), visited, depth+1, config); err != nil {
return
}
}

This will already have the effect of overwriting the whole io.containerd.grpc.v1.cri plugin, which surprised me! Because that means that the following for loop is unnecessary (which is why I removed it in this PR):

// Replace entire sections instead of merging map's values.
for k, v := range from.Plugins {
to.Plugins[k] = v
}

How the new merge works

In my PR I used the Transformer feature from mergo (see https://github.com/imdario/mergo#transformers)

With that feature we can hook into toml.Tree's merge from src to dst with a callback function. Since it will overwrite the struct completely, the best idea I could come up with is to preserve the values that would have been discarded by copying them from destination to source. This is slightly confusing, but knowing that mergo.merge will overwrite destination with source later, the values need to be present in source 😃

How I tested this code change

In order to avoid cluttering this already long PR description, I will add a text document as an attachment that shows a "real" example, how I tested if this code is also working on a real machine, and not just in unit tests.

Please let me know if this approach makes sense, and if not, how to do it differently.
It would make my life quite a bit easier if this would be the merge behavior for importing configs, but would be even better if others agree and see it as an improvement as well.

Thanks,
Ray

manual_testing.txt

@k8s-ci-robot
Copy link

Hi @rayburgemeestre. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rayburgemeestre rayburgemeestre force-pushed the change-merge-strategy-for-containerd-plugins branch 3 times, most recently from 90bd1f4 to fc7c9bc Compare August 30, 2022 19:58
@estesp
Copy link
Member

estesp commented Sep 2, 2022

Thanks for the detailed description and extra tests to validate behavior. Overall I understand what is being improved here and it seems valuable.

I guess one (lazy) question: does this now mean that load order is important for imports which may all have defined settings for the same entry? I haven't ever looked closely at mergo or our specific use of it, but I guess the open question in my mind is whether we now have a potential ordering/"race", and if so, maybe more importantly, is that ordering reliable or random (if I run merge one time and file 3 wins over file 2, will that always be true, or will sometimes file 2 win over file 3 for the same setting)? Mostly I just want to understand so we know whether or not we need to define expected behavior. I think with our current less "intelligent" implementation this just wasn't an issue as a single file could only impact an entire plugin and there was no true merging of line-by-line settings within a tree branch.

@rayburgemeestre
Copy link
Author

Hi Phil, thanks for the feedback!

You're right, it depends on the order in which the files are listed in the "imports" array. Luckily everything is all sorted and deterministic. The following in config.toml:

imports = ["/path/data1.toml", "/path/data2.toml"]

Will have data1.toml merge over config.toml, and then data2.toml over (already updated) config.toml.

Globs can also be used in this imports array. Those are luckily expanded to files in lexicographical order, I confirmed it with a debugger and some extra tests. This function (resolveImports) is where the glob is handled when loading the config:

for _, path := range imports {
if strings.Contains(path, "*") {
matches, err := filepath.Glob(path)
if err != nil {
return nil, err
}
out = append(out, matches...)

Zooming in on filepath.Glob eventually a sort is found to guarantee lexicographical order:
https://cs.opensource.google/go/go/+/refs/tags/go1.19:src/path/filepath/match.go;l=347;drc=ccab2fbc30b0553fce54646a4da0a8645eda40a3

One combined example to demo all the behavior:

imports = ["/foo/data1.toml", "/bar/*.toml", "/aaa/a.toml"]

Let's say there are a few files in /bar: a.toml, b.toml and c.toml. Then the result becomes:

imports = ["/foo/data1.toml", "/bar/a.toml", "/bar/b.toml", "/bar/c.toml", "/aaa/a.toml"]

The code that is responsible for merging each config can be seen in the caller of resolveImports here:

imports, err := resolveImports(path, config.Imports)

@jimmidyson
Copy link

@estesp Is there anything I can do to help get this PR merged? I think it's an important one generally for configuration, it's important to me at least!

@dkoshkin
Copy link

This looks like a great change, we have very similar usecases adding nvidia and registry drop in files at runtime and this would enable us to remove old-standing workarounds.

@samuelkarp samuelkarp added this to New in Code Review via automation Sep 27, 2022
@samuelkarp samuelkarp moved this from New to Needs Discussion in Code Review Sep 27, 2022
@dmcgowan dmcgowan added the status/needs-discussion Needs discussion and decision from maintainers label Aug 28, 2023
@dmcgowan dmcgowan removed this from Needs Discussion in Code Review Aug 28, 2023
@dmcgowan dmcgowan added this to the 2.0 milestone Aug 28, 2023
Added some unit tests that show why we might want to get rid of it.

Signed-off-by: Ray Burgemeestre <rayb@nvidia.com>
@rayburgemeestre rayburgemeestre force-pushed the change-merge-strategy-for-containerd-plugins branch from fc7c9bc to a37ddca Compare February 8, 2024 11:45
@zvonkok
Copy link

zvonkok commented Feb 8, 2024

@rayburgemeestre
Copy link
Author

rayburgemeestre commented Feb 8, 2024

Sorry if the history of this PR now looks a little messy/verbose, it's because in the meantime the code base has changed.
I removed my transformer (I'll keep it as a text file here for the sake of archiving: TomlTreeTransformer.txt), in favor of the existing sliceTransformer that was recently introduced here:

1571301

I've left the minimally required fix needed at this point + unit tests. This makes at least the diff for this PR much simpler.

I'll also include the output for the tests without and without the fix (just to show how the added tests would fail without the change), for convenience. Please let me know if I need to change something more!

@elezar
Copy link
Contributor

elezar commented Feb 8, 2024

@rayburgemeestre since cri-o already supports drop-in files for overriding the config, does following a similar mechanism here work too? See https://github.com/cri-o/cri-o/blob/e0e17ee187c9f52d870b80cee9116c4fd5ca279e/pkg/config/config.go#L699

I haven't dug too much into whether only leaves of the config tree are replaced this way, but I know that this is commonly used to configure the NVIDIA runtime through a drop-in file, for example.

@dmcgowan
Copy link
Member

dmcgowan commented Feb 8, 2024

@elezar You can specific an import path or glob for loading configurations files which can override the base config. This is just changing the logic to merge the plugin configs rather than overriding the entire config. We also support dynamic drop in files for stuff like registry host configuration and cni.

I think this change is fine to accept as is. There are probably some surprise cases which we should make sure we document. I also found that the equal check in arrays of structs didn't work as well as I would have hoped. We can probably add more tests around that.

A separate put related change to this, we discussed changing config_path = "/cm/local/apps/containerd/var/etc/certs.d" to an actual array type. This would fit well with this feature if you were to drop in a whole configuration with its own directory of registry configs. Currently it uses a string split on a path separator, which is not great and doesn't allow anyway to merge.

})
}

func TestMergingTwoPluginConfigsRecursively(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a config with an array, possibly array of structs you can include? Those are the cases which hit the more complex logic today.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what you mean, I will try to update this PR with that idea (hopefully!) tomorrow. :)
I've been a bit quiet past few days, but it's taking me a bit longer to figure some things out, I've been trying to test a lot of real configurations today, mostly from actual configuration files that we use, and I think I found a few edge cases that were broken after I rebased this PR last week. The unit tests succeed, but I believe they are not testing enough.

EDIT: I have updated this test now to use a new test helper function that should include an array of configs. I think it is now more readable also to see what is being tested (especially because of switching to string for expected)

@dmcgowan dmcgowan removed the status/needs-discussion Needs discussion and decision from maintainers label Feb 8, 2024
…rged configs.

Added extra tests, and test helpers to make the tests more readable.

Signed-off-by: Ray Burgemeestre <rayb@nvidia.com>
@rayburgemeestre rayburgemeestre force-pushed the change-merge-strategy-for-containerd-plugins branch from bd8c292 to ca76cdb Compare February 13, 2024 15:11
@rayburgemeestre
Copy link
Author

After quite a bit of confusion because of switching back and forth between version 1.7.13 for work and 'master' for this PR. It turned out that containerd config dump did not seem to give reliable output for me. I tried to address it in my second/latest commit, I'm curious if my change makes sense to you as well @dmcgowan.

Some changes I made are:

  • outputConfig was always writing to os.Stdout. I added a unit test so I wanted to assert on the output, so I changed it to use generateConfig under the hood, which takes an io.Writer for the customizing the output.
  • generateConfig (in my PR, previously outputConfig) did some plugin loading and overwriting. However that would result in basically removing previously loaded stuff (LoadConfigs done in "dump" and "migrate" commands). My suggested fix is to do the loading and registration of plugins as it pleases, but then after that, we should merge the config that was read from disk on top of that, using the existing mergeConfig logic. Which I had to make public by renaming to MergeConfig for this reason, because it came from a different package.

The original change from earlier commits is still there and needed in my opinion.
Restoring for example this block of replacing entire plugins will nicely show what the effect is now with the unit tests, see: test_fails.txt

The simplest failure is from the first test here: https://github.com/containerd/containerd/pull/7347/files#diff-9dcf619ce07838a6db58d9519023dc77b0e4356c8e3e56ddea8029e43b937d93R257-R276. The [cni] part would be gone without it, as it is overwritten when the config for [registry] was merged.

// this command, generate the max configuration version
config.Version = srvconfig.CurrentConfigVersion

return toml.NewEncoder(os.Stdout).SetIndentTables(true).Encode(config)
// Now merge on top of this, the actual user config that is present on disk
if err = srvconfig.MergeConfig(config, userConfig); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to under the role the Merge is taking here. the plugin Decode is important to run on the user/default configuration as that will replace the interface{} initially decoded as map[string]interface{} into the struct provided by the plugin. I suspect when this Merge is called we may have the to case as the default plugin config struct and the from as the map[string]interface{} from dump (or nil from default)

@dmcgowan
Copy link
Member

dmcgowan commented Feb 13, 2024

It turned out that containerd config dump did not seem to give reliable output for me.
...
My suggested fix is to do the loading and registration of plugins as it pleases, but then after that, we should merge the config that was read from disk on top of that, using the existing mergeConfig logic.

Calling merge config after plugin decode seems like it may cause issues with the types not matching. The Decode takes the map[string]interface{}, marshals it into bytes then unmarshals it into the config struct provided by the plugin. One side of the merge would have had this extra processing while the other did not. If this behavior was needed, then the same marshal/unmarshal back into map[string]interface{} would likely be need to make the merge consistent.

Do you have more details on what you were seeing that made it unreliable?

@rayburgemeestre
Copy link
Author

rayburgemeestre commented Feb 14, 2024

I see, I think I did overlook that something changed. When trying to 'demo' in a concise manner I see that the original code gives warnings that are no longer shown after my changes. I will dig deeper into into. But in the meantime I'll share what was my motivation for the change.

--snip-- (cutting out part of my message, because I discovered I am not sure what I wrote is correct)

EDIT: Sorry, I picked an example that customized plugins."io.containerd.grpc.v1.cri".cni but that has changed to plugins."io.containerd.grpc.v1.runtime".cni since 1.7.x so I will recheck and update later.

Copy link
Author

@rayburgemeestre rayburgemeestre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so a few hours later...

I finally understand what the original function is doing, also thanks to your clarification w/r/t Decode @dmcgowan 🙏. Now I realize my previous commit actually broke the correct behavior, sorry.

So what I learned is that go has init() functions that normally load stuff before main(). So plugins that are loaded during execution of containerd config dump were not being loaded in my newly added unit test. So in my unit tests more values were actually accepted than would have been in practice. Even though I got the correct result in my unit test, it was actually not the correct desired behavior, because the marshalling back and forth wasn't happening. Anyway, that is now undone in my third/last commit.

I've updated my unit test, so now it actually does test similarly to the "config dump" command, and I've updated my inputs for the test to be correct. E.g., not using [plugins.'io.containerd.grpc.v1.cri'.cni] but [plugins.'io.containerd.grpc.v1.runtime'.cni] instead.

Hope the changes + tests make more sense now again :)

{Expected: true, Value: "/my-custom-certs.d-config-path"},
{Expected: true, Value: expected_runtimes},
}
testMergeConfig(t, []string{data1, data2, data3, data4, data5, data6, data7, data8, data9}, asserts)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was worth keeping this test because as opposed to the unit tests in server/config/config_test.go, this one also tests the mapping back and forth (and removal of fields that are not mapping to the structs).

If you do not agree it's that useful, I can remove them. (If we want to remove, I can also undo the splitting of outputConfig into outputConfig & generateConfig in command/config.go)

… test with fixes.

Signed-off-by: Ray Burgemeestre <rayb@nvidia.com>
@rayburgemeestre
Copy link
Author

rayburgemeestre commented Mar 21, 2024

Hi all, I thought long and hard about this, but I decided to propose a replacement PR.
I've created this one already:

#9982

Some context has changed since I've opened this PR:

  • We now have a 2.0.0 release coming up, which perhaps makes it 'easier' to justify making such a potentially semi-breaking change.
  • There has been some back & forth in this PR, because I initially made it before a sliceTransformer was added to the main branch, taking away the majority of the proposed changes, and also rendering the branch-name for this PR slightly wrong now.
  • I also made some mistakes, and learned some new things, had to reverted some commits, etc.

Then finally also potentially too many additional unit tests, that even though I might find them useful, perhaps all of it make this PR too hard to review. Let me know if this makes sense, then I'll close this PR, and kindly request a review on that one instead 👼 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Reviewers
Development

Successfully merging this pull request may close these issues.

None yet

9 participants