Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows CI: Add support for testing with containerd #41479

Merged
merged 2 commits into from Aug 24, 2021

Conversation

olljanat
Copy link
Contributor

@olljanat olljanat commented Sep 21, 2020

- What I did
Set Windows Server Preview Build 20295 and later to use ContainerD as default runtime (like agreed on #41455 (comment)) and provided CI for it (by modifying Win 2022 CI added by #39846).

Updated to working with #42528

- How I did it

  • Some preparation changes to tests was done on Prepare tests for Windows containerd support #42164
  • Moved Windows + ContainerD support out from experimental mode.
  • Added ContainerD support to Windows CI. Including disable for TestExecWithCloseStdin and TestPsListContainersFilterHealth which got stuck forever.
    - Set Windows build greater or equal of 20295 defaulting to ContainerD and enabling CI for it.
  • Disabled tests TestAPIStatsNoStreamGetCpu , TestAPIStatsNetworkStats , TestCommitAfterContainerIsDone and TestRunSetMacAddress which looks to be broken after updating to latest ContainerD version.
  • Included ContainerD logs to Jenkins artifacts.

- How to verify it
Pass CI on Win 2022 with and without ContainerD

- What is left to later PRs

- A picture of a cute animal (not mandatory but encouraged)
image

Relates to #41455

@olljanat olljanat marked this pull request as draft September 22, 2020 06:30
@olljanat
Copy link
Contributor Author

Integration API tests starts running but some of the are failing to timeout:

[2020-09-22T07:26:26.176Z] === Failed
[2020-09-22T07:26:26.176Z] === FAIL: github.com/docker/docker/integration/container TestExecWithCloseStdin (600.23s)
[2020-09-22T07:26:26.176Z] panic: test timed out after 10m0s

[2020-09-22T07:36:29.628Z] === Failed
[2020-09-22T07:36:29.628Z] === FAIL: github.com/docker/docker/integration/image TestRemoveImageOrphaning (600.18s)
[2020-09-22T07:36:29.628Z] panic: test timed out after 10m0s

But integration CLI most probably crashes containerd.exe or dockerd.exe lost connection to it so tests are just hanging until they timeout:

[2020-09-22T07:37:00.441Z] INFO: Integration CLI tests being run from the host:
[2020-09-22T07:37:00.441Z] INFO: gotestsum --format=standard-verbose --jsonfile=..\\bundles\\go-test-report-intcli-tests.json --junitfile=..\\bundles\\junit-report-intcli-tests.xml -- "-tags" "autogen" "-test.timeout" "200m" 
[2020-09-22T07:37:09.060Z] INFO: Windows Base image is  mcr.microsoft.com/windows/servercore:ltsc2019
[2020-09-22T07:37:09.060Z] INFO: Testing against a local daemon
[2020-09-22T07:37:09.060Z] === RUN   TestDockerSuite
[2020-09-22T07:37:09.060Z] === RUN   TestDockerSuite/TestAPIClientVersionOldNotSupported
[2020-09-22T07:37:09.060Z] === RUN   TestDockerSuite/TestAPICreateDeletePredefinedNetworks
[2020-09-22T07:37:09.060Z] === RUN   TestDockerSuite/TestAPIErrorJSON
[2020-09-22T07:37:09.060Z] === RUN   TestDockerSuite/TestAPIErrorNotFoundJSON
[2020-09-22T07:37:09.060Z] === RUN   TestDockerSuite/TestAPIErrorNotFoundPlainText
[2020-09-22T07:37:09.060Z] === RUN   TestDockerSuite/TestAPIErrorPlainText
[2020-09-22T07:37:09.060Z] === RUN   TestDockerSuite/TestAPIGetEnabledCORS
[2020-09-22T07:37:09.060Z] === RUN   TestDockerSuite/TestAPIImagesDelete
[2020-09-22T07:37:10.035Z] === RUN   TestDockerSuite/TestAPIImagesFilter
[2020-09-22T07:37:10.035Z] === RUN   TestDockerSuite/TestAPIImagesHistory
[2020-09-22T07:37:10.498Z] === RUN   TestDockerSuite/TestAPIImagesImportBadSrc
[2020-09-22T07:37:10.961Z] === RUN   TestDockerSuite/TestAPIImagesSaveAndLoad
[2020-09-22T08:48:08.315Z] Sending interrupt signal to process
[2020-09-22T08:48:26.570Z] Sending interrupt signal to process
[2020-09-22T08:48:28.316Z] After 20s process did not stop

@olljanat
Copy link
Contributor Author

olljanat commented Mar 1, 2021

Looks that I found those most problematic test cases which made CI timeout. These tests are still failing which need investigation:

[2021-02-28T20:29:28.366Z] === RUN   TestPauseFailsOnWindowsServerContainers
[2021-02-28T20:29:30.541Z] --- FAIL: TestPauseFailsOnWindowsServerContainers (2.55s)
[2021-02-28T20:29:30.541Z]     pause_test.go:65: assertion failed: expected error to contain "cannot pause Windows Server Containers", got "Error response from daemon: Cannot pause container d39844a5b34d0a8e7c2782e1d0e8ffb6e7082c36ef1b2824830f004ae4b92b05: not implemented"
[2021-02-28T20:29:30.541Z]         Error response from daemon: Cannot pause container d39844a5b34d0a8e7c2782e1d0e8ffb6e7082c36ef1b2824830f004ae4b92b05: not implemented

[2021-02-28T20:30:07.542Z] === RUN   TestResize
[2021-02-28T20:30:09.718Z] --- FAIL: TestResize (2.59s)
[2021-02-28T20:30:09.718Z]     resize_test.go:32: assertion failed: error is not nil: Error response from daemon: exec: 'c86c9ca26f5c60f033655fd89027cc188473b60b81fb478a3fb7e33a2135beb7' in task: 'c86c9ca26f5c60f033655fd89027cc188473b60b81fb478a3fb7e33a2135beb7' is not a tty: failed precondition

@TBBle
Copy link
Contributor

TBBle commented Mar 1, 2021

Looks that I found those most problematic test cases which made CI timeout. These tests are still failing which need investigation:

[2021-02-28T20:29:28.366Z] === RUN   TestPauseFailsOnWindowsServerContainers
[2021-02-28T20:29:30.541Z] --- FAIL: TestPauseFailsOnWindowsServerContainers (2.55s)
[2021-02-28T20:29:30.541Z]     pause_test.go:65: assertion failed: expected error to contain "cannot pause Windows Server Containers", got "Error response from daemon: Cannot pause container d39844a5b34d0a8e7c2782e1d0e8ffb6e7082c36ef1b2824830f004ae4b92b05: not implemented"
[2021-02-28T20:29:30.541Z]         Error response from daemon: Cannot pause container d39844a5b34d0a8e7c2782e1d0e8ffb6e7082c36ef1b2824830f004ae4b92b05: not implemented

This appears to just be a change in error message from locally-generated "cannot pause Windows Server Containers" to "Cannot pause container ...: not implemented" which might bubble up from hcsshim (although I can't track down the exact source in either hcsshim or containerd).

So checking for annot pause would match both error messages, I guess. Or "cannot pause Windows Server Containers|Cannot pause container .*: not implemented" if you want to get more regexy about it, and still check for the specific error text.

It'd be easier and more consistent if both old and new errors were wrapping a NotImplemented or something, but that's water under the bridge.

@TBBle
Copy link
Contributor

TBBle commented Mar 1, 2021

Looks that I found those most problematic test cases which made CI timeout. These tests are still failing which need investigation:

[2021-02-28T20:30:07.542Z] === RUN   TestResize
[2021-02-28T20:30:09.718Z] --- FAIL: TestResize (2.59s)
[2021-02-28T20:30:09.718Z]     resize_test.go:32: assertion failed: error is not nil: Error response from daemon: exec: 'c86c9ca26f5c60f033655fd89027cc188473b60b81fb478a3fb7e33a2135beb7' in task: 'c86c9ca26f5c60f033655fd89027cc188473b60b81fb478a3fb7e33a2135beb7' is not a tty: failed precondition

I had a quick look since I was looking at the other failure, and I believe this error is coming from ResizePty but I haven't traced through why we hit that, or even looked at the test.

func (he *hcsExec) ResizePty(ctx context.Context, width, height uint32) error {
	he.sl.Lock()
	defer he.sl.Unlock()
	if !he.io.Terminal() {
		return errors.Wrapf(errdefs.ErrFailedPrecondition, "exec: '%s' in task: '%s' is not a tty", he.id, he.tid)
	}


	if he.state == shimExecStateRunning {
		return he.p.Process.ResizeConsole(ctx, uint16(width), uint16(height))
	}
	return nil
}

Hmm, this might just be a fault in TestResize, perhaps

cID := container.Run(ctx, t, client)

should be

cID := container.Run(ctx, t, client, container.WithTty(true))

and we've never noticed that non-WCOW containers will accept a ResizeConsole call when they don't have a TTY? Or they are getting a TTY by default, and WCOW doesn't, perhaps. Or even hcs v1 and hcs v2 APIs differ in the 'default TTY' state. I don't know, but either way, it seems that in all cases, a test of TTY resize should explicitly request a TTY.

It might be interesting to have an integration test to see what happens with container.WithTty(false), as some platforms might pass but do nothing, and some might fail.

The other two tests would have the same problem, except that TestResizeWithInvalidSize is already blocked on WCOW (for no documented reason, maybe just an oversight in 66a37b4#diff-bfbb6624ee4b0a55ee7b8d18da70f0c7dc9c152a3981ad79bc80bc23fac31401), and TestResizeWhenContainerNotStarted probably generates the existing failure before it hits (he *hcsExec) ResizePty.

@olljanat olljanat force-pushed the ci-win-containerd-support branch 7 times, most recently from 04943af to 4c99dc7 Compare March 18, 2021 16:21
@olljanat
Copy link
Contributor Author

@thaJeztah @StefanScherer I think that is it time to discuss that how we want setup CI for this one?

Should we have two RS5 builds running on parallel? One like it is currently and another one with environment variable:
DOCKER_WINDOWS_CONTAINERD_RUNTIME='1'

Also note that I did split those modified tests to #42164 and will rebase this one after it is merged.

@StefanScherer
Copy link
Contributor

@olljanat A parallel step with the environment variable sounds good to me.

@olljanat olljanat force-pushed the ci-win-containerd-support branch 2 times, most recently from 38b9708 to 07120bf Compare July 26, 2021 11:19
@olljanat olljanat changed the title Default to ContainerD on Windows Server 2022 Windows CI: Add support for testing with containerd Jul 26, 2021
@olljanat
Copy link
Contributor Author

@thaJeztah rebased, updated to match with #42528 and skipped default runtime change. PTAL

integration-cli/requirements_test.go Outdated Show resolved Hide resolved
daemon/start_windows.go Show resolved Hide resolved
integration-cli/docker_api_stats_test.go Outdated Show resolved Hide resolved
integration-cli/docker_api_stats_test.go Outdated Show resolved Hide resolved
integration-cli/docker_cli_commit_test.go Outdated Show resolved Hide resolved
integration-cli/docker_cli_run_test.go Outdated Show resolved Hide resolved
integration-cli/requirements_test.go Outdated Show resolved Hide resolved
Dockerfile.windows Outdated Show resolved Hide resolved
@@ -252,6 +255,16 @@ RUN `
Remove-Item C:\binutils.zip; `
Remove-Item C:\gitsetup.zip; `
`
Write-Host INFO: Downloading containerd; `
Install-Package -Force 7Zip4PowerShell; `
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. we should ask them to upload .zip for Windows.

(But should no longer be an issue if we start uploading containerd binaries to download.docker.com)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, just noticed this. Windows has included tar for a long time, can we not just use that directly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL! (It's clearly been a while since I worked on Windows). Yes if we don't need to install, that'd be great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Windows Server 2016 does not include tar so win-RS1 (which is still enabled on master branch builds) would stop working if we add requirement for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated but looks that I still missed couple of ContainerD texts which still need to be updated and Win 2022 without containerd did hit #42612 but what we want to do with this one?

Options are:

  1. stay on 7zip
  2. switch to tar and add logic to Dockerfile that containerd will be only installed if OS version is at least RS5

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would an equivalent of a command -v tar work? (check if the command exists, and if not, download it)? A quick Google search brought me to this page; https://www.shellhacks.com/windows-which-equivalent-cmd-powershell/, and a stackoverflow thread; https://superuser.com/questions/34492/powershell-equivalent-to-the-unix-which-command

I'm ok with doing it in a follow-up if it's too complicated though

Copy link
Contributor

@TBBle TBBle Aug 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The win-RS1 parallel in the build is disabled or inactive, last time I looked, and I recall someone trying it briefly in one of the related PRs, and discovering it to be non-working, or at least needs a bunch of tests skipped.

Not that I'm advocating breaking it further, but there was a discussion about dropping support for it in the 22.x release while moving to containerd anyway, since Server LTSC2016 falls out of mainstream support in January 2022.

I would not be shocked if containerd doesn't support Windows Server LTSC 2016 and no one noticed. There's stuff in containerd master (in the snapshotter) that doesn't seem to work on LTSC 2019, but I've never proven this in isolation as the tests that trigger it are part of my WIP, and trigger other issues as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RS1 still run from master branch after PR is merged to master. You can see build results by browsing commit history https://github.com/moby/moby/commits/master and clicking that green dot / red x from there. Also afaik those builds are still used as part of Docker EE packaging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But CI green 🟢 so maybe we go with this one now?

hack/ci/windows.ps1 Show resolved Hide resolved
@olljanat olljanat force-pushed the ci-win-containerd-support branch 2 times, most recently from 76c65e2 to 90930f7 Compare August 9, 2021 11:31
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@thaJeztah
Copy link
Member

@cpuguy83 PTAL

Signed-off-by: Olli Janatuinen <olli.janatuinen@gmail.com>
@olljanat
Copy link
Contributor Author

Rebased as was required by #42720

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still LGTM

@cpuguy83 @tianon PTAL

@@ -616,6 +638,15 @@ Try {
Write-Host -ForegroundColor Green "INFO: Args: $dutArgs"
New-Item -ItemType Directory $env:TEMP\daemon -ErrorAction SilentlyContinue | Out-Null

# Start containerd first
if (-not ("$env:DOCKER_WINDOWS_CONTAINERD_RUNTIME" -eq "")) {
Start-Process "$env:TEMP\binary\containerd.exe" `
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to make sure containerd-shim-runhcs-v1.exe is in PATH, or is containerd.exe smart enough to look next to itself for that? (So we don't accidentally get one from the host.)

Copy link
Contributor

@TBBle TBBle Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/containerd/containerd/blob/v1.5.5/runtime/v2/shim/util.go#L66-L100 uses os/exec.LookPath first, and then if not found, it checks next to the containerd binary. So yes, I guess it would find a system-$PATH-installed containerd shim before the one next to containerd.exe.

Per https://github.com/containerd/containerd/blob/v1.5.5/docs/managed-opt.md it might make sense to put the shim binary in $env:ProgramData\containerd\root\opt (or rather, the test containerd's isolated root directory), although I didn't notice code in containerd to ensure that's searched before $PATH, and I haven't actually experimented with this myself.

Late edit: I checked, "managed opt" works by prepending itself to the PATH. So ignore that idea here, it doesn't bring anything more to the table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. $env:PATH gets already overwritten on

$env:PATH="$env:TEMP\go\bin;$env:PATH"

and
$env:PATH="$env:TEMP\binary;$env:PATH;" # Force to use the test binaries, not the host ones.

so combining those together and move it to be done before containerd.exe is started should be enough.

Will verify that and update PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed on latest commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tianon is that good now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/containerd/containerd/blob/v1.5.5/runtime/v2/shim/util.go#L66-L100 uses os/exec.LookPath first, and then if not found, it checks next to the containerd binary. So yes, I guess it would find a system-$PATH-installed containerd shim before the one next to containerd.exe.

Hm.. reminds me of a security fix in Go 1.15. Opened containerd/containerd#5906 to fix that 😅

…v1.exe is used

Signed-off-by: Olli Janatuinen <olli.janatuinen@gmail.com>
Copy link
Member

@tianon tianon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@thaJeztah
Copy link
Member

😞 looks like it timed out after 2 hours; let me kick CI again to see if that was a one-off

@thaJeztah thaJeztah added this to the 21.xx milestone Aug 23, 2021
@olljanat
Copy link
Contributor Author

😞 looks like it timed out after 2 hours; let me kick CI again to see if that was a one-off

Should be as it already passed on runs 37 and 38 https://ci-next.docker.com/public/blue/organizations/jenkins/moby/activity?branch=PR-41479 (assuming that breaking changes have not been merged to master after that).

@olljanat
Copy link
Contributor Author

olljanat commented Aug 24, 2021

Hmm. Timed out second time. @StefanScherer was there changes on build servers between those runs?

EDIT: Ah, it was not timeout but cancelled for other reasons (some Jenkins logic I guess). So need just one more try I guess.

@StefanScherer
Copy link
Contributor

Yes, we've updated the Windows Server 2022 machine to the LTSC version.
Job run 37 was with the Insider build and Docker 20.10.7, the aborted job run 40 was with the LTSC version and Docker 20.10.8. These are the main differences.

But yes, let's kick off another build...

@olljanat
Copy link
Contributor Author

CI 💚

@thaJeztah
Copy link
Member

Let's get this merged 👍 Thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants