Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi g-w: Interactive username cltbld does not match task user task_171042065159733 #6952

Open
aerickson opened this issue Apr 3, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@aerickson
Copy link
Contributor

aerickson commented Apr 3, 2024

Describe the bug
g-w multi workers reboot loop with the message

"description": "Interactive username cltbld does not match task user task_171042065159733 from file \"/opt/worker/next-task-user.json\"",

To Reproduce

Steps to reproduce the behavior:

Unsure how they are getting to this state. Does the error mean that someone submitted a job trying to be interactive as the 'cltbld' user (but we're in multiuser so that user isn't valid?)? This shouldn't be fatal for the worker (definitely the job) if that's what's going on.

Expected behavior
The worker would keep working.

We resolve this with a sudo rm /opt/worker/*user.json and a reboot. It seems like g-w could detect this state and do the same and avoid having to manually intervene.

Taskcluster version
generic-worker (multiuser engine) 60.3.4 [ revision: https://github.com/taskcluster/taskcluster/commits/943a6f2b0d14fa0270280bc6f23acc2945d0fe45 ]

Platform (please complete the following information):

Mac OS X

[aerickson@macmini-r8-255.test.releng.mdc1.mozilla.com ~]$ sw_vers
ProductName:		macOS
ProductVersion:		13.6
BuildVersion:		22G120

Additional context

@aerickson aerickson added the bug Something isn't working label Apr 3, 2024
@rcurranmoz
Copy link
Contributor

Upvote

@petemoore
Copy link
Member

This means, Generic Worker ran, and created user task_171042065159733 to run the next task, it has then rebooted the machine. When Generic Worker starts up again, it sees that the next task should be run by task_171042065159733 so waits for that user to log into the desktop. However, it discovers that the user cltbld instead has logged into the desktop, and does not know what to do. It therefore gives up. Perhaps it should reboot the machine, but perhaps someone has logged in on purpose and is doing something. It can't really know, which is why it throws the error message.

What is the reason for the cltbld account being logged into the machine? Is there a loaner process, whereby users are granted access to workers via this user account? There is a generic-worker interactive feature that would allow users to run interactive tasks via generic-worker. That also has the benefit that the interactive task will run as a real task user.

I think the fix here is not to log into the machine as the cltbld user, or if it is needed, to perform some administrative activity that can't be done over ssh, then sudo rm /opt/worker/*user.json is indeed the correct approach.

Note, Generic Worker can't really decide to self-fix this issue - it demonstrates that something is wrong. A trusted environment has logged in as the wrong user, and done something in conflict with Generic Worker. This is an error, so it panics. It created the user, and rebooted the machine, so it expects that user to be logged in. If it isn't, and it just fixed the issue and rebooted, you would never be able to log in as cltbld, because Generic Worker would immediately reboot the machine if it had an auto-fix. I think the underlying problem is not Generic Worker behaviour, it is that something/someone is logging in as cltbld, which interferes with the worker workflow. If this is for interactive tasks, is there a reason users can't use the taskcluster interactive feature directly? That is guarded by scopes, does not require that anyone share passwords with users and has no administrative burden. It guards access to workers, and makes sure any changes they apply occur in an isolated environment. If they need to be able to do things as root, the task user can be granted privileges by adding osGroups to the task payload, e.g. to make it capable of sudo if that is required.

@petemoore
Copy link
Member

Note, the fact it is a reboot-loop is probably beyond the scope of generic worker. Generic Worker just exits with a particular exit code, presumably something else detects this, and then reboots the machine, causing it to boot-loop. If that thing is worker runner, that is another Worker Runner bug, which will be gone when #6229 lands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants