Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Builds can be triggered with old configuration, and not saved into the builds table, or appear anywhere in the UI #8915

Open
geofffranks opened this issue Feb 13, 2024 · 0 comments
Labels

Comments

@geofffranks
Copy link

Summary

Steps to reproduce

This will be tricky to reproduce, so I'm taking a best guess stab based on what we observed when this occurred for us.

  1. Configure a pipeline to commit a "test-file-this-is-the-wrong-name" file to a git repo, and push it up to origin/main. Trigger this job every minute. If there's nothing to commit, that's ok, just end the job with success.
  2. Kick off a long-running job in a separate pipeline (ours had been running from Jan 31 through Feb 9th (trying to claim a locked pool-resource).
  3. In the mean time, modify the configuration of the first pipeline to commit "test-file-this-is-the-right-name" instead of the wrongly-named file being committed.
  4. Wait a few hours, or days, then manually remove the "test-file-this-is-the-wrong-name" file from the repo, and push the change up to origin/main.
  5. Wait for a ghost build to create "test-file-this-is-the-wrong-name" and add it back to the repo.
  6. Kill the long-running job.
  7. Repeat.

Expected results

Step 4 should never occur, and all builds kicked off for the pipeline should appear in the Concourse UI.

Actual results

Step 4 occurs, and the build does not show up in the Concourse UI. Similarly, searching the Concourse database for a record of the build fails.
However, atc logs show that the build was in fact triggered, but the build ID used cannot be found anywhere in the database.

Additional context

We have 3 web nodes, and 1 database node. We had unique job names between test-file-this-is-the-wrong-name and test-file-this-is-the-right-name which made it easier to find the logs for the ghost build.

This may have to do with database locking, but IDK. I would believe this issue to be so farfetched to not be real, if I had not observed it in our logs + not in our database.

ATC logs from our environment:

34551:{"timestamp":"2024-02-08T17:31:17.952396875Z","level":"info","source":"atc","message":"atc.tracker.notify.run.put-step.finished","data":{"build":"1","build_id":311872714,"exit-status":0,"job":"bump-healthchecker-in-networking-release","job-id":72956,"pipeline":"sandbox","session":"28.74164.3.73","step-name":"cf-networking-repo","team":"wg-arp-networking","version-info":{"version":{"ref":"60252ecaec6b65310415664163ff333b78933891"},"metadata":[{"name":"commit","value":"60252ecaec6b65310415664163ff333b78933891"},{"name":"author","value":"App Platform Runtime Working Group CI Bot"},{"name":"author_date","value":"2024-02-08 17:31:02 +0000"},{"name":"committer","value":"App Platform Runtime Working Group CI Bot"},{"name":"committer_date","value":"2024-02-08 17:31:02 +0000"},{"name":"message","value":"Upgrade silk-healthchecker\n"},{"name":"url","value":"https://github.com/cloudfoundry/cf-networking-release/commit/60252ecaec6b65310415664163ff333b78933891"}]}}}

web.dfabe417-7b47-4301-89f1-2db6ae4ada96.2024-02-09-21-21-39/web/web.stdout.log.4
108792:{"timestamp":"2024-02-08T17:31:31.843101381Z","level":"info","source":"atc","message":"atc.tracker.notify.run.put-step.finished","data":{"build":"1","build_id":311872719,"exit-status":0,"job":"bump-package-golang","job-id":72955,"pipeline":"sandbox","session":"28.74173.3.104","step-name":"cf-networking-repo","team":"wg-arp-networking","version-info":{"version":{"ref":"60252ecaec6b65310415664163ff333b78933891"},"metadata":[{"name":"commit","value":"60252ecaec6b65310415664163ff333b78933891"},{"name":"author","value":"App Platform Runtime Working Group CI Bot"},{"name":"author_date","value":"2024-02-08 17:31:02 +0000"},{"name":"committer","value":"App Platform Runtime Working Group CI Bot"},{"name":"committer_date","value":"2024-02-08 17:31:02 +0000"},{"name":"message","value":"Upgrade silk-healthchecker\n"},{"name":"url","value":"https://github.com/cloudfoundry/cf-networking-release/commit/60252ecaec6b65310415664163ff333b78933891"}]}}}

Searching our ATC database for the build IDs mentioned in those logs:

atc=> select * from builds where id in ('311872719', '311872714');
 id | name | status | scheduled | start_time | end_time | schema | private_plan | completed | job_id | reap_time | team_id | manually_triggered | interceptible | nonce | public_plan | pipeline_id | drained | create_time | aborted | rerun_of | rerun_numbe
r | inputs_ready | needs_v6_migration | span_context | resource_id | resource_type_id | created_by
----+------+--------+-----------+------------+----------+--------+--------------+-----------+--------+-----------+---------+--------------------+---------------+-------+-------------+-------------+---------+-------------+---------+----------+------------
--+--------------+--------------------+--------------+-------------+------------------+------------
(0 rows)

Screenshot of commit not being listed as an output of any builds:

Screenshot 2024-02-13 at 10 44 01 AM

Timeline:

  1. Long Running job kicked off Jan 31st
  2. Commit that was generated by a valid job prior to changing our config (Feb 5th)
  3. Pipeline config was fixed on Feb 8th at ~16:45 UTC.
  4. Commit that was generated by a valid job after changing our config (Feb 8th at 16:48 UTC)
  5. Commit that was generated by the ghost job (Feb 8th at 17:31 UTC)
  6. Long running job canceled Feb 9th.

Triaging info

  • Concourse version: 7.10.0
  • Browser (if applicable):
  • Did this used to work?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant