Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci.jio/infra.ci.jio] datadog plugin destroyed the build data. Can we stop using it? #4080

Closed
dduportal opened this issue May 3, 2024 · 12 comments

Comments

@dduportal
Copy link
Contributor

dduportal commented May 3, 2024

Service(s)

ci.jenkins.io, infra.ci.jenkins.io, Datadog

Summary

The version 7.0.0 of the Jenkins datadog plugin has been released 3 days ago and we deployed it to the 2 controllers using it: ci.jenkins.io and infra.ci.jenkins.io.

Again, we see the build history destroyed:

  • Builds are marked as "1 Jan 1970"
  • Some builds are stuck (despite being finished since hours or days)
  • We cannot stop the stuck build unless using the Console: clicking on the "stop build" only ends in an error stack in the logs:
  • Any restart/reload spams the logs with XMl parsong errors like
could not load /var/jenkins_home/jobs/Tools/jobs/bom/branches/master/builds/2463
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Premature end of file.
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(Unknown Source)
	at com.thoughtworks.xstream.io.xml.StaxReader.pullNextEvent(StaxReader.java:58)
Caused: com.thoughtworks.xstream.io.StreamException: 
	at com.thoughtworks.xstream.io.xml.StaxReader.pullNextEvent(StaxReader.java:74)
	at com.thoughtworks.xstream.io.xml.AbstractPullReader.readRealEvent(AbstractPullReader.java:148)
	at com.thoughtworks.xstream.io.xml.AbstractPullReader.readEvent(AbstractPullReader.java:141)
	at com.thoughtworks.xstream.io.xml.AbstractPullReader.move(AbstractPullReader.java:118)
	at com.thoughtworks.xstream.io.xml.AbstractPullReader.moveDown(AbstractPullReader.java:103)
	at com.thoughtworks.xstream.io.xml.StaxReader.<init>(StaxReader.java:45)
	at com.thoughtworks.xstream.io.xml.StaxDriver.createStaxReader(StaxDriver.java:173)
	at com.thoughtworks.xstream.io.xml.StaxDriver.createReader(StaxDriver.java:100)
	at hudson.XmlFile.unmarshal(XmlFile.java:196)
Caused: java.io.IOException: Unable to read /var/jenkins_home/jobs/Tools/jobs/bom/branches/master/builds/2463/build.xml

@MarkEWaite did warn us in https://matrix.to/#/!JLUOInpEYmxJIYXlzs:matrix.org/$X94sIpk_CEMCxnnu-JmIzR1aHg_6uG0-Crl6SMXCD_Y?via=g4v.dev&via=gitter.im&via=matrix.org about other user with the same problem. He mentionned the following loooong issue: https://issues.jenkins.io/browse/JENKINS-66328

Related: #4079

Notes

  • In order to stop a "zombie" (stuck) build, an administrator need to run the following code in the Administration Console:
Jenkins.instance
.getItemByFullName('<job name>')
.getBranch('<branch or PR name')
.getBuildByNumber(<build ID>)
.finish(hudson.model.Result.ABORTED, new java.io.IOException("Aborting build"));

The job names can be determined with:

Jenkins.instance.getAllItems(AbstractItem.class).each {
    println(it.fullName)
  };
@dduportal
Copy link
Contributor Author

Update:

  • Took 2 snapshots of the 2 controllers so we can try to restore data if fixed, or we can provide data to datadog plugin developers
  • Worked on BOM testing build is stuck #4079 and stopped stuck builds
  • Removed all visible stuck builds on infra.ci and triggered a new build: terrafomr jobs, kubernetes, website jobs and packer-image. There might be others

Todo:

  • Clean up the ci.jenkins.io build queue by fixing each job one after the other

@dduportal
Copy link
Contributor Author

Opened jenkinsci/datadog-plugin#423 to make the plugin maintainers aware

@dduportal
Copy link
Contributor Author

A quick discussion/research by the team (thanks @lemeurherve !) shows the use cases we have on the Jenkins infra with the datadog plugin.
The goal for us is to understand the pros and cons of removing the plugin (but keeping the datadog-agent infra. collection: it's only the Jenkins integration through the plugin we consider disabling) to protect ourselves from such failure as it it the 3rd time in the past 2 months.

  • We have a monitor of the build queue which will be impacted (as the metric won't be reported anymore): https://github.com/jenkins-infra/datadog/blob/main/datadog-monitors.tf#L194-L213
    • Removing the datadog plugin from ci.jio or infra.ci.jio would need us to disable/delete this monitor
  • ci.jenkins.io utilizes the datadog CI/CD observability (with the collected metrics + traces from datadog) as per Add observability for the build agents #2769
    • Removing the datadog plugin from ci.jio would decrease the observability around the BOM. Please note that agents have their metrics and logs collected even if we remove the plugin so we won't be completely blind here
  • infra.ci.jenkins.io is most probably using the plugin for the same reason as ci.jio (observability of pipelines). But don't use it: we should start considering to stop using the plugin in this private controller.

@dduportal dduportal removed the triage Incoming issues that need review label May 3, 2024
@dduportal
Copy link
Contributor Author

Update: stopped all the apparent builds stuck on ci.jenkins.io (e.g. from the build queue, checking each build and if older thant 2 hours then check the associated branch + history)

@jonesbusy
Copy link

What about https://plugins.jenkins.io/opentelemetry/ ?

Not sure if the collector can send metrics to datadog but might be an option to keep observability of bom builds

@lemeurherve
Copy link
Member

For the record,

The 7.0.0 release of the plugin has been removed from the Update Center

Originally posted by @nikita-tkachenko-datadog in jenkinsci/datadog-plugin#423 (comment)

@MarkEWaite
Copy link

Confirmed that it has been removed from the update center by opening

https://updates.jenkins.io/latest/datadog.hpi?mirrorlist

That URL opens to the 6.0.3 release of the plugin.

@dduportal
Copy link
Contributor Author

For the record,

The 7.0.0 release of the plugin has been removed from the Update Center

Originally posted by @nikita-tkachenko-datadog in jenkinsci/datadog-plugin#423 (comment)

Confirmed that it has been removed from the update center by opening

https://updates.jenkins.io/latest/datadog.hpi?mirrorlist

That URL opens to the 6.0.3 release of the plugin.

As such, we'll try to revert the plugin to 6.0.3 to check if it works again without blocking builds or destroying histories.

The following operations will be performed:

  • Set ci.jenkins.io in "maintenance" mode to limit the amount of changed files (which will be lost if backup is restored)
  • Take a snapshot of the JENKINS_HOME's data disk to serve as backup if things go wrong
  • Upload the 6.0.3 datadog plugin HPI manually from the UI and restart
  • Then:
    • If The downgrade goes well and solves the 1970 (and stuck builds) builds then we continue with the plugin (for now) as we cannot spend the effort to get away from datadog on ci.jenkins.io for now, and we'll be carefull on upcoming upgrades of datadog
    • Else if if fails terribly, then we restore the backup and try the same with removing the plugin

@dduportal
Copy link
Contributor Author

dduportal commented May 6, 2024

For the record,

The 7.0.0 release of the plugin has been removed from the Update Center

Originally posted by @nikita-tkachenko-datadog in jenkinsci/datadog-plugin#423 (comment)

Confirmed that it has been removed from the update center by opening
https://updates.jenkins.io/latest/datadog.hpi?mirrorlist
That URL opens to the 6.0.3 release of the plugin.

As such, we'll try to revert the plugin to 6.0.3 to check if it works again without blocking builds or destroying histories.

The following operations will be performed:

  • Set ci.jenkins.io in "maintenance" mode to limit the amount of changed files (which will be lost if backup is restored)

  • Take a snapshot of the JENKINS_HOME's data disk to serve as backup if things go wrong

  • Upload the 6.0.3 datadog plugin HPI manually from the UI and restart

  • Then:

    • If The downgrade goes well and solves the 1970 (and stuck builds) builds then we continue with the plugin (for now) as we cannot spend the effort to get away from datadog on ci.jenkins.io for now, and we'll be carefull on upcoming upgrades of datadog
    • Else if if fails terribly, then we restore the backup and try the same with removing the plugin

Update:

  • Downgrade of datadog plugin done with success
  • The datetime of 1970 is not solved by the downgrade.
  • However no more build stuck: triggereed happenend as expected.

@dduportal
Copy link
Contributor Author

Next step: removing datadog plugin from infra.ci.jenkins.io.

Ping @smerle33: don't forget there is a datadog configuration to remove before upgrading infra.ci with an image without the plugin (otherwise it will crash at startup).

@dduportal
Copy link
Contributor Author

Update:

@dduportal
Copy link
Contributor Author

Update:

A fixing patch has been published on the datadog plugin: version 7.0.1. Worth upgrading on ci.jenkins.io (need planning, announce, backup)

This has been done with success yesterday:

  • Jenkins home snapshot taken to back-up data
  • Plugin upgrade went well: no more stuck builds neiter "1970" builds.

We can close this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants