Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

main: journald support with external correlation #4047

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

lzap
Copy link
Contributor

@lzap lzap commented Mar 28, 2024

This patch adds native journald support through logrus which enables to work with structured logs. It uses a dependency that is already in the project (coreos/systemd-go) which is a pure Go implementation. New log type is introduced named "journal" and it is now the default with fallback to "text" if journald is not present on the system.

Structured logging enabled me to work with logrus fields that can help with correlating log records on standalone or hosted deployments. Already existing "operation_id" field (random string) is now fully propagated via echo context logger and new logrus hook extracts it from Go standard context.

A new field called "external_id" is extracted from HTTP header X-External-Id when present. The idea is to set this to X-Edge-Request-Id on hosted image builder so logs can be easily correlated across both services.

Additionally, each cloud API request is now logged using the built-in echo logging middleware which creates entries like Processed request GET /path. Together with journald this can be tested end-to-end:

# curl -H "X-External-Id: XXXX" --unix-socket /run/cloudapi/api.socket -XGET http://localhost/api/image-builder-composer/v2/openapi

This will create the following entries:

# journalctl -e -o json-pretty --output-fields=OPERATION_ID,MESSAGE,EXTERNAL_ID
{
        "OPERATION_ID" : "2eJwe1mdLa9s2rB3PVZbVAttQi1",
        "EXTERNAL_ID" : "XXXX",
        "MESSAGE" : "Returning OpenAPI document"
}
{
        "MESSAGE" : "Processed request GET /api/image-builder-composer/v2/openapi",
        "EXTERNAL_ID" : "XXXX",
        "OPERATION_ID" : "2eJwe1mdLa9s2rB3PVZbVAttQi1",
}

Users can easily dig logs per request or operation:

# journalctl -u osbuild-composer OPERATION_ID=2eJwe1mdLa9s2rB3PVZbVAttQi1 -o cat
Returning OpenAPI document
Processed request GET /api/image-builder-composer/v2/openapi

TODO

  • adequate testing for the new functionality or fixed issue
  • adequate documentation informing people about the change such as
    • submit a PR for the READMEs listed here
    • submit a PR for the osbuild.org website repository if this PR changed any behavior not covered by the automatically updated READMEs

HACKING.md Show resolved Hide resolved
internal/cloudapi/v2/handler.go Outdated Show resolved Hide resolved
Copy link
Contributor

@kingsleyzissou kingsleyzissou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability it would be really nice if this could get broken up into smaller patches :)

Copy link
Member

@achilleas-k achilleas-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a decent idea and will make log reading a lot less unpleasant.
I'd like to see the PR broken into more readable smaller commits though. Some of the inline comments could be in-code since they clarify things that are happening implicitly and aren't going to be very useful as PR comments in the long run.
Also, some of the inline comments are nice explanations for why a change happened, which is often very useful to have as a commit message for that standalone patch.

HACKING.md Show resolved Hide resolved
cmd/osbuild-composer/main.go Show resolved Hide resolved
cmd/osbuild-composer/main.go Show resolved Hide resolved
internal/cloudapi/v2/server.go Show resolved Hide resolved
@lzap
Copy link
Contributor Author

lzap commented Apr 4, 2024

Moved all comments into the codebase, broken into three commits:

  • Documentation and gitignore update
  • Adding journald native support
  • Propagating operation and external id through context to logs (this one is hard to break up into two as I had to do updates to existing operation id in order to efficiently implement external id)

achilleas-k
achilleas-k previously approved these changes Apr 5, 2024
Copy link
Member

@achilleas-k achilleas-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tidying up. LGTM!

Tiny note about the documantation, but can be changed in a follow-up, or even ignored.

HACKING.md Outdated Show resolved Hide resolved
HACKING.md Outdated Show resolved Hide resolved
@lzap lzap force-pushed the correlated-journal branch 3 times, most recently from f00871d to 14cfd17 Compare April 8, 2024 07:16
Copy link
Member

@achilleas-k achilleas-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@achilleas-k achilleas-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lzap We don't do merge commits in these repos. Can you drop the merge commit and rebase instead?

@lzap
Copy link
Contributor Author

lzap commented Apr 9, 2024

Damn that was a misclick, amended a rebase.

achilleas-k
achilleas-k previously approved these changes Apr 9, 2024
@lzap
Copy link
Contributor Author

lzap commented Apr 9, 2024

+ sudo composer-cli blueprints depsolve ostree
ERROR: Depsolve Error: Get "http://localhost/api/v1/blueprints/depsolve/ostree": EOF

and then an AWS test which I don’t understand why fails

@achilleas-k
Copy link
Member

and then an AWS test which I don’t understand why fails

iot-bootable-container failed with

dial tcp: lookup cdn03.quay.io: no such host

which looks like a plain old network hiccup. Annoying.

@ezr-ondrej ezr-ondrej force-pushed the correlated-journal branch 2 times, most recently from 90404cf to 7d2c53a Compare April 12, 2024 09:58
@lzap
Copy link
Contributor Author

lzap commented Apr 12, 2024

@ezr-ondrej please do not rebase, @achilleas-k plans to only restart failed tests. is that right? thanks.

@achilleas-k
Copy link
Member

@ezr-ondrej please do not rebase, @achilleas-k plans to only restart failed tests. is that right? thanks.

It's a tricky decision. It's good to rebase when there's a lot of changes in main and there's a chance the rebase might break things, but OTOH the CI runtimes are long and annoying.

A merge queue would be great but that doesn't work with flaky testing.

@ezr-ondrej
Copy link
Contributor

Yeah, sorry! I've got new buttons in Gitlab now, I'll push those instead xD

@lzap
Copy link
Contributor Author

lzap commented Apr 15, 2024

Only schulzbot is red, seems only 8.4 is red, likely not caused by this PR.

@ezr-ondrej
Copy link
Contributor

likely not caused by this PR.

Yeah, it's a timout, very unlikely to be caused by this PR, but it means the tests on 8.4 didn't run on this PR :(

@achilleas-k
Copy link
Member

There's a known issue with that test but I thought we fixed it already.

@lzap
Copy link
Contributor Author

lzap commented Apr 17, 2024

Rebased.

@ezr-ondrej
Copy link
Contributor

The main pipeline is green!! Woooot!

@lzap
Copy link
Contributor Author

lzap commented Apr 22, 2024

Well bunch of tests failed, someone help me to understand if these are relevant. These logs are hard to read.

@achilleas-k
Copy link
Member

Since all the edge tests are failing but the main gitlab CI pipeline succeeded, it might something with the edge tests themselves or something in the PR changed something the tests rely on.

I'm seeing at least one instance of the issue described in #4094

@achilleas-k
Copy link
Member

achilleas-k commented Apr 22, 2024

Oh, we're hitting the old:

error: Postprocessing and committing: Finalizing rootfs: Hardlinking rpmdb to base location: Hardlinking /usr/share/rpm to /usr/lib/sysimage/rpm-ostree-base-db: Analyzing /usr/share/rpm/ content: File exists (os error 17)\
Traceback (most recent call last):\
  File \\\"/run/osbuild/bin/org.osbuild.ostree.commit\\\", line 68, in <module>\
    r = main(args[\\\"inputs\\\"],\
  File \\\"/run/osbuild/bin/org.osbuild.ostree.commit\\\", line 55, in main\
    subprocess.run(argv,\
  File \\\"/usr/lib64/python3.9/subprocess.py\\\", line 528, in run\
    raise CalledProcessError(retcode, process.args,\
subprocess.CalledProcessError: Command '['rpm-ostree', 'compose', 'commit', '--repo=/run/osbuild/tree/repo', '--add-metadata-string=version=9-stream', '--add-metadata-string=rpmostree.inputhash=1e1524b04aaad2bbdc26c3d2a1d43e1bf5c95ae327bf399af3815389b277f64d', '--write-composejson-to=/run/osbuild/tree/compose.json', '/tmp/tmp1f6ir9sg.json', '/run/osbuild/tree/tmp22cqxixc']' returned non-zero exit status 1.\
\", \"error\": null}}}]" jobId=b11a4709-a108-40d8-8f81-69a07e79ac94

Commit 6f69cac updated the sources which introduces a version of rpm-ostree that causes an issue we're aware of. The issue has been fixed but we've been waiting for it to reach Fedora 39 and CS9 repos before updating sources. But since edge tests aren't required for merging and we needed newer sources for other reasons, I suppose all our edge tests are failing on F39 and CS9 now.

Copy link

This PR is stale because it has been open 30 days with no activity. Remove "Stale" label or comment or this will be closed in 7 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants