Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT] clang -fuse-ld=lld output is low priority #102000

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jonpryor
Copy link
Member

@jonpryor jonpryor commented May 7, 2024

Fixes: #101727

Context: 410aa0a
Context: https://github.com/dotnet/msbuild/blob/1c3b240ce7417223672c62862a6ff7e884e6997a/src/Tasks/Exec.cs#L385
Context: https://github.com/dotnet/msbuild/blob/1c3b240ce7417223672c62862a6ff7e884e6997a/src/Shared/TaskLoggingHelper.cs#L1380

Imagine you want to build a NativeAot library for linux-bionic-arm64:

% dotnet new classlib -n HelloBionicSharedLib
% cd HelloBionicSharedLib
% dotnet publish -c Release -r linux-bionic-arm64 -p:PublishAotUsingRuntimePack=true -p:PublishAot=true

but your environment is wrong. Because your environment is wrong, it will error out. (It should error out!)

Are the generated errors useful?

% dotnet publish -c Release -r linux-bionic-arm64 -p:PublishAotUsingRuntimePack=true -p:PublishAot=true
…
clang : error : invalid linker name in argument '-fuse-ld=lld'
$HOME/.nuget/packages/microsoft.dotnet.ilcompiler/8.0.4/build/Microsoft.NETCore.Native.Unix.targets(236,5): error :
Symbol stripping tool ('llvm-objcopy' or 'objcopy') not found in PATH. Try installing appropriate package for llvm-objcopy or objcopy to resolve the problem or set the StripSymbols property to false to disable symbol stripping.

There are two errors here: one from clang, and one from the microsoft.dotnet.ilcompiler NuGet package. The first one from clang is useless ("what does it mean?! how do I fix it?!"), and the latter message is the useful error message.

Additionally, the first error shouldn't even be emitted! It comes from commit 410aa0a, which attempts to get the lld version:

clang -fuse-ld=lld -Wl,--version

This check is supposed to be optional; if it fails, that's "fine."" The wrapping <Exec/> captures stdout and the error code, and wants the output to be Low priority so that it isn't be shown by default.

The problem is that it wasn't really fine; if there is no lld, then the above command will fail:

% clang -fuse-ld=lld -Wl,--version
[stderr] clang: error: invalid linker name in argument '-fuse-ld=lld'
[exit code is 1]

and the default "warning error format" checks for the strings error and warning. Meaning because
clang -fuse-ld=lld -Wl,--version writes a string containing error to stderr, that is converted into an MSBuild error.

Because this invocation is supposed to be optional, update the <Exec/> to set Exec.IgnoreStandardErrorWarningFormat=true and Exec.StandardErrorImportance=Low:

<Exec Command="&quot;$(CppLinker)&quot; -fuse-ld=lld -Wl,--version"
    …
    IgnoreStandardErrorWarningFormat="true"
    StandardErrorImportance="Low"
/>

This ensures that the error messages from clang are not reported as errors, removing the unnecessary error message from the build.

Review the other <Exec/> invocations and set
IgnoreStandardErrorWarningFormat="true" and
StandardErrorImportance="Low" when Exec.IgnoreExitCode=true or Exec.StandardOutputImportance=Low.

Fixes: dotnet#101727

Context: 410aa0a
Context: https://github.com/dotnet/msbuild/blob/1c3b240ce7417223672c62862a6ff7e884e6997a/src/Tasks/Exec.cs#L385
Context: https://github.com/dotnet/msbuild/blob/1c3b240ce7417223672c62862a6ff7e884e6997a/src/Shared/TaskLoggingHelper.cs#L1380

Imagine you want to build a NativeAot library for linux-bionic-arm64:

	% dotnet new classlib -n HelloBionicSharedLib
	% cd HelloBionicSharedLib
	% dotnet publish -c Release -r linux-bionic-arm64 -p:PublishAotUsingRuntimePack=true -p:PublishAot=true

…*but your environment is wrong*.  Because your environment is wrong,
it will error out.  (It *should* error out!)

Are the generated errors *useful*?

	% dotnet publish -c Release -r linux-bionic-arm64 -p:PublishAotUsingRuntimePack=true -p:PublishAot=true
	…
	clang : error : invalid linker name in argument '-fuse-ld=lld'
	$HOME/.nuget/packages/microsoft.dotnet.ilcompiler/8.0.4/build/Microsoft.NETCore.Native.Unix.targets(236,5): error :
	Symbol stripping tool ('llvm-objcopy' or 'objcopy') not found in PATH. Try installing appropriate package for llvm-objcopy or objcopy to resolve the problem or set the StripSymbols property to false to disable symbol stripping.

There are two errors here: one from `clang`, and one from the
`microsoft.dotnet.ilcompiler` NuGet package.  The first one from
`clang` is useless ("what does it mean?! how do I fix it?!"),
and the latter message is the *useful* error message.

Additionally, the first error *shouldn't even be emitted*!  It comes
from commit 410aa0a, which attempts to get the `lld` version:

	clang -fuse-ld=lld -Wl,--version

This check is supposed to be *optional*; if it fails, that's "fine.""
The wrapping `<Exec/>` captures stdout and the error code, and wants
the output to be Low priority so that it isn't be shown by default.

The problem is that it wasn't *really* fine; if there is no `lld`,
then the above command will fail:

	% clang -fuse-ld=lld -Wl,--version
	[stderr] clang: error: invalid linker name in argument '-fuse-ld=lld'
	[exit code is 1]

and the [default "warning error format"][0] checks for the strings
`error` and `warning`.  Meaning *because*
`clang -fuse-ld=lld -Wl,--version` writes a string containing `error`
to stderr, that is converted into an MSBuild error.

Because this invocation is supposed to be optional, update the
`<Exec/>` to set [`Exec.IgnoreStandardErrorWarningFormat`][1]=true
and [`Exec.StandardErrorImportance`][2]=Low:

	<Exec Command="&quot;$(CppLinker)&quot; -fuse-ld=lld -Wl,--version"
	    …
	    IgnoreStandardErrorWarningFormat="true"
	    StandardErrorImportance="Low"
	/>

This ensures that the error messages from `clang` are *not* reported
as errors, removing the unnecessary error message from the build.

Review the other `<Exec/>` invocations and set
`IgnoreStandardErrorWarningFormat="true"` and
`StandardErrorImportance="Low"` when `Exec.IgnoreExitCode`=true
or `Exec.StandardOutputImportance`=Low.

[0]: https://github.com/dotnet/msbuild/blob/1c3b240ce7417223672c62862a6ff7e884e6997a/src/Shared/CanonicalError.cs#L77
[1]: https://learn.microsoft.com/dotnet/api/microsoft.build.tasks.exec.ignorestandarderrorwarningformat?view=msbuild-17-netcore
[2]: https://learn.microsoft.com/dotnet/api/microsoft.build.utilities.tooltask.standarderrorimportance?view=msbuild-17-netcore
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label May 7, 2024
Copy link
Contributor

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

agocke
agocke previously approved these changes May 8, 2024
Copy link
Member

@agocke agocke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@am11
Copy link
Member

am11 commented May 8, 2024

This check is supposed to be optional; if it fails, that's "fine."" The wrapping <Exec/> captures stdout and the error code, and wants the output to be Low priority so that it isn't be shown by default.

This check is not optional. It is needed to ensure that lld versions > 12 includes linker script to get a successful build.

If this PR is enabling cross-building for linux-bionic on macOS, then we need to treat it as such; "enabler" PR. First, these errors are real and they are supposed to fail the build fast (early on). Lets examine it:

  1. Why lld version check is needed?

Since clang 13 on linux and freebsd, we need sections.ld version script to keep symbols like __modules. Without it, we were running into these errors: #84493.

  1. Why llvm-objcopy condition is necessary?

Because it's only logical to ensure we have a working stripping tool when StripSymbols is enabled. You can turn off StripSymbols in your project if you don't have the stripping tool on the build machine.

I think we should keep the current behavior as we generally do not support cross-os-build (but only the cross-arch-build) for NativeAOT. linux-bionic is an exception because Android NDK provides a complete toolchain (compiler, libs and the whole nine) and we just invoke NDK toolchain. So we should figure out why it's not working as expected on macOS. You always have option to use cross-compile on linux in your setup (we have several docker images are available to give you a head start).

cc @jkotas, @sbomer, who may also remember some details.

@agocke agocke dismissed their stale review May 8, 2024 21:19

re-evaluating

@agocke
Copy link
Member

agocke commented May 8, 2024

Is it possible that some of these checks should be blocking and some shouldn't be?

For instance,

<Exec Command="command -v &quot;$(Xcrun)&quot;" IgnoreExitCode="true" StandardOutputImportance="Low" StandardErrorImportance="Low" IgnoreStandardErrorWarningFormat="true" Condition="'$(_IsiOSLikePlatform)' == 'true'">
      <Output TaskParameter="ExitCode" PropertyName="_WhereXcrun" />
    </Exec>
    <Error Condition="'$(_WhereXcrun)' != '0' and '$(_IsiOSLikePlatform)' == 'true'"
      Text="'$(Xcrun)' not found in PATH. Make sure '$(Xcrun)' is available in PATH." />

It doesn't seem like a problem if that first Exec fails, since the job of the Error is to produce the appropriate failure.

Copy link
Member

@MichalStrehovsky MichalStrehovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know when grep/command -v would write to stderr? Wondering under what circumstances these are needed.

Otherwise LGTM, thanks!

<Output TaskParameter="ExitCode" PropertyName="_WhereXcrun" />
</Exec>
<Error Condition="'$(_WhereXcrun)' != '0' and '$(_IsiOSLikePlatform)' == 'true'"
Text="'$(Xcrun)' not found in PATH. Make sure '$(Xcrun)' is available in PATH." />

<Exec Command="&quot;$(Xcrun)&quot; --sdk $(_AppleSdkName) --show-sdk-path" IgnoreExitCode="true" StandardOutputImportance="Low" Condition="'$(SysRoot)' == '' and '$(_IsiOSLikePlatform)' == 'true'" ConsoleToMsBuild="true">
<Exec Command="&quot;$(Xcrun)&quot; --sdk $(_AppleSdkName) --show-sdk-path" IgnoreExitCode="true" StandardOutputImportance="Low" StandardErrorImportance="Low" IgnoreStandardErrorWarningFormat="true" Condition="'$(SysRoot)' == '' and '$(_IsiOSLikePlatform)' == 'true'" ConsoleToMsBuild="true">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little bit suspicious about the pre-existing IgnoreExitCode here. @ivanpovazan do we want to ignore exit code here or was it just a copy-paste mistake? If we want ignored exit code, the addition looks reasonable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it seems like a copy-paste mistake (good catch, thanks!).

If this command fails, we shouldn't be continuing the build as SysRoot has to be resolved when '$(_IsiOSLikePlatform)' == 'true'.

I suggest we remove IgnoreExitCode="true" here.

@MichalStrehovsky
Copy link
Member

If this PR is enabling cross-building for linux-bionic on macOS, then we need to treat it as such; "enabler" PR.

Cross-building for linux-bionic from macOS already works. The top post in this PR describes how this was hit - if you have a misconfigured toolchain (the clang on the PATH is not the clang from the bionic SDK), on a mac you'll hit the codepath where clang -fuse-ld=lld -Wl,--version will return a non-zero exit code and prints an error message to stderr. Writing to error to stderr leads to msbuild promoting to an error and also relogging the message from clang. We don't want the message in non-diagnostic msbuild logs because we handle the non-zero exit code. We also don't want Exec task to be treated as error when error was written to stderr.

The fix/bug itself is not bionic specific - this codepath is hittable on Linux-targeting-Linux as well. It just happened to be found with misconfigured toolchain on mac targeting bionic.

@am11
Copy link
Member

am11 commented May 9, 2024

@MichalStrehovsky, macOS has ld64 linker which is used by AppleClang. If lld is needed for cross-build, then the stock toolchain needs to be installed from homebrew. It does not hit on linux because we have appropriate conditions in place. Ignoring these errors is not the right thing to do here. These errors were intentional as I give a breakdown in my previous comment.

@MichalStrehovsky
Copy link
Member

@MichalStrehovsky, macOS has ld64 linker which is used by AppleClang. If lld is needed for cross-build, then the stock toolchain needs to be installed from homebrew. It does not hit on linux because we have appropriate conditions in place. Ignoring these errors is not the right thing to do here. These errors were intentional as I give a breakdown in my previous comment.

This PR is touching more lines than was strictly needed for this bug. The discussion is about this line:

<Exec Command="&quot;$(CppLinker)&quot; -fuse-ld=lld -Wl,--version" Condition="'$(LinkerFlavor)' == 'lld'" IgnoreExitCode="true" StandardOutputImportance="Low" ConsoleToMSBuild="true">
<Output TaskParameter="ExitCode" PropertyName="_LinkerVersionStringExitCode" />
<Output TaskParameter="ConsoleOutput" PropertyName="_LinkerVersionString" />
</Exec>

Currently if clang doesn't understand -fuse-ld=lld MSBuild will print clang: error: invalid linker name in argument '-fuse-ld=lld' and it will also show as an error due to the presence of error string token. Do you consider that by design? Maybe I don't understand why we check/handle the non-zero exit code (which should already cover the command failing for whatever reason - the extra output to stderr just looks like a diagnostic bonus).

@am11
Copy link
Member

am11 commented May 9, 2024

I think we would need:

-  <Exec Command="&quot;$(CppLinker)&quot; -fuse-ld=lld -Wl,--version" Condition="'$(LinkerFlavor)' == 'lld'" IgnoreExitCode="true" StandardOutputImportance="Low" ConsoleToMSBuild="true"> 
+ <Exec Command="&quot;$(CppLinker)&quot; -fuse-ld=lld -Wl,--version 2&gt;/dev/null" Condition="'$(LinkerFlavor)' == 'lld'" IgnoreExitCode="true" StandardOutputImportance="Low" ConsoleToMSBuild="true"> 

to match what we use in other places, e.g.

command -v ninja 2>/dev/null || command -v ninja-build 2>/dev/null || { echo "Unable to locate ninja!"; exit 1; }

(it will still captures the exit code)

Reminds me of this feature request opened a while back: dotnet/msbuild#7970 to idiomatically filter out selective errors in <Exec, but in the meanwhile we can control it from the executing script.

@MichalStrehovsky
Copy link
Member

I think we would need:

What's the difference between that and adding StandardErrorImportance="Low" IgnoreStandardErrorWarningFormat="true"? Doing it the MSBuild way lets one see stderr in binlogs still, whereas if it goes to /dev/null, it's gone.

@am11
Copy link
Member

am11 commented May 9, 2024

The current error mode of -fuse-ld=lld -Wl,--version is: lld is not installed and lld was requested. If we don't like the error message and want to print something nicer, then we will need to add an error based on _LinkerVersionStringExitCode, otherwise we will run into other errors later #84493. In this case 2>/dev/null makes sense. But, if we are trying to escape this error mode as a whole, which I think this PR is really about (and lack of 2>/dev/null/IgnoreStandardErrorWarningFormat was gating it) we shouldn't be doing that because that is problematic for other cross build combinations.

@MichalStrehovsky
Copy link
Member

We currently have this handling:

  • The command exits with zero and we have the requested version on stdout
  • The command exits with non-zero and nothing is printed to stderr. We ignore this because of IgnoreExitCode=true. We then never set _LinkerVersion. Now that I look at it, that looks like a bug.
  • The command exits with non-zero and error is printed to stderr. MSBuild logs this as an error in the build log, but apparently proceeds further based on the failure mode description in the top post (MSBuild only stopped because it couldn't find objcopy later, but that's unrelated; if there was objcopy, we'd probably take the unhandled "_LinkerVersion is empty" codepath).

Is the bug in IgnoreExitCode=true then?

@am11
Copy link
Member

am11 commented May 9, 2024

The command exits with non-zero and nothing is printed to stderr.

I think this does not happen because we always get an error in case lld was not located by clang driver, but requested by user either explicitly (or e.g. on FreeBSD, where it's set as default). Yes, this is not the explicit way of handing it and we are relying on MSBuild's error-detection behavior. As it stands in main branch, msbuild will error out.

The command exits with non-zero and error is printed to stderr. MSBuild logs this as an error in the build log, but apparently proceeds further based on the failure mode description in the top post (MSBuild only stopped because it couldn't find objcopy later, but that's unrelated;

objcopy has separate conditions which select llvm-objcopy for llvm toolchain and falls back to binutil's objcopy. We had a discussion to enhance this in order to align with our cmake implementation which searches for exact toolchain matching objcopy, e.g. llvm-objcopy-18 (convention followed by all major linux distros) and not just the un-suffixed one on PATH llvm-objcopy (convention followed by our manually compiled toolchain in CBL-mariner/AzureLinux); but our locate_toolchain_exec() cmake function handles both:

locate_toolchain_exec(objcopy CMAKE_OBJCOPY NO)

At the end it was getting too complicated so we ended up handling it differently by providing ObjCopyName msbuild property so user can set toolchain specific names .

Back on _LinkerVersion front, if it was empty because lld was not installed (but requested), the clang build command will error out eventually (invalid linker name in argument '-fuse-ld=lld' or some such).


If linux-bionic compilation works with ld64 only choice provided by AppleClang (which is a bit surprising), we can extend the condition from Condition="'$(LinkerFlavor)' == 'lld'" to Condition="'$(LinkerFlavor)' == 'lld' and '$(_IsApplePlatform)' != 'true'". Otherwise, it seems we have a few options:

@MichalStrehovsky
Copy link
Member

I'll ask this way: what would break if we delete the IgnoreExitCode=true part and related _LinkerVersionStringExitCode logic? We were only sent on the StandardErrorImportance="Low" path because it looked like we're okay if this Exec fails, but it doesn't look like we are okay with it failing. The build currently doesn't stop because of IgnoreExitCode, it continues to whatever is the next failure and then reports both failures that may or may not be related.

What we want to fix is for the build to not fail with two different errors at the same time.

It looked like we need to suppress error message from a handled error condition, but it looks like the error condition is not actually handled and should fail the build instead.

@am11
Copy link
Member

am11 commented May 9, 2024

It looked like we need to suppress error message from a handled error condition, but it looks like the error condition is not actually handled and should fail the build instead.

Yes, we are not subsequently checking for error and that result in odd looking error message later because it's ultimately a no-go. e.g. on Ubuntu 18 with clang-7 and lld-7 installed, this is what we get with LinkerFlavor=lld:

$ dotnet9 publish -p:PublishAot=true -p:LinkerFlavor=lld | tee
...
  wapi1 -> /wapi1/bin/Release/net9.0/linux-arm64/wapi1.dll
clang : error : invalid linker name in argument '-fuse-ld=lld' [/wapi1/wapi1.csproj]
  Generating native code
/root/.nuget/packages/microsoft.dotnet.ilcompiler/9.0.0-preview.5.24256.1/build/Microsoft.NETCore.Native.targets(349,96): error MSB4086: A numeric comparison was attempted on "$(_LinkerVersion)" that evaluates to "" instead of a number, in condition "'$(LinkerFlavor)' == 'lld' and '$(_LinkerVersion)' > '12'". [/wapi1/wapi1.csproj]

We can either:

  • keep IgnoreExitCode=true and introduce an explicit error to the effect "Error: lld was requested but not found by $(CppLinker). Set <LinkerFlavor> property in your project to the name of available linker on your system." (preferred option IMO)
  • or treat '$(_LinkerVersion)' == '' as '$(_LinkerVersion)' == '0' (which will still fail the 'clang' command, so not very robust option)

@sbomer
Copy link
Member

sbomer commented May 10, 2024

Sorry I'm a bit late to the discussion. Just to clarify a few points that @am11 already made (I'm only talking about the linker version check, not sure if this applies to the other checks you've touched):

  • If lld is requested but not available, the build should fail.
  • If lld is requested and available, we use the version to determine whether to pass some extra arguments.

Which of these issues are you intending to solve @jonpryor?

  1. Multiple errors are reported (unavailable lld and some downstream, potentially unrelated, errors)
  2. The error message about missing lld is not friendly

If we only want to solve 1) then I'd suggest removing IgnoreExitCode=true (and adding IgnoreStandardErrorWarningFormat="true" as you have) so that only one error is reported when lld is unavailable. To solve 2) we would want to keep IgnoreExitCode=true, and instead report a custom error message when the exit code is non-zero.

@MichalStrehovsky
Copy link
Member

Reading through #101727 I think the main point was about "1. Multiple errors are reported (unavailable lld and some downstream, potentially unrelated, errors)". If we switch this to IgnoreExitCode=false, the failure mode will change from:

clang : error : invalid linker name in argument '-fuse-ld=lld'
$HOME/.nuget/packages/microsoft.dotnet.ilcompiler/8.0.4/build/Microsoft.NETCore.Native.Unix.targets(236,5): error :
Symbol stripping tool ('llvm-objcopy' or 'objcopy') not found in PATH. Try installing appropriate package for llvm-objcopy or objcopy to resolve the problem or set the StripSymbols property to false to disable symbol stripping.

(First line has an error message but absolutely zero context about it and it makes it look like the subsequent line is context)

To:

clang : error : invalid linker name in argument '-fuse-ld=lld'
$HOME/.nuget/packages/microsoft.dotnet.ilcompiler/8.0.4/build/Microsoft.NETCore.Native.targets(XXX): error MSB3073: The command ""clang" XXXXX" exited with code 1.

This is enough for people to start troubleshooting because the message is now relevant to the failing MSBuild line.

I would not go as far as "2. The error message about missing lld is not friendly" because that has other questions like "why don't we do similar detection when the linker is not lld?", "why don't we add detection for other misconfigured toolchains - like trying to target MUSL but using a GLIBC toolchain, or missing libz-dev, or any of the many ways the native toolchain could be misconfigured". We have to rely on an externally supplied toolchain that could be in all sorts of bad state for what the user wants to do. I worry about adding various handling for this because we could introduce bugs in it and things that would compile fine will get blocked because we wanted to go above and beyond with messages and detect things that could be wrong. The message we produce could also be just misleading because we don't really know what's wrong.

@am11
Copy link
Member

am11 commented May 13, 2024

/root/.nuget/packages/microsoft.dotnet.ilcompiler/9.0.0-preview.5.24256.1/build/Microsoft.NETCore.Native.targets(349,96): error MSB4086: A numeric comparison was attempted on "$(_LinkerVersion)" that evaluates to "" instead of a number, in condition "'$(LinkerFlavor)' == 'lld' and '$(_LinkerVersion)' > '12'". [/wapi1/wapi1.csproj]

It's better to handle the lld<v13 case explicitly.

@MichalStrehovsky
Copy link
Member

I've extracted the "don't IgnoreExitCode for LLD version check" part to #102278.

MichalStrehovsky added a commit that referenced this pull request May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-NativeAOT-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
6 participants