Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GB18030] Fix "Unable to translate Unicode character" #10063

Merged
merged 5 commits into from Apr 25, 2024

Conversation

ilonatommy
Copy link
Member

@ilonatommy ilonatommy commented Apr 24, 2024

Fixes #10066
[GB18030] Failed to run the WebAPI project that named as GB18030 provided level3 strings

Context

OS: Windows 11 Enterprise 22H2 ZH-CN
Affected Build: 9.0.0-preview.4.24209.5, Aso repro on 8.0.300-preview.24203.14(8.0.2)

Steps to reproduce:

Use CLI to create WebAPI project copying GB18030 characters, for example:

dotnet new webapi -controllers -o 㐇𠁠𪨰𫠊𫦠𮚮⿕#
cd 㐇𠁠𪨰𫠊𫦠𮚮⿕#
dotnet run

Without this PR:

dotnet run
Building...
Restore complete (0.5s)
You are using a preview version of .NET. See: https://aka.ms/dotnet-support-policy
  㐇𠁠𪨰𫠊𫦠��⿕# failed with 2 error(s) (2.9s) → bin/Debug/net9.0/㐇𠁠𪨰𫠊𫦠��⿕#.dll
    /workspaces/runtime/.dotnet/sdk/9.0.100-preview.3.24204.13/Microsoft.Common.CurrentVersion.targets(5578,5): error MSB3491: Could not write lines to file "obj/Debug/net9.0/㐇𠁠𪨰𫠊𫦠��⿕#.csproj.FileListAbsolute.txt". Unable to translate Unicode character \\uD86E at index 2368 to specified code page.
    /workspaces/runtime/.dotnet/sdk/9.0.100-preview.3.24204.13/Microsoft.Common.CurrentVersion.targets(5823,5): error MSB3491: Could not write lines to file "obj/Debug/net9.0/㐇𠁠𪨰𫠊𫦠��⿕#.csproj.FileListAbsolute.txt". Unable to translate Unicode character \\uD86E at index 2368 to specified code page.

Build failed with 2 error(s) in 3.8s

The build failed. Fix the build errors and run again.

With this PR:

dotnet run
Building...
Restore complete (0.5s)
You are using a preview version of .NET. See: https://aka.ms/dotnet-support-policy
  㐇𠁠𪨰𫠊𫦠��⿕# succeeded (0.6s) → bin/Debug/net9.0/㐇𠁠𪨰𫠊𫦠��⿕#.dll

Build succeeded in 1.5s
info: Microsoft.Hosting.Lifetime[14]
      Now listening on: http://localhost:5210
info: Microsoft.Hosting.Lifetime[0]
      Application started. Press Ctrl+C to shut down.
info: Microsoft.Hosting.Lifetime[0]
      Hosting environment: Development
info: Microsoft.Hosting.Lifetime[0]
      Content root path: /workspaces/reproGB/㐇𠁠𪨰𫠊𫦠��⿕#
^Cinfo: Microsoft.Hosting.Lifetime[0]
      Application is shutting down...

Changes Made

Changed the way we create a substring to prevent cutting surrogates in half.

Testing

Added a test for new substring function.

Regerssion

Yes, after #9346.

@ladipro
Copy link
Member

ladipro commented Apr 24, 2024

This change looks more like a workaround. Do we understand why \uD86E fails to be encoded to UTF-8? For what it's worth, this test program runs fine for me:

var defaultEncoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true);
File.WriteAllText("C:\\temp\\encoded.txt", "\u3407\ud840\udc60\ud86a\ude30\ud86e\udc0a\ud86e\udda0\ud879\udeae\u2fd5\u0023", defaultEncoding);

defaultEncoding is the default used by the WriteLinesToFile task and the string is equivalent to the problematic project name, as far as I can tell.

@ilonatommy
Copy link
Member Author

ilonatommy commented Apr 24, 2024

True, actually smallest repro does not help here either:

  <Target Name="MyTask" AfterTargets="Build">
    <WriteLinesToFile
      File="㐇𠁠𪨰𫠊𫦠𮚮⿕#\msEncoding.txt"
      Lines="㐇𠁠𪨰𫠊𫦠𮚮⿕#"
      Overwrite="true"/>
  </Target>

works.
Do you have an idea how to add a test to this repo to debug it better?

Edit:
it's comes down to a problem with MSBuildCopyMarkerName contents. You're right that the fix was not addressing the underlying issue. I will update the PR when I find the proper fix.

@ilonatommy
Copy link
Member Author

ilonatommy commented Apr 24, 2024

If we moved

<MSBuildCopyMarkerName Condition="'$(MSBuildCopyMarkerName.Length)' &gt; '17'">$(MSBuildProjectFile.Substring(0,8)).$([MSBuild]::StableStringHash($(MSBuildProjectFile)).ToString("X8"))</MSBuildCopyMarkerName>

to C#:

public sealed class ShortenAndHashProjectName : Task
    {
        [Required]
        public string ProjectName { get; set; }

        [Output]
        public string ShortProjectName { get; set; }

        public override bool Execute()
        {
            if (ProjectName.Length <= 17)
            {
                ShortProjectName = ProjectName;
                return true;
            }

            // if the last char of string is a surrogate, cutting it in half would confuse encoder
            int length = char.IsHighSurrogate(ProjectName[7]) ? 9 : 8;
            string truncatedProjectName = ProjectName.Substring(0, length);
            string originalProjectNameHash = StableStringHash(ProjectName);
            ShortProjectName = $"{truncatedProjectName}.{originalProjectNameHash}".ToString("X8");
            return true;
    }

it should fix the issue. MsBuild does not provide a way to detect surrogates and if the last char is a beginning of surrogate, it cuts it in half.
Why is shortening the project name necessary?

@ladipro
Copy link
Member

ladipro commented Apr 24, 2024

Tagging @JanKrivanek as this is an unfortunate regression introduced with the recent stable hash changes. Jan, do you think we can get away without shortening? if not, it would probably make sense to add a string shortening property function which would be safe with respect to surrogates.

@JanKrivanek
Copy link
Member

Oh good find!

Shortening was one of the requests that the change was addressing. Let's move it to prop function

@ladipro
Copy link
Member

ladipro commented Apr 24, 2024

@ilonatommy, rather than a task, we would prefer the functionality to be exposed as an intrinsic property function. Here's a PR where one was recently added: #9665

@ilonatommy
Copy link
Member Author

@JanKrivanek could you please take over? I'm not sure how to test the changes and how to access StableStringHash method correctly.

@ilonatommy ilonatommy removed the request for review from maririos April 24, 2024 15:20
Copy link
Member

@JanKrivanek JanKrivanek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thnak you!

@ilonatommy ilonatommy marked this pull request as ready for review April 25, 2024 08:05
Copy link
Member

@ladipro ladipro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@ladipro
Copy link
Member

ladipro commented Apr 25, 2024

The new prop function should be added to the doc page: https://learn.microsoft.com/en-us/visualstudio/msbuild/property-functions?view=vs-2022#msbuild-property-functions. Not urgent but we should make sure we don't forget.

@ladipro ladipro merged commit d4b9bd8 into dotnet:main Apr 25, 2024
10 checks passed
Comment on lines -391 to +392
<!-- For a long MSBuildProjectFile let's shorten this to 17 chars - using the first 8 chars of the filename and a filename hash. -->
<MSBuildCopyMarkerName Condition="'$(MSBuildCopyMarkerName.Length)' &gt; '17'">$(MSBuildProjectFile.Substring(0,8)).$([MSBuild]::StableStringHash($(MSBuildProjectFile)).ToString("X8"))</MSBuildCopyMarkerName>
<!-- For a long MSBuildProjectFile let's shorten this to 17 chars - using the first 8 codepoints of the filename and a filename hash. -->
<MSBuildCopyMarkerName Condition="'$(MSBuildCopyMarkerName.Length)' &gt; '17'">$([MSBuild]::SubstringByTextElements($(MSBuildProjectFile), 0, 8)).$([MSBuild]::StableStringHash($(MSBuildProjectFile)).ToString("X8"))</MSBuildCopyMarkerName>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are now text elements rather than codepoints. Which has consequences:

  • $(MSBuildCopyMarkerName.Length) can still be greater than 17 after this shortening.
  • If $(MSBuildProjectFile) is 👨‍👨‍👦‍👦👨‍👨‍👦‍👦.proj, then StringInfo.SubstringByTextElements(0, 8) will throw ArgumentOutOfRangeException. That string has 27 UTF-16 code units but only 7 text elements.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're going to have an [MSBuild]::SubstringByTextElements intrinsic function, then either it should check for out-of-range arguments itself, or there should be an [MSBuild]::CountTextElements function as well. Otherwise it's too difficult to use reliably in projects.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think what this really needs is a property function that returns all the text elements that entirely fit within the specified substring, with arguments given in UTF-16 code units. So that the resulting string never has more UTF-16 code units than requested. Because those are what matters for file name lengths in NTFS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are excellent points. We may get away with this functionality to be just a best-effort "make the file name reasonably long" but we definitely must fix the possible ArgumentOutOfRangeException and it's worth thinking about a more useful property function too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider moving the entire operation to a new, dedicated property function so we could just do

<MSBuildCopyMarkerName>$([MSBuild]::ShortUniqueProjectName())</MSBuildCopyMarkerName>

or similar?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still keep it general for other strings as well - there seems to be other usages in org on path args that might need revision: https://github.com/search?q=repo%3Adotnet%2Fsdk+path%3A**%2F*.targets+substring&type=code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[GB18030] UpToDate marker file name shortening doesn't respect surrogate chars
5 participants