Rule request: Use Ascii #1999

iRon7 · 2024-04-30T08:29:45Z

Summary of the new feature

Remembering the days behind my TRS-80 where the first versions only had a 6 bit character set of 64 characters.
A few years later, I extend the character set with 32 more (mainly lower case) characters by soldering an additional chip piggybacked on the original character set chip.

Nowadays, there many codepage extensions resulting in a thousands of characters. This is nice for human language support where it concerns outputs and/or comments but often causes issues with the code itself also knowing that the use of some specific non-ascii characters (as e.g. smart quotes and EM-dashes) that end up in code are even generally unintended.

UseBOMForUnicodeEncodedFile

The UseBOMForUnicodeEncodedFile rule is quiet useless if the author has no intention to use anything else than ASCII characters.
It only mentions that there is a non-Ascii character somewhere in de code but were it resides is often a mystery.

Note that:

Some non-Ascii characters are difficult to recognize (e.g. an EM-dash) or can't be recognized at all (e.g. a No-break space).
The current version of VSCode highlights some extended characters, but not all (as double smart quotes and diacritic characters).
Some characters might result in a ParseError which causes the parser (and anything that relies on it as PSScriptAnalyzer) to stop process.
There are different codepage recommendations and defaults for Windows PowerShell and PowerShell (core).

Human language vs programming language

Were humans might not even notice a difference between certain characters and continue to understand the contents, a parser or a program might react unexpectedly. (Take the PSScriptAnalyzer with the suggested prototype as an example: Invoke-ScriptAnalyzer -CustomRulePath .\UseASCII.psm1 -ScriptDefinition "Write-Host 'coöperate'", why does this work PowerShell 7 and throw an Cannot convert error with Windows PowerShell?)
The argument "the whole file is checked without considering if it's actual code or not" makes some sense but the main goal of a PowerShell Script (.ps1) file is to run a script also knowing that there are several other ways to deal with any statements that require non-code characters (usually for output only).

afaik, there are no general cmdlets -, methods - or operators names that require non-ascii characters
Variable names with non-ascii characters should to my opinion always be avoided.
(output) strings that require non-ascii characters might be substituted with:
- "co`u{00F6}perate" (from PowerShell version 6) or
- "co$([char]0x00F6)perate" (from PowerShell version 3)
In case it concerns non-Ascii characters in comments
- (Large) comments (as e.g. comment help) might be put aside in e.g. a different (.md, .xml) file or referred on the
  web (HelpUri=).
- Comment in English
- Simply leave out the single non-ascii characters or diacritics
- Suppress the analyzer warning.

Proposed technical implementation details (optional)

This proposed rule covers rule requests:

Prototype

To capture any non-ascii character:

The AST parser might potentially break due to a ParseError caused by specific characters and PowerShell versions
Even the Tokenize method can't be fully used as it doesn't capture specific control characters (as e.g. -smart- quotes).

Meaning that to my opinion the only way to capture all potential undesired characters is to scan the full text:

PSUseASCII

#Requires -Version 3.0

function Measure-UseASCII {
<#
    .SYNOPSIS
    Use UTF-8 Characters
    .DESCRIPTION
    Validates if only ASCII characters are used and reveal the position of any violation.
    .INPUTS
    [System.Management.Automation.Language.ScriptBlockAst]
    .OUTPUTS
    [Microsoft.Windows.PowerShell.ScriptAnalyzer.Generic.DiagnosticRecord]
#>

    [CmdletBinding()]
    [OutputType([Microsoft.Windows.PowerShell.ScriptAnalyzer.Generic.DiagnosticRecord])]
    Param (
        [Parameter(Mandatory = $true)]
        [ValidateNotNullOrEmpty()]
        [System.Management.Automation.Language.ScriptBlockAst]
        $ScriptBlockAst
    )
    Begin {
        function GetNonASCIIPositions ([String]$Text) {
            $LF  = [Char]0x0A
            $DEL = [Char]0x7F
            $LineNumber = 1; $ColumnNumber = 1
            for ($Offset = 0; $Offset -lt $Text.Length; $Offset++) {
                $Character = $Text[$Offset]
                if ($Character -eq $Lf) {
                    $LineNumber++
                    $ColumnNumber = 0
                }
                else {
                    $ColumnNumber++
                    if ($Character -gt $Del) {
                        [PSCustomObject]@{
                            Character    = $Character
                            Offset       = $Offset
                            LineNumber   = $LineNumber
                            ColumnNumber = $ColumnNumber
                        }
                    }
                }
            }
        }

        function CharToHex([Char]$Char) {
            ([Int][Char]$Char).ToString('x4')
        }
        function SuggestedAscii([Char]$Char) {
            switch ([Int]$Char) {
                0x00A0 { ' ' }
                0x1806 { '-' }
                0x2010 { '-' }
                0x2011 { '-' }
                0x2012 { '-' }
                0x2013 { '-' }
                0x2014 { '-' }
                0x2015 { '-' }
                0x2016 { '-' }
                0x2212 { '-' }
                0x2018 { "'" }
                0x2019 { "'" }
                0x201A { "'" }
                0x201B { "'" }
                0x201C { '"' }
                0x201D { '"' }
                0x201E { '"' }
                0x201F { '"' }
                Default {
                    $Ascii = $Char.ToString().Normalize([System.text.NormalizationForm]::FormD)[0]
                    if ($Ascii -le 0x7F) { $Ascii } else { '_' }
                }

            }
        }
    }

    Process {
        # As the AST parser, tokenize doesn't capture (smart) quotes
        # $Tokens = [System.Management.Automation.PSParser]::Tokenize($ScriptBlockAst.Extent.Text, [ref]$null)
        # $Violations = $Tokens.where{ $_.Content -cMatch '[\u0100-\uFFFF]' }
        $Violations = GetNonASCIIPositions $ScriptBlockAst.Extent.Text
        [Collections.Generic.List[Microsoft.Windows.PowerShell.ScriptAnalyzer.Generic.DiagnosticRecord]]@(
            Foreach ($Violation in $Violations) {
                $Text = $ScriptBlockAst.Extent.Text
                For ($i = $Violation.Offset - 1; $i -ge 0; $i--) { if ($Text[$i] -NotMatch '\w') { break } }
                $Start = $i + 1
                For ($i = $Violation.Offset + 1; $i -lt $Text.Length; $i++) { if ($Text[$i] -NotMatch '\w') { break } }
                $Length = $i - $Start
                $Word = $Text.SubString($Start, $Length)

                $StartPosition = [System.Management.Automation.Language.ScriptPosition]::new(
                    $Null,
                    $Violation.LineNumber,
                    $Violation.ColumnNumber,
                    $ScriptBlockAst.Extent.Text
                )
                $EndPosition = [System.Management.Automation.Language.ScriptPosition]::new(
                    $Null,
                    $Violation.LineNumber,
                    ($Violation.ColumnNumber + 1),
                    $ScriptBlockAst.Extent.Text
                )
                $Extent = [System.Management.Automation.Language.ScriptExtent]::new($StartPosition, $EndPosition)
                $Character = $Violation.Character
                $UniCode   = "U+$(CharToHex $Character)"
                $SuggestedAscii = SuggestedAscii $Character
                $AscCode   = "U+$(CharToHex $SuggestedAscii)"
                [Microsoft.Windows.PowerShell.ScriptAnalyzer.Generic.DiagnosticRecord]@{
                    Message              = "Non-ASCII character $UniCode found in: $Word"
                    Extent               = $Extent
                    RuleName             = 'PSUseASCII'
                    Severity             = 'Information'
                    RuleSuppressionID    = $Word
                    SuggestedCorrections = [System.Collections.ObjectModel.Collection[Microsoft.Windows.PowerShell.ScriptAnalyzer.Generic.CorrectionExtent]](
                        [Microsoft.Windows.PowerShell.ScriptAnalyzer.Generic.CorrectionExtent]::New(
                            $Violation.LineNumber,
                            $Violation.LineNumber,
                            $Violation.ColumnNumber,
                            ($Violation.ColumnNumber + 1),
                            $SuggestedAscii ,
                            "Replace '$Character' ($UniCode) with '$SuggestedAscii' ($AscCode)"
                        )
                    )
                }
            }
        )
    }
}
Export-ModuleMember -Function Measure-*

"Spot the 10 non-ascii characters:"

<#
    .SYNOPSIS
    Use ASCII test
    .DESCRIPTION
    The main use of diacritics in Latin script is to change the sound-values of the letters to which they are added.
    Historically, English has used the diaeresis diacritic to indicate the correct pronunciation of ambiguous words,
    such as "coöperate", without which the <oo> letter sequence could be misinterpreted to be pronounced
#>

# [System.Diagnostics.CodeAnalysis.SuppressMessageAttribute('PSUseAscii', 'coöperate')]
Param()

Write-Host “test” –ForegroundColor ‘Red’ -BackgroundColor ‘Green’
Write-Host 'No-break space'

Analyzer results

Invoke-ScriptAnalyzer -CustomRulePath \UseASCII.psm1 .\Test.ps1

RuleName                            Severity     ScriptName Line  Message
--------                            --------     ---------- ----  -------
PSUseASCII                          Information  Test.ps1   7     Non-ASCII character U+00f6 found in: coöperate
PSUseASCII                          Information  Test.ps1   10    Non-ASCII character U+00f6 found in: coöperate
PSUseASCII                          Information  Test.ps1   13    Non-ASCII character U+201c found in: “test
PSUseASCII                          Information  Test.ps1   13    Non-ASCII character U+201d found in: test”
PSUseASCII                          Information  Test.ps1   13    Non-ASCII character U+2013 found in: –ForegroundColor
PSUseASCII                          Information  Test.ps1   13    Non-ASCII character U+2018 found in: ‘Red
PSUseASCII                          Information  Test.ps1   13    Non-ASCII character U+2019 found in: Red’
PSUseASCII                          Information  Test.ps1   13    Non-ASCII character U+2018 found in: ‘Green
PSUseASCII                          Information  Test.ps1   13    Non-ASCII character U+2019 found in: Green’
PSUseASCII                          Information  Test.ps1   14    Non-ASCII character U+00a0 found in: break space

Note that I have commented-out the SuppressMessageAttribute in the example PowerShell file.
This is because of a known bug #1686 which causes several of the following errors to occur:

Invoke-ScriptAnalyzer: Suppression Message Attribute error at line 10 in script definition : Cannot find any DiagnosticRecord with the Rule Suppression ID coöperate.

Also for this reason I would like to see a formal (disabled by default) rule for this.

What is the latest version of PSScriptAnalyzer at the point of writing: 1.22.0

The text was updated successfully, but these errors were encountered:

See: PowerShell/PSScriptAnalyzer#1999

SydneyhSmith · 2024-05-07T22:27:27Z

Thanks @iRon7 we'd love more community discussion on this issue

iRon7 added a commit to iRon7/PSRules that referenced this issue May 1, 2024

Use ASCII

b04b04d

See: PowerShell/PSScriptAnalyzer#1999

SydneyhSmith added Issue - Discussion Issue-Enhancement Issue - New Rule labels May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rule request: Use Ascii #1999

Rule request: Use Ascii #1999

iRon7 commented Apr 30, 2024 •

edited

SydneyhSmith commented May 7, 2024

Rule request: Use Ascii #1999

Rule request: Use Ascii #1999

Comments

iRon7 commented Apr 30, 2024 • edited

Summary of the new feature

UseBOMForUnicodeEncodedFile

Human language vs programming language

Proposed technical implementation details (optional)

Prototype

SydneyhSmith commented May 7, 2024

iRon7 commented Apr 30, 2024 •

edited