Running indentation fixes adds whitespace to various code segments #1304

tunetheweb · 2021-08-20T14:48:07Z

As a follow on from #1302 and #1303 (and likely related to #1149 and possibly others) I have noticed that when running sqlfluff fix on SQL which requires indentation fixes, the extra white space is added to the code, rather than as addition whitespace segments.

So the function_name_identifier of UNNEST can become ••••UNNEST (where • represents a space).

This can then cause unexpected consequences when the next rule runs in that test as suddenly the parsed tree is different than what it should be.

Expected Behaviour

Whitespace should not be added to other identifiers/segments as this can break expectations.

Observed Behaviour

Whitespace is added other identifiers/segments, resulting in the next rule in that run potentially not running properly.

Steps to Reproduce

Add the below print statement to src/sqlfluff/core/rules/analysis/select.py:

    for function_name in table_expr.recursive_crawl("function_name"):
        print('LOOK AT THIS:%s:' % function_name.raw.lower())
        # Other rules can increase whitespace in the function name, so use strip to remove
        # See: https://github.com/sqlfluff/sqlfluff/issues/1304
        if function_name.raw.lower() in dialect.sets("value_table_functions"):
            return True
    return False

Create a test.sql file with this:

SELECT
    category,
    value
FROM
    table1,
UNNEST(1, 2, 3) AS category

Then run this command:

sqlfluff fix --dialect bigquery test.sql --fixed-suffix FIXED --force --rules L003,L025

And note that even though we are only printing the function name, the spaces are included:

==== finding fixable violations ====
LOOK AT THIS:    unnest:
LOOK AT THIS:    unnest:
LOOK AT THIS:    unnest:
...etc.

I've confirmed it's not a bigquery or function_name specific issue as using this SQL:

SELECT
    category,
    value
FROM
    table1,
table2

And adding this to the _has_value_table_function:

    for table_ref in table_expr.recursive_crawl("table_reference"):
        print("LOOK AT THIS:%s:" % table_ref.raw)

Then running this command:

sqlfluff fix test.sql --fixed-suffix FIXED --force --rules L003,L025

Leads to the same issue:

==== finding fixable violations ====
LOOK AT THIS:table1:
LOOK AT THIS:    table2:
LOOK AT THIS:table1:
LOOK AT THIS:    table2:
== [test.sql] FAIL
L:   6 | P:   1 | L003 | Indentation not consistent with line #5

Dialect

all

Version

Include the output of sqlfluff --version along with your Python version

0.6.3 - python 3.8

Configuration

No .sqlfluff, so default config.

The text was updated successfully, but these errors were encountered:

barrywhart · 2021-08-22T20:36:50Z

@alanmcruickshank: I wonder if you may have any insights on this issue. After some initial investigation, I have learned that rule L003 is adding spaces before the UNNEST here, around line 175:

        # If we don't have any indent and we should, then add a single
        elif len("".join(elem.raw for elem in current_indent_buffer)) == 0:
            fixes = [
                LintFix(
                    "create",
                    current_anchor,
                    WhitespaceSegment(
                        raw=desired_indent,
                    ),
                )
            ]

The problem is, the WhitespaceSegment is being added to an area of the parse tree that shouldn't contain whitespace -- a FunctionNameSegment. Inside LintedFile.fix_string(), the updated tree contains:

(Pdb) self.tree.segments[0].segments[0].segments[3].segments[-1].segments[1].segments[0].segments[0].segments[0]
<FunctionNameSegment: ([L:  6, P:  1])>
(Pdb) [(s.type, s.raw) for s in self.tree.segments[0].segments[0].segments[3].segments[-1].segments[1].segments[0].segments[0].segments[0].segments]
[('whitespace', '    '), ('raw', 'UNNEST')]

If I understand correctly, it shouldn't be possible for a FunctionNameSegment to contain a whitespace segment (this is from dialect_ansi.py):

@ansi_dialect.segment()
class FunctionNameSegment(BaseSegment):
    """Function name, including any prefix bits, e.g. project or schema."""

    type = "function_name"
    match_grammar = Sequence(
        # Project name, schema identifier, etc.
        AnyNumberOf(
            Sequence(
                Ref("SingleIdentifierGrammar"),
                Ref("DotSegment"),
            ),
        ),
        # Base function name
        OneOf(
            Ref("FunctionNameIdentifierSegment"),
            Ref("QuotedIdentifierSegment"),
        ),
        allow_gaps=False,
    )

My question is, would you consider this a bug in L003 for adding a whitespace segment somewhere it's not allowed, or is it a bug in the core linter for not inserting the whitespace at the correct location in the parse tree.

For reference, this parse output shows where the whitespace should be, i.e. between the two from_expressions, rather than placing it 5 levels deeper:

from_expression
[META] indent
from_expression_element
table_expression
function
function_name

sqlfluff parse --dialect bigquery ./test/fixtures/linter/autofix/bigquery/005_unnest_spacing/after.sql
[L:  1, P:  1]      |file:
[L:  1, P:  1]      |    statement:
[L:  1, P:  1]      |        select_statement:
[L:  1, P:  1]      |            select_clause:
...
[L:  4, P:  1]      |            from_clause:
[L:  4, P:  1]      |                keyword:                                      'FROM'
[L:  4, P:  5]      |                newline:                                      '\n'
[L:  5, P:  1]      |                whitespace:                                   '    '
[L:  5, P:  5]      |                from_expression:
...
[L:  5, P: 11]      |                comma:                                        ','
[L:  5, P: 12]      |                newline:                                      '\n'
[L:  6, P:  1]      |                whitespace:                                   '    '  # <<<--- HERE
[L:  6, P:  5]      |                from_expression:
[L:  6, P:  5]      |                    [META] indent:
[L:  6, P:  5]      |                    from_expression_element:
[L:  6, P:  5]      |                        table_expression:
[L:  6, P:  5]      |                            function:
[L:  6, P:  5]      |                                function_name:
[L:  6, P:  5]      |                                    function_name_identifier:  'UNNEST'
[L:  6, P: 11]      |                                bracketed:
[L:  6, P: 11]      |                                    start_bracket:            '('
[L:  6, P: 12]      |                                    [META] indent:
[L:  6, P: 12]      |                                    expression:
[L:  6, P: 12]      |                                        literal:              '1'
[L:  6, P: 13]      |                                    comma:                    ','
[L:  6, P: 14]      |                                    whitespace:               ' '
[L:  6, P: 15]      |                                    expression:
[L:  6, P: 15]      |                                        literal:              '2'
[L:  6, P: 16]      |                                    comma:                    ','
[L:  6, P: 17]      |                                    whitespace:               ' '
[L:  6, P: 18]      |                                    expression:
[L:  6, P: 18]      |                                        literal:              '3'
[L:  6, P: 19]      |                                    [META] dedent:
[L:  6, P: 19]      |                                    end_bracket:              ')'
[L:  6, P: 20]      |                        whitespace:                           ' '
[L:  6, P: 21]      |                        alias_expression:
[L:  6, P: 21]      |                            keyword:                          'AS'
[L:  6, P: 23]      |                            whitespace:                       ' '
[L:  6, P: 24]      |                            identifier:                       'category'
[L:  6, P: 32]      |                    [META] dedent:
[L:  6, P: 32]      |    newline:                                                  '\n'

alanmcruickshank · 2021-08-24T13:30:54Z

I think this is the latter, i.e. bug in ... core linter for not inserting the whitespace at the correct location in the parse tree. In theory the fixing routines should know where they can and can't add whitespace. I don't think individual rules should have to worry about that level of restriction.

I do wonder whether they're a bit naive to do this effectively at the moment and could do with some work.

The way I think it should work is that the position of UNNEST is effectively the same as the position of from_expression because they start at the same point. The fixing routine should evaluate from_expression first (as it's higher in the tree) and then consume the fix at that point, inserting the whitespace at the correct point, rather than waiting until we get to UNNEST.

I wonder whether the fixing routines are being too selective though and not matching on from_expression properly and instead only matching when we hit UNNEST. 🤔

That leaves us with two options I think:

Revisit the matching of fixes to segments. This will definitely have changed when I did the refactor of positioning a few months ago - so it's very possible there's a bug in here which can be fixed to achieve the desired result.
If we find that we can't initially position the whitespace in the right place in the first pass (without breaking something else) - then we could also do a second pass through the tree after any fixes are applied to "promote" and inappropriately positioned whitespace up to the correct point in the tree. It's a bit less neat - but could also work quite well.

barrywhart · 2021-08-25T00:22:15Z

I wonder whether the fixing routines are being too selective though and not matching on from_expression properly and instead only matching when we hit UNNEST.

I think this is exactly the behavior. I did some study of the code, and it's literally checking for object identity, I.e. what specific segment object did the lint rule specify as the anchor in the LintFix object. It's not considering the character position at all.

Will it always be correct to apply the fix at the highest possible location in the parse tree? 🤔

tunetheweb · 2021-08-27T11:55:22Z

@barrywhart / @alanmcruickshank I did some more digging on this (didn't get the answer yet btw before you get too excited!)

First of all I confirmed it was affecting all dialects and wasn't anything to do with the special case for UNNEST. I updated the issue above to show a simple use case in ANSI with two tables in FROM statement.

Secondly it broke in 0.5.3 - it works in 0.5.2. But can't see anything obvious in the 0.5.3 release notes to explain this: https://github.com/sqlfluff/sqlfluff/releases/tag/0.5.3

Here's the diffs: 0.5.2...0.5.3

tunetheweb · 2021-08-27T12:00:51Z

OK this is the commit that broke it: 99f8bc6

Unfortunately it's quite a big one so a lot of change to go through :-(

tunetheweb · 2021-08-27T14:03:43Z

OK so 99f8bc6 broke it for UNNEST statement, but it was still broken for regular table from clauses (the second example in first issue) prior to that (and I think possibly forever to be honest!)

But the change in that commit does give us some further clues. Basically using a simple structure works and prevents whitespaces being added in the wrong place:

    FunctionNameSegment=RegexParser(
        r"[A-Z][A-Z0-9_]*",
        CodeSegment,
        name="function_name",
        type="function_name",
    ),

But using a more complex structure class doesn't prevent this and so gives the bug:

@ansi_dialect.segment()
class FunctionNameSegment(BaseSegment):
    """Function name."""

    type = "function_name"
    match_grammar = RegexParser(
        r"[A-Z_][A-Z0-9_]*",
        CodeSegment,
    )

Shouldn't these basically be the same?

This basically changed in that commit for FunctionNameSegment hence why it broke UNNEST (which is basically a function). However tables were already a complex class so they were broken already by the time of that commit.

Getting kind of stuck with this now so any pointers greatly appreciated!

barrywhart · 2021-08-30T21:10:57Z

The SQLFluff parser makes a distinction between segments and grammars, with grammars being a lower-level thing. IIUC, grammars behave a bit like macros in C; they're more like a preprocessor thing in that they do not appear in the final parse tree; only the underlying things do. OTOH, segments do appear in the parse tree. So in the second case, the parse tree would have both a function_name and the underlying string nested under it.

Your analysis helps make sense of this. In the above case, the grammar only contains one segment.

I don't know exactly how we want to fix this, but one possible idea is that when applying fixes, SQLFluff should treat the lowest-level segments in the tree as "atomic": i.e., don't make changes within segments; instead, move up the parse tree and apply the fix "adjacent to" the segment. This is a heuristic and may not avoid all possible problems, but I think it would reduce the likelihood of problems.

I think the bug is located here: It's applying the fix exactly on the anchor, without considering the structural heuristic above.

                        # Look for identity not just equality.
                        # This handles potential positioning ambiguity.
                        if f.anchor is seg:

To be clear, I haven't worked on this area of the code and am not familiar with it. But when I was investigating this issue recently by looking through code with the test case in this issue, this is the area I found where it was applying the fix and it made me think, this needs to be smarter.

I see some code in this file (linted_file.py) that appears to apply some policies about how or whether to apply certain fixes. I don't know if or how this code relates to the file above. See the comments for the fix_string() function.

barrywhart · 2021-10-16T00:18:38Z

@alanmcruickshank: Assigning this to you for a closer look, if you don't mind. I found another instance of this issue today: #1668.

tunetheweb · 2022-02-05T15:04:29Z

So the problem is we want to add whitespace in a different section of the parse tree than we’re currently analysing. The white space is inserted in the correctly place in the SQL, but not in the parse tree (which has several layers of abstraction applied on top of that SQL).

While it would be ideal to insert it in the right place where possible, sometimes (like the example here), that’s basically not possible (or at least very tricky!).

So I wonder if instead of trying to solve that, we should instead just reparse between each fix iteration to get the correct parse tree? That would probably go a long way to solving most issues - like that raised in #2134 . It would slow down fix but not lint and only for bad files that need fixes (or lots of iterations of fixes) applied so I think worth it.

Thoughts @barrywhart / @alanmcruickshank ?

barrywhart · 2022-02-05T16:08:11Z

I'd like to either do what you suggest or make the core linter smarter, so if a rule gives it a "bad" fix, it'll figure out the right place to apply it. If the latter is possible, it fixes the issue without the performance hit of re-parsing the file each time a fix is applied.

tunetheweb · 2022-02-05T16:22:00Z

Btw I was suggesting re-parsing for every iteration (of which we currently do a max of 10), rather than every fix within that iteration. Just think that’ll be simpler than fixing the parser (and potentially more future proof for future rules that aren’t (or can’t be) implemented 100% right).

barrywhart · 2022-02-05T16:39:27Z

re-parsing for every iteration

I understand. I don't know the inner details, but our hands may be tied, e.g. if we only re-parse after every iteration, will we still have issues? I think you're probably correct, though -- IIRC, for each iteration, all the rules "see" the same parse tree, and the fixes are only applied at the end of the iteration.

tunetheweb added the bug Something isn't working label Aug 20, 2021

tunetheweb added the rule bug A rule is not working as intended, either missing errors or incorrectly highlighting non-errors label Aug 25, 2021

barrywhart assigned alanmcruickshank Oct 16, 2021

barrywhart mentioned this issue Oct 16, 2021

L027 bizarre log output: "Unqualified reference '\\n *' found in select" #1668

Closed

This was referenced Nov 1, 2021

TSQL linting fix: L019+L034 breaks the code #1734

Closed

Prevent L019 plus L034 corrupting SQL #1803

Merged

barrywhart mentioned this issue Dec 5, 2021

L008 refactor #2004

Merged

1 task

jpy-git mentioned this issue Dec 18, 2021

Result of fix changes on subsequent runs (fix is non-deterministic) #2134

Closed

jpers36 mentioned this issue Dec 19, 2021

Lint spaces in qualified names #2130

Merged

barrywhart unassigned alanmcruickshank Feb 3, 2022

This was referenced Feb 8, 2022

fix succeeds even though lint fails with PRS error #2345

Closed

fix doesn't settle on the first time #2584

Closed

barrywhart mentioned this issue Feb 15, 2022

Core linter enhancement: Check for successful parse after applying fixes #2657

Merged

1 task

barrywhart self-assigned this Feb 15, 2022

barrywhart mentioned this issue Feb 15, 2022

L003 (indentation rule): Improve anchors for fixes to avoid corrupting parse tree #2664

Closed

1 task

barrywhart closed this as completed in #2657 Feb 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running indentation fixes adds whitespace to various code segments #1304

Running indentation fixes adds whitespace to various code segments #1304

tunetheweb commented Aug 20, 2021 •

edited

barrywhart commented Aug 22, 2021 •

edited

alanmcruickshank commented Aug 24, 2021 •

edited by tunetheweb

barrywhart commented Aug 25, 2021

tunetheweb commented Aug 27, 2021

tunetheweb commented Aug 27, 2021

tunetheweb commented Aug 27, 2021

barrywhart commented Aug 30, 2021

barrywhart commented Oct 16, 2021

tunetheweb commented Feb 5, 2022

barrywhart commented Feb 5, 2022

tunetheweb commented Feb 5, 2022

barrywhart commented Feb 5, 2022 •

edited

Running indentation fixes adds whitespace to various code segments #1304

Running indentation fixes adds whitespace to various code segments #1304

Comments

tunetheweb commented Aug 20, 2021 • edited

Expected Behaviour

Observed Behaviour

Steps to Reproduce

Dialect

Version

Configuration

barrywhart commented Aug 22, 2021 • edited

alanmcruickshank commented Aug 24, 2021 • edited by tunetheweb

barrywhart commented Aug 25, 2021

tunetheweb commented Aug 27, 2021

tunetheweb commented Aug 27, 2021

tunetheweb commented Aug 27, 2021

barrywhart commented Aug 30, 2021

barrywhart commented Oct 16, 2021

tunetheweb commented Feb 5, 2022

barrywhart commented Feb 5, 2022

tunetheweb commented Feb 5, 2022

barrywhart commented Feb 5, 2022 • edited

tunetheweb commented Aug 20, 2021 •

edited

barrywhart commented Aug 22, 2021 •

edited

alanmcruickshank commented Aug 24, 2021 •

edited by tunetheweb

barrywhart commented Feb 5, 2022 •

edited