Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running indentation fixes adds whitespace to various code segments #1304

Closed
tunetheweb opened this issue Aug 20, 2021 · 12 comments · Fixed by #2657
Closed

Running indentation fixes adds whitespace to various code segments #1304

tunetheweb opened this issue Aug 20, 2021 · 12 comments · Fixed by #2657
Assignees
Labels
bug Something isn't working rule bug A rule is not working as intended, either missing errors or incorrectly highlighting non-errors

Comments

@tunetheweb
Copy link
Member

tunetheweb commented Aug 20, 2021

As a follow on from #1302 and #1303 (and likely related to #1149 and possibly others) I have noticed that when running sqlfluff fix on SQL which requires indentation fixes, the extra white space is added to the code, rather than as addition whitespace segments.

So the function_name_identifier of UNNEST can become ••••UNNEST (where • represents a space).

This can then cause unexpected consequences when the next rule runs in that test as suddenly the parsed tree is different than what it should be.

Expected Behaviour

Whitespace should not be added to other identifiers/segments as this can break expectations.

Observed Behaviour

Whitespace is added other identifiers/segments, resulting in the next rule in that run potentially not running properly.

Steps to Reproduce

Add the below print statement to src/sqlfluff/core/rules/analysis/select.py:

    for function_name in table_expr.recursive_crawl("function_name"):
        print('LOOK AT THIS:%s:' % function_name.raw.lower())
        # Other rules can increase whitespace in the function name, so use strip to remove
        # See: https://github.com/sqlfluff/sqlfluff/issues/1304
        if function_name.raw.lower() in dialect.sets("value_table_functions"):
            return True
    return False

Create a test.sql file with this:

SELECT
    category,
    value
FROM
    table1,
UNNEST(1, 2, 3) AS category

Then run this command:

sqlfluff fix --dialect bigquery test.sql --fixed-suffix FIXED --force --rules L003,L025

And note that even though we are only printing the function name, the spaces are included:

==== finding fixable violations ====
LOOK AT THIS:    unnest:
LOOK AT THIS:    unnest:
LOOK AT THIS:    unnest:
...etc.

I've confirmed it's not a bigquery or function_name specific issue as using this SQL:

SELECT
    category,
    value
FROM
    table1,
table2

And adding this to the _has_value_table_function:

    for table_ref in table_expr.recursive_crawl("table_reference"):
        print("LOOK AT THIS:%s:" % table_ref.raw)

Then running this command:

sqlfluff fix test.sql --fixed-suffix FIXED --force --rules L003,L025

Leads to the same issue:

==== finding fixable violations ====
LOOK AT THIS:table1:
LOOK AT THIS:    table2:
LOOK AT THIS:table1:
LOOK AT THIS:    table2:
== [test.sql] FAIL
L:   6 | P:   1 | L003 | Indentation not consistent with line #5

Dialect

all

Version

Include the output of sqlfluff --version along with your Python version

0.6.3 - python 3.8

Configuration

No .sqlfluff, so default config.

@tunetheweb tunetheweb added the bug Something isn't working label Aug 20, 2021
@barrywhart
Copy link
Member

barrywhart commented Aug 22, 2021

@alanmcruickshank: I wonder if you may have any insights on this issue. After some initial investigation, I have learned that rule L003 is adding spaces before the UNNEST here, around line 175:

        # If we don't have any indent and we should, then add a single
        elif len("".join(elem.raw for elem in current_indent_buffer)) == 0:
            fixes = [
                LintFix(
                    "create",
                    current_anchor,
                    WhitespaceSegment(
                        raw=desired_indent,
                    ),
                )
            ]

The problem is, the WhitespaceSegment is being added to an area of the parse tree that shouldn't contain whitespace -- a FunctionNameSegment. Inside LintedFile.fix_string(), the updated tree contains:

(Pdb) self.tree.segments[0].segments[0].segments[3].segments[-1].segments[1].segments[0].segments[0].segments[0]
<FunctionNameSegment: ([L:  6, P:  1])>
(Pdb) [(s.type, s.raw) for s in self.tree.segments[0].segments[0].segments[3].segments[-1].segments[1].segments[0].segments[0].segments[0].segments]
[('whitespace', '    '), ('raw', 'UNNEST')]

If I understand correctly, it shouldn't be possible for a FunctionNameSegment to contain a whitespace segment (this is from dialect_ansi.py):

@ansi_dialect.segment()
class FunctionNameSegment(BaseSegment):
    """Function name, including any prefix bits, e.g. project or schema."""

    type = "function_name"
    match_grammar = Sequence(
        # Project name, schema identifier, etc.
        AnyNumberOf(
            Sequence(
                Ref("SingleIdentifierGrammar"),
                Ref("DotSegment"),
            ),
        ),
        # Base function name
        OneOf(
            Ref("FunctionNameIdentifierSegment"),
            Ref("QuotedIdentifierSegment"),
        ),
        allow_gaps=False,
    )

My question is, would you consider this a bug in L003 for adding a whitespace segment somewhere it's not allowed, or is it a bug in the core linter for not inserting the whitespace at the correct location in the parse tree.

For reference, this parse output shows where the whitespace should be, i.e. between the two from_expressions, rather than placing it 5 levels deeper:

  • from_expression
  • [META] indent
  • from_expression_element
  • table_expression
  • function
  • function_name
sqlfluff parse --dialect bigquery ./test/fixtures/linter/autofix/bigquery/005_unnest_spacing/after.sql
[L:  1, P:  1]      |file:
[L:  1, P:  1]      |    statement:
[L:  1, P:  1]      |        select_statement:
[L:  1, P:  1]      |            select_clause:
...
[L:  4, P:  1]      |            from_clause:
[L:  4, P:  1]      |                keyword:                                      'FROM'
[L:  4, P:  5]      |                newline:                                      '\n'
[L:  5, P:  1]      |                whitespace:                                   '    '
[L:  5, P:  5]      |                from_expression:
...
[L:  5, P: 11]      |                comma:                                        ','
[L:  5, P: 12]      |                newline:                                      '\n'
[L:  6, P:  1]      |                whitespace:                                   '    '  # <<<--- HERE
[L:  6, P:  5]      |                from_expression:
[L:  6, P:  5]      |                    [META] indent:
[L:  6, P:  5]      |                    from_expression_element:
[L:  6, P:  5]      |                        table_expression:
[L:  6, P:  5]      |                            function:
[L:  6, P:  5]      |                                function_name:
[L:  6, P:  5]      |                                    function_name_identifier:  'UNNEST'
[L:  6, P: 11]      |                                bracketed:
[L:  6, P: 11]      |                                    start_bracket:            '('
[L:  6, P: 12]      |                                    [META] indent:
[L:  6, P: 12]      |                                    expression:
[L:  6, P: 12]      |                                        literal:              '1'
[L:  6, P: 13]      |                                    comma:                    ','
[L:  6, P: 14]      |                                    whitespace:               ' '
[L:  6, P: 15]      |                                    expression:
[L:  6, P: 15]      |                                        literal:              '2'
[L:  6, P: 16]      |                                    comma:                    ','
[L:  6, P: 17]      |                                    whitespace:               ' '
[L:  6, P: 18]      |                                    expression:
[L:  6, P: 18]      |                                        literal:              '3'
[L:  6, P: 19]      |                                    [META] dedent:
[L:  6, P: 19]      |                                    end_bracket:              ')'
[L:  6, P: 20]      |                        whitespace:                           ' '
[L:  6, P: 21]      |                        alias_expression:
[L:  6, P: 21]      |                            keyword:                          'AS'
[L:  6, P: 23]      |                            whitespace:                       ' '
[L:  6, P: 24]      |                            identifier:                       'category'
[L:  6, P: 32]      |                    [META] dedent:
[L:  6, P: 32]      |    newline:                                                  '\n'

@alanmcruickshank
Copy link
Member

alanmcruickshank commented Aug 24, 2021

I think this is the latter, i.e. bug in ... core linter for not inserting the whitespace at the correct location in the parse tree. In theory the fixing routines should know where they can and can't add whitespace. I don't think individual rules should have to worry about that level of restriction.

I do wonder whether they're a bit naive to do this effectively at the moment and could do with some work.

The way I think it should work is that the position of UNNEST is effectively the same as the position of from_expression because they start at the same point. The fixing routine should evaluate from_expression first (as it's higher in the tree) and then consume the fix at that point, inserting the whitespace at the correct point, rather than waiting until we get to UNNEST.

I wonder whether the fixing routines are being too selective though and not matching on from_expression properly and instead only matching when we hit UNNEST. 🤔

That leaves us with two options I think:

  1. Revisit the matching of fixes to segments. This will definitely have changed when I did the refactor of positioning a few months ago - so it's very possible there's a bug in here which can be fixed to achieve the desired result.
  2. If we find that we can't initially position the whitespace in the right place in the first pass (without breaking something else) - then we could also do a second pass through the tree after any fixes are applied to "promote" and inappropriately positioned whitespace up to the correct point in the tree. It's a bit less neat - but could also work quite well.

@barrywhart
Copy link
Member

I wonder whether the fixing routines are being too selective though and not matching on from_expression properly and instead only matching when we hit UNNEST.

I think this is exactly the behavior. I did some study of the code, and it's literally checking for object identity, I.e. what specific segment object did the lint rule specify as the anchor in the LintFix object. It's not considering the character position at all.

Will it always be correct to apply the fix at the highest possible location in the parse tree? 🤔

@tunetheweb tunetheweb added the rule bug A rule is not working as intended, either missing errors or incorrectly highlighting non-errors label Aug 25, 2021
@tunetheweb
Copy link
Member Author

@barrywhart / @alanmcruickshank I did some more digging on this (didn't get the answer yet btw before you get too excited!)

First of all I confirmed it was affecting all dialects and wasn't anything to do with the special case for UNNEST. I updated the issue above to show a simple use case in ANSI with two tables in FROM statement.

Secondly it broke in 0.5.3 - it works in 0.5.2. But can't see anything obvious in the 0.5.3 release notes to explain this: https://github.com/sqlfluff/sqlfluff/releases/tag/0.5.3

Here's the diffs: 0.5.2...0.5.3

@tunetheweb
Copy link
Member Author

OK this is the commit that broke it: 99f8bc6

Unfortunately it's quite a big one so a lot of change to go through :-(

@tunetheweb
Copy link
Member Author

OK so 99f8bc6 broke it for UNNEST statement, but it was still broken for regular table from clauses (the second example in first issue) prior to that (and I think possibly forever to be honest!)

But the change in that commit does give us some further clues. Basically using a simple structure works and prevents whitespaces being added in the wrong place:

    FunctionNameSegment=RegexParser(
        r"[A-Z][A-Z0-9_]*",
        CodeSegment,
        name="function_name",
        type="function_name",
    ),

But using a more complex structure class doesn't prevent this and so gives the bug:

@ansi_dialect.segment()
class FunctionNameSegment(BaseSegment):
    """Function name."""

    type = "function_name"
    match_grammar = RegexParser(
        r"[A-Z_][A-Z0-9_]*",
        CodeSegment,
    )

Shouldn't these basically be the same?

This basically changed in that commit for FunctionNameSegment hence why it broke UNNEST (which is basically a function). However tables were already a complex class so they were broken already by the time of that commit.

Getting kind of stuck with this now so any pointers greatly appreciated!

@barrywhart
Copy link
Member

The SQLFluff parser makes a distinction between segments and grammars, with grammars being a lower-level thing. IIUC, grammars behave a bit like macros in C; they're more like a preprocessor thing in that they do not appear in the final parse tree; only the underlying things do. OTOH, segments do appear in the parse tree. So in the second case, the parse tree would have both a function_name and the underlying string nested under it.

Your analysis helps make sense of this. In the above case, the grammar only contains one segment.

I don't know exactly how we want to fix this, but one possible idea is that when applying fixes, SQLFluff should treat the lowest-level segments in the tree as "atomic": i.e., don't make changes within segments; instead, move up the parse tree and apply the fix "adjacent to" the segment. This is a heuristic and may not avoid all possible problems, but I think it would reduce the likelihood of problems.

I think the bug is located here: It's applying the fix exactly on the anchor, without considering the structural heuristic above.

                        # Look for identity not just equality.
                        # This handles potential positioning ambiguity.
                        if f.anchor is seg:

To be clear, I haven't worked on this area of the code and am not familiar with it. But when I was investigating this issue recently by looking through code with the test case in this issue, this is the area I found where it was applying the fix and it made me think, this needs to be smarter.

I see some code in this file (linted_file.py) that appears to apply some policies about how or whether to apply certain fixes. I don't know if or how this code relates to the file above. See the comments for the fix_string() function.

@barrywhart
Copy link
Member

@alanmcruickshank: Assigning this to you for a closer look, if you don't mind. I found another instance of this issue today: #1668.

@tunetheweb
Copy link
Member Author

So the problem is we want to add whitespace in a different section of the parse tree than we’re currently analysing. The white space is inserted in the correctly place in the SQL, but not in the parse tree (which has several layers of abstraction applied on top of that SQL).

While it would be ideal to insert it in the right place where possible, sometimes (like the example here), that’s basically not possible (or at least very tricky!).

So I wonder if instead of trying to solve that, we should instead just reparse between each fix iteration to get the correct parse tree? That would probably go a long way to solving most issues - like that raised in #2134 . It would slow down fix but not lint and only for bad files that need fixes (or lots of iterations of fixes) applied so I think worth it.

Thoughts @barrywhart / @alanmcruickshank ?

@barrywhart
Copy link
Member

I'd like to either do what you suggest or make the core linter smarter, so if a rule gives it a "bad" fix, it'll figure out the right place to apply it. If the latter is possible, it fixes the issue without the performance hit of re-parsing the file each time a fix is applied.

@tunetheweb
Copy link
Member Author

Btw I was suggesting re-parsing for every iteration (of which we currently do a max of 10), rather than every fix within that iteration. Just think that’ll be simpler than fixing the parser (and potentially more future proof for future rules that aren’t (or can’t be) implemented 100% right).

@barrywhart
Copy link
Member

barrywhart commented Feb 5, 2022

re-parsing for every iteration

I understand. I don't know the inner details, but our hands may be tied, e.g. if we only re-parse after every iteration, will we still have issues? I think you're probably correct, though -- IIRC, for each iteration, all the rules "see" the same parse tree, and the fixes are only applied at the end of the iteration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working rule bug A rule is not working as intended, either missing errors or incorrectly highlighting non-errors
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants