New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running indentation fixes adds whitespace to various code segments #1304
Comments
@alanmcruickshank: I wonder if you may have any insights on this issue. After some initial investigation, I have learned that rule L003 is adding spaces before the
The problem is, the
If I understand correctly, it shouldn't be possible for a
My question is, would you consider this a bug in L003 for adding a whitespace segment somewhere it's not allowed, or is it a bug in the core linter for not inserting the whitespace at the correct location in the parse tree. For reference, this parse output shows where the whitespace should be, i.e. between the two
|
I think this is the latter, i.e. I do wonder whether they're a bit naive to do this effectively at the moment and could do with some work. The way I think it should work is that the position of I wonder whether the fixing routines are being too selective though and not matching on That leaves us with two options I think:
|
I think this is exactly the behavior. I did some study of the code, and it's literally checking for object identity, I.e. what specific segment object did the lint rule specify as the anchor in the LintFix object. It's not considering the character position at all. Will it always be correct to apply the fix at the highest possible location in the parse tree? 🤔 |
@barrywhart / @alanmcruickshank I did some more digging on this (didn't get the answer yet btw before you get too excited!) First of all I confirmed it was affecting all dialects and wasn't anything to do with the special case for Secondly it broke in 0.5.3 - it works in 0.5.2. But can't see anything obvious in the 0.5.3 release notes to explain this: https://github.com/sqlfluff/sqlfluff/releases/tag/0.5.3 Here's the diffs: 0.5.2...0.5.3 |
OK this is the commit that broke it: 99f8bc6 Unfortunately it's quite a big one so a lot of change to go through :-( |
OK so 99f8bc6 broke it for UNNEST statement, but it was still broken for regular table from clauses (the second example in first issue) prior to that (and I think possibly forever to be honest!) But the change in that commit does give us some further clues. Basically using a simple structure works and prevents whitespaces being added in the wrong place: FunctionNameSegment=RegexParser(
r"[A-Z][A-Z0-9_]*",
CodeSegment,
name="function_name",
type="function_name",
), But using a more complex structure class doesn't prevent this and so gives the bug: @ansi_dialect.segment()
class FunctionNameSegment(BaseSegment):
"""Function name."""
type = "function_name"
match_grammar = RegexParser(
r"[A-Z_][A-Z0-9_]*",
CodeSegment,
) Shouldn't these basically be the same? This basically changed in that commit for Getting kind of stuck with this now so any pointers greatly appreciated! |
The SQLFluff parser makes a distinction between segments and grammars, with grammars being a lower-level thing. IIUC, grammars behave a bit like macros in C; they're more like a preprocessor thing in that they do not appear in the final parse tree; only the underlying things do. OTOH, segments do appear in the parse tree. So in the second case, the parse tree would have both a Your analysis helps make sense of this. In the above case, the grammar only contains one segment. I don't know exactly how we want to fix this, but one possible idea is that when applying fixes, SQLFluff should treat the lowest-level segments in the tree as "atomic": i.e., don't make changes within segments; instead, move up the parse tree and apply the fix "adjacent to" the segment. This is a heuristic and may not avoid all possible problems, but I think it would reduce the likelihood of problems. I think the bug is located here: It's applying the fix exactly on the anchor, without considering the structural heuristic above.
To be clear, I haven't worked on this area of the code and am not familiar with it. But when I was investigating this issue recently by looking through code with the test case in this issue, this is the area I found where it was applying the fix and it made me think, this needs to be smarter. I see some code in this file ( |
@alanmcruickshank: Assigning this to you for a closer look, if you don't mind. I found another instance of this issue today: #1668. |
So the problem is we want to add whitespace in a different section of the parse tree than we’re currently analysing. The white space is inserted in the correctly place in the SQL, but not in the parse tree (which has several layers of abstraction applied on top of that SQL). While it would be ideal to insert it in the right place where possible, sometimes (like the example here), that’s basically not possible (or at least very tricky!). So I wonder if instead of trying to solve that, we should instead just reparse between each fix iteration to get the correct parse tree? That would probably go a long way to solving most issues - like that raised in #2134 . It would slow down Thoughts @barrywhart / @alanmcruickshank ? |
I'd like to either do what you suggest or make the core linter smarter, so if a rule gives it a "bad" fix, it'll figure out the right place to apply it. If the latter is possible, it fixes the issue without the performance hit of re-parsing the file each time a fix is applied. |
Btw I was suggesting re-parsing for every iteration (of which we currently do a max of 10), rather than every fix within that iteration. Just think that’ll be simpler than fixing the parser (and potentially more future proof for future rules that aren’t (or can’t be) implemented 100% right). |
I understand. I don't know the inner details, but our hands may be tied, e.g. if we only re-parse after every iteration, will we still have issues? I think you're probably correct, though -- IIRC, for each iteration, all the rules "see" the same parse tree, and the fixes are only applied at the end of the iteration. |
As a follow on from #1302 and #1303 (and likely related to #1149 and possibly others) I have noticed that when running
sqlfluff fix
on SQL which requires indentation fixes, the extra white space is added to the code, rather than as addition whitespace segments.So the
function_name_identifier
ofUNNEST
can become••••UNNEST
(where • represents a space).This can then cause unexpected consequences when the next rule runs in that test as suddenly the parsed tree is different than what it should be.
Expected Behaviour
Whitespace should not be added to other identifiers/segments as this can break expectations.
Observed Behaviour
Whitespace is added other identifiers/segments, resulting in the next rule in that run potentially not running properly.
Steps to Reproduce
Add the below print statement to
src/sqlfluff/core/rules/analysis/select.py
:Create a test.sql file with this:
Then run this command:
And note that even though we are only printing the function name, the spaces are included:
I've confirmed it's not a bigquery or function_name specific issue as using this SQL:
And adding this to the
_has_value_table_function
:Then running this command:
Leads to the same issue:
Dialect
all
Version
Include the output of
sqlfluff --version
along with your Python version0.6.3 - python 3.8
Configuration
No .sqlfluff, so default config.
The text was updated successfully, but these errors were encountered: