L008 refactor #2004

jpy-git · 2021-11-30T00:43:41Z

Brief summary of the change made

This PR fixes #2001. The column reference following the comma had multiple child segments, therefore each child segment would detect the same comma and re-trigger the rule. I fixed this by using memory to keep a log of which commas have previously been fixed and added a test case to verify the single counting.

Are there any other side effects of this change that we should be aware of?

No

Pull Request checklist

Please confirm you have completed any of the necessary steps below.
Included test cases to demonstrate any code changes, which may be one or more of the following:
- .yml rule test cases in test/fixtures/rules/std_rule_cases.
- .sql/.yml parser test cases in test/fixtures/dialects (note YML files can be auto generated with python test/generate_parse_fixture_yml.py or by running tox locally).
- Full autofix test cases in test/fixtures/linter/autofix.
- Other.
Added appropriate documentation for the change.
Created GitHub issues for any relevant followup/future enhancements if appropriate.

codecov · 2021-11-30T00:57:17Z

Codecov Report

Merging #2004 (a19ac2f) into main (2785e84) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main     #2004   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          148       148           
  Lines        10492     10495    +3     
=========================================
+ Hits         10492     10495    +3

Impacted Files	Coverage Δ
src/sqlfluff/core/parser/segments/base.py	`100.00% <100.00%> (ø)`
src/sqlfluff/rules/L008.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2785e84...a19ac2f. Read the comment docs.

jpy-git · 2021-12-02T20:24:10Z

@barrywhart could you take a look at this please, it uses memory to log fixed commas so we don't double flag them

barrywhart · 2021-12-02T21:19:26Z

I can believe that this works, but it doesn't quite feel like a "fix" to me. My instinct is that there's a design flaw in the rule. Is there a way to write the rule so that it only checks for the issue for certain segments, so that by design, it only detects the error once? E.g. the first segment after the comma or the first raw segment after the comma?

In particular, this code seems odd to me: IIUC, rather than processing the segment that was passed by the linter, it looks at another segment related to that one.

Would love your thoughts on this. I can take a closer look if my advice seems too vague or wrongheaded (always a possibility!!).

jpy-git · 2021-12-03T00:32:40Z

@barrywhart makes sense, let me have a look at a more comprehensive fix and get back to you on this one tomorrow 👍

jpy-git

@barrywhart this is ready for review now 😄 Hopefully a much more robust implementation of L008

src/sqlfluff/core/parser/segments/base.py

jpy-git · 2021-12-05T14:59:47Z

src/sqlfluff/rules/L008.py

+        # Raw stack is appropriate as the only segments we can care about are
+        # comma, whitespace, newline, and comment, which are all raw.
+        # Using the raw_segments allows us to account for possible unexpected
+        # parse tree structures resulting from other rule fixes.


FYI this is the sort of thing I'm referring to above (from fixing test/fixtures/linter/autofix/bigquery/002_templating/before.sql)

One of the other rules is adding newlines and whitespace within a function_name segment. The approach I've taken in L008 should work regardless of issues like this 👍

It sounds like you may be seeing an occurrence of #1304. I would really like to get all these rule bugs fixed (I suspect many of them involve misuse of LintFix type edit) and add a linter check to prevent such bugs in the future.

Other related issues:

L027 bizarre log output: "Unqualified reference '\\n *' found in select" #1668: Another apparent rule bug

Option to automatically re-parse the SQL after "fix" #1012: Vague idea about a general way of detecting all such issues

@barrywhart so I think a lot of these issues aren't necessarily an edit thing, you would get a similar issue with create_before or create_after as they all work in a similar way.

Reading through the apply_fixes logic (

sqlfluff/src/sqlfluff/core/parser/segments/base.py

Line 938 in 0624a84

def apply_fixes(self, fixes):

), a fix is applied by:

Search down the parse tree, at each level appending each segment you find to segment buffer.

If you find the anchor segment (of the fix not the result) then either
edit: add the new segments to the buffer, but not the anchor.
create_before: add the new segment to the buffer and then the anchor segment
create_after: add the anchor segment to the buffer and then add the new segment

then subsequent segments are added to the buffer.

The key thing to note here is that all fixes segments are inserted at the SAME level as the fix anchor_segment. The reason we seem to get a lot of issue like the one above is some logic (in this case I think it's L016) finds a segment to analyse and determine if a fix needs to be applied and then naively just applies the fix at the same tree level even if not appropriate.

So in this case L016 probably determines that it can split a line by looking at the first raw segment after a space, which is the function_name_identifier, and then inserts the newline and whitespace before that. Something like LintFix("create_before", <function_name_identifier_segment>, [NewlineSegment(), WhitespaceSegment()]), when in fact from the parse tree the appropriate fix is actually LintFix("create_before", <expression_segment>, [NewlineSegment(), WhitespaceSegment()]).
So there needs to be bit of extra logic/consideration to ensure the fix is actually placed in the correct place in the parse tree rather than to just get the output appearing correct in the simple case. That being said it is hard sometimes to implement.

A good example of this is the recent PR we did for #1979, in which we detect the linting error at the final code segment of the tree but apply the fix at the FileSegment in order to get the correct parse structure.

^That's basically the cause of these bugs, however, given that it's sometimes hard to create logic to determine the correct fix placement (especially when the parse grammar itself can be inconsistent) I do think that #1012 is an interesting idea.

If you were, for example, to take the messed up post-fix parse tree structure shown above and re-parse the updated <file_segment>.raw to then supply to subsequent fix-rounds you would maintain the fixes whilst giving subsequent rules the correct parse tree. (guess it would be between loops in lint_fix_parsed)

It might not be the most efficient but could be more practical given that we are unlikely to have completely well defined grammar anytime soon?

I have also been wondering if we could run the relevant match() functions to detect and complain if a rule does something invalid to the parse tree. I think I need to look at some of the bugs more closely to decide if it's reasonable to flag/fix them.

Re-parsing is an alternative -- it should definitely work, but it feels like cheating, and in order to be 100% sure, we'd potentially need to re-parse after every fix (or every rule run)? It seems like it'd be useful to know which rules have these issues so we only re-parse after running a problematic rule.

src/sqlfluff/rules/L008.py

barrywhart · 2021-12-05T19:32:26Z

Looks good! A few small questions, just about ready for merge.

jpy-git · 2021-12-05T20:19:05Z

@barrywhart Implemented the feedback 😄

barrywhart

Looks great! 🎉

jpy-git · 2021-12-05T20:21:03Z

@barrywhart I won't steal your merge this time! 😄

barrywhart · 2021-12-05T20:29:03Z

Whee!! Now my weekend is complete.

Add memory to log previously fixed commas in L008

8ce2eaf

jpy-git added 3 commits December 1, 2021 18:28

Merge branch 'main' into triple_count_l008

3abcd97

Merge branch 'main' into triple_count_l008

19d1c99

Merge branch 'main' into triple_count_l008

eb7ff4b

jpy-git and others added 4 commits December 3, 2021 00:32

Merge branch 'main' into triple_count_l008

fa715c0

Merge branch 'main' into triple_count_l008

fef2925

Add new L008 logic

4a04532

Use raw_segments for most robust fix

460e93e

jpy-git commented Dec 5, 2021

View reviewed changes

jpy-git and others added 2 commits December 5, 2021 15:02

coverage

66f7d63

Merge branch 'main' into triple_count_l008

dfd2471

jpy-git changed the title ~~Add memory to log previously fixed commas in L008~~ L008 refactor Dec 5, 2021

jpy-git added 2 commits December 5, 2021 15:26

satisfy coverage

8d88408

Extra unit test for coverage of unused method

22678c0

barrywhart reviewed Dec 5, 2021

View reviewed changes

src/sqlfluff/rules/L008.py Show resolved Hide resolved

src/sqlfluff/rules/L008.py Show resolved Hide resolved

src/sqlfluff/rules/L008.py Outdated Show resolved Hide resolved

src/sqlfluff/rules/L008.py Outdated Show resolved Hide resolved

jpy-git and others added 2 commits December 5, 2021 20:07

Implement review feedback

1adfa61

Merge branch 'main' into triple_count_l008

a19ac2f

barrywhart approved these changes Dec 5, 2021

View reviewed changes

barrywhart merged commit 0624a84 into sqlfluff:main Dec 5, 2021

jpy-git deleted the triple_count_l008 branch December 5, 2021 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

L008 refactor #2004

L008 refactor #2004

jpy-git commented Nov 30, 2021

codecov bot commented Nov 30, 2021 •

edited

jpy-git commented Dec 2, 2021

barrywhart commented Dec 2, 2021

jpy-git commented Dec 3, 2021

jpy-git left a comment

jpy-git Dec 5, 2021

barrywhart Dec 5, 2021 •

edited

jpy-git Dec 5, 2021

jpy-git Dec 5, 2021

barrywhart Dec 5, 2021

barrywhart commented Dec 5, 2021

jpy-git commented Dec 5, 2021

barrywhart left a comment

jpy-git commented Dec 5, 2021

barrywhart commented Dec 5, 2021

L008 refactor #2004

L008 refactor #2004

Conversation

jpy-git commented Nov 30, 2021

Brief summary of the change made

Are there any other side effects of this change that we should be aware of?

Pull Request checklist

codecov bot commented Nov 30, 2021 • edited

Codecov Report

jpy-git commented Dec 2, 2021

barrywhart commented Dec 2, 2021

jpy-git commented Dec 3, 2021

jpy-git left a comment

Choose a reason for hiding this comment

jpy-git Dec 5, 2021

Choose a reason for hiding this comment

barrywhart Dec 5, 2021 • edited

Choose a reason for hiding this comment

jpy-git Dec 5, 2021

Choose a reason for hiding this comment

jpy-git Dec 5, 2021

Choose a reason for hiding this comment

barrywhart Dec 5, 2021

Choose a reason for hiding this comment

barrywhart commented Dec 5, 2021

jpy-git commented Dec 5, 2021

barrywhart left a comment

Choose a reason for hiding this comment

jpy-git commented Dec 5, 2021

barrywhart commented Dec 5, 2021

codecov bot commented Nov 30, 2021 •

edited

barrywhart Dec 5, 2021 •

edited