Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

L008 refactor #2004

Merged
merged 14 commits into from
Dec 5, 2021
Merged

L008 refactor #2004

merged 14 commits into from
Dec 5, 2021

Conversation

jpy-git
Copy link
Contributor

@jpy-git jpy-git commented Nov 30, 2021

Brief summary of the change made

This PR fixes #2001. The column reference following the comma had multiple child segments, therefore each child segment would detect the same comma and re-trigger the rule. I fixed this by using memory to keep a log of which commas have previously been fixed and added a test case to verify the single counting.

Are there any other side effects of this change that we should be aware of?

No

Pull Request checklist

  • Please confirm you have completed any of the necessary steps below.

  • Included test cases to demonstrate any code changes, which may be one or more of the following:

    • .yml rule test cases in test/fixtures/rules/std_rule_cases.
    • .sql/.yml parser test cases in test/fixtures/dialects (note YML files can be auto generated with python test/generate_parse_fixture_yml.py or by running tox locally).
    • Full autofix test cases in test/fixtures/linter/autofix.
    • Other.
  • Added appropriate documentation for the change.

  • Created GitHub issues for any relevant followup/future enhancements if appropriate.

@codecov
Copy link

codecov bot commented Nov 30, 2021

Codecov Report

Merging #2004 (a19ac2f) into main (2785e84) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main     #2004   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          148       148           
  Lines        10492     10495    +3     
=========================================
+ Hits         10492     10495    +3     
Impacted Files Coverage Δ
src/sqlfluff/core/parser/segments/base.py 100.00% <100.00%> (ø)
src/sqlfluff/rules/L008.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2785e84...a19ac2f. Read the comment docs.

@jpy-git
Copy link
Contributor Author

jpy-git commented Dec 2, 2021

@barrywhart could you take a look at this please, it uses memory to log fixed commas so we don't double flag them

@barrywhart
Copy link
Member

I can believe that this works, but it doesn't quite feel like a "fix" to me. My instinct is that there's a design flaw in the rule. Is there a way to write the rule so that it only checks for the issue for certain segments, so that by design, it only detects the error once? E.g. the first segment after the comma or the first raw segment after the comma?

In particular, this code seems odd to me: IIUC, rather than processing the segment that was passed by the linter, it looks at another segment related to that one.

Would love your thoughts on this. I can take a closer look if my advice seems too vague or wrongheaded (always a possibility!!).

@jpy-git
Copy link
Contributor Author

jpy-git commented Dec 3, 2021

@barrywhart makes sense, let me have a look at a more comprehensive fix and get back to you on this one tomorrow 👍

Copy link
Contributor Author

@jpy-git jpy-git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@barrywhart this is ready for review now 😄 Hopefully a much more robust implementation of L008

src/sqlfluff/core/parser/segments/base.py Show resolved Hide resolved
Comment on lines +51 to +54
# Raw stack is appropriate as the only segments we can care about are
# comma, whitespace, newline, and comment, which are all raw.
# Using the raw_segments allows us to account for possible unexpected
# parse tree structures resulting from other rule fixes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this is the sort of thing I'm referring to above (from fixing test/fixtures/linter/autofix/bigquery/002_templating/before.sql)
Screenshot 2021-12-05 at 14 17 53

One of the other rules is adding newlines and whitespace within a function_name segment. The approach I've taken in L008 should work regardless of issues like this 👍

Copy link
Member

@barrywhart barrywhart Dec 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like you may be seeing an occurrence of #1304. I would really like to get all these rule bugs fixed (I suspect many of them involve misuse of LintFix type edit) and add a linter check to prevent such bugs in the future.

Other related issues:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@barrywhart so I think a lot of these issues aren't necessarily an edit thing, you would get a similar issue with create_before or create_after as they all work in a similar way.

Reading through the apply_fixes logic (

def apply_fixes(self, fixes):
), a fix is applied by:

  • Search down the parse tree, at each level appending each segment you find to segment buffer.
  • If you find the anchor segment (of the fix not the result) then either
    edit: add the new segments to the buffer, but not the anchor.
    create_before: add the new segment to the buffer and then the anchor segment
    create_after: add the anchor segment to the buffer and then add the new segment
  • then subsequent segments are added to the buffer.

The key thing to note here is that all fixes segments are inserted at the SAME level as the fix anchor_segment. The reason we seem to get a lot of issue like the one above is some logic (in this case I think it's L016) finds a segment to analyse and determine if a fix needs to be applied and then naively just applies the fix at the same tree level even if not appropriate.

So in this case L016 probably determines that it can split a line by looking at the first raw segment after a space, which is the function_name_identifier, and then inserts the newline and whitespace before that. Something like LintFix("create_before", <function_name_identifier_segment>, [NewlineSegment(), WhitespaceSegment()]), when in fact from the parse tree the appropriate fix is actually LintFix("create_before", <expression_segment>, [NewlineSegment(), WhitespaceSegment()]).
So there needs to be bit of extra logic/consideration to ensure the fix is actually placed in the correct place in the parse tree rather than to just get the output appearing correct in the simple case. That being said it is hard sometimes to implement.

A good example of this is the recent PR we did for #1979, in which we detect the linting error at the final code segment of the tree but apply the fix at the FileSegment in order to get the correct parse structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^That's basically the cause of these bugs, however, given that it's sometimes hard to create logic to determine the correct fix placement (especially when the parse grammar itself can be inconsistent) I do think that #1012 is an interesting idea.

If you were, for example, to take the messed up post-fix parse tree structure shown above and re-parse the updated <file_segment>.raw to then supply to subsequent fix-rounds you would maintain the fixes whilst giving subsequent rules the correct parse tree. (guess it would be between loops in lint_fix_parsed)

It might not be the most efficient but could be more practical given that we are unlikely to have completely well defined grammar anytime soon?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also been wondering if we could run the relevant match() functions to detect and complain if a rule does something invalid to the parse tree. I think I need to look at some of the bugs more closely to decide if it's reasonable to flag/fix them.

Re-parsing is an alternative -- it should definitely work, but it feels like cheating, and in order to be 100% sure, we'd potentially need to re-parse after every fix (or every rule run)? It seems like it'd be useful to know which rules have these issues so we only re-parse after running a problematic rule.

src/sqlfluff/rules/L008.py Outdated Show resolved Hide resolved
@jpy-git jpy-git changed the title Add memory to log previously fixed commas in L008 L008 refactor Dec 5, 2021
src/sqlfluff/rules/L008.py Show resolved Hide resolved
src/sqlfluff/rules/L008.py Show resolved Hide resolved
src/sqlfluff/rules/L008.py Outdated Show resolved Hide resolved
src/sqlfluff/rules/L008.py Outdated Show resolved Hide resolved
@barrywhart
Copy link
Member

Looks good! A few small questions, just about ready for merge.

@jpy-git
Copy link
Contributor Author

jpy-git commented Dec 5, 2021

@barrywhart Implemented the feedback 😄

Copy link
Member

@barrywhart barrywhart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! 🎉

@jpy-git
Copy link
Contributor Author

jpy-git commented Dec 5, 2021

@barrywhart I won't steal your merge this time! 😄

@barrywhart barrywhart merged commit 0624a84 into sqlfluff:main Dec 5, 2021
@jpy-git jpy-git deleted the triple_count_l008 branch December 5, 2021 20:28
@barrywhart
Copy link
Member

Whee!! Now my weekend is complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Combination of L008 and L019 causes L008 triple count
2 participants