Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve similarities checker by considering relative indentation of code blocks rather than absolute indentation #8882

Open
anadon opened this issue Jul 26, 2023 · 8 comments
Labels
Enhancement ✨ Improvement to a component Needs design proposal 🔒 This is a huge feature, some discussion should happen before a PR is proposed Needs specification 🔐 Accepted as a potential improvement, and needs to specify edge cases, message names, etc.

Comments

@anadon
Copy link

anadon commented Jul 26, 2023

Current problem

Currently, the following is not caught as duplicate code. This is sanitized from a production codebase I work on.

if cond:
    stmt_1
    ...
    stmt_n
    return
stmt_1
...
stmt_n

Desired solution

I would expect R0801 to be flagged for these two code segments.

Additional context

Distinct but possibly related to #1457 . This seems to be an issue with the fact that for comparing code fragments, it does so using the absolute indentation level, and not converting each code segment to use a relative indentation level. Now, in order to make this change it does cause a considerable increase in algorithmic complexity and would require the current hash based approach to actually use a number of different hashes per line, one per indentation level. However, I have found it much easier to argue for better coding when a tool says that code should be changed than when I say it.

@anadon anadon added the Needs triage 📥 Just created, needs acknowledgment, triage, and proper labelling label Jul 26, 2023
@Pierre-Sassoulas Pierre-Sassoulas added Enhancement ✨ Improvement to a component Needs design proposal 🔒 This is a huge feature, some discussion should happen before a PR is proposed Needs specification 🔐 Accepted as a potential improvement, and needs to specify edge cases, message names, etc. and removed Needs triage 📥 Just created, needs acknowledgment, triage, and proper labelling labels Jul 26, 2023
@Pierre-Sassoulas
Copy link
Member

Using a hash without the leading indentation spaces could lead to false positives, multiplying the number of check on a line by the number of indentation level would be disastrous performance wise especially on large code base where the base algo already compare every line to every other line which is inherentely costly and scale poorly.

I added the design proposal needed label because I think the enhancement makes sense but details matter and jumping to implementation on this one will go especially badly. Imo a state of the art of what other duplicate finders do is required, probably a scientific litterature review too.

@anadon
Copy link
Author

anadon commented Jul 26, 2023 via email

@anadon
Copy link
Author

anadon commented Jul 26, 2023 via email

@Pierre-Sassoulas
Copy link
Member

Could you give an example of what you mean by relative indentation @anadon ? (Also note that we have option to ignore some kind of things like docstrings, function signatures, imports, etc.)

@anadon
Copy link
Author

anadon commented Jul 26, 2023

With regards to my initial comment,

0: if cond:
1:     stmt_1
2:     ...
3:     stmt_n
4:     return
5: stmt_1
6: ...
7: stmt_n

0 -> if cond:
1 -> stmt_1, stmt_1
2 -> ..., ...
3 -> stmt_n, stmt_n
5 -> stmt_1
6 -> ...
7 -> stmt_n

lines 1 to 3 at depth 1 match lines 5 to 7 at depth 0. If these were to be added to a suffix tree or suffix array in a tagged manner, all matches could be detected in O(n) time or O(nlog(n)) time and O(n) space. It just so happens that O(nlog(n)) time implementations tend to be faster for reasons I've not been able to study to sufficient depth. I quick search reveals that the project landscape has changed considerable, so library selection would require some research and testing.

@anadon
Copy link
Author

anadon commented Jul 26, 2023

After work I'll give a bigger and more clear example of input and how it could be treated.

@anadon
Copy link
Author

anadon commented Jul 26, 2023

VAR = None

def func_1(*args):
    if not args: 
        stmt_1
        ...
        stmt_n
        return
    else:
        ...
        if args[0]:
            ...
            stmt_1
            ...
            stmt_n
    ...
    stmt_1
    ...
    stmt_n
    ...
    return

stmt_1
...
stmt_n

Can be taken as

VAR = None

def func_1(*args):
    if not args: 
        stmt_1
        ...
        stmt_n
        return
    else:
        ...
        if args[0]:
            ...
            stmt_1
            ...
            stmt_n
    ...
    stmt_1
    ...
    stmt_n
    ...
    return

stmt_1
...
stmt_n
if not args: 
    stmt_1
    ...
    stmt_n
    return
else:
    ...
    if args[0]:
        ...
        stmt_1
        ...
        stmt_n
...
stmt_1
...
stmt_n
...
return
stmt_1
...
stmt_n
return
...
if args[0]:
    ...
    stmt_1
    ...
    stmt_n
...
stmt_1
...
stmt_n

As can be seen by example, each block able to share the same level of contiguous indentation is represented as that continuous block in the same way the top level file is. These kind of virtual/mapped files can then be used directly to detect repeated code blocks independent of absolute level of indentation while preserving matching blocks with respect to their relative indentation. This does increase resource usage, but I think it is worth it given how much more effective duplicate code detection would be.

@Pierre-Sassoulas
Copy link
Member

Pierre-Sassoulas commented Jul 30, 2023

If these were to be added to a suffix tree or suffix array in a tagged manner, all matches could be detected in O(n) time or O(nlog(n)) time and O(n) space

Without looking too deep into the the practical aspect (and in particular the python specific treatment we do) this sounds acceptable, however this also sounds like a complete overhaul that is going to be a big time commitment (and perf benchmark are lacking for the duplicate-code check so it's also going to involves a lot of testing).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ✨ Improvement to a component Needs design proposal 🔒 This is a huge feature, some discussion should happen before a PR is proposed Needs specification 🔐 Accepted as a potential improvement, and needs to specify edge cases, message names, etc.
Projects
None yet
Development

No branches or pull requests

2 participants