Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow cops to share context between files #7968

Closed
ajorgensen opened this issue May 13, 2020 · 21 comments
Closed

Allow cops to share context between files #7968

ajorgensen opened this issue May 13, 2020 · 21 comments
Assignees

Comments

@ajorgensen
Copy link

ajorgensen commented May 13, 2020

Is your feature request related to a problem? Please describe.

tl;dr It would be great to have a way to have a cop run across all processed sources to compare between files.

On a project I work on we make use of rails engines to separate different parts of our codebase into smaller components. As such we have a use case where we need to make sure there are no duplicate controller names in any of the rails engines so for example two separate rails engines should not be allowed to create the duplicate class name FooController.

Initially this was solved with a class level cache which would store all of the class names that had been seen and add an offense in there were any duplicates. This was a bit of a hack but since Rubocop creates a new cop instance per processed source this was the only way to share information between processed source files. When we added rubocop-daemon this became a problem because the cache now persists between runs due to the long running process. This leads to false positives because the cache will likely already contain all of the class names that we saw on the last run.

I know this isn't necessarily Rubocop's issue since its really the long running process via rubocop-daemon that creates the problem but I think that the issue can only really be solved within Rubocop itself.

Describe the solution you'd like

There are three solutions I can think of. The easiest might be to provide a run identifier that can uniquely identify each different rubocop run. This would allow the user to check this ID during cop execution and determine if they want to clear the cache wherever it is stored. This might be the easiest and is essentially how I solved it for our cop (i'll explain my current solution further down).

A second solution would be to have a context object that is created at the beginning of the run and allow cops to store and share context between each of the cops. In my case, this would allow my cop to store the different class names which can be checked in the next instance of the cop for the next processed file. This would probably be the most flexible and fit into the current model.

Thirdly would be to change the current contract of a cop to allow it to be long lived. This would be to create all of the cop instances up front and then use those instance for each processed file. This would also likely require the addition of a few hooks that can be used by cops before_run, after_run, etc to allow the cop to manually clean up any state that it has.

Describe alternatives you've considered

I have currently worked around the issue essentially implementing idea #1. In the cop we still have the class level cache but we also keep track of the object_id of the Rubocop::Runner instance.

def _run_id
  ObjectSpace.each_object(RuboCop::Runner).to_a.last&.object_id || 1
end

This instance is long lived for the life of the run and is recreated in rubocop-daemon for each run. This gives us a unique identifier that can be checked against for each processed file. If the object_id has changed then we know its a new run and can clear the cache.

Additional context

None

@ajorgensen ajorgensen changed the title Compare Across Files Compare Across Processed Sources May 13, 2020
@ajorgensen ajorgensen changed the title Compare Across Processed Sources Allow cops to share context between files May 13, 2020
@marcandre
Copy link
Contributor

marcandre commented May 20, 2020

Easiest to implement (for RuboCop and you) would probably be class level callbacks that any Cop class can optionally define (e.g. def MyCop.before_run and def MyCop.after_run).

An easier alternative to your _run_id might be to monkey patch Runner#initialize and call your cache clearing method.

@ajorgensen
Copy link
Author

@marcandre that's an interesting idea, I always get a little nervous amount monkey patching given the wild west ruby used to be back in the day 😄. Those class level callback are a cool idea I hadn't considered but as you pointed out that would probably be one of the easier solutions to implement.

I know this is a weird feature request since its not technically Rubocop's issue since without rubocop-daemon it works as expected. The solution I posted to look for the last instance of the Runner instance seems to work fairly well, although having to call to_a on an enumerable always makes me a little nervous but what can you do.

I was actually a bit surprised I couldn't find any other examples of Cops that look for validations across files, I sort of would have thought this use case would be more common.

If there is interest beyond my use case I'd be happy to put together an implementation of the idea you proposed, if not I'm also happy to close this and let it be a point of reference for someone else trying to do something similar.

@tas50
Copy link
Contributor

tas50 commented May 20, 2020

This is something that would be incredibly useful for the Cookstyle project which uses the RuboCop engine. We're analyzing Chef cookbooks and we've been pretty successful at it with the isolated file design of RuboCop. There's a lot of things that would be really great to build out that need to take into consideration the entire cookbook and therefore need a way to check the ast of multiple files. I'm currently walking the directory tree to find our metadata file and then reparsing that, but it's pretty slow to parse that file over and over again. Having some sort of way to grab that file's parsed ast out of cache and use a matcher against it would be really amazing.

@marcandre
Copy link
Contributor

I'm slaving on a refactoring of the add_offense / autocorrect API in #7868 and I will most likely introduce Cop#source_buffer=. Currently, some cops are simply setting @source_buffer directly. I will have a legacy support mode (on by default), cop writers will have to specify explicitly if they use the newer API. Since existing cops might rely somehow on the instances being recreated for each source file, maybe the legacy mode could maintain that while the modern API would imply the notion that a Cop is created once for the full run, with multiple calls to Cop#source_buffer= (and possible a dedicated callback?).
This would make it easy to cache stuff accross files (and automatically not between runs even with the daemon).

@bbatsov how does instances of cops being reused for different files sound like? That would impact only Cops what explicitly sign in to API I'm working on

@ajorgensen
Copy link
Author

@tas50 super interesting use case, I knew I couldn't have been the only one who had a use case that required looking across source files. I'm not sure what the most extensible solution would be here, the class level callbacks would likely only satisfy my use case and help with rubocop-daemon type environments. If there's anything I can do to help with planning/implementation of a proposed solution I'd be happy to contribute.

@tas50
Copy link
Contributor

tas50 commented May 21, 2020

I think I've mostly hit the wall with what I can do one file at a time so I'm thinking about what I can achieve with helpers and ast caching at this point. Give me some more time and I'm sure I'll have more opinions on how this could potentially look in a perfect world. I have 190 cops for Chef now that we ship to all our customers. It's been a ton of fun to write these. Great platform overall and I think I can bend it to achieve the next phase of what I need.

@marcandre
Copy link
Contributor

My PR now includes modern Cops being persisted accross files. The callbacks (now named on_begin_walk and on_end_walk) would allow you to store the current processed source, e.g.

def initialize(*)
  @processed = {}
  super
end

def on_end_walk
  @processed[processed_source.path] = processed_source
end

It's not clear if this persistence is on for all modern cops (inheriting from Cop::Base instead of Cop::Cop) or only opt-in. Now that I think of it, opt-in might simply be easier as persistence might be more trouble than boon for most cops.

@bbatsov
Copy link
Collaborator

bbatsov commented Jun 6, 2020

Makes sense to me. A bit of historical context - originally we went for the share-nothing approach mostly because in Ruby you can't really make much meaningful cross-file analysis - you don't know what code is loaded in the file you're inspecting (unlike in a language like Python, Java or Clojure), you don't know in which order the source files were loaded, etc. That meant other than looking for code duplications, there wasn't much potential for cross-file analysis and so we focused on analyzing files in isolation, which as a bonus was simpler to implement.

@marcandre
Copy link
Contributor

Given that the majority of cops (ours or custom) don't need to keep context and that it is difficult to test that a cop is clearing it's cache (if any), I'm thinking it would be best that persistence be opt-in, say:

def self.multiple_source_support?
  true
end

Otherwise cops get re-created as before. Sounds good?

Note: our mixins should still support it and clear their cache though. As much as my Cache helper is cute, there's a bug in JRuby that makes it unusable for mixins, so I'll get rid of it.

@pirj
Copy link
Member

pirj commented Aug 3, 2020

use case where we need to make sure there are no duplicate controller names in any of the rails engines

Let me ask you a couple of questions @ajorgensen

Do you namespace your engines?
Do you use isolate_namespace?
What happens if you extract an engine to a gem, how would you make sure the new engine from your monolith doesn't overlap with it?

Are there any cops that would benefit from this feature?

@ajorgensen
Copy link
Author

Do you namespace your engines?
In general the engines are namespaced but there are some cases where the classes are at the top level.

Do you use isolate_namespace?
Looks like 2 / ~150 of the engines use isolate_namespace

What happens if you extract an engine to a gem, how would you make sure the new engine from your monolith doesn't overlap with it?
My understanding is the rubocop definition to detect duplicate class names (the motivation for this feature request) checks to make sure there are no collisions. We have ~25 custom cops that do a variety of things.

Are there any cops that would benefit from this feature?
I'm sorry im not sure what you mean by this question. In our use case, the checking for duplicate class names across ruby files would benefit from the ability to keep context between cop instances. I'm not sure I have another specific use case off the top of my head though.

@pirj
Copy link
Member

pirj commented Aug 3, 2020

Do you use isolate_namespace?

Looks like 2 / ~150 of the engines use isolate_namespace

I could find more than that https://github.com/search?l=Ruby&q=isolate_namespace&type=Code
One of the first references a problem related to it:

# isolate_namespace is causing problems accessing the app's url helpers when the app
# mixes in methods to a controller.
#   <%= logout_path %> <-- this will crash, no url helpers are accessible

but I can't tell how fair is this though.

What happens if you extract an engine to a gem, how would you make sure the new engine from your monolith doesn't overlap with it?

Let me reiterate this. What if you have gem 'myengine' in your Gemfile, how would you expect RuboCop to find name collisions in this case?

My understanding is the rubocop definition to detect duplicate class names

RuboCop is a static code analyzer.
There's nothing that prevents you to define app/models/user.rb and app/workflows/user.rb defining the same User class, where one would just reopen it and throw potentially overlapping methods into it. No Rails engines involved. I might be mistaken, but this doesn't seem like a task for RuboCop to me.

Are there any cops that would benefit from this feature?

I'm not sure I have another specific use case off the top of my head though.

That was the answer I was hoping to get, thanks.

@marcandre
Copy link
Contributor

marcandre commented Aug 3, 2020

Are there any cops that would benefit from this feature?

Seems to me that some cops looking for duplications could definitely benefit from it. Duplicated spec, duplicated constant, duplicated method, duplicated algorithm, ...

@pirj
Copy link
Member

pirj commented Aug 3, 2020

Duplicated spec

A spec for the same class in a different folder?

duplicated constant

rubocop-rubycw doesn't handle that, but there's a runtime warning:

1.rb:3: warning: already initialized constant A::X
1.rb:2: warning: previous definition of X was here

duplicated method

ruby -cw doesn't seem to care about that even in the scope of a single file. Keeping in mind method definitions in conditionals, define_method with interpolated method name, undef this seems like a very complicated and error-prone task. Not to say when the source is spread over.

duplicated algorithm

https://github.com/seattlerb/flay

@ajorgensen
Copy link
Author

ajorgensen commented Aug 3, 2020

@pirj Im not sure I understand your point. In the codebase that I am working on, and the motivation for this feature request, we have an explicit need for finding duplicate class names across many rails engines. Im sorry if I am misunderstanding your intention but I am not advocating for a general purpose duplicate class finder to be built into Rubocop, simply the ability to share context between cop instance during a run to allow any cross file type analysis.

Rubocop today does not support comparing across different files in general as it has a "share nothing" model for the cops at runtime, which totally makes sense and makes the design much easier to reason about. For the custom cop we wrote for the codebase I work on we hacked around this limitation by using a class variable but it comes with its drawbacks specifically when attempting to cache cop runs to speed up execution, as I've mentioned this is more of a rubocop-daemon problem not Rubocop itself but I think it can really only be solved at the Rubocop level by providing a mechanism to share context between cop instances.

@marcandre
Copy link
Contributor

Duplicated spec

A spec for the same class in a different folder?

Could be, or testing the same thing (say if there's a shared_example that should be extracted), or the same let() {} appearing more than X times, but I guess that all falls under the more genera duplicated method/algorithm.

duplicated constant

rubocop-rubycw doesn't handle that, but there's a runtime warning:

1.rb:3: warning: already initialized constant A::X
1.rb:2: warning: previous definition of X was here

Sorry, I wasn't clear, I meant duplicated values, say the same %[list of identical keywords that could be factorized]

duplicated method

ruby -cw doesn't seem to care about that even in the scope of a single file. Keeping in mind method definitions in conditionals, define_method with interpolated method name, undef this seems like a very complicated and error-prone task. Not to say when the source is spread over.

duplicated algorithm

https://github.com/seattlerb/flay

Exactly, for flay-like processing...

@pirj
Copy link
Member

pirj commented Aug 3, 2020

What I mean is that bending RuboCop to do cross-file checks may not overweigh the added complexity given that apart from this single use case there are other tools that do this job better (flay) or even no other dedicated tool dares to do that.

Keeping in mind that Rails engines can be isolated out of the box if written correctly, I would say a cop in rubocop-rails that checks for the presence of isolate_namespace and defined namespaces for engine code would do just fine.

@pirj
Copy link
Member

pirj commented Aug 3, 2020

I'm not trying to shot down the idea, just not too convinced that it is really worth it.

@marcandre
Copy link
Contributor

What I mean is that bending RuboCop to do cross-file checks may not overweigh the added complexity

It's basically done, I just have to test it and close this issue, and it really wasn't complex.

@dvandersluis
Copy link
Member

@marcandre were you still working on this?

@marcandre
Copy link
Contributor

Totally slipped my mind, and I don't intend to, so I'm closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants