Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faker-ruby has too many generators to maintain #2689

Open
stefannibrasil opened this issue Jan 24, 2023 · 12 comments
Open

Faker-ruby has too many generators to maintain #2689

stefannibrasil opened this issue Jan 24, 2023 · 12 comments

Comments

@stefannibrasil
Copy link
Contributor

faker-ruby has a ton of generators. Although it can be an easy way for new contributors to help, it adds too much work for maintainers.

Here are some problems that I've encountered for the past few months:
a) generators relying on 3rd party APIs. Those services get shut down, or could even return unexpected values. It requires us to deprecate and remove generators often. We should avoid relying on those services as as much as we can.
b) generators being added without having a measurement of its need. It's hard to know which generators are the most popular and which ones aren't used at all. But some of them are not worth of being maintained because they are so niche.
c) there are so many generators, it's hard to keep track of them, and know about them. I've spent time reviewing PRs only to realize there is one already, and it wasn't straightforward to find it.
d) there are lots of generators, but the translators are not up to date. In my perspective, it makes sense for us to focus on translating the generators that we want to keep.

This is not a popular issue, I'd assume. But to keep faker in a stable manner, I'd propose some changes related to the points above:

  • delete some generators, and create a guideline that helps maintainers decide when to add a new generator. It's hard to measure which generators are not being used, but we can scan the ones that are clearly too niche to justify maintain them. For now, add to the Contributing guides that new generators need to have an Issue first instead of already opening a PR. And, require more details from the contributor that there isn't a generator for it already.
  • decrease the number of generators that rely on 3rd party APIs. Ideally, only a handful of them.
  • create a guideline for new generators: have a rubric to decide when to merge a new one.
  • identify the generators we don't want to keep anymore, deprecate them, and release another major version after all of that.
  • open issues for translations, and other locales improvements.

The reason why I'm opening this issue is because working on new generators is where most of us spend our time. It's time I'd like to be using for solving critical bugs, and stabilizing the gem with the generators we want to keep, with the translations up to date.

What are your thoughts? How has your experience been? What do you agree, or disagree?

cc @vbrazo @stympy @koic @thdaraujo @psibi @Zeragamba @connorshea @Newman101

@psibi
Copy link
Member

psibi commented Jan 24, 2023

I think there are two types of generators we have:

  • Yaml based ones which is self contained.
  • Third party one which relies on some web service. Faker developers don't have much control over it.

My opinion on the generators: We should have as many yaml generators as possible. Infact, I try to review most of the new Yaml based generators PR that are added. I also find that it increases the value of faker-ruby.

Regarding third party API, I have no strong opinions and I don't mind removing them since we don't have control over them anyway. Historically, these web API's have gone down which has affected faker-ruby and it's stability.

identify the generators we don't want to keep anymore, deprecate them, and release another major version after all of that.

I very strongly believe we should not remove any yaml based generators. Infact, the Haskell package which I have authored (https://hackage.haskell.org/package/fakedata) depends on faker for it and removing any yaml would be a breaking change there.

@Zeragamba
Copy link
Contributor

We can also potentially revisit the idea of splitting Faker up into multiple gems, which was explored a bit in #1539

@stefannibrasil
Copy link
Contributor Author

We should have as many yaml generators as possible. Infact, I try to review most of the new Yaml based generators PR that are added. I also find that it increases the value of faker-ruby.

I still think we could have a rubric of some sort to evaluate a generator before adding it. Simply because the generator is Yaml based does not seem like a strong enough point to accept any PR that is adding a generator. To prove my point, the more generators there are, the harder it is to know if one exists already, especially because sometimes the names are not so easy to search for. It's easy to end up with duplicated generators, for example.

Regarding third party API, I have no strong opinions and I don't mind removing them since we don't have control over them anyway. Historically, these web API's have gone down which has affected faker-ruby and it's stability.

I agree. I am going to open a new issue with a list of them for to have a sense of how many there are.

I very strongly believe we should not remove any yaml based generators. In fact, the Haskell package which I have authored (https://hackage.haskell.org/package/fakedata) depends on faker for it and removing any yaml would be a breaking change there.

Fair. As a heads up, even if we want to remove yaml based generators, the Changing logs clarify the breaking changes before bumping the new version. That said, it seems a good starting point is to list the generators that rely on external services for now.

I'm not suggesting we remove all generators. My point with this discussion is to bring awareness to the fact that simply adding any new generators without a rubric, and focusing only on generators, makes it harder to improve the existing ones.

@stefannibrasil
Copy link
Contributor Author

We can also potentially revisit the idea of splitting Faker up into multiple gems, which was explored a bit in #1539

I believe @vbrazo has shared that with me. It looks like the discussion there didn't go anywhere, and it wasn't clear to me what difference it would make in maintaining faker.

I'm thinking of starting with the generators that rely on 3rd parties, and see how it goes from there. What do you think?

@andycroll
Copy link

andycroll commented Feb 28, 2023

Baring in mind the speed issues reported in #2719. Is there an appetite for an approach where you might have a minimal (or zero) faker instantiation and then bring in generators as required by the end user?

This would be similar to the a configuration pattern used in Rails.

i.e.

require "rails/all"

vs.

require "rails"
# Pick the frameworks you want:
require "active_model/railtie"
require "active_job/railtie"
require "active_record/railtie"
# require "active_storage/engine"
require "action_controller/railtie"
require "action_mailer/railtie"
# require "action_mailbox/engine"
# require "action_text/engine"
require "action_view/railtie"
# require "action_cable/engine"
require "rails/test_unit/railtie"

..in the config/application.rb.

Folks who just want to "require & forget" can use faker as is, and those who prefer to specify can do so.

Something like...

gem "faker", require: "faker/minimal"

Then some sort of initialiser.

Thoughts?

@stoivo
Copy link

stoivo commented Feb 28, 2023

hi @andycroll (I watch a few of you talks on youtube :D, thanks). That look like a smart suggestion to configurable this way. From my testing requiring all the files is just a small postion of the time(100ms) to load. Still ig could be speedup, and it creates more constans then whay you might want. Most of the time is spent parsing yaml(1400ms), so the biggest gain would be if we could reduce the amount of yaml being parsed.

I don't know if it make any sence to try to eager load and parse all yaml files if the gem is eager loaded. The alternative is to parse yaml files needed on demand and cache in memory. It would make it alot simplier to only load on demand since then we don't need to load all locales.

Just want to have it said: I think there is very little to gain by splitting it up into multiple gems.
Also, I am not the maintainer so I dont' feel like mo opinion should have that much weight.

@psibi
Copy link
Member

psibi commented Feb 28, 2023

I don't know if it make any sence to try to eager load and parse all yaml files if the gem is eager loaded. The alternative is to parse yaml files needed on demand and cache in memory. It would make it alot simplier to only load on demand since then we don't need to load all locales.

This is what I do in fakedata (The haskell version of faker, but uses faker's yaml file directly) - Do it on demand and cache it in memory so that the same yaml file is never read again. I wrote a more detailed report about it here: https://psibi.in/posts/2019-09-23-fakedata-perf.html

@jremes-foss
Copy link
Contributor

Some YAML-based generators may be sort of generators which someone may find redundant. But some other people may actually find them useful. What I would suggest is to look at each generator and think about their use case. If there is actual practical use case, then the generator can be kept.

What comes to generators which are relying on third party API's, I would strongly suggest we deprecate those. Reason is simple: Faker should be as much self-contained as possible. Faker developers (and those using Faker) should have full control of the library. Therefore, I don't view code relying on third party API's, or any other web services for that matter, a good thing.

@m-vz
Copy link

m-vz commented Mar 31, 2023

I have just begun using Faker, pitching in here on the off-chance that the perspective of someone new to the gem is useful in this discussion.

When browsing through the generators available, I was surprised to find extremely niche data like Myst quotes and World Cup stadiums. In my experience, gems that provide some form of content are usually split into a core where the main functionality is found and other gems that provide said content. (Take i18n for example, which only adds functionality while other gems like rails-i18n, devise-i18n, etc. provide the actual translation data.)

My blind guess is that the vast majority of people using Faker only use a very small subset of its generators to create usernames, emails, addresses and other common test data. However, as @jremes-foss mentioned, noone can know for certain which generators that look redundant at first may actually be the very reason why someone is using Faker.

Looking over the generators I would say there are a few main categories:

  • User Data (addresses, names, phone numbers, ...)
  • Cultural (books, movies, tv shows, games, ...)
  • Things – couldn't think of a good name for this (coffee, electrical components, vehicle, ...)
  • Places (mountains, nations, universities, restaurants, ...)
  • Others?

I understand how splitting up faker into multiple gems adds complexity and I agree that one gem per namespace as discussed in #1539 is probably overkill. However, let's say categories of less common generators were moved into gems like faker-culture or faker-places. This would free up the discussion in the main repo to be purely about faker functionality and "core" generators (the ones most likely used for simple test data like Faker::Internet and Faker::Name) while new generator proposals and translations could be managed in their specific content repos.

Additionally, only having a few of the most common generators in the core faker gem would almost certainly improve the performance of using faker (#2719) for most people. And people that need niche cultural content would at least know that they are going to include a lot of additional data to load each test when they include faker-culture.

@m-vz
Copy link

m-vz commented Mar 31, 2023

There is of course also the option of removing all generators from the core repo and having it depend on a base content gem like faker-common, while extra content could be included from optional gems like the ones above.

A base content gem like this could be useful as a template for anyone who wants to create additional generator gems.

@robyurkowski
Copy link

This issue hasn't seen any update for a long while, but I'd like to add two cents not as a maintainer but as a long-time user of the library in favour of breaking up the gem or at very least namespacing things a bit more frequently.

Every time I use Faker, I wind up coming to the list of generators and it's a bit of a search to find what I need. Sometimes, the generators that produce the data that I need don't map directly to name (looking at you in particular, Faker::Internet.email), and in those cases, finding the right generator is more difficult than desired.

Moreover, I feel like the discoverability of important and highly useful generators is significantly harmed by the presence of so many others without some basic level of categorization. (If there is such a categorized list, I would be in your debt if you'd link it to me.)

I understand that taxonomy is difficult, but even a poor first attempt would make it significantly easier to find the right data (and to understand what's available).

Thank you all for your ongoing work on the project. :)

@stefannibrasil
Copy link
Contributor Author

For reference: in #2852 we verified there are no generators relying on external services.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants