Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: Reduce the time complexity of all Faker::Crypto methods #2938

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

alextaujenis
Copy link
Contributor

@alextaujenis alextaujenis commented Apr 18, 2024

Motivation / Background

This Pull Request has been created because the Faker::Crypto methods are generating extraordinarily long random strings to run through each hashing algorithm. The performance of each method within Faker::Crypto can be improved by up to 80% by balancing the complexity of the random strings for each hash algorithm.

Additional Information

The MD5 hash algorithm returns a string of 32 characters within the range of "a - f" and "0 - 9", which is a range of 16 characters. The total number of possible combinations with this setup is 16^32.

The method Faker::Crypto.md5 computes the MD5 hash of a random string like this:

OpenSSL::Digest::MD5.hexdigest(Lorem.characters)

Digging a little deeper we see that Lorem.characters returns a default number of 255 Alphanumeric characters:

def characters(number: 255, min_alpha: 0, min_numeric: 0)
  Alphanumeric.alphanumeric(number: number, min_alpha: min_alpha, min_numeric: min_numeric)
end

Going further we find out that Faker::Alphanumeric.alphanumeric returns random characters within the range of "a - z" and "0 -9", which is a range of 36 characters. The total number of possible combinations with this setup is 36^255, which is mind-bogglingly larger than 16^32 (the total number of possible combinations for the MD5 hash algorithm).

Optimization

The total number of random characters passed through the MD5 hashing algorithm within Faker::Crypto can be safely reduced from 255 down to a specific number. You can find that number by solving for the minimum value of x within: 36^x > 16^32. This allows MD5 collisions to occur before (non-unique) collisions within Faker::Lorem.characters.

We can find that perfect number for each hash algorithm by running this ruby solver:

algorithms = {MD5: 32, SHA1: 40, SHA256: 64, SHA512: 128}
algorithms.each do |algorithm, hash_length|
  puts "Algorithm: #{algorithm} - Hash Length: #{hash_length}"
  i = 0
  number = nil
  while(number == nil) do
    i+=1
    greater_than = ((36 ** i) > (16 ** hash_length))
    difference = (36 ** i) - (16 ** hash_length)
    number = i if greater_than
    puts "36^#{i} #{greater_than ? ">" : "<"} 16^#{hash_length} by #{difference}"
  end
  output = "OpenSSL::Digest::#{algorithm}.hexdigest(Lorem.characters(number: #{number})) SUGGESTED"
  puts output, "*" * output.length
end

Which provides this output:

Algorithm: MD5 - Hash Length: 32
36^1 < 16^32 by -340282366920938463463374607431768211420
36^2 < 16^32 by -340282366920938463463374607431768210160
36^3 < 16^32 by -340282366920938463463374607431768164800
36^4 < 16^32 by -340282366920938463463374607431766531840
36^5 < 16^32 by -340282366920938463463374607431707745280
36^6 < 16^32 by -340282366920938463463374607429591429120
36^7 < 16^32 by -340282366920938463463374607353404047360
36^8 < 16^32 by -340282366920938463463374604610658304000
...
36^24 < 16^32 by -317830109213583906223287396307975536640
36^25 > 16^32 by 467998910543825597179764993024768081920
OpenSSL::Digest::MD5.hexdigest(Lorem.characters(number: 25)) SUGGESTED
...

From this output we can see the point at which 36^x > 16^32:

36^24 < 16^32 by -317830109213583906223287396307975536640
36^25 > 16^32 by 467998910543825597179764993024768081920

This shows that Faker::Lorem.characters(number: 24) has 317830109213583906223287396307975536640 less possible combinations than the MD5 hash algorithm, while Faker::Lorem.characters(number: 25) has 467998910543825597179764993024768081920 more possible combinations than the MD5 hash algorithm.

We can safely reduce the number of random characters to 25 for the Faker::Crypto.md5 algorithm while still returning deterministically unique hashes. Here are the other optimal values for each algorithm:

OpenSSL::Digest::MD5.hexdigest(Lorem.characters(number: 25))
OpenSSL::Digest::SHA1.hexdigest(Lorem.characters(number: 31))
OpenSSL::Digest::SHA256.hexdigest(Lorem.characters(number: 50))
OpenSSL::Digest::SHA512.hexdigest(Lorem.characters(number: 100))

Performance Benchmark

You can see from the benchmark below that after reducing library complexity the Faker::Crypto methods are up to 80% faster, while the Faker::Omniauth methods enjoy performance gains up to 50% depending upon how heavily they rely upon the Faker::Crypto methods.

Performance Benchmark - Faker Gem (Apil 18, 2024)

Checklist

Before submitting the PR make sure the following are checked:

  • This Pull Request is related to one change. Changes that are unrelated should be opened in separate PRs.
  • Commit message has a detailed description of what changed and why. If this PR fixes a related issue include it in the commit message. Ex: [Fix #issue-number]
  • Tests are added or updated if you fix a bug, refactor something, or add a feature.
  • Tests and Rubocop are passing before submitting your proposed changes.
  • Double-check the existing generators documentation to make sure the new generator you want to add doesn't already exist.
  • You've reviewed and followed the Contributing guidelines.

@alextaujenis
Copy link
Contributor Author

@stefannibrasil

Copy link
Contributor

@stefannibrasil stefannibrasil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is great, thank you so much for working on this! Just left one small suggestion.

@@ -15,7 +15,14 @@ class << self
#
# @faker.version 1.6.4
def md5
OpenSSL::Digest::MD5.hexdigest(Lorem.characters)
# The MD5 algorithm will experience a collision much sooner
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of moving those comments to the top of the file? The individual calculations can remain at the top of the methods, but since this comment applies to all generators (with the difference of the number), it makes sense to keep the comment at the top.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants