Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Rarity score from RegEx #238

Open
nodtem66 opened this issue Dec 2, 2021 · 2 comments
Open

[Proposal] Rarity score from RegEx #238

nodtem66 opened this issue Dec 2, 2021 · 2 comments

Comments

@nodtem66
Copy link
Contributor

nodtem66 commented Dec 2, 2021

Is your feature request related to a problem? Please describe.
The rarity is used to sort the result of PyWhat.
However, I feel like it's a subjective value that didn't have a formal way to define it.
Currently, I'm just looking at the rarity of neighboring RegExps, using my own gut to decide it, and waiting for someone to reject or confirm.

This is a current definition of a rarity on the wiki page

rarity (float between 1.0 and 0.0): How unlikely is it to be a false-positive? Think about how big of a chance there is of something completely different to match your regex. Choose 1 for very unlikely, 0 for very likely.

Some tips on how to pick rarity:

1 - contains a word that is unique to it
0.7 - matches to a specific pattern and characters
0.5 - mostly matching to specific characters
0.3 - pretty broad, has only a few specific characters
0.2 - broad, almost no specific characters
0 - matches to almost everything

I need some way to calculate the rarity of RegExps or tokens.

Describe the solution you'd like
Deterministic way to calculate rarity

Describe alternatives you've considered
N/A (I can't figure out the alternatives)

Additional context

  • I use Shannon Entropy, Hamming distance, Levenshtein distance, and custom script that parse RegEx.
  • See Google Colab
  • Prelimary results:
+----------------------------------------------------------+--------+--------+---------+----------------+-----------------------+-----------------------+------------------------+------------------+
| Name                                                     | Rarity | Length | Entropy | Entropy/Length |  Levenshtein distance |       Fuzz ratio      | Hamming distance ratio |     Regexp Score |
+----------------------------------------------------------+--------+--------+---------+----------------+-----------------------+-----------------------+------------------------+------------------+
| Note                                                     |        |        |         |                | (0=Different, 1=Same) | (0=Different, 1=Same) | (0=Same, 1=Different)  | High=High Rarity |
| PGP Public Key                                           |   1    |    125 |   5.016 |       0.040131 |          0.66         |          0.54         |          0.65          |            49.12 |
| PGP Private Key                                          |   1    |    125 |   5.062 |       0.040495 |          0.72         |          0.62         |          0.64          |            51.12 |
| SSH RSA Public Key                                       |   1    |     16 |   3.578 |       0.223614 |          0.53         |          0.58         |          0.47          |             6.05 |
| PEM-formatted Private Key                                |   1    |     68 |   3.617 |       0.053192 |          0.84         |          0.79         |          0.75          |            37.28 |
| SSH ECDSA Public Key                                     |   1    |     28 |   4.236 |       0.151283 |          0.63         |          0.68         |          0.37          |            13.27 |
| SSH ED25519 Public Key                                   |   1    |     18 |   3.837 |       0.213144 |          0.66         |          0.72         |          0.31          |             6.05 |
| Access-Control-Allow-Header                              |   1    |     23 |   3.762 |       0.163577 |          0.96         |          0.96         |          0.04          |            18.05 |
| TryHackMe Flag Format                                    |   1    |     20 |   4.022 |       0.201096 |          0.53         |          0.55         |          0.56          |             6.02 |
| HackTheBox Flag Format                                   |   1    |     21 |   4.107 |       0.195553 |          0.4          |          0.44         |          0.79          |             6.52 |
| Capture The Flag (CTF) Flag                              |   1    |      6 |   2.585 |       0.430827 |          0.57         |          0.63         |          0.6           |             3.69 |
| YouTube Video                                            |   1    |     27 |   4.078 |       0.151053 |          0.58         |          0.61         |          0.79          |             16.6 |
| Bitcoin Cash (BCH) Wallet Address                        |   1    |     42 |   4.529 |        0.10783 |          0.24         |          0.26         |          0.97          |             8.07 |
| Heroku API Key                                           |   1    |     42 |   4.148 |        0.09876 |          0.43         |          0.43         |          0.78          |             6.15 |
| Slack API Key                                            |   1    |     76 |    4.48 |       0.058948 |          0.38         |          0.38         |          0.84          |             3.16 |
| Slack Webhook                                            |   1    |     78 |   5.109 |       0.065502 |          0.52         |          0.52         |          0.57          |            28.14 |
| Amazon Web Services Simple Storage (AWS S3) URL          |   1    |     28 |   4.155 |       0.148394 |          0.59         |          0.65         |          0.89          |            13.24 |
| Amazon Web Services Simple Storage (AWS S3) Internal URL |   1    |     16 |   3.578 |       0.223614 |          0.35         |          0.39         |          0.68          |             1.25 |
| Square Application Secret                                |   1    |     29 |    4.28 |       0.147594 |          0.24         |          0.26         |          0.87          |             7.27 |
| Square Access Token                                      |   1    |     64 |   5.254 |       0.082086 |          0.18         |          0.22         |          0.96          |             5.06 |
| Stripe API Key                                           |   1    |     32 |     4.5 |       0.140625 |          0.35         |          0.35         |          0.76          |             5.06 |
| GitHub Access Token                                      |   1    |     25 |   3.075 |       0.123003 |          0.54         |          0.62         |          0.95          |             9.04 |
| Amazon Resource Name (ARN)                               |   1    |     38 |   4.468 |       0.117569 |          0.27         |          0.28         |          0.87          |             3.11 |
| Facebook Secret Key                                      |   1    |     36 |    3.65 |        0.10138 |          0.34         |          0.35         |          0.9           |             5.08 |
| Facebook Client ID                                       |   1    |     30 |   4.257 |       0.141885 |          0.41         |          0.42         |          0.82          |             5.07 |
| Twitter Secret API Key                                   |   1    |     44 |   4.497 |       0.102195 |          0.35         |          0.35         |          0.81          |             7.09 |
| Twitter Client ID                                        |   1    |     35 |   4.536 |       0.129608 |          0.34         |          0.34         |          0.76          |             7.08 |
| Node Package Manager (NPM) Token                         |   1    |     40 |   4.796 |       0.119911 |          0.25         |          0.25         |          0.89          |             3.06 |
| GitHub Personal Access Token                             |   1    |     40 |   4.753 |       0.118826 |          0.25         |          0.25         |          0.89          |             3.06 |
| GitHub OAuth Access Token                                |   1    |     40 |   4.715 |       0.117883 |          0.25         |          0.25         |          0.89          |             3.06 |
| GitHub App Token                                         |   1    |     40 |   4.765 |       0.119133 |          0.24         |          0.24         |          0.91          |             2.07 |
| GitHub Refresh Token                                     |   1    |     12 |   3.585 |       0.298747 |          0.42         |          0.44         |          0.6           |             3.02 |
| LinkedIn Client ID                                       |   1    |     20 |   3.884 |       0.194209 |          0.46         |          0.48         |          0.58          |             8.04 |
| LinkedIn Secret Key                                      |   1    |     24 |   4.252 |       0.177151 |          0.44         |          0.46         |          0.65          |             8.05 |
| Stripe Restricted API Token                              |   1    |     32 |   4.625 |       0.144531 |          0.36         |          0.36         |          0.74          |             6.05 |
| Stripe Standard API Token                                |   1    |     32 |   4.562 |       0.142578 |          0.37         |          0.37         |          0.73          |             6.05 |
| Square OAuth Token                                       |   1    |     50 |   5.149 |       0.102975 |          0.29         |          0.29         |          0.85          |             5.26 |
| PayPal/Braintree Access Token                            |   1    |     73 |    4.61 |       0.063156 |          0.52         |          0.52         |          0.64          |            21.11 |
| MWS Auth Token                                           |   1    |     45 |   3.959 |        0.08797 |          0.48         |          0.48         |          0.65          |             7.14 |
| Picatic API Key                                          |   1    |     37 |   4.209 |       0.113755 |          0.29         |          0.29         |          0.87          |             2.07 |
| Google OAuth Access Key                                  |   1    |     69 |   5.151 |       0.074658 |          0.25         |          0.25         |          0.92          |             2.48 |
| Google OAuth ID                                          |   1    |     59 |    4.87 |       0.082534 |          0.49         |          0.49         |          0.59          |            24.04 |
| StackHawk API Key                                        |   1    |     46 |   4.719 |       0.102582 |          0.27         |          0.27         |          0.86          |             4.08 |
| NuGet API Key                                            |   1    |     46 |   4.621 |        0.10046 |          0.28         |          0.28         |          0.92          |             2.25 |
| SendGrid Token                                           |   1    |     69 |   5.376 |       0.077916 |          0.23         |          0.23         |          0.93          |              2.1 |
| Zoho Webhook Token                                       |   1    |     55 |   4.773 |       0.086774 |          0.79         |          0.82         |          0.4           |            31.11 |
| Zapier Webhook Token                                     |   1    |     51 |   4.309 |       0.084494 |          0.78         |          0.78         |          0.57          |            32.12 |
| Datadog API Key                                          |   1    |     32 |   3.593 |       0.112286 |          0.32         |          0.33         |          0.94          |             0.04 |
| Datadog Client Token                                     |   1    |     35 |    3.74 |       0.106854 |          0.38         |          0.38         |          0.85          |             3.04 |
| New Relic Admin API Key                                  |   1    |     32 |   3.992 |       0.124742 |          0.42         |          0.42         |          0.8           |             4.05 |
| New Relic Insights API Key                               |   1    |     37 |   4.628 |       0.125084 |          0.26         |          0.26         |          0.86          |             3.06 |
| New Relic REST API Key                                   |   1    |     47 |   4.046 |        0.08609 |          0.43         |          0.43         |          0.85          |             4.06 |
| New Relic Synthetics Location Key                        |   1    |     40 |   4.125 |       0.103127 |          0.41         |          0.41         |          0.81          |             4.07 |
| New Relic User API Key                                   |   1    |     32 |   4.539 |       0.141841 |          0.33         |          0.33         |          0.81          |             4.05 |
| Microsoft Teams Webhook                                  |   1    |    245 |   5.807 |       0.023702 |          0.36         |          0.37         |          0.77          |            43.32 |
| Google FCM Server Key                                    |   1    |    152 |   5.659 |       0.037229 |          0.22         |          0.21         |          0.95          |             4.18 |
| Google Calendar URI                                      |   1    |     57 |   4.629 |       0.081208 |          0.83         |          0.86         |          0.15          |            38.08 |
| Discord Webhook                                          |   1    |    123 |   5.471 |       0.044479 |          0.44         |          0.44         |          0.86          |            34.72 |
| Cloudinary Credentials                                   |   1    |     35 |   4.458 |       0.127359 |          0.53         |          0.55         |          0.56          |            10.08 |
| PyPI Upload Token                                        |   1    |     30 |   4.415 |       0.147169 |          0.75         |          0.82         |          0.2           |            18.22 |
| Shopify Private App Access Token                         |   1    |     38 |   4.294 |       0.112989 |          0.41         |          0.4          |          0.81          |             5.05 |
| Shopify Custom App Access Token                          |   1    |     38 |   4.041 |       0.106334 |          0.39         |          0.39         |          0.81          |             5.05 |
| Shopify Access Token                                     |   1    |     38 |   4.274 |       0.112467 |          0.38         |          0.38         |          0.81          |             5.05 |
| Shopify Shared Secret                                    |   1    |     38 |   4.201 |       0.110559 |          0.39         |          0.39         |          0.82          |             5.05 |
| Dynatrace Token                                          |   1    |     96 |   4.885 |       0.050884 |          0.28         |          0.28         |          0.92          |             2.35 |
| Amazon SNS Topic                                         |   1    |     29 |   4.018 |       0.138546 |          0.56         |          0.59         |          0.51          |             9.21 |
| Notion Note URI                                          |   1    |     73 |   4.922 |       0.067429 |          0.52         |          0.53         |          0.64          |            16.13 |
| Notion Team Note URI                                     |   1    |     71 |   4.931 |       0.069456 |          0.51         |          0.51         |          0.82          |            15.12 |
| Nano (NANO) Wallet Address                               |   1    |     65 |   4.606 |       0.070867 |          0.3          |          0.3          |          0.92          |             3.59 |
| Time-Based One-Time Password (TOTP) URI                  |   1    |     31 |   4.365 |       0.140807 |          0.5          |          0.55         |          0.91          |            10.07 |
| SSHPass Clear Password Argument                          |   1    |     28 |   4.209 |        0.15032 |          0.44         |          0.48         |          0.62          |             8.03 |
| Mount Command With Clear Credentials                     |   1    |     42 |    4.66 |       0.110954 |          0.47         |          0.47         |          0.81          |            25.06 |
| CIFS Fstab Entry With Clear Credentials                  |   1    |     46 |   4.855 |       0.105543 |          0.46         |          0.47         |          0.85          |            20.06 |
| Google Cloud Platform API Key                            |  0.8   |     26 |   3.979 |       0.153042 |          0.3          |          0.31         |          0.88          |             0.07 |
| Mailchimp API Key                                        |  0.8   |     37 |   3.766 |       0.101771 |          0.37         |          0.37         |          0.85          |             2.06 |
| Notion Integration Token                                 |  0.8   |     50 |   5.109 |       0.102175 |          0.28         |          0.28         |          0.85          |             6.06 |
| Digital Object Identifier (DOI)                          |  0.7   |     36 |    4.35 |       0.120839 |          0.47         |          0.54         |          0.85          |             9.01 |
| Internet Protocol (IP) Address Version 6                 |  0.7   |     24 |   3.935 |        0.16394 |          0.28         |          0.31         |          0.93          |             0.75 |
| Uniform Resource Locator (URL)                           |  0.7   |    163 |   5.732 |       0.035163 |          0.19         |          0.27         |          0.97          |             7.04 |
| Internet Protocol (IP) Address Version 4                 |  0.7   |     12 |   2.855 |       0.237949 |          0.49         |          0.54         |          0.89          |             0.48 |
| Bitcoin (₿) Wallet Address                               |  0.7   |     26 |   4.316 |       0.165993 |          0.16         |          0.17         |          0.98          |             0.92 |
| Latitude & Longitude Coordinates                         |  0.7   |     30 |    3.67 |       0.122319 |          0.35         |          0.38         |          0.92          |             0.57 |
| EUI-48 Identifier (Ethernet, WiFi, Bluetooth, etc)       |  0.5   |     17 |   2.934 |       0.172586 |          0.19         |          0.19         |          0.9           |             0.04 |
| Dogecoin (DOGE) Wallet Address                           |  0.5   |     34 |   4.735 |       0.139251 |          0.2          |          0.2          |          0.96          |             1.05 |
| Email Address                                            |  0.5   |     71 |   4.896 |       0.068962 |          0.25         |          0.28         |          0.96          |             0.24 |
| Phone Number                                             |  0.5   |     31 |   3.652 |        0.11781 |          0.35         |          0.38         |          0.92          |             1.14 |
| American Social Security Number                          |  0.5   |     10 |   2.322 |       0.232193 |          0.24         |          0.23         |          0.97          |             0.24 |
| Bitly Secret Key                                         |  0.5   |     40 |   3.775 |       0.094377 |          0.35         |          0.35         |          0.94          |             0.05 |
| Visual Studio App Center API Token                       |  0.5   |     40 |   3.747 |       0.093677 |          0.35         |          0.35         |          0.94          |             0.05 |
| YouTube Channel ID                                       |  0.5   |     24 |   4.085 |       0.170207 |          0.22         |          0.22         |          0.89          |             2.04 |
| Discord Bot Token                                        |  0.5   |     59 |   5.315 |       0.090079 |          0.19         |          0.2          |          0.95          |             0.12 |
| UUID                                                     |  0.5   |     36 |   3.694 |       0.102622 |          0.34         |          0.34         |          0.84          |             0.14 |
| United States Postal Service (UPS) Tracking Number       |  0.5   |     18 |   2.927 |       0.162636 |          0.31         |          0.31         |          0.83          |             1.24 |
| Turkish License Plate Number                             |  0.4   |      7 |   2.807 |       0.401051 |          0.25         |          0.24         |          0.86          |             0.17 |
| Date of Birth                                            |  0.4   |  ERROR |       - |          ERROR |           -           |         ERROR         |           -            |            ERROR |
| Monero (XMR) Wallet Address                              |  0.3   |     95 |   5.463 |       0.057504 |          0.22         |          0.22         |          0.97          |             0.12 |
| Litecoin (LTC) Wallet Address                            |  0.3   |     34 |   4.499 |        0.13233 |          0.23         |          0.23         |          0.95          |             0.05 |
| Ripple (XRP) Wallet Address                              |  0.3   |     34 |   4.055 |       0.119251 |          0.23         |          0.23         |          0.94          |             1.04 |
| American Express Card Number                             |  0.3   |     17 |   3.029 |       0.178155 |          0.44         |          0.42         |          0.78          |             0.27 |
| BCGlobal Card Number                                     |  0.3   |     16 |   2.578 |       0.161114 |          0.47         |          0.48         |          0.77          |             0.82 |
| Carte Blanche Card Number                                |  0.3   |     14 |   2.842 |       0.203026 |          0.44         |          0.44         |          0.79          |             0.43 |
| Diners Club Card Number                                  |  0.3   |     15 |   2.923 |       0.194882 |          0.43         |          0.43         |          0.86          |             0.38 |
| Discover Card Number                                     |  0.3   |     17 |   2.581 |       0.151824 |          0.38         |          0.38         |          0.79          |             0.75 |
| MasterCard Number                                        |  0.3   |     19 |   3.511 |       0.184794 |          0.4          |          0.41         |          0.85          |             0.49 |
| Maestro Card Number                                      |  0.3   |     19 |   3.071 |       0.161624 |          0.37         |          0.38         |          0.87          |             0.85 |
| Visa Card Number                                         |  0.3   |     16 |   3.203 |       0.200176 |          0.43         |          0.43         |          0.86          |             0.25 |
| Insta Payment Card Number                                |  0.3   |     16 |   2.578 |       0.161114 |          0.41         |          0.41         |          0.81          |             0.43 |
| JCB Card Number                                          |  0.3   |     15 |    2.79 |       0.185993 |          0.38         |          0.38         |          0.89          |             0.69 |
| Korean Local Card Number                                 |  0.3   |     16 |   3.024 |       0.189025 |          0.43         |          0.43         |          0.84          |             0.23 |
| Laser Card Number                                        |  0.3   |     16 |   2.578 |       0.161114 |          0.42         |          0.42         |          0.78          |             0.83 |
| Solo Card Number                                         |  0.3   |     16 |   2.656 |       0.165977 |          0.47         |          0.47         |          0.72          |             0.82 |
| Switch Card Number                                       |  0.3   |     16 |   2.858 |       0.178654 |          0.48         |          0.48         |          0.78          |             1.09 |
| Ethereum (ETH) Wallet Address                            |  0.3   |     42 |   3.797 |        0.09041 |          0.38         |          0.38         |          0.9           |             1.25 |
| Slack Token                                              |  0.3   |     15 |    3.64 |       0.242682 |          0.4          |          0.46         |          0.6           |             3.03 |
| Amazon Web Services Organization Identifier              |  0.3   |     12 |   3.189 |       0.265727 |          0.3          |          0.3          |          0.83          |             1.05 |
| Google API Key                                           |  0.3   |     39 |   4.875 |       0.125004 |          0.24         |          0.24         |          0.88          |             4.04 |
| Google OAuth Token                                       |  0.3   |      7 |   2.807 |       0.401051 |          0.57         |          0.65         |          0.41          |             2.41 |
| Mailgun API Key                                          |  0.3   |     36 |   4.837 |        0.13435 |          0.25         |          0.25         |          0.88          |             3.05 |
| Twilio API Key                                           |  0.3   |     34 |   3.837 |       0.112841 |          0.34         |          0.34         |          0.89          |             2.04 |
| Twilio Account SID                                       |  0.3   |     34 |   4.595 |       0.135137 |          0.21         |          0.21         |          0.93          |             2.04 |
| Twilio Application SID                                   |  0.3   |     34 |   4.653 |       0.136868 |          0.22         |          0.23         |          0.92          |             2.04 |
| Google ReCaptcha API Key                                 |  0.3   |     40 |   4.784 |       0.119605 |          0.18         |          0.18         |          0.97          |             0.75 |
| Amazon Standard Identification Number (ASIN)             |  0.3   |     10 |   3.322 |       0.332193 |          0.22         |          0.22         |          0.88          |             1.02 |
| Facebook App Token                                       |  0.3   |     35 |   4.593 |       0.131241 |          0.25         |          0.25         |          0.95          |             0.06 |
| JSON Web Token (JWT)                                     |  0.2   |     28 |   4.423 |       0.157973 |          0.18         |          0.21         |          0.98          |             0.03 |
| Amazon Web Services Access Key                           |  0.2   |     20 |   3.684 |       0.184209 |          0.19         |          0.19         |          0.98          |             0.03 |
| Amazon Web Services Secret Access Key                    |  0.2   |     40 |   4.753 |       0.118826 |          0.16         |          0.17         |          0.99          |             0.05 |
| Amazon Web Services EC2 Instance ID                      |  0.2   |     19 |   4.037 |       0.212495 |          0.25         |          0.27         |          0.83          |             1.03 |
| Turkish Identification Number                            |  0.2   |     11 |   2.664 |       0.242139 |          0.33         |          0.34         |          0.92          |             0.04 |
| Facebook Access Token                                    |  0.2   |    192 |   5.729 |       0.029841 |          0.21         |          0.21         |          0.98          |              2.2 |
| ObjectID                                                 |   0    |     24 |    3.47 |       0.144591 |          0.28         |          0.28         |          0.96          |             0.03 |
| Recent Unix Timestamp                                    |   0    |     10 |   2.522 |       0.252193 |          0.41         |          0.41         |          0.91          |             0.02 |
| Recent Unix Millisecond Timestamp                        |   0    |     13 |   2.661 |        0.20471 |          0.38         |          0.38         |          0.91          |             0.02 |
| Unix Timestamp                                           |   0    |      8 |   2.156 |       0.269455 |          0.33         |          0.36         |          0.91          |             0.02 |
| Unix Millisecond Timestamp                               |   0    |     11 |   2.845 |       0.258668 |          0.39         |          0.39         |          0.91          |             0.02 |
| ULID                                                     |   0    |     26 |   4.162 |       0.160076 |          0.24         |          0.24         |          0.97          |             0.04 |
| YouTube Video ID                                         |   0    |     27 |   4.533 |       0.167876 |          0.12         |          0.12         |          1.0           |             0.03 |
| Turkish Tax Number                                       |   0    |     10 |   2.446 |       0.244644 |          0.34         |          0.35         |          0.88          |             0.02 |
| Key:Value Pair                                           |   0    |     13 |   3.547 |       0.272815 |          0.14         |          0.2          |          0.98          |             0.05 |
+----------------------------------------------------------+--------+--------+---------+----------------+-----------------------+-----------------------+------------------------+------------------+
@bee-san
Copy link
Owner

bee-san commented Dec 2, 2021

This is cool! Is Regexp Score the 'rarity' in this case? Can you build a function that other people can use to easily test this? Maybe put it into /scripts?

If you do Regexp Score mod 10 it'll put it into the 10-point range for us :) And I might suggest rounding the result in another column too (so 0.51 becomes 1, 0.49 becomes 0) just to see what that's like :D

thanks so much for this, this is absolutely great 🔥 A non-subjective formal way to define rarity would be absolutely amazing :)

@nodtem66
Copy link
Contributor Author

nodtem66 commented Dec 2, 2021

It seems to be promising, but it's just a prototype and needs a lot of works.
For example:

  • What are the best parameters for RegExScore? (repeat_score, in_score, ascii_score, digit_score, etc.)
  • Some RegExp with high rarity produces a low score, such as
Google Cloud Platform API Key
Regexp = "(?i)^([0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12})$"
rarity = 0.8
score = 0.07
  • It means that we need more accurate metric for scoring Regexp.

By the way, you can play the script in Google Colab.
This is the RegExScore class source code:

import sre_parse
import strings

# calculate score from literal string (e.g. prefix, suffix)
# use weight metric (*_score) parameters
class RegExScore():
  def __init__(self, 
               repeat_score = 0.01, # score for quantifier `{0,}`
               in_score = 0.1,      # score for character set `[*]`
               ascii_score=1.0,     # score for a fixed ascii `a-zA-z`
               digit_score=0.2,     # score for a fixed digit `0-9`
               literal_default_score=0.01, # score for whitespaces
               debug=False # print the debug message
    ):
    self.repeat_score = repeat_score
    self.in_score = in_score
    self.ascii_score = ascii_score
    self.digit_score = digit_score
    self.literal_default_score = literal_default_score
    self.debug = debug

  def calculate(self, regexp:str):
    return self.token_score(sre_parse.parse(regexp))

  def token_score(self, tokens:tuple):
    score = 0
    for _token in tokens:
      if self.debug:
        print("Loop: ", _token)

      # add the score from subpattern `()`
      if _token[0] == sre_parse.SUBPATTERN:
        _, _, _, child = _token[1]
        if self.debug:
          print(_token[0], len(child))
        score += self.token_score(child)
      # add score from quantifier `{min,max}`
      elif _token[0] == sre_parse.MAX_REPEAT:
        _min, _max, child = _token[1]
        _score = self.repeat_score * (_min + 0 if _max == sre_parse.MAXREPEAT else _max)
        if self.debug:
          print('\tscore:', _score)
        score += _score + self.token_score(child)
      # add score from mean of branch group `A|B|C|D`
      elif _token[0] == sre_parse.BRANCH:
        _, branch = _token[1]
        if self.debug:
          print('\tbranch:', len(branch))
        sub_score = 0
        for child in branch:
          sub_score += self.token_score(child)
        score += sub_score / float(len(branch))
      # add score from character set `[]`
      elif _token[0] == sre_parse.IN:
        if self.debug:
          print('\tscore:', self.in_score)
        score += self.in_score
      # add score from fixed literal
      elif _token[0] == sre_parse.LITERAL:
        literal = chr(_token[1])
        if self.debug:
          print('\tchr:', literal)
        if literal in string.ascii_letters:
          score += self.ascii_score
        elif literal in string.digits:
          score += self.digit_score
        else:
          score += self.literal_default_score
    return score

Feel free to comment or suggest your thoughts.
I'm looking forward to discussing this with anyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants