Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add feature flag to enable legacy md5 hash for anonymous user id #30832

Conversation

kaustavb12
Copy link
Contributor

@kaustavb12 kaustavb12 commented Aug 8, 2022

Description

PR #26198 changed the hash algorithm to generate an anonymous user id from MD5 to SHAKE-128.

However, there are Open edX operators who maintain backward compatibility of anonymous user IDs after past rotations of their Django secret key. For them, altering the hashing algorithm was a breaking change that made their analytics inconsistent.

This PR, introduces a feature flag, to enable Open edX operators to switch back to the legacy MD5 algorithm if they so require.

Supporting information

Link to other information about the change, such as Jira issues, GitHub issues, or Discourse discussions.
Be sure to check they are publicly readable, or if not, repeat the information here.

Testing instructions

  1. Checkout this branch in devstack
  2. Browse any course.
  3. Open LMS shell and retrieve the anonymous user id for the user/course using
from common.djangoapps.student.models import AnonymousUserId
anonymous_user_ids = AnonymousUserId.objects.filter(user=8).filter(course_id='course-v1:edX+DemoX+Demo_Course').order_by('-id')
  1. Note this anonymous id down
  2. Delete the anonymous user id
anonymous_user_ids[0].delete()
  1. Open /edx/etc/lms.yml
  2. Under Features add the new Flag
FEATURES:
    ENABLE_LEGACY_MD5_HASH_FOR_ANONYMOUS_USER_ID: true
  1. Rerun steps 2 to 4.
  2. Verify the anonymous user ids are different

Deadline

"None" if there's no rush, or provide a specific date or event (and reason) if there is one.

Other information

Private Ref: BB-6533

@openedx-webhooks openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Aug 8, 2022
@openedx-webhooks
Copy link

Thanks for the pull request, @kaustavb12! Please note that it may take us up to several weeks or months to complete a review and merge your PR.

Feel free to add as much of the following information to the ticket as you can:

  • supporting documentation
  • Open edX discussion forum threads
  • timeline information ("this must be merged by XX date", and why that is)
  • partner information ("this is a course on edx.org")
  • any other information that can help Product understand the context for the PR

All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here.

Please let us know once your PR is ready for our review and all tests are green.

Copy link
Member

@Agrendalath Agrendalath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

  • I tested this: verified that new hashes are generated with MD5 when the feature flag is enabled
  • I read through the code
  • I checked for accessibility issues: n/a
  • Includes documentation
  • I made sure any change in configuration variables is reflected in the corresponding client's configuration-secure repository: n/a

@Agrendalath
Copy link
Member

@natabene, this is ready for your review.

@natabene
Copy link
Contributor

natabene commented Aug 9, 2022

@kaustavb12 Thank you for your contribution.

@Agrendalath Agrendalath force-pushed the kaustav/enable_legacy_anonymous_user_id_hash_algo branch from ba802b7 to b1b41bf Compare August 9, 2022 18:03
…to md5

The hashing algorithm has been changed in cd60646. However, there are Open edX
operators who maintain backward compatibility of anonymous user IDs after past
rotations of their Django secret key. For them, altering the hashing algorithm
was a breaking change that made their analytics inconsistent.
@bradenmacdonald
Copy link
Contributor

@kaustavb12 @Agrendalath I don't really understand the need for this change. As I understand, all past anonymous IDs are stored in a database table and won't be changed, even though new ones use a new algorithm. So why can't the operator here load the IDs from the database table for their analytics instead of depending on the details of the hash algorithm?

@Agrendalath
Copy link
Member

@bradenmacdonald, this operator is using a custom analytics system that is integrated with multiple external services. As far as I know, some of these services were using direct user IDs in the past. Therefore, the operator calculates MD5 hashes directly in their database and stores them to maintain the integrity of all existing data. There was a key rotation in the past that changed the anonymous user IDs, so they also have lots of legacy code to make these IDs backward compatible with all possible versions.

Hashes are calculated in the database, but database engines are not adapting new hashing functions too fast. For example, SHA-2 was accepted in 2002 (FIPS PUB 180-2), but it appeared in MySQL in 2010 (v5.5.5), and PostgreSQL didn't fully support its variants until 2018. As SHA-3 (FIPS PUB 202) was published in August 2015, it's not likely to see it fully supported soon. Therefore, it would be necessary to pre-calculate these hashes, instead of preparing them directly in the DB query. We don't have details about how exactly these analytics work, though, as we're not managing them, nor even have any access to the code.

@bradenmacdonald
Copy link
Contributor

@Agrendalath Thanks, though I am still still having trouble understanding why there's any need to calculate hashes within a DMBS at all. Can't there just be a table in every database/analytics system that needs one, which contains the map of real_id, anyonmous_id for all anonymous ids that have ever been used? i.e. use JOIN/lookups instead of hashing/calculations.

In other words, the root problem here is that the external systems were integrated in a way that was too highly coupled to the internal details of how Open edX calculates its anonymous IDs. So it seems like the right fix is to un-couple them from that, rather than add complexity to Open edX by supporting two hash mechanisms. However, I won't block the PR on it - I see it's only a couple lines of code in the end.

Comment on lines +1025 to +1026
# .. toggle_description: Whether to enable the legacy MD5 hashing algorithm to generate anonymous user id
# instead of the newer SHAKE128 hashing algorithm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to give a brief explanation here of why that might be desirable for someone.

# instead of the newer SHAKE128 hashing algorithm
# .. toggle_use_cases: open_edx
# .. toggle_creation_date: 2022-08-08
# .. toggle_target_removal_date: None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removal date can be left off for non-temporary toggles.

@timmc-edx
Copy link
Contributor

I share Braden's concerns -- this seems like a really unusual use-case, and I'd want to have a clearer picture of why this can't be solved with table lookups or a sync job before I'd be comfortable adding a toggle that enables an old hash algorithm.

(It also sounds like the analytics system might need to have a copy of all current and former SECRET_KEY values in order to operate, which seems like it could be a security concern.)

@kaustavb12
Copy link
Contributor Author

Thanks @timmc-edx.

We are discussing this change internally as well as with our client to see if the approach can be modified based on suggestions from you and @bradenmacdonald. I'll keep you posted on the same.

@kaustavb12
Copy link
Contributor Author

@timmc-edx

Based on your and @bradenmacdonald's suggestions, we had a discussion with our client on the approach to follow here. And it was decided that it would be best for the analytics system to pull the required data from DB rather than calculating the hashes using the SECRET_KEY.

Therefore we are closing this PR. Thanks a lot to you and @bradenmacdonald for your reviews and suggestions.

cc. @Agrendalath

@kaustavb12 kaustavb12 closed this Nov 17, 2022
@openedx-webhooks
Copy link

@kaustavb12 Even though your pull request wasn’t merged, please take a moment to answer a two question survey so we can improve your experience in the future.

@Agrendalath Agrendalath deleted the kaustav/enable_legacy_anonymous_user_id_hash_algo branch November 18, 2022 09:14
@Agrendalath Agrendalath changed the title feat: add feature flag to enable legacy md5 hash for anonymous user id https://github.com/open-craft/edx-platform/pull/487 Jun 30, 2023
@Agrendalath Agrendalath changed the title https://github.com/open-craft/edx-platform/pull/487 https://github.com/open-craft/edx-platform/pull/48 Jun 30, 2023
@Agrendalath Agrendalath changed the title https://github.com/open-craft/edx-platform/pull/48 feat: add feature flag to enable legacy md5 hash for anonymous user id Jul 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
open-source-contribution PR author is not from Axim or 2U
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

6 participants