Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk User Import into SuperTokens #912

Open
6 tasks
anku255 opened this issue Jan 24, 2024 · 7 comments
Open
6 tasks

Bulk User Import into SuperTokens #912

anku255 opened this issue Jan 24, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@anku255
Copy link
Contributor

anku255 commented Jan 24, 2024

https://docs.google.com/document/d/1TUrcIPbdsHfqheIkB6CTkTNiA6YN-73Xz_pX2kjXhkc/edit

Open PRs:

TODO:

  • Allow developers to configure parallelism in ProcessBulkImportUsers cron job

    Currently, it takes on an average 66 seconds to process 1000 users. This is very slow if we are processing a large number of users. This happens because we loop through the users one by one in a for loop and also use just 1 DB connection for BulkImportProxyStorage.

    The solution is to process users parallely using threads and create a BulkImportProxyStorage instance for each user (that is being processed). The number of users we will process in parallel will depend on the BULK_MIGRATION_PARALLELISM config value set by the user. This will be a SaaS protected prop and can be added to PROTECTED_CONFIGS in CoreConfig.java. It should have the @NotConflictingInApp annotation.

  • PR changes in supertokens-core PR
    - All the PR changes are done but there maybe more changes after review.

  • PR changes in supertokens-postgresql-plugin PR

  • Changes in Node Script to add users

    The script needs to re-written to optimise for the following user stories -

    • The user is not expected to monitor the script. The script should try to continue processing and retry failures wherever possible.

    • The user should be able to re-run the script using the same input file multiple times and the script should process the remaining users. This can be implemented by maintaining a state file per input file name.

    • The Core API calls should have an exponential backoff retries (unlimited) in case we get an error from the Core. This is to ensure that we don't halt processing in case the Core is down for a few seconds.

    • The script should continue showing the Bulk Import Cron Job status after it has added all the users. Any users with status=FAILED will be added to the same usersHavingInvalidSchema.json file. This file could also be renamed to be something like usersHavingErrors.json. Since, we will be writing to the usersHavingErrors.json file, they are expected to wait until all the users have been processed and then fix the error file and add those users again.

    • The script should display progress logs while adding the users. This could include total number of usrs, number of users added, number of users having errors, etc.

    • We also need to re-search about the size limit of the JSON file. A JSON file having a million user would be about 880 MB. JSON files cannot be streamed, the whole file needs to be read in memory. If this is an issue then we may need to switch to ndjson file format which allows streaming the file.

  • Documentation for Bulk Import

  • [ ] Update the CDI spec to include /bulk-import/import and /bulk-import/users/count APIs. Also update the BulkImportUser schema to include plainTextPassword field.

  • Bulk Import for Auth0 users
    After the Bulk Import task is complete. We plan to have a special guide for Auth0 users.

    Auth0 users need to request the exported file from their support team if they want password hashses of the users. For any other type of login, they can export the users themselves using the API. However, this API doesn't include user roles.

    • We could write a script that takes the exported JSON and Auth0 credentials. The script would get all the roles and map to the users. (We could also call the roles API for each user but that would take more API calls)

    • We can also add separate page in our documentation for Auth0 users that guides them about requesting the export JSON file and running the above script.

@anku255 anku255 added the enhancement New feature or request label Jan 24, 2024
@rishabhpoddar
Copy link
Member

rishabhpoddar commented Jan 27, 2024

About method 1:

  • I don't think we need a generic template like id, type etc.. We don't really have any examples of other similar jobs, and there is no point making it generic (to keep things simpler). Everything related to this will have its own table / cronjob / apis anway. So the APIs listed here can just be:
    • add more user for migration
    • get the status of migration (how much is left, how much time remaining approx)
  • Could you add more details for what fields will the JSON have exactly?
  • If the dev has to add many millions of users, then will there be many such JSONs? Or just one big one?
  • What is the point of status on a job level exactly? If the job is there in the db, then it's pending, else it's finished. If it had partially failed, how will the status help know what parts of it are done?
  • There is not enough info for me to create a mental map of what you are thinking. Could you highlight the exact steps that need to be done by the user and a high level of what the core will do with the API input?
  • What kind of json verification will be done in the API?
  • What happens if there is a duplicate user in the JSON and one is already in the db?
  • We don't really have api paths like job/:job_id in the core. The core routing does not allow for placeholder in paths easily. So we just make it as part of the query param instead.
  • If the user has multiple cores running and then start the migration, how will we handle race conditions across the core instances, given that the cron will run in both the cores.
  • How often will the cronjob run?

About method 2:

main advantage of this approach lies in avoiding the addition of a CRON job

This is a very tiny advantage IMO. What other advantages are there?

  • Mention that the disadvantage is extra cost of maintenance that is added to our SaaS to run the script and stream the result to the dashboard

Additional points:

  • Will we involve the user management dashboard here in any way? (for either of the two methods)
  • If the migration takes a few hours let's say, what can be done to add the new set of users that had signed up in the older auth service whilst the migration to supertokens is going on?

@anku255
Copy link
Contributor Author

anku255 commented Feb 2, 2024

@rishabhpoddar After discussions, we've opted for method 1. I've expanded the original to provide detailed elaboration on method 1.

@rishabhpoddar
Copy link
Member

Database Schema

Please give it in SQL format like CREATE TABLE ... and specify any keys.

Request Body of API to post json

The totp object needs skew, period, deviceName (optional). Please see the API to import a device.

Size Limit: 10,000 users

The test you ran, what are the specs of the setup?

  • RAM
  • Config for the core's connection pool size?
  • Did the insertion of users happen in parallel or sequently in the API?
  • The diff in time between 10k and 20k users i not much, so why not 20k? Did you try more than 20k?

All possible errors for the /add-users endpoint

  • What if the user being imported already exists (either a user in ST, or as an entry in the bulk table)

thirdPartyId must be one of 'apple', 'discord', 'facebook', 'github', 'google', 'linkedin' or 'twitter'.

This is not true.

Get bulk import users

  • Why do we need importid (seems like an overkill).

Delete bulk import users by importId

Why do we need this API? Why can't we just auto clear the space in the db?

A cron job, ProcessBulkImportUsers, runs every minute to handle users added to the bulk_import_users table. It processes users (status = 'NEW') in batches of 1000, updating their status to 'PROCESSING' before processing and to 'PROCESSED' afterward. If processing fails, the error_msg column is updated with error messages.

  • Why do we need PROCESSED status?
  • What kind of error messages are we talking about here? What happens if there is a failure in updating the error message in the first place?

The query in getUsersToProcess function

  • Does it work for mysql and postgresql?
  • Is the FOR UPDATE necessary? If you are running one query, isn't it atomic anyway?
  • Is the SKIP LOCKED needed? This query is not being run as part of a bigger transaction, hence the acquiring and releasing of the lock is very fast anyway.. How did you decide to add this?

All possible errors for the Cron Job

  • I feel like there are lots of errors missing here.. for example:
    • what about the ones due to account linking, where you may have an existing primary user with the same email, but with a different login method?
    • what if an external user id is already being used by another user?
  • "Similar to the current import endpoint behavior, the cron job updates user data if the email / phoneNumber already exists." -> I don't think this is a good idea cause the entry being processed could have other information like a different set of roles, in which case, would you overwrite the roles for the older user or merge them?
  • How is the user supposed to know about these errors and handle them? They would have to modify / delete that specific user perhaps to fix this issue.. How do they do that?
  • What happens in the cron if an error is encountered for a user? Does it move on to the next user or stop there?

Pseudo code for the cron job

  • Please mention what the implementation of importUser is. I need to see that in a flow diagram clearly, in detail.
  • What will happen in one transaction? For examlpe, adding a user and marking them as PROCESSED should be within one transaction, correct?

Utilizing the Bulk Import API

  • Why do users need to use just in time migration along with bulk migration? Sure, there might be new user sign ups during the bulk migration, but users can just add those later on to ST anyway?

@anku255
Copy link
Contributor Author

anku255 commented Feb 6, 2024

@rishabhpoddar

Please give it in SQL format like CREATE TABLE ... and specify any keys.

Done.

The totp object needs skew, period, deviceName (optional). Please see the API to import a device.

Done.

The test you ran, what are the specs of the setup?

I ran the tests with my MacBook Pro which has pretty good specs but I was able to insert users till 50K as well. 10K is a conservative number that should be able to accommodate lower end specs too.

What if the user being imported already exists (either a user in ST, or as an entry in the bulk table)

We'll update the data, overwriting fields wherever applicable for simplicity. If a user is performing a bulk import while their previous auth provider is active, it's possible that user data could be updated for some users after creating the export file. Allowing users to update and create users in the same API would be helpful.

thirdPartyId must be one of 'apple', 'discord', 'facebook', 'github', 'google', 'linkedin' or 'twitter'.
This is not true.

Can you please elaborate? Do we support every thirdparty?

Why do we need importid (seems like an overkill).

Upon some discussion we decided to remove it.

Does it work for mysql and postgresql? Is the FOR UPDATE necessary
Yes it works in both (mysql v8 and above), postgresql (v12 and above).

The information on the atomicity of Common Table Expressions (WITH statement) is ambiguous. While some sources suggest potential issues where another query might select the same documents before they are updated, incorporating FOR UPDATE provides an additional layer of safety.

I don't think this is a good idea cause the entry being processed could have other information like a different set of roles, in which case, would you overwrite the roles for the older user or merge them?

I have wrote more about this in the above point.

How is the user supposed to know about these errors and handle them? They would have to modify / delete that specific user perhaps to fix this issue.. How do they do that?

Good point. The API will provide the index of the user requiring attention. Our script can then handle the removal of users with errors, creating a separate file for them. Users with errors can be addressed manually later.

What happens in the cron if an error is encountered for a user? Does it move on to the next user or stop there?
Yeah, the cron will continue processing other users. The user having error will be marked as FAILED and an error message will be set.

Why do users need to use just in time migration along with bulk migration? Sure, there might be new user sign ups during the bulk migration, but users can just add those later on to ST anyway?

Not only there could be new sign ups but existing user data can be updated as well. Users can call the API again with the updated data and make it work without "just in time" migration as well. This requires that bulk import API will update the data if the user already exists.

Please mention what the implementation of importUser is. I need to see that in a flow diagram clearly, in detail.

Please find a pseudo code for the importUser function below. This is missing some things like userroles but it should be enough to give you an idea about how I am thinking about this.

function ImportUsers(user) {
    try {
      Start Transaction;
  
      let EPUserId = null;
      let TPUserId = null;
      let PWlessUserId = null;
      let primaryUserId = null;
      
      Set isPrimary to true for the first loginMethod if none of the methods have it set
      
      if (isPrimary is not true for any user.loginMethods) {
         user.loginMethods[0].isPrimary = true;
      }
      
      for (loginMethod in user.loginMethods) {
          if (loginMethod.recipeId = 'emailpassword') {
            EPUserId = CreateUserForEP(loginMethod, user.externalUserId);
            if (loginMethod.isPrimary) {
              primaryUserId = EPUserId; 
            }
          }
          
          if (loginMethod.recipeId = 'passwordless') {
            PWlessUserId = CreateUserForPasswordless(loginMethod, user.externalUserId);
            if (loginMethod.isPrimary) {
              primaryUserId = PWlessUserId; 
            }
          }
          
          if (loginMethod.recipeId = 'thirdparty') {
            TPUserId = CreateUserForThirdParty(loginMethod, user.externalUserId);
            if (loginMethod.isPrimary) {
              primaryUserId = TPUserId; 
            }
          }
      }
      
      if (user.loginMethods.length > 1) {
        Call accountlinking.createPrimaryUser for primaryUserId;
        Call accountlinking.linkAccount for every userId other than primaryUser;
      }
      
      if (user.usermetadata) {
        Update user metadata for the primaryUserId;
      }
      
      if (user.mfa) {
        Call mfa related APIs for the primaryUserId; 
      }
      
      Delete user from bulk_import_users table; 
    } catch(e) {
      user.status = 'FAILED';
      user.error_msg = e.message // Set meaningful error message
    } finally {
      End Transaction;
    }
}


function CreateUserForEP(data, externalUserId) {
  Validate data fields;
  Call EmailPassword.importUserWithPasswordHash;
  if (data.isPrimary) {
    Link this user to externalUserId;
  }

  if (data.time_joined) {
    Update time_joined of the created user;
  }

  if (data.isVerified) {
    Update email verification status;
  }
  
  return newUserId;
}


function CreateUserForPasswordless(data, externalUserId) {
  Validate data fields;
  Call Passwordless.createCode;
  Call Passwordless.consumeCode; // Email will be verified automatically here

  if (data.isPrimary) {
    Link this user to externalUserId;
  }

  if (data.time_joined) {
    Update time_joined of the created user;
  }
  return newUserId;
}

function CreateUserForThirdParty(data, externalUserId) {
  Validate data fields;
  Call ThirdParty.SignInAndUp API; // Pass data.isVerfied 
  
  if (data.isPrimary) {
    Link this user to externalUserId;
  }

  if (data.time_joined) {
    Update time_joined of the created user; 
  }

  return newUserId;
}

@rishabhpoddar
Copy link
Member

We'll update the data, overwriting fields wherever applicable for simplicity. If a user is performing a bulk import while their previous auth provider is active, it's possible that user data could be updated for some users after creating the export file. Allowing users to update and create users in the same API would be helpful.

We decided that we will not be updating existing users in the bulk migration logic. If we see a repeated user in bulk migration (same external user id, or same account info in same tenant), then we will mark it as an error

@anku255
Copy link
Contributor Author

anku255 commented May 3, 2024

Problem

When a user generates a bulk import JSON file from an existing authentication provider, ongoing changes to user data can make the import outdated by the time it's completed. For example, if a user updates their password after the JSON file is created, they will not be able to log in with their new password on SuperTokens.

Solution 1: Bulk Import with User Data Updates

A straightforward solution is to perform a second bulk import after switching to SuperTokens. If the Bulk Import API supports updates, the new data can replace or add to what's in the SuperTokens database, helping to correct any inconsistencies from the initial import.

Issues with Solution 1

  • Delayed Consistency: Data inconsistencies will persist until the second import completes. This delay can lead to various issues, including users being unable to log in, especially if significant time elapses between the first and second imports.

  • Data Overwrite Risks: If user data in SuperTokens is updated directly between the two imports, a second import might inadvertently overwrite these newer updates with older data. This risk can be mitigated by checking the updated_at timestamp to ensure data consistency.

Solution 2: Implementing LazyMigration Recipe

Create a LazyMigration recipe to seamlessly migrate users to SuperTokens upon their first login or signup. Here's a potential implementation:

recipeList = [
  LazyMigration({
    getUser: (userId: string): BulkImportJSON => {
      // Fetch user data from the existing authentication provider
    },
  })
]

The process unfolds as follows:

  1. Transition to SuperTokens: Integrate SuperTokens into your application and configure the LazyMigration recipe. Start using SuperTokens for authentication.

  2. User Authentication: When an existing user logs in for the first time, the signUp recipe in Supertokens will retries their data from your existing authentication provider using the LazyRecipe recipe's getUser function. The user is then imported into SuperTokens. New users can sign up directly with SuperTokens.

  3. Bulk Import: Use Bulk Import to migrate all existing users to SuperTokens at once. Users previously migrated via the LazyMigration recipe will be automatically skipped during this process.

  4. Deactive existing auth provider: Upon completion of the bulk import, you can safely remove the existing authentication provider from your system.

@anku255 anku255 changed the title Strategies for Bulk User Import into SuperTokens Bulk User Import into SuperTokens May 29, 2024
@anku255
Copy link
Contributor Author

anku255 commented May 29, 2024

Bulk Import Documentation Outline

Migration steps overview

Overall goals:

  • Migration without downtime of your application.
  • Users do not need to change their password
  • Migration of users even if you do not have their password hash.
  • Users need not be logged out
  • Existing MFA TOTP devices can be reused
  • Existing auth UI can be reused.
  • Existing user IDs do not change

Steps:

  • Integrate SuperTokens into your application:
  • Decide between what kind of migration you want to use for your users:
    • Option 1: Migration API
      • Use this if you have the password hash of your users.
    • Option 2: User Creation without password hashes:
      • If you don't have password hashes use this method
      • It requires you to run the older auth system in parallel
  • Session migration (optional)
  • Post production operations

Code migration

You want to completely replace your existing auth system with SuperTokens

You just follow our docs as is.

You want to keep both alive for a certain amount of time

You just follow our docs as is, except for session verification, you do the following:

app.post("/verify", verifySession({sessionRequired: false}), async (req, res) => {
let userId = undefined;
if (req.session !== undefined) {
// this means a supertokens session exists
userId = req.session.getUserId();
} else {
// try doing auth with existing auth provider
userId = await yourExistingAuthVerify(req);
}

if (userId === undefined) {
    // send a 401
} else {
    // Your API logic.
}

});

Frontend migration:

  • You can use our pre built UI or use custom UI without changing your existing UI -> link to the revelant docs

Option 1: Migration API -> Overview

For bulk migration

  • First we need to create a JSON file containing all the users in your app.
  • Add the user JSON to the SuperTokens core
  • Monitoring SuperTokens core cronjob progress

For small number of users

Talk about the migration API here and link to the JSON schema explanation in the "1) Creating the user JSON" page
Also mention that if you call the API twice for the same user, it will not update the second time, and if users want to update some information, they will have to do so by calling individual APIs. For example, if you want to update the password hash of this user, call XYZ API, or if you want to update roles, call the roles PUT API (link to the CDI spec).


1) Creating the user JSON

  • TODO: What are the limit?
  • TODO: how will we recommend people to do this? One file for millions of users, or break it down somehow?
  • Explain the JSON schema
  • Point to an example of how to make the JSON from Auth0

2) Add the user JSON to the SuperTokens core

  • Start the SuperTokens core service connected to your database

    • Only applicable if you are using self hosted, else we already have our managed service
  • Import the JSON users into SuperTokens:

    • Talk about node script and how to run it.
    • Talk about handling errors
    • About monitoring the progress
  • Time / ram estimations for 1M users.


3) Monitoring the core cronjob

  • Once you upload the JSON, the core will run a cron and this rate etc etc..

  • Monitor based on the GET API

  • How do know when the migration is done?

  • How to know about errors and fix them?

  • How to know how many users are pending?

  • How to delete entries from the migration table?

    • Why would you want to do this? For example, if a user errored out, maybe cause a user with that email already exists, then you may want to ignore / delete this entry.
  • Time / ram estimations for 1M users:

    • For self hosted core:
      • Mention that by default, it will be 1 parallel job and the time is XZY for 1M user. This means 1 connection is used.
      • How to change this, and the effects.
    • For managed service:
      • Mention that by default, it will be 1 parallel job and the time is XZY for 1M user. This means 1 connection is used.
      • For changing this, get in touch with us.

Example: Create user JSON file from Auth0

  • Get the exported file from Auth0
  • Run this command / script, and see the output
  • What to do if you don't have password hashes?
    • Do not use bulk migration, and instead, use lazy migration.

Option 2: User Creation without password hashes

For now, let it be how it is in the docs


Session migration (optional)

For now, let it be how it is in the docs


Step 4: Post production operations

If you have done bulk migration:

- Mention the problem that users may have updated their data whilst the migration is happening.
- This is an unsolved problem for now, but we will solve it by introducing a lazy migration recipe (coming soon).
- If you really don't want this problem, you should stop your existing auth provider before you start creating the user JSON. This will ofc add to downtime for your application.

If you have done User Creation without password hashes:

- You have to keep your existing auth provider running for the amount of time you think enough users have signed in to your application.
- The rest of the users will have to be imported through bulk migration with random passwords and then they will have to go through the password reset flow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants