Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have a single way to identify a project in our telemetry #507

Open
yetudada opened this issue Jan 8, 2024 · 14 comments
Open

Have a single way to identify a project in our telemetry #507

yetudada opened this issue Jan 8, 2024 · 14 comments
Assignees
Labels
bug Something isn't working telemetry

Comments

@yetudada
Copy link
Contributor

yetudada commented Jan 8, 2024

Description

Let me show you a subset of the data that comes into our database from kedro-telemetry in our CLI commands table.

USER_ID PROJECT_NAME PACKAGE_NAME USERNAME
1343655360016970 946346cad253d7b 59f1c4c71309c46f anonymous
1856054624437891 03e5c42f9fd8d0ab ec3759e2c570d30
8035489891863387 8a171405cf81ef53 d92d8dd3de0a094
4531280966080738 bb7f5d8666210fd5 ae46b8e451079db
6022834494838396 946346cad253d7b 59f1c4c71309c46f anonymous
8035489891863387 8a171405cf81ef53 d92d8dd3de0a094
1416039240284711 256cfde57fee6e63 99adc231b045331
6759436125356775 2e35e5f9d892407 2e35e5f9d8924072 anonymous
6092965114078963 0bc9045c71518347 b8308c21e8a66ad
8249949466136913 ef85afadc4586f38 57cbe0328faa502

Context

This task has two components:

  • Make a recommendation to the team about which is the most accurate identifier that we should use to ID a project
  • And detail why project_name sometimes has blank entries

Expected Result

We should only have a single project ID that we all trust and we should not have a blank entry for that ID.

@astrojuanlu
Copy link
Member

Why project_name sometimes has blank entries

Looks like kedro-telemetry 0.2.3 might have introduced that behavior.

WITH subset AS (SELECT
    (case when (PROJECT_NAME IS NULL) then 1 else 0 end) AS PROJECT_NAME_IS_NULL,
    TELEMETRY_VERSION
FROM HEAP_KEDRO_APP.HEAP.ANY_COMMAND_RUN
)

SELECT
    TELEMETRY_VERSION,
    SUM(PROJECT_NAME_IS_NULL) AS COUNT_NULL_PROJECT_NAME,
    COUNT(*) AS TOTAL_COMMANDS
FROM subset
GROUP BY TELEMETRY_VERSION
ORDER BY TELEMETRY_VERSION

image

In fact, from the changelog, it looks like it was intentional:

Project name is a leftover that is not important to store if we already know the package name and since there is currently no easy way of accessing this value, it is discarded.

#84

Mistery solved 😄

How to track unique projects

There are several ways to do this. For example, we could generate a UUID4 per project (similar to #333). Alternatively, we could get a hash of user UUID4 (as per #333) concatenated with the package_name (let's forget about project_name, see above).

I do think generating a UUID4 per project should be the way forward. Now, how do we store that, whether we want to track it in version control or not etc we still need to think about.

Using the hash of the package_name is nice (because it helps us identify things like "spaceflights" or "test") but would also be considered pseudonymised data so we have to stop doing it.

@astrojuanlu
Copy link
Member

This issue still needs some thought. Some options:

  • We reuse the existing .telemetry file, and record a UUID4 of the project there. But it won't be stored in version control.
  • We record a UUID4 upon project creation with kedro new or the starters in settings.py or a separate file. But then custom projects will not have it.

@astrojuanlu astrojuanlu self-assigned this Mar 4, 2024
@DimedS
Copy link
Contributor

DimedS commented Mar 4, 2024

This issue still needs some thought. Some options:

  • We reuse the existing .telemetry file, and record a UUID4 of the project there. But it won't be stored in version control.
  • We record a UUID4 upon project creation with kedro new or the starters in settings.py or a separate file. But then custom projects will not have it.

I agree that using a uniquely generated projectId, not linked to the project name, is better. This ensures distinct IDs for different projects even if they share the same name. Ideally, this ID should be generated at the time of project creation with kedro new or upon executing any Kedro command within an project without ID, rather than depending on the kedro-telemetry plugin for this generation process, aligning the generation process closely with project inception.

@yetudada
Copy link
Contributor Author

yetudada commented Mar 4, 2024

I have two questions:

  • A Kedro project often has multiple collaborators (multiple UUIDs on user_id assigned to a single package_name), how would we ensure that we would still see unique projects in our data? I would use this data to understand project-specific features points i.e. "100 projects are running in CI" or "5% of projects have more than 1000 nodes"
  • Do other OS projects also have the concept of a "project"? If yes, how do they track it?

@astrojuanlu
Copy link
Member

summary:

  • we now use the hash of the package_name as a way to uniquely track projects. this has similar problems than the user ID User identity for telemetry events is not unique #333 (might not be unique, e.g. there will be several projects named spaceflights, test, or demo) and it can be considered pseudonymous data. so we want to move away from it.
  • as @yetudada points out above, the main requirement is being able to uniquely track projects that can have multiple collaborators.
  • One possible solution, as I suggested above, is generating a UUID4 per project. But there are questions around how and when do we generate it and where do we store it (see my comment above)

Not all of the libraries we surveyed on #510 (comment) track projects uniquely. For example, neither of these projects do:

  • Prefect
  • DVC
  • Evidently

Two of the projects do track unique projects:

@astrojuanlu
Copy link
Member

astrojuanlu commented Mar 6, 2024

On today's Tech Design session we discussed this at length, without reaching an obvious conclusion. Long story short: this is hard ™️

  • There were questions on why do we want to track these projects uniquely in the first place. Some of the possible use cases are listed in Improving our understanding of our users with kedro-telemetry #510 (comment)
  • It was clear from the start that using .telemetry is not an appropriate solution since it's not committed to version control
  • It was pointed out that committing a project UUID in version control was not without flaws: for example, it could be accidentally copy pasted across projects
    • At what point a project stops being itself?12
  • @DimedS noted that we could use pyproject.toml to track this information, but then it would be lost for packaged projects Telemetry data for tools and example_pipeline is not collected in packaged Kedro projects #567
  • @idanov raised an interesting point about production projects: do we actually want that data? Wouldn't it be too noisy? Would users leave telemetry enabled when in production? Should we focus on local development and how humans use Kedro (rather than machines)?
    • Should we reframe the "how many projects in production" question to "how many projects have a lifecycle longer than 6 months"?
    • This would still require unique project tracking though
  • @lrcouto suggested that we use some intrinsic project properties to derive an identifier, but it was not clear how to do that without basically resorting back to hashing, which we want to avoid
  • @datajoely suggested to use the hash of the first commit as a project identifier (smart!) but @merelcht raised a question about having a dependency on git Investigate how we can remove Kedro's dependency on git (for starters) kedro#2051 (we could just call the subprocess) and also what happens when a baseline project is forked (how often would this happen?)
  • It was noted that the current method is not flawless either - one could perfectly change the package name, and the hash in the telemetry would change

Clearly no perfect solution exists, we will need to keep exploring.

Footnotes

  1. "After several hundreds of years of maintenance, if each individual piece of the Ship of Theseus was replaced, one after the other, was it still the same ship?"

  2. "It is not possible to step into the same river twice" -- Plutarch

@astrojuanlu
Copy link
Member

astrojuanlu commented Mar 7, 2024

Giving this a bit more thought. From the two examples I could find above,

  • great_expectations init doesn't seem to set the data_context_id the CLI is no longer supported, and the Python API does fill the data_context_id and then the user can optionally serialize it and store it in version control
  • reflex init sets a project hash but it's not committed to version control

@astrojuanlu
Copy link
Member

I think it's safe to say that

  • Unique project tracking, with all its flaws, is necessary to make certain assessments on project lifecycle duration.
  • No method is perfect, but impossibility to achieve perfection shouldn't be a blocker to doing something.
  • We should not rely on pyproject.toml for this, since it's not going to be present for packaged projects.
  • Most reasonable options seem to be adding this to settings.py or using a separate file that would need to be committed to version control.

@merelcht merelcht assigned noklam and ElenaKhaustova and unassigned astrojuanlu May 13, 2024
@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented May 21, 2024

After reviewing the discussion, references, and current implementation, I suggest the following improvements:

  1. Current Approach: I agree that moving away from hashing the package_name to uniquely track projects is necessary. This method not only creates uniqueness issues but also risks exposing original project names in case of a database breach.

  2. Project Definition: We should establish a clear definition of what constitutes a project and what differentiates one project from another. In my view, a project created from scratch should be treated as unique and assigned an ID. Collaborations on the same project, particularly modifications to the source code, do not change its identity, so the ID remains the same. However, altering certain attributes, such as the name (or name and version), should be considered a project change. Therefore, I suggest defining a project by its ID and name (or other attributes).

  3. Suggested Approach

  • Project ID: Based on the requirement to collect statistics by project, we need to define and store a project ID. Since we are eliminating hash storage, the ID should be randomly generated and stored in the project metadata, which is then committed to version control.
  • Checksum to Avoid ID Duplication: To prevent the same project ID when a project is copied as a template, we can store a checksum as H=hash(project ID+name). Before sending telemetry, we can verify if the current hash(project ID+name)==H. This helps us identify if the project has changed and whether the ID needs regeneration. We will store only the project ID on our side, avoiding exposure of hashed names.
  • Telemetry Handling: Generate the project ID and checksum H when the user opts in for telemetry, if they do not already exist. Compare the checksum H and hash(project ID+name) before sending telemetry and renew the project ID and checksum H if they differ. Use only a project ID as the project identifier on our side.
  • Security Measures: For added security, hash the project ID before sending telemetry, so only a hashed version of the project ID is stored in our database. In the event of a database breach, this prevents immediate lookup by project ID. Additionally, we could hash the received hashes with a non-exposed hash function to make them irreversible, though after discussion with @astrojuanlu, we decided to forego this step as heap.io does not support this layer and our current approach already provides adequate protection. Only open-source projects could potentially be reverted in case of a database leak.
  • Metadata Storage: Store the generated project ID and checksum H in the project metadata file. After discussions with @noklam and @astrojuanlu, we decided to use pyproject.toml as we already utilize it for telemetry data about tools - Telemetry data for tools and example_pipeline is not collected in packaged Kedro projects #567. Since we do not expect users to opt in for telemetry during production, we are focusing on the development stage. Creating a separate file would increase the number of project files unnecessarily, and modifying settings.py is not as straightforward as modifying configuration files.

This approach aims to address the uniqueness issue, enhance security, and integrate seamlessly with our current workflow.

@astrojuanlu, @noklam, @DimedS, @ankatiyar, @datajoely, @idanov, @merelcht, @yetudada - curious on your thoughts, concerns, suggestions 🙂

@astrojuanlu
Copy link
Member

I love the idea of using the Project UUID as a salt for the value we transmit and store in our systems. I'm wondering if we could simplify it even more and

  • We only save the Project UUID in pyproject.toml
  • We always transmit and log hash(Project UUID + project name)

That way, if the project name (or any other property we consider) changes, for us it will automatically be a different project.

Otherwise, your proposal @ElenaKhaustova as it stands in #507 (comment) sounds good to me!

@ElenaKhaustova
Copy link
Contributor

I love the idea of using the Project UUID as a salt for the value we transmit and store in our systems. I'm wondering if we could simplify it even more and

  • We only save the Project UUID in pyproject.toml
  • We always transmit and log hash(Project UUID + project name)

That way, if the project name (or any other property we consider) changes, for us it will automatically be a different project.

Otherwise, your proposal @ElenaKhaustova as it stands in #507 (comment) sounds good to me!

Thank you, @astrojuanlu. I like the idea! It provides the same security level and we don't have to savehash(Project UUID + project name). The only drawback that I see is that we will not be able to regenerate Project UUID. So we will get different hashes in case of a project name change, but if it changes back to the original, we will get duplications. I'm not sure if that's the probable case, though. We can avoid this case by following the original idea or sacrifice it to simplify the idea and store just Project UUID in pyproject.toml.

@astrojuanlu
Copy link
Member

So we will get different hashes in case of a project name change, but if it changes back to the original, we will get duplications.

If the name changes back to the original, then the stored hash would be the same as the original project - so we'd see a discontinuity, but perhaps it would make even more sense 😄

I'll let others chime in, otherwise if there are no strong objections I'd say we can proceed 💪🏼

@DimedS
Copy link
Contributor

DimedS commented May 22, 2024

After reviewing the discussion, references, and current implementation, I suggest the following improvements:

Thank you, @ElenaKhaustova. This is a strong solution with a high level of security! I support the idea of generating a unique project ID and storing it in the pyproject.yml file. Hashing the ID and name before sending to heap.io is also a good approach, aligning with @astrojuanlu's simplicity proposal. I have only one concern: changing a project name doesn't always signify a new project, but I believe such cases are quite rare.

@merelcht
Copy link
Member

Thank you for writing up this detailed solution @ElenaKhaustova! This sounds like a strong solution to me. The name changing thing is interesting, but I don't think we should spend too much time pondering about this. I'd expect people to change their project name mostly at the beginning when they're still setting up and getting started with a project. Switches during a project seem unlikely, but even if that does happen I don't think that would hugely skew our metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working telemetry
Projects
Status: In Progress
Development

No branches or pull requests

6 participants