Have a single way to identify a project in our telemetry #507

yetudada · 2024-01-08T11:39:09Z

Description

Let me show you a subset of the data that comes into our database from kedro-telemetry in our CLI commands table.

USER_ID	PROJECT_NAME	PACKAGE_NAME	USERNAME
1343655360016970	946346cad253d7b	59f1c4c71309c46f	anonymous
1856054624437891		03e5c42f9fd8d0ab	ec3759e2c570d30
8035489891863387		8a171405cf81ef53	d92d8dd3de0a094
4531280966080738		bb7f5d8666210fd5	ae46b8e451079db
6022834494838396	946346cad253d7b	59f1c4c71309c46f	anonymous
8035489891863387		8a171405cf81ef53	d92d8dd3de0a094
1416039240284711		256cfde57fee6e63	99adc231b045331
6759436125356775	2e35e5f9d892407	2e35e5f9d8924072	anonymous
6092965114078963		0bc9045c71518347	b8308c21e8a66ad
8249949466136913		ef85afadc4586f38	57cbe0328faa502

Context

This task has two components:

Make a recommendation to the team about which is the most accurate identifier that we should use to ID a project
And detail why project_name sometimes has blank entries

Expected Result

We should only have a single project ID that we all trust and we should not have a blank entry for that ID.

The text was updated successfully, but these errors were encountered:

astrojuanlu · 2024-02-15T18:27:13Z

Why `project_name` sometimes has blank entries

Looks like kedro-telemetry 0.2.3 might have introduced that behavior.

WITH subset AS (SELECT
    (case when (PROJECT_NAME IS NULL) then 1 else 0 end) AS PROJECT_NAME_IS_NULL,
    TELEMETRY_VERSION
FROM HEAP_KEDRO_APP.HEAP.ANY_COMMAND_RUN
)

SELECT
    TELEMETRY_VERSION,
    SUM(PROJECT_NAME_IS_NULL) AS COUNT_NULL_PROJECT_NAME,
    COUNT(*) AS TOTAL_COMMANDS
FROM subset
GROUP BY TELEMETRY_VERSION
ORDER BY TELEMETRY_VERSION

In fact, from the changelog, it looks like it was intentional:

Project name is a leftover that is not important to store if we already know the package name and since there is currently no easy way of accessing this value, it is discarded.

#84

Mistery solved 😄

How to track unique projects

There are several ways to do this. For example, we could generate a UUID4 per project (similar to #333). Alternatively, we could get a hash of user UUID4 (as per #333) concatenated with the package_name (let's forget about project_name, see above).

I do think generating a UUID4 per project should be the way forward. Now, how do we store that, whether we want to track it in version control or not etc we still need to think about.

Using the hash of the package_name is nice (because it helps us identify things like "spaceflights" or "test") but would also be considered pseudonymised data so we have to stop doing it.

astrojuanlu · 2024-03-04T13:55:48Z

This issue still needs some thought. Some options:

We reuse the existing .telemetry file, and record a UUID4 of the project there. But it won't be stored in version control.
We record a UUID4 upon project creation with kedro new or the starters in settings.py or a separate file. But then custom projects will not have it.
❓

DimedS · 2024-03-04T16:31:53Z

This issue still needs some thought. Some options:

We reuse the existing .telemetry file, and record a UUID4 of the project there. But it won't be stored in version control.

We record a UUID4 upon project creation with kedro new or the starters in settings.py or a separate file. But then custom projects will not have it.

❓

I agree that using a uniquely generated projectId, not linked to the project name, is better. This ensures distinct IDs for different projects even if they share the same name. Ideally, this ID should be generated at the time of project creation with kedro new or upon executing any Kedro command within an project without ID, rather than depending on the kedro-telemetry plugin for this generation process, aligning the generation process closely with project inception.

yetudada · 2024-03-04T16:56:47Z

I have two questions:

A Kedro project often has multiple collaborators (multiple UUIDs on user_id assigned to a single package_name), how would we ensure that we would still see unique projects in our data? I would use this data to understand project-specific features points i.e. "100 projects are running in CI" or "5% of projects have more than 1000 nodes"
Do other OS projects also have the concept of a "project"? If yes, how do they track it?

astrojuanlu · 2024-03-06T06:20:44Z

summary:

we now use the hash of the package_name as a way to uniquely track projects. this has similar problems than the user ID User identity for telemetry events is not unique #333 (might not be unique, e.g. there will be several projects named spaceflights, test, or demo) and it can be considered pseudonymous data. so we want to move away from it.
as @yetudada points out above, the main requirement is being able to uniquely track projects that can have multiple collaborators.
One possible solution, as I suggested above, is generating a UUID4 per project. But there are questions around how and when do we generate it and where do we store it (see my comment above)

Not all of the libraries we surveyed on #510 (comment) track projects uniquely. For example, neither of these projects do:

Prefect
DVC
Evidently

Two of the projects do track unique projects:

Great Expectations has a data_context_id per project https://docs.greatexpectations.io/docs/reference/learn/usage_statistics/
Reflex has a distinct_app_id https://github.com/reflex-dev/reflex/blob/1b207eb444f73a8fef06ba6b13319cb0a7aa92b7/reflex/utils/telemetry.py#L90-L98 that is a random number generated upon first creation https://github.com/reflex-dev/reflex/blob/1b207eb444f73a8fef06ba6b13319cb0a7aa92b7/reflex/utils/prerequisites.py#L491-L492

astrojuanlu · 2024-03-06T19:15:14Z

On today's Tech Design session we discussed this at length, without reaching an obvious conclusion. Long story short: this is hard ™️

There were questions on why do we want to track these projects uniquely in the first place. Some of the possible use cases are listed in Improving our understanding of our users with kedro-telemetry #510 (comment)
It was clear from the start that using .telemetry is not an appropriate solution since it's not committed to version control
It was pointed out that committing a project UUID in version control was not without flaws: for example, it could be accidentally copy pasted across projects
- At what point a project stops being itself?¹²
@DimedS noted that we could use pyproject.toml to track this information, but then it would be lost for packaged projects Telemetry data for tools and example_pipeline is not collected in packaged Kedro projects #567
@idanov raised an interesting point about production projects: do we actually want that data? Wouldn't it be too noisy? Would users leave telemetry enabled when in production? Should we focus on local development and how humans use Kedro (rather than machines)?
- Should we reframe the "how many projects in production" question to "how many projects have a lifecycle longer than 6 months"?
- This would still require unique project tracking though
@lrcouto suggested that we use some intrinsic project properties to derive an identifier, but it was not clear how to do that without basically resorting back to hashing, which we want to avoid
@datajoely suggested to use the hash of the first commit as a project identifier (smart!) but @merelcht raised a question about having a dependency on git Investigate how we can remove Kedro's dependency on git (for starters) kedro#2051 (we could just call the subprocess) and also what happens when a baseline project is forked (how often would this happen?)
It was noted that the current method is not flawless either - one could perfectly change the package name, and the hash in the telemetry would change

Clearly no perfect solution exists, we will need to keep exploring.

"After several hundreds of years of maintenance, if each individual piece of the Ship of Theseus was replaced, one after the other, was it still the same ship?" ↩
"It is not possible to step into the same river twice" -- Plutarch ↩

astrojuanlu · 2024-03-07T23:42:50Z

Giving this a bit more thought. From the two examples I could find above,

~~great_expectations init doesn't seem to set the data_context_id~~ the CLI is no longer supported, and the Python API does fill the data_context_id and then the user can optionally serialize it and store it in version control
reflex init sets a project hash but it's not committed to version control

astrojuanlu · 2024-04-15T08:30:36Z

I think it's safe to say that

Unique project tracking, with all its flaws, is necessary to make certain assessments on project lifecycle duration.
No method is perfect, but impossibility to achieve perfection shouldn't be a blocker to doing something.
We should not rely on pyproject.toml for this, since it's not going to be present for packaged projects.
Most reasonable options seem to be adding this to settings.py or using a separate file that would need to be committed to version control.

ElenaKhaustova · 2024-05-21T21:30:29Z

After reviewing the discussion, references, and current implementation, I suggest the following improvements:

Current Approach: I agree that moving away from hashing the package_name to uniquely track projects is necessary. This method not only creates uniqueness issues but also risks exposing original project names in case of a database breach.
Project Definition: We should establish a clear definition of what constitutes a project and what differentiates one project from another. In my view, a project created from scratch should be treated as unique and assigned an ID. Collaborations on the same project, particularly modifications to the source code, do not change its identity, so the ID remains the same. However, altering certain attributes, such as the name (or name and version), should be considered a project change. Therefore, I suggest defining a project by its ID and name (or other attributes).
Suggested Approach

Project ID: Based on the requirement to collect statistics by project, we need to define and store a project ID. Since we are eliminating hash storage, the ID should be randomly generated and stored in the project metadata, which is then committed to version control.
Checksum to Avoid ID Duplication: To prevent the same project ID when a project is copied as a template, we can store a checksum as H=hash(project ID+name). Before sending telemetry, we can verify if the current hash(project ID+name)==H. This helps us identify if the project has changed and whether the ID needs regeneration. We will store only the project ID on our side, avoiding exposure of hashed names.
Telemetry Handling: Generate the project ID and checksum H when the user opts in for telemetry, if they do not already exist. Compare the checksum H and hash(project ID+name) before sending telemetry and renew the project ID and checksum H if they differ. Use only a project ID as the project identifier on our side.
Security Measures: For added security, hash the project ID before sending telemetry, so only a hashed version of the project ID is stored in our database. In the event of a database breach, this prevents immediate lookup by project ID. Additionally, we could hash the received hashes with a non-exposed hash function to make them irreversible, though after discussion with @astrojuanlu, we decided to forego this step as heap.io does not support this layer and our current approach already provides adequate protection. Only open-source projects could potentially be reverted in case of a database leak.
Metadata Storage: Store the generated project ID and checksum H in the project metadata file. After discussions with @noklam and @astrojuanlu, we decided to use pyproject.toml as we already utilize it for telemetry data about tools - Telemetry data for tools and example_pipeline is not collected in packaged Kedro projects #567. Since we do not expect users to opt in for telemetry during production, we are focusing on the development stage. Creating a separate file would increase the number of project files unnecessarily, and modifying settings.py is not as straightforward as modifying configuration files.

This approach aims to address the uniqueness issue, enhance security, and integrate seamlessly with our current workflow.

@astrojuanlu, @noklam, @DimedS, @ankatiyar, @datajoely, @idanov, @merelcht, @yetudada - curious on your thoughts, concerns, suggestions 🙂

astrojuanlu · 2024-05-22T08:08:49Z

I love the idea of using the Project UUID as a salt for the value we transmit and store in our systems. I'm wondering if we could simplify it even more and

We only save the Project UUID in pyproject.toml
We always transmit and log hash(Project UUID + project name)

That way, if the project name (or any other property we consider) changes, for us it will automatically be a different project.

Otherwise, your proposal @ElenaKhaustova as it stands in #507 (comment) sounds good to me!

ElenaKhaustova · 2024-05-22T11:08:58Z

I love the idea of using the Project UUID as a salt for the value we transmit and store in our systems. I'm wondering if we could simplify it even more and

We only save the Project UUID in pyproject.toml

We always transmit and log hash(Project UUID + project name)

That way, if the project name (or any other property we consider) changes, for us it will automatically be a different project.

Otherwise, your proposal @ElenaKhaustova as it stands in #507 (comment) sounds good to me!

Thank you, @astrojuanlu. I like the idea! It provides the same security level and we don't have to savehash(Project UUID + project name). The only drawback that I see is that we will not be able to regenerate Project UUID. So we will get different hashes in case of a project name change, but if it changes back to the original, we will get duplications. I'm not sure if that's the probable case, though. We can avoid this case by following the original idea or sacrifice it to simplify the idea and store just Project UUID in pyproject.toml.

astrojuanlu · 2024-05-22T12:15:11Z

So we will get different hashes in case of a project name change, but if it changes back to the original, we will get duplications.

If the name changes back to the original, then the stored hash would be the same as the original project - so we'd see a discontinuity, but perhaps it would make even more sense 😄

I'll let others chime in, otherwise if there are no strong objections I'd say we can proceed 💪🏼

DimedS · 2024-05-22T12:51:49Z

After reviewing the discussion, references, and current implementation, I suggest the following improvements:

Thank you, @ElenaKhaustova. This is a strong solution with a high level of security! I support the idea of generating a unique project ID and storing it in the pyproject.yml file. Hashing the ID and name before sending to heap.io is also a good approach, aligning with @astrojuanlu's simplicity proposal. I have only one concern: changing a project name doesn't always signify a new project, but I believe such cases are quite rare.

merelcht · 2024-05-23T11:20:28Z

Thank you for writing up this detailed solution @ElenaKhaustova! This sounds like a strong solution to me. The name changing thing is interesting, but I don't think we should spend too much time pondering about this. I'd expect people to change their project name mostly at the beginning when they're still setting up and getting started with a project. Switches during a project seem unlikely, but even if that does happen I don't think that would hugely skew our metrics.

yetudada added bug Something isn't working telemetry labels Jan 8, 2024

yetudada mentioned this issue Jan 8, 2024

Improving our understanding of our users with kedro-telemetry #510

Open

astrojuanlu added this to the Address issues with kedro-telemetry milestone Jan 10, 2024

astrojuanlu self-assigned this Mar 4, 2024

merelcht assigned noklam and ElenaKhaustova and unassigned astrojuanlu May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have a single way to identify a project in our telemetry #507

Have a single way to identify a project in our telemetry #507

yetudada commented Jan 8, 2024

astrojuanlu commented Feb 15, 2024

astrojuanlu commented Mar 4, 2024

DimedS commented Mar 4, 2024

yetudada commented Mar 4, 2024

astrojuanlu commented Mar 6, 2024

astrojuanlu commented Mar 6, 2024 •

edited

astrojuanlu commented Mar 7, 2024 •

edited

astrojuanlu commented Apr 15, 2024

ElenaKhaustova commented May 21, 2024 •

edited

astrojuanlu commented May 22, 2024

ElenaKhaustova commented May 22, 2024

astrojuanlu commented May 22, 2024

DimedS commented May 22, 2024

merelcht commented May 23, 2024

Have a single way to identify a project in our telemetry #507

Have a single way to identify a project in our telemetry #507

Comments

yetudada commented Jan 8, 2024

Description

Context

Expected Result

astrojuanlu commented Feb 15, 2024

Why project_name sometimes has blank entries

How to track unique projects

astrojuanlu commented Mar 4, 2024

DimedS commented Mar 4, 2024

yetudada commented Mar 4, 2024

astrojuanlu commented Mar 6, 2024

astrojuanlu commented Mar 6, 2024 • edited

Footnotes

astrojuanlu commented Mar 7, 2024 • edited

astrojuanlu commented Apr 15, 2024

ElenaKhaustova commented May 21, 2024 • edited

astrojuanlu commented May 22, 2024

ElenaKhaustova commented May 22, 2024

astrojuanlu commented May 22, 2024

DimedS commented May 22, 2024

merelcht commented May 23, 2024

Why `project_name` sometimes has blank entries

astrojuanlu commented Mar 6, 2024 •

edited

astrojuanlu commented Mar 7, 2024 •

edited

ElenaKhaustova commented May 21, 2024 •

edited