Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate deterministic project IDs correlated with creation date #6471

Open
tfmorris opened this issue Mar 19, 2024 · 6 comments · May be fixed by #6621
Open

Generate deterministic project IDs correlated with creation date #6471

tfmorris opened this issue Mar 19, 2024 · 6 comments · May be fixed by #6621
Labels
Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.

Comments

@tfmorris
Copy link
Member

tfmorris commented Mar 19, 2024

Currently project IDs are generated using the formula:

        return System.currentTimeMillis() + Math.round(Math.random() * 1000000000000L);

This suffers from the problem that there's no way to recover the original Unix timestamp to know when the project was created if the ID is the only information available. In data recovery and debugging scenarios, it can be useful to know when a project was created.

Proposed solution

Multiply the timestamp by a fixed amount and then add the random component in the lower order bits, e.g.

        return (System.currentTimeMillis() * 1000) + Math.round(Math.random() * 1000);

Alternatives considered

Powers of 2 are easier for shifting/masking bits, but harder for humans to process

        return (System.currentTimeMillis() << 10) + Math.round(Math.random() * 1024);

Additional context

Throughout the system the identifier is treated as opaque, so changing how it's generated shouldn't affect anything.

The 64-bit value returned by System.currentTimeMillis() has the high order 23 bits clear, so there's plenty of headroom to shift it up without affecting longevity of the IDs.

@tfmorris tfmorris added Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements. Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators labels Mar 19, 2024
@tfmorris
Copy link
Member Author

@wetneb What do you think of this proposal? Any preference for how the numeric range is divided? In addition to the two options above (10^3 and 2^5), we also have a PR which proposes 2^20.

@thadguidry
Copy link
Member

thadguidry commented May 21, 2024

@tfmorris Hmm, a bit like Snowflake ids, I like it. Incidentally, we use TSIDs in DB2Rest. There's a thread-safe Java library: https://github.com/vladmihalcea/hypersistence-tsid
Which also adds a node part that's adjustable. The node part could hold an OpenRefine version, or simply a 3 bit node id and 2 bit version for 5 bits on the node part? Having the OpenRefine version encoded would help migration and sharing, no? We could quickly detect the project was from version 3.8 and uplift when shared and given to someone to open using version 4.0?

@wetneb
Copy link
Sponsor Member

wetneb commented May 21, 2024

I'm a bit torn on this… intuitively I'd rather prefer to go in the direction of using completely opaque ids which wouldn't carry any particular information. Users might not be aware that the project ids contain this information and it might constitute an unwelcome information leak in certain circumstances. But not a hill I would die on…

@xinluz6 xinluz6 linked a pull request May 21, 2024 that will close this issue
@tfmorris
Copy link
Member Author

@wetneb The project metadata already includes both the creation time and the last updated time. This is intended to provide a hint to the user in case the metadata is gone/corrupted, ie "Your missing project is the one that you created on the afternoon of May 30." Currently we have no way of telling the user what project(s) is/are missing. (Of course, the best thing is not to lose the projects in the first place.) From a practical point of view, the current IDs are completely opaque.

Another option would be to use the newly defined UUID v7 from rfc9562, but that would require increasing the field size from 64 to 128 bits, breaking compatibility, so is a non-starter for now until we have protocol & metadata versioning. One useful hint from that spec is that dividing fields on nibble boundaries makes them more easily human parseable in hex format. We could also place the timestamp in the high order 48 bits to match the UUID layout, for whatever that's worth.

@thadguidry That repo looks like a rip-off of https://github.com/f4b6a3/tsid-creator/, but we don't need sortable IDs - just a rough idea of time that we can convey to the user. We should have the OpenRefine version encoded in the metadata, but I don't think the project ID is the correct place for it.

@thadguidry
Copy link
Member

@tfmorris Gotcha, agree. Btw, in the first paragraph of the README says that it's not a "ripoff", it's a "fork" that's maintained because the original repo is no longer wanting to be maintained by its creator.

@wetneb
Copy link
Sponsor Member

wetneb commented May 22, 2024

The project metadata already includes both the creation time and the last updated time.

Yes, I am aware that we store those times in the project metadata, but what I am saying is that it feels somewhat quirky to also encode that in the project id itself.

This is intended to provide a hint to the user in case the metadata is gone/corrupted, ie "Your missing project is the one that you created on the afternoon of May 30."

To provide something like this, I'd rather store the entire project metadata in a more corruption-resilient way independent from project serialization, for instance in a SQLite database. That would have the advantage of also being able to provide the user with not just the creation date, but also the project name and other metadata fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.
Projects
None yet
3 participants