New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate deterministic project IDs correlated with creation date #6471
Comments
@wetneb What do you think of this proposal? Any preference for how the numeric range is divided? In addition to the two options above (10^3 and 2^5), we also have a PR which proposes 2^20. |
@tfmorris Hmm, a bit like Snowflake ids, I like it. Incidentally, we use TSIDs in DB2Rest. There's a thread-safe Java library: https://github.com/vladmihalcea/hypersistence-tsid |
I'm a bit torn on this… intuitively I'd rather prefer to go in the direction of using completely opaque ids which wouldn't carry any particular information. Users might not be aware that the project ids contain this information and it might constitute an unwelcome information leak in certain circumstances. But not a hill I would die on… |
@wetneb The project metadata already includes both the creation time and the last updated time. This is intended to provide a hint to the user in case the metadata is gone/corrupted, ie "Your missing project is the one that you created on the afternoon of May 30." Currently we have no way of telling the user what project(s) is/are missing. (Of course, the best thing is not to lose the projects in the first place.) From a practical point of view, the current IDs are completely opaque. Another option would be to use the newly defined UUID v7 from rfc9562, but that would require increasing the field size from 64 to 128 bits, breaking compatibility, so is a non-starter for now until we have protocol & metadata versioning. One useful hint from that spec is that dividing fields on nibble boundaries makes them more easily human parseable in hex format. We could also place the timestamp in the high order 48 bits to match the UUID layout, for whatever that's worth. @thadguidry That repo looks like a rip-off of https://github.com/f4b6a3/tsid-creator/, but we don't need sortable IDs - just a rough idea of time that we can convey to the user. We should have the OpenRefine version encoded in the metadata, but I don't think the project ID is the correct place for it. |
@tfmorris Gotcha, agree. Btw, in the first paragraph of the README says that it's not a "ripoff", it's a "fork" that's maintained because the original repo is no longer wanting to be maintained by its creator. |
Yes, I am aware that we store those times in the project metadata, but what I am saying is that it feels somewhat quirky to also encode that in the project id itself.
To provide something like this, I'd rather store the entire project metadata in a more corruption-resilient way independent from project serialization, for instance in a SQLite database. That would have the advantage of also being able to provide the user with not just the creation date, but also the project name and other metadata fields. |
Currently project IDs are generated using the formula:
This suffers from the problem that there's no way to recover the original Unix timestamp to know when the project was created if the ID is the only information available. In data recovery and debugging scenarios, it can be useful to know when a project was created.
Proposed solution
Multiply the timestamp by a fixed amount and then add the random component in the lower order bits, e.g.
Alternatives considered
Powers of 2 are easier for shifting/masking bits, but harder for humans to process
Additional context
Throughout the system the identifier is treated as opaque, so changing how it's generated shouldn't affect anything.
The 64-bit value returned by
System.currentTimeMillis()
has the high order 23 bits clear, so there's plenty of headroom to shift it up without affecting longevity of the IDs.The text was updated successfully, but these errors were encountered: