Skip to content
This repository has been archived by the owner on May 17, 2024. It is now read-only.

Support Databases that don't have md5 or fast md5 (MSSQL) #51

Closed
pnelson-de opened this issue Jun 8, 2022 · 19 comments
Closed

Support Databases that don't have md5 or fast md5 (MSSQL) #51

pnelson-de opened this issue Jun 8, 2022 · 19 comments
Labels
new-db-driver Request to add a new database driver performance stale_immune Immunity to stale bot

Comments

@pnelson-de
Copy link

I was browsing through the list of supported databases on your readme and noticed MSSQL is not listed although it can be found in databases.py. Just wondering if this was intentional and also what the status of that connector is at the moment.

Cheers

@erezsh
Copy link
Contributor

erezsh commented Jun 9, 2022

Hello!

Yes, we originally intended to include MSSQL, and even wrote the code for it. But the md5 function ended up being way too slow (x100 slower than postgres iirc). We considered it unusable, so we decided to put it aside for now.

If we find a solution to the performance issue, I'll be happy to add MSSQL officially.

@sirupsen
Copy link
Contributor

sirupsen commented Jun 9, 2022

Hey @pnelson-de, thanks for writing in :)

In extension of Erez' reply, something he and I have discussed is that we will need to support a variety of databases where md5 hashing aggregation not supported. ElasticSearch is another example. For those, we may have to 'negotiate' a weaker form of hashing, such as simply a sum.

However, it's not a super safe default as it's not entirely unlikely you'll end up with the same number for segments even if the data is wrong...

However, there's already plenty of types in play that need to hash correctly to the same values across databases. 😅 Once the testing infrastructure is in place for that, we would look at databases that don't have proper support for checksumming.

@visch
Copy link

visch commented Jun 13, 2022

Guessing https://orderbyselectnull.com/2018/05/31/hashbytes-scalability/ is related / similar problem. Was looking to do an Oracle MSSQL check with this tool myself so I'll +1 this. I don't have a better idea for you all though at this moment.

Took a quick peak at https://docs.microsoft.com/en-us/sql/t-sql/functions/checksum-agg-transact-sql?view=sql-server-2017 but I'm not confident

@sirupsen
Copy link
Contributor

sirupsen commented Jun 13, 2022

@visch are you looking to test Oracle MSSQL to another Oracle MSSQL?

@sirupsen
Copy link
Contributor

@visch the problem with CHECKSUM_AGG and CHECKSUM is they don't specify the hashing algorithm. If you can find out if it's SHA1, CRC32, etc. by testing some strings against HASHBYTES on the same instance with some selects, then we should be able to use it and 'negotiate the algorithm' with the other databases. Similar to how we, as of, negotiate the precision of timestamp(n) fields in https://github.com/datafold/data-diff/pulls?q=is%3Apr+is%3Aopen+sort%3Aupdated-desc .

If it's not a checksum we can reproduce https://github.com/datafold/data-diff/pulls?q=is%3Apr+is%3Aopen+sort%3Aupdated-desc and add support for negotiating a common algorithm, in the case of MSSQL, it'll be sum() instead of md5(). It shouldn't be too hard.

@visch
Copy link

visch commented Jun 13, 2022

@visch are you looking to test Oracle MSSQL to another Oracle MSSQL?

Oracle DB to Microsoft SQL

@sirupsen
Copy link
Contributor

sirupsen commented Jun 13, 2022

@visch if you're up for implementing what I suggested above, let us know. You can join us in #tools-data-diff in the Locally Optimistic Slack..

If you have a supported database pair and wiling to test and provide feedback, we'd be eager to hear it :) Even just Oracle -> Oracle on the same table would be really useful for us.

@sirupsen
Copy link
Contributor

Looks like CHECKSUM uses their own fairly simple algorithm.. So not something the others would easily support. So I think we'd go with the sum() based approach for it to start.

@visch
Copy link

visch commented Jun 13, 2022

I don't have the time right now to look at this but in case someone wants to dive in

https://gitlab.com/vischous/oracle2mssql/-/blob/master/script.sh is a nice podman (replace that with docker if you want) script to spin up oracle / mssql dbs. Note there's a few extra commands you have to run with Oracle here https://gitlab.com/vischous/oracle2mssql/-/blob/master/oracle.sql

@sirupsen sirupsen changed the title Supported Databases - MS SQL Support Databases that don't have md5 or fast md5 (MSSQL) Jun 21, 2022
@joshcrichman
Copy link

Just adding my vote for SQL Server support! Best of luck solving the technical issues.

@rhysla
Copy link

rhysla commented Jul 27, 2022

+1 for MS SQL Server support

@erezsh
Copy link
Contributor

erezsh commented Jul 28, 2022

Please vote by "reacting" on the post. That makes it easier to aggregate.

@erezsh
Copy link
Contributor

erezsh commented Sep 24, 2022

Proposal for support for SQLServer

The hash: We take each column, turn its normalized form into a hex string, and then take the 16 left-most digits, 16 right-most digits, and possibly another 16 digits from its middle. We convert each of them into an int64, and together with the length of the string, we dot-product them with a constant vector of prime numbers. So length(col1) * prime1 + col1a * prime2 + col1b * prime3 + length(col2) * prime4 + col2a * prime5 + ... and so on.

Cons:

  • Might miss changes in long text values
  • Unclear what are the chances of a collision. They are probably pretty low, but not as low as md5.
  • The run speed will be linear to the amount of columns being compared.

Pros:

  • Should be very fast, and work for all databases (on sqlserver, 25m rows of 1 column should take around 10 seconds)
  • Should be good enough to catch any change in most fields, like floats and like dates.
  • If we want to reduce collisions even further, we can use a better "hashing function" at the expense of some performance. For example, something like col*(col+3) % prime , aka the Knuth variant. These are details we can easily change and test once the rest is implemented.

To summarize, I think it might be good enough for actual use, but it's hard to measure exactly how likely it is to break down.
If anyone has thoughts or ideas about this, I'd love to hear them.

@Elliot2718
Copy link

My team would be very interested in having SQL Server support. I don't have background in hashing functions, so can't really speak to the best approach there, but a colleague and I would be willing to spend some time working on this. @erezsh should we chat first? Or should we just give it a go?

@erezsh
Copy link
Contributor

erezsh commented Jan 12, 2023

Hi @Elliot2718 ,

If you have any questions about the proposal I wrote, or how to approach implementing it, we can set up a chat and get you up to speed.

Keep in mind that the implementation should be done by creating a new Dialect Mixin that is compatible with https://github.com/datafold/sqeleton/ .

For example, here is the MD5 mixin implementation for postgres: https://github.com/datafold/sqeleton/blob/master/sqeleton/databases/postgresql.py#L29

And here is how it's included in data-diff: https://github.com/datafold/data-diff/blob/master/data_diff/databases/postgresql.py#L5

The implementation itself can reside in either sqeleton or data-diff, though I think for consistency it's better to put it in Sqeleton.

@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2023

This issue has been marked as stale because it has been open for 60 days with no activity. If you would like the issue to remain open, please comment on the issue and it will be added to the triage queue. Otherwise, it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues/PRs that have gone stale label Jun 8, 2023
@joshcrichman
Copy link

Commenting to move this out of being stale. This is still a highly relevant feature that needs to be added.

@github-actions github-actions bot added triage and removed stale Issues/PRs that have gone stale labels Jun 8, 2023
@dlawin
Copy link
Contributor

dlawin commented Jun 26, 2023

Commenting to move this out of being stale. This is still a highly relevant feature that needs to be added.

Yeah this should definitely remain open. I'm unsure if/when we as maintainers would have the bandwidth to work on this, but very open to community contributions in the meantime.

@dlawin dlawin added stale_immune Immunity to stale bot and removed triage labels Jul 10, 2023
@glebmezh
Copy link
Contributor

Hi all!

I'm sorry for the delay in following up on this.

We made a hard decision to sunset the data-diff package and won't provide further development or support.

If that's of interest, over the past few months, we have rewritten the diffing engine in Datafold Cloud and solved many issues that existed in this package.

We also added full support for diffing MS SQL Server 2019+ with other databases.

-Gleb

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
new-db-driver Request to add a new database driver performance stale_immune Immunity to stale bot
Projects
Status: Done
Development

No branches or pull requests

9 participants