Support Databases that don't have md5 or fast md5 (MSSQL) #51

pnelson-de · 2022-06-08T18:15:04Z

I was browsing through the list of supported databases on your readme and noticed MSSQL is not listed although it can be found in databases.py. Just wondering if this was intentional and also what the status of that connector is at the moment.

Cheers

erezsh · 2022-06-09T08:10:02Z

Hello!

Yes, we originally intended to include MSSQL, and even wrote the code for it. But the md5 function ended up being way too slow (x100 slower than postgres iirc). We considered it unusable, so we decided to put it aside for now.

If we find a solution to the performance issue, I'll be happy to add MSSQL officially.

sirupsen · 2022-06-09T09:18:37Z

Hey @pnelson-de, thanks for writing in :)

In extension of Erez' reply, something he and I have discussed is that we will need to support a variety of databases where md5 hashing aggregation not supported. ElasticSearch is another example. For those, we may have to 'negotiate' a weaker form of hashing, such as simply a sum.

However, it's not a super safe default as it's not entirely unlikely you'll end up with the same number for segments even if the data is wrong...

However, there's already plenty of types in play that need to hash correctly to the same values across databases. 😅 Once the testing infrastructure is in place for that, we would look at databases that don't have proper support for checksumming.

visch · 2022-06-13T14:40:09Z

Guessing https://orderbyselectnull.com/2018/05/31/hashbytes-scalability/ is related / similar problem. Was looking to do an Oracle MSSQL check with this tool myself so I'll +1 this. I don't have a better idea for you all though at this moment.

Took a quick peak at https://docs.microsoft.com/en-us/sql/t-sql/functions/checksum-agg-transact-sql?view=sql-server-2017 but I'm not confident

sirupsen · 2022-06-13T16:03:49Z

@visch are you looking to test Oracle MSSQL to another Oracle MSSQL?

sirupsen · 2022-06-13T16:14:16Z

@visch the problem with CHECKSUM_AGG and CHECKSUM is they don't specify the hashing algorithm. If you can find out if it's SHA1, CRC32, etc. by testing some strings against HASHBYTES on the same instance with some selects, then we should be able to use it and 'negotiate the algorithm' with the other databases. Similar to how we, as of, negotiate the precision of timestamp(n) fields in https://github.com/datafold/data-diff/pulls?q=is%3Apr+is%3Aopen+sort%3Aupdated-desc .

If it's not a checksum we can reproduce https://github.com/datafold/data-diff/pulls?q=is%3Apr+is%3Aopen+sort%3Aupdated-desc and add support for negotiating a common algorithm, in the case of MSSQL, it'll be sum() instead of md5(). It shouldn't be too hard.

visch · 2022-06-13T16:19:35Z

@visch are you looking to test Oracle MSSQL to another Oracle MSSQL?

Oracle DB to Microsoft SQL

sirupsen · 2022-06-13T16:24:45Z

@visch if you're up for implementing what I suggested above, let us know. You can join us in #tools-data-diff in the Locally Optimistic Slack..

If you have a supported database pair and wiling to test and provide feedback, we'd be eager to hear it :) Even just Oracle -> Oracle on the same table would be really useful for us.

sirupsen · 2022-06-13T16:40:44Z

Looks like CHECKSUM uses their own fairly simple algorithm.. So not something the others would easily support. So I think we'd go with the sum() based approach for it to start.

visch · 2022-06-13T16:55:17Z

I don't have the time right now to look at this but in case someone wants to dive in

https://gitlab.com/vischous/oracle2mssql/-/blob/master/script.sh is a nice podman (replace that with docker if you want) script to spin up oracle / mssql dbs. Note there's a few extra commands you have to run with Oracle here https://gitlab.com/vischous/oracle2mssql/-/blob/master/oracle.sql

joshcrichman · 2022-06-28T22:49:39Z

Just adding my vote for SQL Server support! Best of luck solving the technical issues.

rhysla · 2022-07-27T23:34:47Z

+1 for MS SQL Server support

erezsh · 2022-07-28T07:01:18Z

Please vote by "reacting" on the post. That makes it easier to aggregate.

erezsh · 2022-09-24T07:25:28Z

Proposal for support for SQLServer

The hash: We take each column, turn its normalized form into a hex string, and then take the 16 left-most digits, 16 right-most digits, and possibly another 16 digits from its middle. We convert each of them into an int64, and together with the length of the string, we dot-product them with a constant vector of prime numbers. So length(col1) * prime1 + col1a * prime2 + col1b * prime3 + length(col2) * prime4 + col2a * prime5 + ... and so on.

Cons:

Might miss changes in long text values
Unclear what are the chances of a collision. They are probably pretty low, but not as low as md5.
The run speed will be linear to the amount of columns being compared.

Pros:

Should be very fast, and work for all databases (on sqlserver, 25m rows of 1 column should take around 10 seconds)
Should be good enough to catch any change in most fields, like floats and like dates.
If we want to reduce collisions even further, we can use a better "hashing function" at the expense of some performance. For example, something like col*(col+3) % prime , aka the Knuth variant. These are details we can easily change and test once the rest is implemented.

To summarize, I think it might be good enough for actual use, but it's hard to measure exactly how likely it is to break down.
If anyone has thoughts or ideas about this, I'd love to hear them.

Elliot2718 · 2023-01-11T21:28:40Z

My team would be very interested in having SQL Server support. I don't have background in hashing functions, so can't really speak to the best approach there, but a colleague and I would be willing to spend some time working on this. @erezsh should we chat first? Or should we just give it a go?

erezsh · 2023-01-12T00:07:09Z

Hi @Elliot2718 ,

If you have any questions about the proposal I wrote, or how to approach implementing it, we can set up a chat and get you up to speed.

Keep in mind that the implementation should be done by creating a new Dialect Mixin that is compatible with https://github.com/datafold/sqeleton/ .

For example, here is the MD5 mixin implementation for postgres: https://github.com/datafold/sqeleton/blob/master/sqeleton/databases/postgresql.py#L29

And here is how it's included in data-diff: https://github.com/datafold/data-diff/blob/master/data_diff/databases/postgresql.py#L5

The implementation itself can reside in either sqeleton or data-diff, though I think for consistency it's better to put it in Sqeleton.

github-actions · 2023-06-08T06:31:49Z

This issue has been marked as stale because it has been open for 60 days with no activity. If you would like the issue to remain open, please comment on the issue and it will be added to the triage queue. Otherwise, it will be closed in 7 days.

joshcrichman · 2023-06-08T07:09:50Z

Commenting to move this out of being stale. This is still a highly relevant feature that needs to be added.

dlawin · 2023-06-26T16:31:36Z

Commenting to move this out of being stale. This is still a highly relevant feature that needs to be added.

Yeah this should definitely remain open. I'm unsure if/when we as maintainers would have the bandwidth to work on this, but very open to community contributions in the meantime.

glebmezh · 2024-05-17T13:34:55Z

Hi all!

I'm sorry for the delay in following up on this.

We made a hard decision to sunset the data-diff package and won't provide further development or support.

If that's of interest, over the past few months, we have rewritten the diffing engine in Datafold Cloud and solved many issues that existed in this package.

We also added full support for diffing MS SQL Server 2019+ with other databases.

-Gleb

sirupsen changed the title ~~Supported Databases - MS SQL~~ Support Databases that don't have md5 or fast md5 (MSSQL) Jun 21, 2022

sirupsen added the performance label Jun 21, 2022

erezsh mentioned this issue Jun 23, 2022

Let the differ choose a shared hashing algorithm #103

Closed

sirupsen mentioned this issue Jun 25, 2022

How soon you will support Microsoft SQL Server & Azure SQL Server??? #108

Closed

nolar mentioned this issue Jul 6, 2022

Support for NoSQL/document-based databases? #152

Closed

nolar mentioned this issue Jul 18, 2022

Add support for Microsoft SQL Server 2016+ #168

Closed

erezsh added the new-db-driver Request to add a new database driver label Jul 20, 2022

github-actions bot added the stale Issues/PRs that have gone stale label Jun 8, 2023

github-actions bot added triage and removed stale Issues/PRs that have gone stale labels Jun 8, 2023

dlawin added stale_immune Immunity to stale bot and removed triage labels Jul 10, 2023

dlawin mentioned this issue Aug 9, 2023

Add support for MS SQL #656

Closed

dlawin mentioned this issue Oct 9, 2023

Support MSSQL for cross-database diffs #696

Merged

3 tasks

glebmezh closed this as completed May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Databases that don't have md5 or fast md5 (MSSQL) #51

Support Databases that don't have md5 or fast md5 (MSSQL) #51

pnelson-de commented Jun 8, 2022

erezsh commented Jun 9, 2022

sirupsen commented Jun 9, 2022

visch commented Jun 13, 2022

sirupsen commented Jun 13, 2022 •

edited

sirupsen commented Jun 13, 2022

visch commented Jun 13, 2022

sirupsen commented Jun 13, 2022 •

edited

sirupsen commented Jun 13, 2022

visch commented Jun 13, 2022

joshcrichman commented Jun 28, 2022

rhysla commented Jul 27, 2022

erezsh commented Jul 28, 2022

erezsh commented Sep 24, 2022 •

edited

Elliot2718 commented Jan 11, 2023

erezsh commented Jan 12, 2023

github-actions bot commented Jun 8, 2023

joshcrichman commented Jun 8, 2023

dlawin commented Jun 26, 2023

glebmezh commented May 17, 2024

Support Databases that don't have md5 or fast md5 (MSSQL) #51

Support Databases that don't have md5 or fast md5 (MSSQL) #51

Comments

pnelson-de commented Jun 8, 2022

erezsh commented Jun 9, 2022

sirupsen commented Jun 9, 2022

visch commented Jun 13, 2022

sirupsen commented Jun 13, 2022 • edited

sirupsen commented Jun 13, 2022

visch commented Jun 13, 2022

sirupsen commented Jun 13, 2022 • edited

sirupsen commented Jun 13, 2022

visch commented Jun 13, 2022

joshcrichman commented Jun 28, 2022

rhysla commented Jul 27, 2022

erezsh commented Jul 28, 2022

erezsh commented Sep 24, 2022 • edited

Proposal for support for SQLServer

Elliot2718 commented Jan 11, 2023

erezsh commented Jan 12, 2023

github-actions bot commented Jun 8, 2023

joshcrichman commented Jun 8, 2023

dlawin commented Jun 26, 2023

glebmezh commented May 17, 2024

sirupsen commented Jun 13, 2022 •

edited

sirupsen commented Jun 13, 2022 •

edited

erezsh commented Sep 24, 2022 •

edited