Skip to content

Latest commit

 

History

History
114 lines (75 loc) · 31.3 KB

vertica_snowflake.md

File metadata and controls

114 lines (75 loc) · 31.3 KB

Distributed Cloud Columnar Databases: Vertica Eon vs Snowflake

February 2021

Introduction

This post compares the two successful Databases, Vertica and Snowflake, focusing on their major offerings and internal implementations. The main comparisons are based on these technical materials, Vertica, Vertica Eon, Vertica Eon Talk 2020, and Snowflake, Snowflake Architecture Talk 2019, Snowflake Talk at CIDR 2021 but many other related ones will be mentioned or cited throughout the writing. The content of this blog is from my sole understanding of the materials and may miss some key designs. All feedback to this contact is welcome and appreciated.

In short, being a Distributed Cloud RDBMS(1) that is able to scale their compute up (use more powerful nodes) and out (add more nodes) while maintaining strong consistency and availability, Vertica's and Snowflake's offerings are very unique in the market. Let us dig into the details.

Similarities

Vertica and Snowflake provide similar Cloud Databases that support separation of storage and compute in which their data and metadata are shared and stored persistently on durable and available storage such as Amazon S3, and their compute clusters are formed elastically on demand to write and read data to and from the shared storage.

Their Cloud databases are Distributed RDBMS's running SQL on an MPP (Massively Parallel Processing) platform that complies with ACID (Atomic Consistent Isolated Durable) transactions, allows fault tolerance, self-recoveries, and self-backups. Although both databases work well for writing and reading data, they are referred to as Data Warehouses because of their focus on OLAP (Online Analytical Processing) where reads happen a lot more often than writes; and therefore leverage Column Store (*1) technologies in their table storage designs.

The two databases are designed to support historical queries that can access deleted data (*2). This is possible because data files are immutable after they are first written. A DELETE is just a new write to store the positions of deleting tuples, and at SELECT time, that brief information is used to eliminate deleted rows before returning results to users. An UPDATE is a DELETE followed by an INSERT. Moreover, they provide zero-copy clones of tables in SQL (*3) that duplicates the table data for users without copying anything. This is possible because the persistent data is immutable. Both old and new tables use the same files of data before the clone time, and the data modifications after that are managed by each table's metadata separately.

Both companies see the popularity of semi-structured data (e.g. JSON, XML, and AVRO) that need non-traditional workload, ELT (Extract, Load, and Transform)(2), and offer features to load semi-structured data first and, when needed, discover and transform the structure part of the semi-structured data automatically or through UDF (User Defined Functions) for better query performance (*4).

As Cloud DB providers, they continue building and improving their data security features such as authentication, authorization, and encryption in flight and at rest (*5) and ($1).

The ecosystems of both databases are very diverse including integration with Kafka for data streaming (*6), Spark connector for bringing data between their databases and Spark ecosystem (*7), SDKs for different programming languages (*8), driver connectors (*9), and BI integration (*10). I will not dive into their details but these two talks of Snowflake's Best Practice for Data Engineers and Vertica's Coolest Features well summarize those offerings.

Here are a few key technologies of their internal implementations:

  • ACID transactions are handled via snapshot isolation and MVCC (multi-version concurrency control) in which queries in a transaction only see a snapshot of immutable files of data at the time the transaction started and a copy of every change of metadata or data is preserved for some duration after their commit.
  • Columns are not only organized separately but also sorted and encoded(3) which are keys for fast scan and efficient query execution. Furthermore, by partitioning tables horizontally and keeping column partition min and max values, their execution engines know to only scan partitions that include needed data.
  • Query plans are built using a lot of obvious heuristics but the non-obvious factors are costed, compared and pruned by a cost model using statistics of the data. Predicates and aggregation operators are pushed down under joins to prune data as early as possible, however, the execution engine evaluates those push-down operators while executing them, and disables them as soon as it discovers no benefits (and may even lead to performance regression)(*11) and ($2).
  • Join SIP (Sideways Information Passing) and transitive predicate techniques are implemented to prune unnecessary join data at scan steps, way before the join actually happens(*12).
  • Even though most optimizations are performed in Query Optimizer, a few are decided on-the-fly at execution time.
  • Queries are executed in vectorized and pipelined fashion to improve cache efficiency and avoid storing intermediate results. The column data is kept in its encoding format as long in the plan as possible to reduce the amount of processing data.
  • Execution counters are logged for every executed operator for comparing with their corresponding estimated values and further query plan diagnosing. When needed, there are techniques to reproduce customer query plans in-house without their metadata/statistics for working on improvements.
  • Tests are critical in distributed and multi-tenant Databases and at a high level can be categorized into:
    • Unit Tests that are tightly coupled to specific features developers implement.
    • Functional tests that integrate different SQL clauses into all possible and sophisticated SQL queries.
    • Stress tests that simulate highly distributed and multi-tenant systems for running intensive workloads that conflict with each other (e.g. one user deletes data that is used by others)($3).
    • Scaling tests that repeat to scale the clusters up and down such as adding or removing nodes and adding or reducing cache size.
    • Performance tests that measure the performance of specific workloads on various cluster configurations.
    • Regression tests for each of the above categories to ensure no regressions in functionality and performance when new features are added.

Differences

While Vertica offers both Cloud DB (Eon mode) and On-Premise DB (Enterprise mode), Snowflake is a DBaaS (Database as a Service) and only offers the Cloud option. Snowflake focuses on easy-to-use and makes everything simple for users. Users pick their general needs and Snowflake will provide suitable configurations for them. Vertica focuses on query performance and lets users optimize their clusters and storage as needed. Each detailed comparison below goes with [S] or [V] to emphasize who is more mature, Snowflake or Vertica respectively.

[S] As a DBaaS, Snowflake provides a pretty and friendly UI for almost all actions including setting up clusters, loading data, running queries, monitoring on-going processes and reporting results.(#1) The UI is also expanded for user collaboration, query plan debugging & diagnosing, providing feedback and getting support. At the time of this writing, Vertica is not yet a DBaaS but does provide self-management service such as ElasticDW and work well on different UI such as DBVisualizer and its own Management Console to provide users some nice experience(#2).

[S] Snowflake's architecture includes three layers: operation service (cloud service), execution compute (VW or Virtual Warehouse), and storage, which I think is a very smart design to divide and conquer the three major components of an elastic and multi-tenant cluster: operation, execution, and storage. The operation includes metadata & transaction management, cluster infrastructure management, query optimizer, security, and collaboration & data sharing. Vertica Cloud DB was evolved from its On-Premise one and implements two layers: operation & execution and storage. Even though coupling operation and execution works best for a distributed shared-nothing on-premise system, a center-point to manage the whole cluster is needed for a cloud DB. Hence, Vertica has made a very quick design to use one of its sub-clusters (equivalent to Snowflake's VW) as a primary one to not only run queries as other sub-clusters but also responsible for the whole cluster operation. I think this is a great turn-around architecture to serve as an intermediate path for Vertica to completely split its operation & execution to support more cloud-oriented services. To the best of my knowledge of Vertica internal implementations, the principle designs of its operation and execution are not tightly coupled and it won't take much effort to split them.

Caching is quite interesting in both:

  • [S] At the operation service layer, Snowflake maintains a hot cache that captures query results for a configurable period but usually 24 hours. If the same or different users run that same query again and the Query Optimizer discovers the partitions participating in the query are the same (which means data has not been modified since its last run), its cached results will be returned immediately without running the query. This feature is useful in a collaborative workplace where many users look for the same reports(*13) and (#3).
  • [V] At the execution compute layer, both Snowflake and Vertica offer warm cache to store data on the compute local disks to avoid reading data from their slow object stores such as S3. In Snowflake, this cache is hidden from users, only captures data of recently run queries, and disappears when that compute is suspended. In Vertica, this cache is called the depot whose size, caching policies, and other configurations such as whether it will persist even after the compute are shutdown, can be specified by users(*14). A node with an empty depot can accept queries but the depot is usually filled to resemble the depot of its peers right after its node joins the cluster and before running queries. The depot enables heavy-I/O queries to run several times faster which is important for customers who only have a few hours after midnight everyday to produce heavy-duty reports.

[V] Even though Vertica and Snowflake store a table as a set of separate files of columns in certain sort orders and encodings, table design in Vertica goes beyond that. A Vertica table is only logical. Its actual physical storage is represented by one or many projections, each of which can be sorted and encoded differently to support the needs of different queries. The projections are designed automatically by Vertica DBDesigner(*15) & ($4) based on specific customer's data, query workload and storage budget. Because there are many projection choices for a table, when a query is issued, Vertica Query Optimizer selects the best set of projections for that query. This is a very unique offer of Vertica that customizes storage design for specific customer workload for query performance and gains in throughput.

Compute is a key component of the cloud DB architecture and both Snowflake and Vertica have their own advantages. A compute includes a set of nodes that either run many tasks of a query in parallel or run many queries in parallel or both. A compute in Snowflake is called a VW (Virtual Warehouse) and in Vertica is a sub-cluster. They both can handle isolated workloads without talking with their peer VWs or sub-clusters respectively, and hence VWs or sub-clusters can be added or removed as needed into Snowflake or Vertica respectively to achieve linear throughput scaling. This is why both Vertica and Snowflake are famous for their elastic throughput scaling that scales the number of queries running in a defined time such as second or minute(*16). So what are the differences between them?

  • [S] Since Snowflake splits its operation and execution layers, a node can be added into a VW in a blink while a query is running and can share the workload of that running query immediately to have it finish sooner. This feature is called elastic crunch scaling that scales the runtime of a query by adding/removing nodes. Adding a node in Vertica is a bit slower due to some operation overhead and cannot join the already-started-running queries either because of the way the query plan is built or the way a sub-custer is designed. Vertica, however, does offer elastic crunch scaling automatically when the number of nodes in the sub-cluster is a multiple of a shard count(*17).
  • [V] While the size of a Vertica sub-cluster can be flexible for elastic crunch scaling, the common practice is to have its size the same as or multiple of its number of shards which is a mechanism to define the degree of inter-node parallelism in Vertica. Data of all large-size projections are horizontally segmented on a specified column (or a set of columns) into the number of shards. Data of the segmented columns of different tables which are landed on the same shard have the same range of values. If these columns are joined or grouped in a query, the execution will be fast because their data are already co-located on the same node, with no need to transfer data between them(*18). This is another unique design of Vertica that clearly pushes its query performance.
  • [S or V?] In summary, both Snowflake and Vertica offer almost linear elastic throughput scaling. Vertica has segment shard setup to push query performance but its elastic crunch scaling has some restrictions. Snowflake has no design of segment shards to push performance for complicated queries but does benefit from immediate crunch scaling. I am curious to see whether the crunch scaling in Snowflake reaches and then passes the performance of Vertica's segment shard setup or it saturates even before hitting that level of performance.

[V] As mentioned earlier, Snowflake offers single mode which is a pure Cloud DB implementing the major technology of separation of storage and compute. Its cloud providers include Amazon AWS, MS Azure, and Google Cloud. Vertica, on the other hand, offers three different DB modes:

  1. Vertica Eon mode on Cloud: this offer is mostly like Snowflake's and the comparisons so far with Snowflake are on this mode. At the time of this writing, this mode works on Amazon AWS and Google Cloud Platform GCP.
  2. Vertica Eon mode On-Premise: This mode also offers the separation of storage and compute technology like the one above but its compute and storage do not have to be on public clouds. The compute can be any cluster customers choose and the storage can be either Pure Storage FlashBlade, MinIO, and HDFS(*19).
  3. Vertica On-Premise: This is the original Shared-Nothing Distributed Data Warehouse of Vertica, which can be installed on a customer's own cluster or any cluster on clouds such as Amazon AWS, MS Azure, and Google Cloud. Note that because this is a shared-nothing cluster, each node has to include both compute (CPU & RAM) and enough disk space for persistent data.

[S] Collaboration between different accounts is one of the key offers of Snowflake conveyed by its feature, Secure Database Sharing, through a producer-consumers model without copying data. The producer creates DB Sharing Metadata that includes which to share and who can consume them. Then consumers create corresponding metadata to let Snowflake know they want to access the sharable ones. When the protocol is established, the consumers can access and join the shared data with their own. To my knowledge, this feature is not available in Vertica at this time.

[V] Besides integrating with AI (Artificial Intelligence) and ML (Machine Learning) tools, Vertica offers in-db ML in which the ML functions are written in SQL and executed just like a Vertica query but including all ML calculations in the query data pipeline(*21) & (#4). There is no need to export and reimport data anywhere. Snowflake does not have this capability yet but from their CIDR 2021 talk linked above, it seems they are working on delivering similar ones.

[V] Vertica offers more mature data types such as Complex Types and Time Series (TS), and diverse materialized views such as Flattened Tables (FT) and Live Aggregate Projections (LAP) (*22). Snowflake, however, enables UNDROP a table within 24 hours since it was dropped (by accident).

[S] Snowflake continues scaling their Operation/Cloud Service layer to manage and support their single cluster of unlimited VWs that they call Global Data Mesh, which supports global communication and moving data in parallel for critical tasks such as cross-region data replication.

Here are a few other differences of their internal implementations:

  • [S or V?] Snowflake's physical storage is hybrid columnar. A micro-partition is a file of many columns. The file header includes pointers to the starts of each of their column data. Vertica's physical storage is purely columnar. Each column is stored in its own file and virtually grouped into a container which is equivalent to Snowflake's micro-partition. Throughout this writing, partition is used when talking about Snowflake's micro-partition and Vertica's container respectively.

  • Snowflake stores partition's metadata (min, max, count, null count, distinct count) at their operation service layer and query optimizer is responsible for using the metadata to prune corresponding partitions. Vertica's equivalent partition metadata is named ValIndex and the query optimizer knows nothing about it. The query plan does not need to tell which partitions to execute but only keeps predicate expressions. At execution time, the execution will prune the partitions based on those expressions. This difference is the result of the design differences on splitting operation from execution in Snowflake mentioned earlier which can be further evaluated as follows:

    • [V] If partition pruning happens in the query optimizer, its plan has to carry partition IDs to be executed. Furthermore, executing physical plans are always better parallelized and multi-threaded than their preliminary logical plan in the optimizer, hence partition pruning at execution is a lot more efficient.
    • [S] If the optimizer knows partition metadata, their statistics for building the plans will be more accurate. Moreover, since the optimizer can access partition metadata, it knows early whether the query results are cached and no need to rerun.
    • I would say if we combine the two choices above, doing partition pruning in both query optimizer and execution engine, the optimal performance will be achieved. There is one more major difference between the two that can also be combined. It seems Snowflake's optimizer only uses partition metadata as their statistics, Vertica's optimizer does not use partition metadata but builds statistical histogram of 100 buckets using random samples of the actual data, each bucket includes the same metadata as the partition metadata. If the number of partitions is larger than the number of the histogram buckets, the actual metadata is more accurate but if there are only a few partitions on a large set of well encoded data, having histogram will define more fine-grained statistics. So I would suggest building histogram based on partition metadata first and if the number of partitions is small, further improve the histogram using random sample data.
  • [V] In an execution compute of many nodes, many times non-co-located data of a running query with joins and group-by needs to be redistributed between nodes in order to produce the right results. Snowflake lets its execution compute decide when to do that, while Vertica decides that at its planning time before sending it to the execution engine for running. This is the results of earlier mentioned design differences in which Vertica sets up segment shard but Snowflake does not.

  • Query profiles(*23) for diagnosing are important and worth comparing:

    • [S] Snowflake provides pretty WebUI for displaying query profiles.
    • [V] Vertica has a robust monitoring schema for every profile detail.
    • Similarly, I would suggest combining the two above to display profile information in an easy-to-read and navigation way.
  • [S or V?] A well-known issue of any query optimizer is that its optimal-generated plan is sometimes far from optimal. It seems Snowflake provides a drag-and-drop way in their WebUI to reconstruct a bad customer query plan for diagnosing for which I have not found any related documentation or demos yet. Vertica provides a Directed Query feature(*24) that enables one to fully rebuild the details of any query plans to not only offer customers to retain their query plans after upgrading if preferred but also build their own plans. Although I cannot compare the two because of missing Snowflake information, I would suggest Vertica to provide UI for their luxuriant Directed Query. Note that this feature is here to enrich their offerings and usability as needed. I believe both Snowflake and Vertica do not recommend their customers to build their own query plans but use the ones generated by their query optimizer instead.

  • [S or V?] The query plan was built in a bottom-up fashion in Vertica but top-down in Snowflake. They are well-known research techniques that have their own advantages.

  • [V] Snowflake uses push strategy to pipeline data in a query plan. Vertica had implemented push technique at first but then converted to pull strategy to get better performance and resource sharing.

  • [S or V?] Snowflake stores its metadata in a key-value store, Foundation DB. Vertica implements its own format and storage.

Rumors

As a technical person, my intention of this writing is to focus only on their offerings and internal implementations. However, many readers have asked for comparisons of cost, pricing, and performance per dollars-spent. I have not spent much time on this regard but I have heard from a lot of reliable sources that Vertica out-performs Snowflake given equivalent amounts of Compute and that, for a given solution, Snowflake turns out to be far more expensive. However, Snowflake is much easier to get started with and far more turn-key with less need for on-site expertise. I would encourage you to do more research for yourself on these topics, and even do POCs with both Snowflake and Vertica on your specific workloads and get the exact quotes.

Afterword

Now that you know the two DBs' similarities and differences, if you are a customer who is searching for a Cloud DB, which one would best fit your needs? If you are a DB builder, which technologies and offerings will you pick for your new DB designs?

Footnotes & References