Automatic Clustering, Materialized Views & Automatic Maintenance in Snowflake

November 13, 2018 Marta Bright

Performance and scale are core to Snowflake. From day one, the founding team has focused on building a fast and zero-maintenance cloud-built data warehouse. Snowflake users focus on analytical tasks instead of continuously tuning the data warehouse or investing in complicated workload management due to concurrency limits. From today, two new automation and performance capabilities are available in Snowflake. They are the latest advancements to Snowflake’s ability to deliver the best performance and scalability without requiring administration or maintenance from our customers. They also lay the foundation for a series of new features to come in the months ahead.  

Automatic Clustering

Snowflake’s automatic clustering feature is now available for all regions and clouds. Automatic clustering is a standard feature customers can enable by contacting Snowflake Support.

With legacy on-premises and cloud data warehouses, it’s the user’s burden to constantly optimize the underlying data storage. This includes updating indexes and statistics, post-load vacuuming procedures, choosing the right distribution keys, dealing with slow partitions due to growing skews, and the need to manually reorder data as new data arrives or gets modified.

With Snowflake, a user can define clustered tables if the natural ingestion order is not sufficient in the presence of varying data access patterns. Now, we are introducing Snowflake’s automatic clustering, which constantly maintains optimal clustering for tables defined as clustered tables without any impact to production workloads. The benefits include:  

  • No need to run manual operations to re-cluster data.
  • Incremental clustering as new data arrives or a larger amount of data is modified.
  • ETL pipelines consisting of DML operations (INSERT, DELETE, UPDATE, MERGE) can run concurrently and are not blocked.
  • Users can resume and suspend automatic clustering on a per table basis, and are billed by the second for only the compute resources used.

Materialized Views & Automatic Maintenance

Snowflake’s materialized views (MVs) are public preview on a per request basis and offered in our enterprise edition. As we looked at an age-old database and optimization challenge, we focused on a few crucial design principles we wanted to achieve to avoid the pain points of traditional approaches:  

  1. Ensure fast DML operations in the presence of MVs – no slow downs when issuing DML statements against base tables, which is common among traditional data warehouses systems.
  2. Provide always, up-to-date and consistent query results when accessing MVs.
  3. Ensure ease-of-use by providing a maintenance service that continuously runs and updates the MVs in the background, and is entirely transparent to the end user.

Main uses cases we’ve enabled with this initial version of Snowflake’s MVs are:

  1. Defining numerous MVs with different clustering keys, leading to improved query performance for different access patterns. 
  2. Materializing single table aggregations.  
  3. Materializing query results and analysis against semi-structured data, using flatten or approximate queries (sampling).  

Scalable compute service: no infrastructure to manage  

Both features are built on top of our scalable, multi-cluster virtual warehouse technology and entirely managed by Snowflake. This means Snowflake efficiently and automatically executes the re-clustering or MV refresh statements in the background. No need to create, size or resize a virtual warehouse.

In case of the background clustering, the compute service continuously monitors the clustering quality of all registered clustered tables. It starts with the most unclustered micro-partitions, and iteratively performs the clustering until an optimal clustering depth is achieved. In the case of MVs, the compute service monitors the base tables and kicks off refresh statements for the corresponding MVs if significant changes are detected. This maintenance process of all depending MVs is asynchronous. In scenarios when a user is accessing an MV, which has yet to be updated, Snowflake’s query engine will perform a combined execution with the base table to always ensure consistent query results. Similar to Snowflake’s automatic clustering with the ability to resume or suspend per table, a user can resume and suspend the automatic maintenance on per MV basis.

Conclusions and Acknowledgements

We thank our numerous preview participants for providing us valuable feedback early on throughout the private beta. The early adoption in our customer base is encouraging and we will continue listening to their feedback. This is just the beginning of a number of additional features related to performance, automation and manageability we’ll be announcing in the next weeks and months ahead. Learn more about materialized views, automatic clustering, and automatic maintenance here.

The post Automatic Clustering, Materialized Views & Automatic Maintenance in Snowflake appeared first on Snowflake.

Previous Article
The Future of Data Sharing Starts Today
The Future of Data Sharing Starts Today

Today’s enterprises generate massive amounts of data internally and through customers, stakeholders, and bu...

Next Article
Share and Save: Introducing the Data Sharing Rebate Program
Share and Save: Introducing the Data Sharing Rebate Program

Last year, we launched Snowflake Data Sharing, known as The Data Sharehouse™, to enable secure data sharing...

×

Subscribe to email updates from the Snowflake Blog

You're subscribed!
Error - something went wrong!