Skip to main content

Snowpark for Python: Bringing Efficiency and Governance to Polyglot ML Pipelines

PLEASE NOTE: This post was originally published in May. It has been updated in July to reflect currently available features and functionality.

Machine learning (ML), more than any other workflow, has imposed the most stress on modern data architectures. Its success is often contingent on the collaboration of polyglot data teams stitching together SQL- and Python-based pipelines to execute the many steps that take place from data ingestion to ML model inference.  

This polyglot nature of data teams is one of the largest impediments to an organization’s ability to operationalize the machine learning workflow and create sustainable return on investment from its data.

The polyglot impediment

For over a decade data professionals have been touting, building, and striving for the utopia of data democratization: a future state where anyone, regardless of role or skills, can leverage the power of data in their daily work. 

Yet, as more people from diverse backgrounds join the conversation, it is unrealistic to expect them all to speak with the data using the same programming language. Over time, different languages have emerged to meet the needs of different communities. While SQL has long been the mainstay of large-scale data transformation and management, other languages like Python have emerged with added flexibility in functional constructs for greater expressiveness as well as extensibility. Today, there are a massive number of Python frameworks to simplify everything from application development to quantitative analysis and ML. 

Specific to ML, many of the challenges of machine learning operations (MLOps) stem directly from this polyglot impediment. Often, the most effective tool for any particular task in a complex training or inference pipeline may be written in SQL or Python. The multitude of frameworks (e.g., TensorFlow, Pytorch, etc.), along with the specialized compute infrastructure to support them, exacerbate this complexity even further. MLOps and DevOps teams are left with the unenviable job of building and maintaining efficient, scalable pipelines across multiple platforms supporting different languages and frameworks. 

The multi-platform approach

Different platforms have emerged to support these different languages as a way to overcome the polyglot impediment. For example, data platforms have traditionally been the domain of data engineers and analysts, but because these platforms don’t always meet the needs of data scientists, who sometimes require different languages and frameworks, some data scientists opt to build their own separate platforms. On top of that, ML engineers often build their own MLOps platforms to support things such as monitoring, orchestration, and version controls. 

Platforms used to develop and deliver ML-powered applications

To bridge multiple processing steps across these platforms, frameworks such as Apache Airflow and dbt have emerged to simplify this orchestration by acting as an integration hub. But there still remains the issue of these unique platforms adding technical debt and risk as data is copied and moved between them, and until now these platforms have not evolved to overcome this polyglot impediment at the data layer. 

While technical teams struggle to maintain a data infrastructure that is fragile and overly complex language- and workload-specific, CIOs and CDOs are constantly dealing with rising costs and security risks from duplicate pipelines and the massive amounts of data moving across these platforms.  

The polyglot platform approach

As the polyglot nature of the data world is not likely to change (think of the emergence of newer languages such as Julia), and data teams continue to run into the challenges and risks associated with data movement across multi-platform architectures, it becomes increasingly apparent that multi-language platforms will play a vital role. Rather than moving data across various single-language platforms, multi-language platforms can support the processing needs of multiple teams and languages, reducing the need to move data outside of its governed boundaries.

To streamline architectures, enhance collaboration between different teams, and provide consistent governance across all data, the world needs more polyglot platforms with seamless good integration with best-of-breed orchestration frameworks. 

Snowpark: The polyglot answer for modern data teams

Snowflake introduced Snowpark as an extensibility framework to create a polyglot platform that bridges the gaps between data engineers, data scientists, ML engineers, application developers, and the MLOps and DevOps teams that support them.

First with support for the most popular languages such as Java, Scala, and JavaScript, Snowpark makes it possible to simplify architectures while reducing costs and governance risks associated with data duplication. Snowpark allows users to talk to their data in the language of their choice while leveraging the performance, scalability, simplicity, security, and governance they have come to expect from Snowflake. Best of all, Snowpark was designed to make it easy to integrate custom functions written in other languages as part of a SQL query or processing step.

And now, Snowpark for Python (in public preview) takes it to a whole new level by embracing a massive community of developers, data engineers, data scientists, and ML engineers. Unsurprisingly, Snowpark for Python also exposes much-needed surface area for integration with orchestration frameworks, and the Snowflake partnership with Anaconda makes it possible to tap into a huge ecosystem of frameworks including TensorFlow, Pytorch, Keras, and many more.

Apache Airflow: An orchestration framework for a multilingual workflow

Simultaneously, Astronomer and the Airflow Community continue to add great support for Python, including the Taskflow API in Airflow 2.0. Taskflow provides a comfortable pythonic interface for data teams, while empowering good software development practices. In conjunction with Snowpark, Taskflow makes it possible to easily define not only complex data transformations in Python but also integrate non-SQL tasks such as ML into a DAG—with no data movement.  

Snowpark and Airflow allow data teams to execute entire pipelines from ELT, feature engineering, experimentation, model training, inference, monitoring, and even powerful visual applications in Streamlit, all without moving and copying data. This allows complex, scalable pipelines to be managed with confidence as they eliminate complexity, reduce governance risk, and support openness with best-of-breed frameworks.

Empowering polyglot teams

Thanks to the democratization of data, today’s teams require platforms that support many different languages and frameworks. Snowpark empowers these polyglot teams with one platform supporting open integrations with the world’s leading orchestration frameworks, enabling operational simplicity while reinforcing good data governance practices. 

Snowpark is already in general availability for Java/Scala and currently in public preview for Python. To build your own ML workflow using Snowpark, Anaconda, and Apache Airflow, check out this step-by-step code guide.

The post Snowpark for Python: Bringing Efficiency and Governance to Polyglot ML Pipelines appeared first on Snowflake.