As a SaaS platform, customers rely on Snowflake to provide a highly available service. In the past couple of weeks, we had two disruptions where customers in some regions were unable to connect to Snowflake because of Online Certificate Status Protocol (OCSP) validation of SSL certificates. We’ve written to our customers, apologizing for any inconvenience the disruptions may have caused them. We’re also working to improve our systems and processes to help prevent such occurrences in the future. This blog lays out the challenges and tradeoffs we faced and what the next steps are.
Securing the Web: Background on OCSP
The SSL/TLS certificates, along with the certificate authorities (CAs) that issue them, form the backbone of how we establish secure and trusted communications. Clients can use these certificates to verify the identity of who they are communicating with and make sure malicious actors are not hijacking their important information.
Certificate revocation checking is an important part of identity verification. Having the ability to revoke a previously valid certificate allows us to respond quickly to a breach or a theft of the private key associated with the certificate and prevent fraudulent use of that certificate. This may sound theoretical but we had one such widespread incident with the heartbleed vulnerability where millions of certificates were potentially compromised and had to be revoked. OCSP (online certificate status protocol) is a protocol used by clients to check if the certificate they have is still valid and has not been revoked. Of the different methods of checking for revocation, it’s the most secure and allows for the quickest response once a revocation happens.
Challenges with OCSP
While OCSP provides the most secure checking of the certificate revocation status, there are several challenges with OCSP. Since OCSP relies on the CAs responding on a client connection, it puts them in the position of running an online service in a highly reliable fashion.
CAs are organizations optimized to provide trust and accountability, but not necessarily resilient infrastructure. Even temporary unavailability of responses from CAs makes it quite difficult to check for revocation. Because of these availability challenges, most clients which check for OCSP have chosen to silently ignore revocation check failures (known as “soft-fail”). Unfortunately, this significantly diminishes the value of OCSP checking because it fails in exactly the cases it’s trying to prevent. As Google software engineer Adam Langley said in a blog post: “soft-fail revocation checks are like a seat-belt that snaps when you crash. Even though it works 99% of the time, it’s worthless because it only works when you don’t need it.”
Snowflake’s Goal: Building a stronger OCSP
Snowflake has chosen to take a stronger position with OCSP. Our clients fail a connection when they can’t check the revocation status of a certificate or all the intermediate certificates used to verify the certificate (a “hard-fail” strategy). Our customers place a lot of trust in Snowflake securing their data and it’s incumbent upon us to make sure we protect their data to the best of our abilities. We do this for all connections our client drivers use, not just to Snowflake but other third-party services such as AWS S3, Azure Blob Store, Okta, etc.
Over the last year, we have been mitigating the challenges of availability normally associated with the “hard-fail” strategy. The commonly cited problem with OCSP is that the certificate authority fails to respond because their responder is temporarily unavailable. To mitigate it, we built a solution in which the Snowflake service queries and caches the OCSP responses from certificate authorities. This allows our client drivers to establish connection even when the CA’s OCSP responder is temporarily unavailable. This is possible, and secure, because the CAs sign the OCSP responses, which are valid for one or more days. As long as the CA’s OCSP responder can occasionally serve a valid response for revocation checking, we can continue to maintain a highly available service for our customers.
What we saw in the last couple of weeks was an entirely different failure mode, which we have not seen in the past and has not been reported in the security community. Instead of CAs being temporarily offline, two different CAs produced expired OCSP responses to revocation checking. Since the CAs are the ultimate authority of trust, the fact that they could not confirm the certificates were not revoked meant that connections to Snowflake which used the certificate failed.
Plans Going Forward
We do not think it is acceptable to weaken security for enterprise services running in the cloud in response to these challenges. We’re taking several steps to mitigate availability issues while we continue to keep a high bar for certificate acceptance and revocation checking.
We’re building an early warning system, which will alert us when CAs aren’t providing the freshest OCSP responses. This can give us an indication of a potential situation with the CA, which may lead to expired responses in the future. We can use this to start working with them to resolve the situation before customers actually see expired responses to eliminate or shorten the time of impact. In addition, we’re building a centralized control which is designed to securely suspend revocation checking for affected customers for brief periods of time as we work with the CAs to remediate problems. While we implement these mechanisms, we currently provide our customers a way to temporarily configure their client drivers in a soft-fail mode for revocation checking to work around OCSP-induced connectivity issues.
Snowflake is committed to working with our customers and be as transparent as possible throughout this journey. We deeply appreciate your patience and your partnership in this effort.