Scaling CT Logs: Temporal Sharding

Our industry is moving toward universal support for Certificate Transparency (CT), one of the largest improvements to trust and security for the Web PKI system and SSL certificates in years. Later this month, CT will effectively become an industry-wide mandate when Google Chrome starts requiring it for all new publicly trusted SSL certificates.

Already, hundreds of millions of certificates have been logged into CT, directly by Certificate Authorities (CAs) and by third-party web crawlers. This is  an important first step in security improvements and real-world testing of the CT system, which is relatively new and still maturing. Engineers working at Google designed CT and had it standardized as an IETF RFC in 2013. Since then, Google’s Chrome team has been working toward being the first browser to fully implement CT, and that project will take a major step forward in just a few days.

Certificate Transparency is a complex system that involves publicly available logs that store issued certificates and act as a cryptographically verifiable record of those certificates. Many logs exist to handle the sum total of the Web PKI and more are expected to be created as the CT ecosystem matures.

One of the challenges maintaining a CT log is supporting the growing scale of the Web PKI. As more websites adopt SSL, and certificate lifetimes become shorter (eventually being measured in days instead of years), we‘ll likely see the yearly total number of trusted certificates reach a billion.

With Chrome’s new requirement (other major browsers are expected to follow), all of those certificates will need to be logged. That’s an incredible strain on these logs, which have to meet strict performance requirements and indirectly serve hundreds of millions of users.

For those curious about how Certificate Transparency works on the backend, this post will discuss some of the operational challenges involved in running a CT log, and explain temporal sharding—a new strategy for coping with the growing size of the Web PKI. First, we’ll give some background on Certificate Transparency for those unfamiliar with the system.

DigiCert & CT

For the most part, CT is a system that enhances the security of the Web PKI as a whole, while requiring little effort from website operators. It’s a CA’s responsibility to properly CT log their certificates and many CAs have been preparing for more than a year for this upcoming deadline in Chrome.

DigiCert has been a proponent of CT for many years. In fact, we were the first CA to start operating a CT log, and earlier this year we began logging all of our new certificates ahead of the industry deadline. If you’re one of our customers this means your certificates are already compliant with this requirement and no action is needed on your end.

Some Background on Certificate Transparency

Certificate Transparency, as its name suggests, brings more transparency to the Web PKI by allowing the ecosystem to know what certificates are being issued by CAs.

Certificate Transparency has been widely used since 2015 when Google Chrome began requiring it for EV certificates. However, at the end of this month (April 2018), all newly issued publicly trusted SSL certificates must be logged in order to be accepted by Google Chrome. This is a major step for CT adoption as it effectively mandates logging for all certificates if you want to be trusted by the world’s most popular browser.

From a practical standpoint, this involves an additional step in the issuance process. CAs will now submit certificates to public databases, known as logs, which can be reviewed by researchers and website operators to detect if certificates are being mis-issued. When a certificate has been made publicly available in a Certificate Transparency log we refer to that certificate as being “logged.”

Prior to CT’s existence, researchers and some major providers tried to accomplish the goals of CT by scraping and crawling the web—similar to a search engine—and recording every certificate they found. This works to some degree, but will always be less accurate than getting the data directly from the source; which, by requiring CAs log their new certificates, is what CT does. When fully deployed, CT can give perfect transparency to all trusted certificates that exist, which is an unprecedented resource for security researchers.

Why do we need CT?

The need for Certificate Transparency has been demonstrated by a number of security failures in the Web PKI ecosystem. In 2011, the Dutch CA DigiNotar was compromised, and an attacker was able to fraudulently issue trusted SSL certificates for high-profile sites including Google.com and Facebook.com.

Since then, other errors and failures could have been discovered sooner, and understood better, if CT had existed. One of the greatest values of CT is that it improves the health and security of the Web PKI across the board. The most catastrophic scenarios, such as a CA being compromised, to more moderate problems, like a CA violating encoding standards, can all be detected, investigated, and addressed with CT.

Certificate Transparency makes it significantly easier to detect violations and compromises of CAs, and, in the rare situation where a large-scale mis-issuance occurs, CT makes it possible to determine the scope of the problem by allowing all of the improper certificates to be found.

CT works in a malicious scenario, where a CA is purposefully violating industry standards or has been hacked. That’s because the malicious actor has to log the certificates if they want them to be trusted by browsers and therefore have their fraudulent certificate be useful in most attacks. It’ll always be possible to not log a certificate, but then it will not be trusted (working no differently than a self-signed certificate), which is hardly of any use in a spoofing or man-in-the-middle attack.

Certificate Transparency will also serve individual website operators by giving them an option to monitor what certificates exist for their domains. This will be useful in two main scenarios: to become aware of a fraudulently issued certificate (either due to a CA’s mis-issuance or compromise of part of the website’s infrastructure such as a webserver or DNS), and to ensure that their organizational certificate policy is being followed. For instance, you could do regular audits to make sure departments within your organization are all ordering from your preferred vendors.

CT Log Operation

CT Logs are databases of issued certificates that are publicly available online. Their purpose is to act as an auditable and unmodifiable record of issued certificates. When a certificate is issued, it is logged and “proof” of its inclusion is provided by the log in the form of a file known as an “SCT.” Browsers (and in the future, other software) that wish to enforce CT logging will receive the SCT during the initial SSL handshake and verify it as part of the certificate evaluation.

Logs can be operated by any organization who wants to take on the effort. Right now, major logs are only operated by a few companies: Google, DigiCert, Comodo, and Cloudflare.

Operating a log is not easy. In principle, there are a lot of similarities between operating a CT log and operating a CA. You have to meet certain operational standards, go through an initial auditing period, and then become “trusted” by a browser. In addition, running a CT log requires meeting high-availability requirements, similar to operating a cloud service.

If a log violates these requirements, they can be distrusted and no longer accepted by browsers. This could have severe consequences for the certificates in that log as they may no longer meet the CT requirement and the certificates themselves may become distrusted (in practice, there are ways to mitigate the effects of this on individual certificates by redundantly logging it).

Because of this risk, log operators want to use every best practice available to them to make operating their logs easier, and mitigate the damage in the event of a failure.

Some early logs have already become quite large—storing 300 million certificates. As a log grows, two things happen: it becomes more expensive to operate, and the impacts of its failure become larger.

How to Combat Challenges with Growing Logs: Temporal Sharding

One challenge facing log operators is log scale. Google Chrome’s policy assumes that logs will accept certificates indefinitely. But this can pose an operational challenge as a log grows to hundreds of millions of certificates. As logs grow it becomes more difficult to perform maintenance on them and keep them running smoothly. It also increases the impact of that log’s failure.

Temporal sharding addresses this problem by allowing logs to limit the scope of acceptable certificates to a date range expressed in certificates as two dates (“notBefore” and “notAfter”), which make up their validity period. Any date range could be chosen, but all logs that are currently sharded have chosen one year segments.

This is in contrast to previous practices that only allowed a log to control what certificates it could accept based on a few criteria: if it was expired or revoked, and what root certificate issued it. Most major logs accept certificates from any trusted root, which makes them the most useful to the ecosystem but also puts them at risk of becoming very large if too many CAs or third-parties begin submitting certificates to them.

With temporal sharding, if a certificate expires after the specified date, it can be rejected by the log. Those submissions would be accepted by the next log in the series. With this practice, instead of running one log, there would actually be multiple logs. For example, one log each for 2018, 2019, and 2020. Each of these would be referred to as physical logs and would make up one logical log as a group. Google runs five physical logs that make up the logical “Argon” log.

Prior to this method, starting a second log did not necessarily help a log operator solve their scaling issues. This is because their second log could just be filled with the same certificates already included in the first log (due to Chrome policy which does not allow a log to reject a certificate for this reason)—or new certificates could be submitted to both. By having to incorporate these duplicate certificates, a log operator may see little to no real increase in capacity by starting an additional log.

Cloudflare’s Nimbus log is the first to use temporal sharding and was added to Chrome in early 2018. DigiCert has also started two logs—nicknamed “Yeti” and “Nessie”—which will each be sharded into five (5) one-year periods, from 2018 to 2022. These will be added to Chrome later this year and will accept certificates from all CAs for free.