Sep 12, 2024

Migrating apps in DigitalOcean with 30x less downtime

We recently had to migrate part of our app hosted with DigitalOcean to make it more resilient and scalable. As a backup client, Duplicati collects backup report data, with events happening throughout the day. This means there were no particular off-peak hours we could migrate in. Since the clients do not retry submitting the reports, any downtime would mean lost reports. Here’s how we worked around it to bring downtime down from 10 minutes to just 30 seconds.

Background

In DigitalOcean, the app infrastructure works by assigning a hostname to each application. That hostname is then associated with all the details around deployment, restarts, migrations, etc. To set up another hostname for the server, you simply add a CNAME record that points to the app and everything works from there. Behind the scenes, DigitalOcean handles switching IP addresses and setting up TLS certificates, so you don’t need to manage it.

For our setup, we had already prepared for the split, so we were already using two different CNAME records pointing to the same app. Once the new app was developed, it would just be a matter of switching the CNAME record to the new app—but unfortunately, it was not as simple as that.

The direct approach

Since DigitalOcean needs to know which app to route traffic to, you cannot assign one hostname to multiple apps. Also, for security reasons, you cannot obtain a TLS certificate for an app before the CNAME record points to it. The best you can do is the following steps:

Set a short time-to-live on the CNAME record
Wait for the short TTL to be used by all clients
Delete the old CNAME record, create the new
Delete the hostname on the old app
Assign the hostname on the new app

Unfortunately, there are a few tripping points here.

First, the moment you delete the CNAME record on the old app, it stops responding to new requests. Second, our domain provider has a minimum DNS TTL of 10 minutes. And third, the new app will not respond to requests before it discovers the new CNAME value and obtains the TLS certificate.

Best-case scenario: up to 10 minutes of lost requests

In the best-case scenario, we’d click everything at the same time. This would create a window in which we would see no requests, because they wouldn’t be routed to anything we control. Since the TTL is 10 minutes, it’s likely that one or more clients would have the old CNAME value cached, and be sending requests to the old server until the DNS entry expires. The new app wouldn’t be able to serve any requests until its TLS certificate is issued. Since we have some clients sending reports every 10 minutes, a loss of requests would almost be guaranteed.

An elaborate workaround

We were really not satisfied even with the minimal downtime, so we experimented with various setups in our test environment. In the end we concluded that having the app served with DigitalOcean’s hosting structure has many benefits in normal operations, but there was no good way to do the switch within DigitalOcean’s constraints. With this conclusion we opted to manually host parts of the setup in a droplet where we would have more control. To avoid introducing any untested code, we set up a temporary intermediary server in our new droplet serving the same Docker image as the DigitalOcean app with docker-compose. To make the switch away from the old DigitalOcean app, we added the hostname to the temporary server using LetsEncrypt.

Since the correct hostname was at the time pointing to DigitalOcean, LetsEncrypt naturally refused to issue a TLS certificate. Expecting this, we then removed the CNAME in DNS, but not the hostname mapping in DigitalOcean, and created an A record pointing to the new server. Then we triggered a re-issue of LetsEncrypt certificates on the temporary server.

The re-issue takes around five seconds to complete, so is shorter and much more predictable. During this time, any requests using the old CNAME record are routed to DigitalOcean and served correctly as the setup inside DigitalOcean has not changed. Only unlucky requests that obtained the DNS record in the short gap before the certificate was issued would fail to establish the TLS connection in this period.

A word of caution: don’t request too many new certificates. If you try to replicate this strategy, make sure you can obtain a certificate, but beware that requesting too many certificates while testing can get your IP or hostname blocked for several days by LetsEncrypt. You need to test that the setup works and then wait a while to ensure you are not hitting any limits.

Once the certificate was issued, the temporary server could handle requests. We combed our network logs and found a slot of two minutes where there were usually no requests, and did the switchover in that slot. Then we waited for 30 minutes until all clients would be using the updated DNS records, before removing the hostname from the old DigitalOcean app.

Making the switch back to a hosted App

Now that the temporary server was running, we picked a new time slot where we did not expect requests for two minutes, deleted the A record, created a CNAME record, and added the new hostname to the new DigitalOcean app.

This part of the switchover was quite tricky, because DigitalOcean will queue the request to create the certificate until it sees the DNS record being changed. This forced some time where the DNS record pointed to DigitalOcean, but there was no route to the app. By experimenting with the setup, we found that the most reliable method was to ensure that the DNS record was set before adding the hostname. The initial DNS check would then usually be done in less than 30 seconds and the certificate would be issued in less than 60 seconds, allowing the new app to serve requests.

Finally, we waited until the DNS records pointing to the temporary server expired, before purging the temporary server. In all we had around 20 seconds of downtime and no reported missed requests.

Final thoughts

You could make the point that we should just stay out of a managed service and avoid an issue like this altogether. But for day-to-day operations there is a lot of value in the managed service, even with our challenges for this particular operation. We could also hope that DigitalOcean would make a feature that would make this switchover seamless, but it’s likely to be such an infrequent operation that it wouldn’t make sense for them to build and maintain. In our case the workaround worked with no harm done.

The switchover described here was part of our effort to provide a great monitoring experience for Duplicati. If you want to add reliable monitoring to your Duplicati backups, sign up for the Duplicati console.

Be safe

Get started for free

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque scelerisque quis massa sit amet pharetra.