4) Logging: If it wasnt logged, it never happened
We need to be careful about what we log and we should log everything that can help us figure out what went wrong.
When working with the cloud it’s normal to experience failure every so often. Never build an application without thinking about how you will recover from a fault and how long will it take.
Sometimes, when you start on a new project, you only have time to plan for the straightforward cases. This means that we can learn a lot from a brand new application with real users. In order to facilitate this learning process and support it properly you need logging in place.
5) Prepare for failure
The cloud never fails thats what the cloud provider wants us to believe! The reality is different (and has been proven several times): even well-managed clouds will fail.
The problem is not that they fail, but that most people are unprepared for such failures, because they believe the cloud is an indestructible silver bullet.
Cloud providers do not explicitly plan for the failover of your services, they just provide the platform and the tools, and it’s your job to plan and implement your own failover system.
Cloud services are known for their accessibility, but they are still bound to Murphys law: Anything that can go wrong will go wrong . Amazons AWS, Microsofts Azure and Google Mail, amongst others, have all failed in the past and most of them will fail again in the future.
An important step for us was to lessen the reliance on our 20+ third party integrations. Our working assumption is that any third party system will fail, and if handled badly, a third party slowdown could quickly escalate and become our slowdown and impact our systems.
To mitigate this, the vast majority of our services run through an out of band message bus. Messages sent to the bus are sent in a fire and forget fashion. Messages sent to the bus takes an overage of 2ms, regardless of the state of the 3rd party. This mechanism allows us to handle requests in a fashion that does not impact the users experience.
All of our emails are handled by a specialist email provider, which provides an API that we use to send emails. This service has proven to be highly reliable, however if they have performance issues or their service becomes unavailable, the user experience is not affected because failed messages are stored and placed in the queue to be resent later.
This mechanism allows us to handle a complete outage from a range of providers using the same principle without having to worry about our users being impacted. Once a provider resumes service, we simply pick up the previously failed messages.
David Kavanagh is technical director at hybrid estate agency Purplebricks.com.