What additional steps can you take, beyond those that are obvious, to keep your ecommerce site running? Let’s dispel some myths, focus on what is possible and practical, and offer some advice.
It’s a lie: No one offers 99.99% uptime
First, we need a definition of ‘running’. We call this ‘uptime’. But there are varying degrees of ‘up’ and ‘down’. Someone who offers cloud infrastructure storage can say that they can deliver 99.xx% uptime with confidence; it is easier to keep one product running (storage) than several. Keeping an entire ecommerce site up is more complicated, because there are so many moving parts.
Would you consider the system up or down if?
- Certain applications do not work, but others do. For example, you can add items to the shopping cart and arrange shipping. However, the payment mechanism does not work, because there is something wrong with the credit card processor that you work with.
- The load balancing feature is not working, so some users see error 500. The load balancer is supposed to mark such web servers as offline, but that does not always work.
- There is extreme latency.
- Planned maintenance. Obviously that is downtime.
Insist on a hosting provider who offers 99.9% uptime
We just said no application can be up 99.9% of the time, but your hosting provider should be able to keep their infrastructure running 99.9% of the time. They are responsible, as we said, for keeping virtual machines, networking, and storage running. Of course something can still go wrong with that. In any storage and virtualization operation, with hundreds or thousands of disks and servers, at least one of them will be down at any given time. But failover mechanisms should ensure that this does not affect your storage and servers. If the company is properly managing that then they could perhaps deliver on what they promise.
Google Cloud Storage says that they offer an average monthly uptime of ≥ 99.0% or 99.9% depending on which products you select. If they do not meet that target, they issue you a refund in the form of credits. If they can provide 99.0% uptime, they are saying they will be down no more than 14 minutes per month.
That is a level that you can live with. That is not likely to affect your customers too much, as they are likely to have that amount of problems with their internet connection (ISP) alone.
The best way to make sure your entire application is working is to use synthetic monitoring.
Synthetic monitoring means creating a fake user account and using that account to exercise all or the most important the functions in your application. This is different than tracking real users for issues related to latency. This reveals errors that the logs might not.
A synthetic user mimics a real user by entering transactions all the way from login; browse products; add items to the shopping cart; accept credit cards, gift cards, and loyalty cards; process returns; and arrange shipping.
There is one issue. If you do that, it is going to upset your inventory management, accounting systems, and bank: you are generating cash flow and creating replenishment orders when no inventory has actually gone out the door. So some care must be given to how to design these transactions, so that everyone knows these are not actual orders.
The biggest threat to downtime is security issues. No one wants to lose 110 million credit cards like Target did. Among the companies that have lost customer data in the last two years are Neiman Marcus, Twitter, LinkedIn, and Forbes magazine. Probably the best way to prevent this is to contract with white-hat hackers, like the Knowledge Consulting Group, to probe your system for security weaknesses.
You should reengage them each time you make a major upgrade, so they can check your system again, because upgrades open up new security issues. You could also monitor Microsoft and other security alerts. These reports show newly discovered security issues in applications that hackers can exploit.
You can also signup for a service like StopTheHacker.com (recently acquired by CloudFlare). They probe your site using hacking techniques they know about and let you know of weaknesses. They check for things like wrong permission set on a directory, the ability to do SQL injection, malware in your environment (if you give them access to your environment), and let you know if someone is using your domain for phishing attacks.
Companies like Dyn and Prolexic offer the ability to greatly scale up your infrastructure when you are under a DDOS (distributed denial of service attack). They do this by redirecting your web traffic to their massive server farms so that your site keeps on running when you are under attack. If you have high visibility, are in a country where there is some kind of civil unrest, or sell something not everyone approves of, you might need this kind of threat mitigation service.
You can also use CloudFlare´s service. They route your traffic through their different routers around the world. You update your DNS records to route traffic through their infrastructure, and they route the traffic in a manner that results in the lowest possible latency. They also they keep track of who is operating botnets and who is spamming sites and they take measures to keep that away from your site. Like Prolexic and Dyn, they also help mitigate DDOS attacks.
Communicate scheduled downtime clearly
Some sites never take scheduled maintenance. But if you are not as sophisticated and large as, say, eBay, you probably cannot avoid that. You can list prominently on your page when you will be down for maintenance, so that your best customers will be informed in advance. That keeps customer relations good.
These are just some of the ways you can focus on keeping the lights on and engines running, so that your application is available for your customers when they need it. There will always be outages, security issues, and problems with the network. The best defense is to plan for that: improve monitoring to head off these problems early, before your customers drop your service because of frustration with your site.