Excellent paper from 2007, by James Hamilton, who was formerly at Microsoft and has been at Amazon since 2008, on how to design and deploy web services at scale. The paper is fairly technical, but some of the key principles are worth knowing even for completely non-technical readers.
First, he outlines three simple tenets:
- Expect failures. A component may crash or be stopped at any time. Dependent components might fail or be stopped at any time. There will be network failures. Disks will run out of space. Handle all failures gracefully.
- Keep things simple. Complexity breeds problems. Simple things are easier to get right. Avoid unnecessary dependencies. Installation should be simple. Failures on one server should have no impact on the rest of the data center.
- Automate everything. People make mistakes. People need sleep. People forget things. Automated processes are testable, fixable, and therefore ultimately much more reliable. Automate wherever possible.
Out of these key principles, some other design principles emerge, for example, for operations-friendly design:
- Design for failure
- Build in redundancy and fault recovery
- Use commodity hardware
- Have a single version of the software
- Host everyone on the same version of the software (aka Multi-tenancy)
The paper is quite friendly to non-technical readers (at least at the beginning), and a breezy read for technical ones, and should probably be required reading for people building services which they intend to scale.
Even as a non-technical founder, you should be able to understand this paper, if only so that you can communicate with your CTO. It's probably a good idea to grab a print-out, and spend a couple of hours walking over it with your CTO and asking questions about the bits you don't understand. You'll come out of it knowing a lot more about the key technology concerns for a growing startup.
As a final note, it's worth adding that while early stage startups with few or no users should be aware of these principles, actually implementing them all from day one would be gross over-engineering. Get users first - scale once you have the users.
If you read this far, you should follow me on twitter here.