Thursday, October 28, 2004

Database DesignOver at Kottke, there is a great set of comments (most of which say the same thing over and over) about normalisation during database design and the use of denormalisation.

Jason asserted that normalisation/denormalisation is a tradeoff and references a good presentation on Flickr and their use of PHP and MySQL.

To summarise, most (90%) of the comments are saying that normalisation is the way to go and that denormalisation is a step you take AFTER you have normalised and then identified performance issues, rather than what I think Jason was hinting at which is that normalisation is an option which you can use if you want to but starting with a denormalised structure is ok. Quite rightly, most of the comments also say that problems/issues etc are database design have been well solved for many years and while creatively is (at least in my mind) totally needed for database design, the process itself of normalisation is well and truely defined and has been for years.

Now, I love database design, and over the last 13 years I have spent a lot of my time in both massive IBM DB2 and IMS systems, MS SQL server database driven financial systems and plenty of time with web systems running MySQL, Oracle or MS SQL Server data mining 10s of millions of rows.

Only last weekend, Phil and I were discussing database design while driving to get some take away food and we covered this very issue.

This leads me to my point (finally). Which is, that with the lowering of barriers for people to develop web applications using tools such as PHP and MySQL, the overall quality of database design has dropped substantially as has database query performance, and this is mainly due to the people developing these applications not having an understanding of good database and query design.

Now this is all and well for what Clay Shirky refers to as Situated Software, which is an emerging pattern of software creation for a specific, generally small scale, community or solution. e.g. Meeting room booking, class note support system etc.

But the problem arises when you don't have a good database design and then try to scale a system. What worked for a few users in testing and development, or even thousands of rows in a larger test, just doesn't work well with a significant (depends on each app) level of use. At this point, the system is well used and changes to retrospectively normalise the database design (and then denormalise poor performing parts) is very very expensive and time consuming not to mention a pain in the ass.

Web based systems which should be high performing given a) their user base b) the query structure and c) the content in the db, when running on one box, run slowly, even on a massive database box with multiple application servers.

in addition to poorly design databases (which I generally consider the root of the problem), developers with a lack of knowledge about how databases work, often end up designing poorly performing queries which, even with a great db design, would cause the system to grind to a halt.

For example, searching for a movie in a db, e.g. The Terminator, should be a simple select statement getting back a handful of rows depending upon the text search you want to do. What it should not be, is a query returning all movies in the database to the application server and then having the application server loop through each and every row from A to Z finding "The terminator".

Over the last 5 years I have developed a way of designing database driven web systems which seems to always start with doing a data model in conjunction with the functionally task definition. This approach seems to work well in most cases, where the web pages are driven by select, update, insert and delete statements. Its almost an databasey object modeling approach which works well for me.

So I believe in finding a balance with ease of access to tools and the responsibility of the developers to spend some time learning (what is not a difficult task really) database design methods and basic query practice. Then using this knowledge to at least create a system which, while not perfect, can start to be scaled or just maintained in the future without major rework.

Design you database reasonably well (it doesn't have to be 3rd normal form), make sure your queries are optimised (indexes used etc) and then develop to your hearts content, the extra time spent up front will save you heaps in maintenance, bug fixing due to data inconsistencies, and scaling issues.