Database optimisation and benchmarking

A project I have been working on has a database table of boxes (1 row for each box), a table of files in each box (anything between 10 and 50 rows for each box), and a table of file sections for each file (usually 3 sections per file).


Database diagram

At one point in my code I get a list of all boxes added within a given date range and display it along with a file count (how many files in the box). When I implemented this more than a year ago, I used a SQL join to get the file count. This was fine at the time - it worked, it was fast, and it was always correct (of course, like all code you wrote a long time ago, I look at it now and cringe).

A year on, and this expensive join is prooving to be an expensive bottleneck because of the large amount of data now in those tables - each time, full table scans are required, and this is amounting to an execution time of several seconds. So its time to take action. I have come up with 4 options so far:

  • Add a ‘file count’ column to my box table and read that instead of doing a join. Advantages: fairly easy to implement. Disadvantages: having to change access code in all apps that use this table so it writes the file count to the box table when saving and reads the column when loading (could use db triggers but they will cause more problems than they fix). Need a routine to add file counts to all existing boxes in the database.
  • Horizontal partitioning - the database is MySql, which has recently acquired this ability. I can partition the box table, file table and section table into date ranges, so retrieving boxes and files by date will not require full table scans. Advantages: not much to change in the code, I can keep using the join method. Disadvantages: there are currently limitations to MySql partitioning, e.g. I have to add a date column to the file and section tables so they can be partitioned, and will have to include that date column in teh table’s primary key
  • More aggressive database cacheing (thanks Dan for suggesting this). Advantages: quick, easy, to changes to be made.
  • Data archiving. Move old rows to an archive table. Advantages: No changes needed to existing code. Disadvantages: Need some stored procedure called regularly to archive data. Will not be able to view old data in apps unless I add features to code for ‘switching on’ archived data when required (could use a view which unions current and archived data).

To help me decide, I am going to do some benchmarking. I got this tip from a discussion on Jeff and Joel’s Stack Overflow podcast when they referred to this question, and pointed out how important it is to test how much you are actually gaining (or losing!) in speed.

posted 2 years ago