Recent posts / Archive

Categories

Hierarchical database in MySQL...

Originally posted by on 06:00 Mon 6 November 2006, last modified 12:34 Mon 7 May 2007.

File under: covariates databases MySQL phd

Okay, so I need to investigate the effect of covariates on influenza infections across a country. Covariates are basically cofounding variables, that may be predictive of the outcome under study. I'm investigating influenza - so lets look at population density. To analyse such data, we need a system to store and retrieve it; a database.

Designing your database is fairly important, obviously in terms of optimisation, and scalability etc etc, also once you start populating the thing, it can be tricky to change the design. I suggest you read about database normalization if you’ve not done so already. Anyway, so I decided that for this project "Europe" is an area of land, or a territory, in exactly the same way that the French départment "Bas-Rhin" is. Therefore they should go in the same table. I also wanted to know that "Bas-Rhin" is a départment in the région Alsace, which itself is in France. All this boils down to a hierarchical structure, which I want to model in a relational database. Therefore, my current database has the following entity relationship….

Entity Relationship Diagram

The only problem with this design is the circular relationship, where a foreign key in the territory table references a primary key in the same table. From a practical point of view, (as far as I'm aware) this means that you can't query the database to return say the name of a territory, and the name of its parent. To do so, means accessing the same column name twice, but for different rows, which I think is impossible. Therefore you have to use two queries. If anyone knows what SQL lets me achieve that - I'd love to hear from you!

Anyway, in all other aspects this design is looking pretty good, it lets me expand the number of variables associated with each territory as and when I'm ready - i.e. it's scalable. It's also general - not tied to the French data that I happen to be working with at the moment. It does require some interesting SQL statements - for example, to calculate the population of the région "Centre", from the populations of its départments - I can do this…

SELECT  SUM(y1990), SUM(y1999) FROM population, territory
WHERE id_population = territory_population_id
AND id_territory = ANY (SELECT id_territory FROM territory
WHERE territory_parent_id = (SELECT id_territory
FROM territory
WHERE territory_name = "Centre"))

| SUM(y1990) | SUM(y1999) |
|    2371036 |    2440329 |
1 row in set (0.01 sec)

So, while it looks a little complex, moving down the hierarchy just means nesting another sumbquery in there, and using the ANY keyword. Awesome!

comment

Comment on article