First-time database design: am I overengineering? [closed]

Background

I'm a first year CS student and I work part time for my dad's small business. I don't have any experience in real world application development. I have written scripts in Python, some coursework in C, but nothing like this.

My dad has a small training business and currently all classes are scheduled, recorded and followed up via an external web application. There is an export/"reports" feature but it is very generic and we need specific reports. We don't have access to the actual database to run the queries. I've been asked to set up a custom reporting system.

My idea is to create the generic CSV exports and import (probably with Python) them into a MySQL database hosted in the office every night, from where I can run the specific queries that are needed. I don't have experience in databases but understand the very basics. I've read a little about database creation and normal forms.

We may start having international clients soon, so I want the database to not explode if/when that happens. We also currently have a couple big corporations as clients, with different divisions (e.g. ACME parent company, ACME healthcare division, ACME bodycare division)

The schema I have come up with is the following:

  1. From the client perspective:
    • Clients is the main table
    • Clients are linked to the department they work for
      • Departments can be scattered around a country: HR in London, Marketing in Swansea, etc.
      • Departments are linked to the division of a company
    • Divisions are linked to the parent company
  2. From the classes perspective:
    • Sessions is the main table
      • A teacher is linked to each session
      • A statusid is given to each session. E.g. 0 - Completed, 1 - Cancelled
      • Sessions are grouped into "packs" of an arbitrary size
    • Each packs is assigned to a client

I "designed" (more like scribbled) the schema on a piece of paper, trying to keep it normalised to the 3rd form. I then plugged it into MySQL Workbench and it made it all pretty for me:
(Click here for full-sized graphic)

alt text
(source: maian.org)

Example queries I'll be running

  • Which clients with credit still left are inactive (those without a class scheduled in the future)
  • What is the attendance rate per client/department/division (measured by the status id in each session)
  • How many classes has a teacher had in a month
  • Flag clients who have low attendance rate
  • Custom reports for HR departments with attendance rates of people in their division

Question(s)

  • Is this overengineered or am I headed the right way?
  • Will the need to join multiple tables for most queries result in a big performance hit?
  • I have added a 'lastsession' column to clients, as it is probably going to be a common query. Is this a good idea or should I keep the database strictly normalised?

Thanks for your time


Solution 1:

Some more answers to your questions:

1) You're pretty much on target for someone who is approaching a problem like this for the first time. I think the pointers from others on this question thus far pretty much cover it. Good job!

2 & 3) The performance hit you will take will largely be dependent on having and optimizing the right indexes for your particular queries / procedures and more importantly the volume of records. Unless you are talking about well over a million records in your main tables you seem to be on track to having a sufficiently mainstream design that performance will not be an issue on reasonable hardware.

That said, and this relates to your question 3, with the start you have you probably shouldn't really be overly worried about performance or hyper-sensitivity to normalization orthodoxy here. This is a reporting server you are building, not a transaction based application backend, which would have a much different profile with respect to the importance of performance or normalization. A database backing a live signup and scheduling application has to be mindful of queries that take seconds to return data. Not only does a report server function have more tolerance for complex and lengthy queries, but the strategies to improve performance are much different.

For example, in a transaction based application environment your performance improvement options might include refactoring your stored procedures and table structures to the nth degree, or developing a caching strategy for small amounts of commonly requested data. In a reporting environment you can certainly do this but you can have an even greater impact on performance by introducing a snapshot mechanism where a scheduled process runs and stores pre-configured reports and your users access the snapshot data with no stress on your db tier on a per request basis.

All of this is a long-winded rant to illustrate that what design principles and tricks you employ may differ given the role of the db you're creating. I hope that's helpful.

Solution 2:

You've got the right idea. You can however clean it up, and remove some of the mapping (has*) tables.

What you can do is in the Departments table, add CityId and DivisionId.

Besides that, I think everything is fine...

Solution 3:

The only changes I would make are:
1- Change your VARCHAR to NVARCHAR, if you might be going international, you may want unicode.

2- Change your int id's to GUIDs (uniqueidentifier) if possible (this might just be my personal preference). Assuming you eventually get to the point where you have multiple environments (dev/test/staging/prod), you may want to migrate data from one to the other. Have GUID Ids makes this significantly easier.

3- Three layers for your Company -> Division -> Department structure may not be enough. Now, this might be over-engineering, but you could generalize that hierarchy such that you can support n-levels of depth. This will make some of your queries more complex, so that may not be worth the trade-off. Further, it could be that any client that has more layers may be easily "stuffable" into this model.

4- You also have a Status in the Client Table that is a VARCHAR and has no link to the Statuses table. I'd expect a little more clarity there as to what the Client Status represents.