0

MySQL Performance Surprise

My day job has me working on a project that measures the historical display of online advertising impressions for the purpose of planning future campaigns.  To do this, I have to:

  1. Retrieve a set of flat files which store data related to a 1/1000 sample of impressions.
  2. Parse those flat files into a MySQL database that stores impression data in two tables.
  3. Run on-demand queries to predict future performance based on that historical data.

The data extracted the flat file includes properties of the impression itself and also a set of "elements" that are, in effect, (non-personal) data about the user to whom the impression was served.  This data is stored in tables called impression and element, respectively.  One impression can have many elements.  In some cases, I do mean many.

In the prediction-by-analysis query we're looking for the number of impressions served to unique users in selected geographic locations and/or selected industries.  The query looks something like this:

 

To be clear, we're talking about millions or records in the impression table and hundreds of millions of records in the element table.  Roughly 2.5 million impressions are stored for each day.  Nonetheless, this query was performing badly due to the number of records in the element table and, more importantly filled a 28G hard drive within 20 days or so.

Against every database best practice I'm aware of, I decided to denormalize and try an experiment (just to see what would happen).  I dropped the element table and added two new fields to the impression table: geo_elements and industry_elements.  These fields store a comma-delimited list of element IDs that are geo- and industry-specific, respectively.

I then modified my query to look like this:

 

Surprise!  The second query outperformed the original query against a normalized database.  In fact, the second solution is the better performer on both INSERT and SELECT operations.  This was tested against 3 days of data.

I suspect that the second solution will scale, um, poorly but I was pretty shocked, quite honestly, that it performed as well as it did this far.  Remember, even in a three day test, we're talking about a lot of records (I don't have the exact number handy at the moment).

I'm not a DBA so if anyone reads this and can offer any insight as to why the second query might perform better and what I should expect as I attempt to scale this out to 15-30 days of data, I'd love to hear it. 

tags:
Database

Search

Rob  Wilkerson