The Wikistats Queries
Posted 2010-03-26 at 05:30 AM by rhann
Marcin asked how complex my big table is. I responded that I've been working on a posting about it but it's not ready to publish in full. Thinking about it though, there's no reason I can't talk about the database design or the queries themselves, so here goes.
Because I can't take away any real customer data to use in my VectorWise testing, and because I didn't want to spend a lot of time sitting in a customer's office running potentially unstable alpha-release software, and 'cos I'd already exhausted the limited challenges of my training database, I needed a substantial publicly available database on which to run realistic queries. A few minutes on Google turned up the Wikipedia statistics and the work being done with them by Percona.com. They have proposed some interesting queries on the data, and they've helpfully named them (though they don't always use the names consistently, e.g. Q_TOP_PROJECTS is also referred to as Q_TOP20_PROJECTS).
The database design is a simple star schema like you find in data warehousing applications.
There is a rapidly growing mountain of data available from Wikipedia so I grabbed just 28-days worth to play with. When I loaded it into the pagestat table I ended up with over three billion rows. Unfortunately my desktop machine has too little memory to run all the interesting queries with the full dataset, but I can create smaller subsets to make a start on testing how response time grows with data volume and maybe also attempt to extrapolate what is possible in principle if I had enough RAM. Small datasets also allowed me to fully test how much of the Ingres SQL dialect is usable with VectorWise.
Just to tantalize you a little more, here's the running times for the Q_TOP50 query on 6.25M rows (0.2487 secs), 12.5M rows (0.4051 secs), 25M rows (0.7008 secs) and 50M rows (1.2959 secs). I'll leave the extrapolation as an exercise for the reader.
More soon. I promise.
Because I can't take away any real customer data to use in my VectorWise testing, and because I didn't want to spend a lot of time sitting in a customer's office running potentially unstable alpha-release software, and 'cos I'd already exhausted the limited challenges of my training database, I needed a substantial publicly available database on which to run realistic queries. A few minutes on Google turned up the Wikipedia statistics and the work being done with them by Percona.com. They have proposed some interesting queries on the data, and they've helpfully named them (though they don't always use the names consistently, e.g. Q_TOP_PROJECTS is also referred to as Q_TOP20_PROJECTS).
The database design is a simple star schema like you find in data warehousing applications.
There is a rapidly growing mountain of data available from Wikipedia so I grabbed just 28-days worth to play with. When I loaded it into the pagestat table I ended up with over three billion rows. Unfortunately my desktop machine has too little memory to run all the interesting queries with the full dataset, but I can create smaller subsets to make a start on testing how response time grows with data volume and maybe also attempt to extrapolate what is possible in principle if I had enough RAM. Small datasets also allowed me to fully test how much of the Ingres SQL dialect is usable with VectorWise.
Just to tantalize you a little more, here's the running times for the Q_TOP50 query on 6.25M rows (0.2487 secs), 12.5M rows (0.4051 secs), 25M rows (0.7008 secs) and 50M rows (1.2959 secs). I'll leave the extrapolation as an exercise for the reader.

More soon. I promise.
Total Comments 4
Comments
-
If you've measured time for 100M, 200M, 400M etc can you please post it? I can extrapolate as well as other person, but I'm more interested in actual measured numbers. Thanks!Posted 2010-03-26 at 12:22 PM by imend
-
Posted 2010-03-27 at 03:27 AM by rhann
-
Well, not exactly - I'm interested to see if/when you've seen breakdown of linear scaling (due to insufficient RAM or IO bw); so I'll have something to compare with. If your measured time increased almost exactly linearly (as it did for earlier numbers) then exact numbers aren't that important.
Thanks for your posts btw.Posted 2010-03-27 at 07:07 PM by imend
-
Ah, I now understand. I can give you some kind of answer to that question. I can run the entire set of queries on up to 50M rows and get proportionate scaling, but above that some of the queries run out of RAM and performance degrades. In fact the alpha release will often just announce that it's run out of memoery and quit. As I wanted to run the entire set I stopped scaling up the size of the dataset once any of the queries failed.
I have to admit to a childish delight in running big queries fast, on a desktop, and I have to force myself to refocus on my original objectives which were never about VLDB processing.Posted 2010-03-28 at 02:55 AM by rhann








