Login Register Actian.com  

Actian Community Blogs

Rate this Entry

The Wikistats Queries

Submit "The Wikistats Queries" to Digg Submit "The Wikistats Queries" to del.icio.us Submit "The Wikistats Queries" to StumbleUpon Submit "The Wikistats Queries" to Google
Posted 2010-03-26 at 05:30 AM by rhann

Marcin asked how complex my big table is. I responded that I've been working on a posting about it but it's not ready to publish in full. Thinking about it though, there's no reason I can't talk about the database design or the queries themselves, so here goes.

Because I can't take away any real customer data to use in my VectorWise testing, and because I didn't want to spend a lot of time sitting in a customer's office running potentially unstable alpha-release software, and 'cos I'd already exhausted the limited challenges of my training database, I needed a substantial publicly available database on which to run realistic queries. A few minutes on Google turned up the Wikipedia statistics and the work being done with them by Percona.com. They have proposed some interesting queries on the data, and they've helpfully named them (though they don't always use the names consistently, e.g. Q_TOP_PROJECTS is also referred to as Q_TOP20_PROJECTS).

The database design is a simple star schema like you find in data warehousing applications.

There is a rapidly growing mountain of data available from Wikipedia so I grabbed just 28-days worth to play with. When I loaded it into the pagestat table I ended up with over three billion rows. Unfortunately my desktop machine has too little memory to run all the interesting queries with the full dataset, but I can create smaller subsets to make a start on testing how response time grows with data volume and maybe also attempt to extrapolate what is possible in principle if I had enough RAM. Small datasets also allowed me to fully test how much of the Ingres SQL dialect is usable with VectorWise.

Just to tantalize you a little more, here's the running times for the Q_TOP50 query on 6.25M rows (0.2487 secs), 12.5M rows (0.4051 secs), 25M rows (0.7008 secs) and 50M rows (1.2959 secs). I'll leave the extrapolation as an exercise for the reader.

More soon. I promise.
Posted in VectorWise
Views 17314 Comments 4 Edit Tags

« Prev     Main     Next »

Total Comments 4

Comments

  1. Old Comment
    If you've measured time for 100M, 200M, 400M etc can you please post it? I can extrapolate as well as other person, but I'm more interested in actual measured numbers. Thanks!
    permalink
    Posted 2010-03-26 at 12:22 PM by imend imend is offline
  2. Old Comment
    rhann's Avatar
    Quote:
    Originally Posted by imend View Comment
    [...] I'm more interested in actual measured numbers.
    Looking at my results won't tell you anything about what your hardware will do no matter how many rows I use. I think your best bet will be to get your hands on the beta release when it becomes available.
    permalink
    Posted 2010-03-27 at 03:27 AM by rhann rhann is offline
  3. Old Comment
    Well, not exactly - I'm interested to see if/when you've seen breakdown of linear scaling (due to insufficient RAM or IO bw); so I'll have something to compare with. If your measured time increased almost exactly linearly (as it did for earlier numbers) then exact numbers aren't that important.
    Thanks for your posts btw.
    permalink
    Posted 2010-03-27 at 07:07 PM by imend imend is offline
  4. Old Comment
    rhann's Avatar
    Ah, I now understand. I can give you some kind of answer to that question. I can run the entire set of queries on up to 50M rows and get proportionate scaling, but above that some of the queries run out of RAM and performance degrades. In fact the alpha release will often just announce that it's run out of memoery and quit. As I wanted to run the entire set I stopped scaling up the size of the dataset once any of the queries failed.

    I have to admit to a childish delight in running big queries fast, on a desktop, and I have to force myself to refocus on my original objectives which were never about VLDB processing.
    permalink
    Posted 2010-03-28 at 02:55 AM by rhann rhann is offline
 
© 2011 Actian Corporation. All Rights Reserved