Login Register Actian.com  

Actian Community Blogs

Rate this Entry

Upgrade Day, and Creating Sample Datasets

Submit "Upgrade Day, and Creating Sample Datasets" to Digg Submit "Upgrade Day, and Creating Sample Datasets" to del.icio.us Submit "Upgrade Day, and Creating Sample Datasets" to StumbleUpon Submit "Upgrade Day, and Creating Sample Datasets" to Google
Posted 2010-03-26 at 02:24 AM by rhann
Updated 2010-03-26 at 04:10 AM by rhann (improved clarity)

I've not posted anything for a few days, and it looks like today is going to be spent unloading my database so I can install VectorWise 1.0. As much as I've enjoyed playing with the 0.8 release, the extra features in 1.0 are too inviting to ignore, and besides, what would be the point of familiarizing myself with the features (and limitations) of a stale version?

I'm using unloaddb in the usual way. Most of the most frequently used utilities work with VectorWise, with only a few documented exceptions. In this case unloaddb is unloading a mix of VectorWise tables and classic Ingres tables from the same database. I did hit a snag with the script generated by unloaddb, but I reckon it's alpha software and besides it was easily fixed. I logged the problem anyway.

To kill some time while I am waiting for the unload to finish I thought I'd show how I created subsets of my largest table to produce quick test cases. As I have mentioned repeatedly, one of my tables is really enormous. Although I can run some queries on it, others are infeasible owing to the lack of physical resources on the machine I'm using. Also, no matter how fast VectorWise is, there are times when I just want to prove that a sequence of steps produces a correct result so I want nearly instant response time and a dataset I can check by other means. I'm not going to go through a 3.3 billlion row table by hand to verify VectorWise's arithmetic.

I decided to take my big table and produce a series of subsets of it, each half the size of the previous one. The idea is that I can start any testing with the smallest subset (6.25 million rows). It is small enough to load into classic Ingres to run verification queries in a reasonable amount of time, and it gives nearly instant response times using VectorWise. I can then use a set twice the size to see if that works within my limited resources, and if it does, I can double the size again, and so on until I find the limit.

To preserve a realistic distribution of values I created the first subset by randomly choosing half the rows from 3.3 billion row table. Every row had an equal probability of being selected. It took a bit less than 55 minutes to create a 1.6 billion row subset like so:
Code:
create table subset as
select * from  rawdata
where randomf() < .5
with structure=vectorwise;
I repeated the procedure on that subset to get an 800 million row subset, then again on that one to get 400 million, and so on, finishing up with a 6.25 million row subset.

The subsets are subsets of my rawdata. Each subset was used to populate the tables of its corresponding star schema. It was easy to verify that this procedure had successfully chosen a representative sample preserving the relative frequency of key values so that the only difference was the size of the dataset.
Posted in VectorWise
Views 18184 Comments 4 Edit Tags

« Prev     Main     Next »

Total Comments 4

Comments

  1. Old Comment
    marcin's Avatar
    Another interesting post, thanks!

    I'm curious, how complex is your 'big table'?
    permalink
    Posted 2010-03-26 at 04:07 AM by marcin marcin is offline
  2. Old Comment
    rhann's Avatar
    Quote:
    Originally Posted by marcin View Comment
    I'm curious, how complex is your 'big table'?
    Not very.

    It is a stream of page impression statistics for Wikipedia. I posted the description a few days ago in response to igm's query. Each row represents how many times a specific page from a particular project got served in a particular hour of a particular day. I am preparing a detailed post on the whole project which I had hoped to post before now. I have high hopes that I will post it in the next few days.
    permalink
    Posted 2010-03-26 at 04:29 AM by rhann rhann is offline
  3. Old Comment
    Hi Ray,
    Can you tell me how can I download the VectorWise?
    permalink
    Posted 2010-03-28 at 06:21 PM by bicealyh bicealyh is offline
  4. Old Comment
    rhann's Avatar
    Quote:
    Originally Posted by bicealyh View Comment
    Can you tell me how can I download the VectorWise?
    I got access to the alpha-release on the Technology Preview Program.

    The first beta-release is imminent but I you will have to apply in a similar way to have access to it.
    permalink
    Posted 2010-03-29 at 12:10 AM by rhann rhann is offline
    Updated 2010-03-29 at 12:31 AM by rhann (added new and better information)
 
© 2011 Actian Corporation. All Rights Reserved