Upgrade Day, and Creating Sample Datasets
Tags ingres, vectorwise
I've not posted anything for a few days, and it looks like today is going to be spent unloading my database so I can install VectorWise 1.0. As much as I've enjoyed playing with the 0.8 release, the extra features in 1.0 are too inviting to ignore, and besides, what would be the point of familiarizing myself with the features (and limitations) of a stale version?
I'm using unloaddb in the usual way. Most of the most frequently used utilities work with VectorWise, with only a few documented exceptions. In this case unloaddb is unloading a mix of VectorWise tables and classic Ingres tables from the same database. I did hit a snag with the script generated by unloaddb, but I reckon it's alpha software and besides it was easily fixed. I logged the problem anyway.
To kill some time while I am waiting for the unload to finish I thought I'd show how I created subsets of my largest table to produce quick test cases. As I have mentioned repeatedly, one of my tables is really enormous. Although I can run some queries on it, others are infeasible owing to the lack of physical resources on the machine I'm using. Also, no matter how fast VectorWise is, there are times when I just want to prove that a sequence of steps produces a correct result so I want nearly instant response time and a dataset I can check by other means. I'm not going to go through a 3.3 billlion row table by hand to verify VectorWise's arithmetic.
I decided to take my big table and produce a series of subsets of it, each half the size of the previous one. The idea is that I can start any testing with the smallest subset (6.25 million rows). It is small enough to load into classic Ingres to run verification queries in a reasonable amount of time, and it gives nearly instant response times using VectorWise. I can then use a set twice the size to see if that works within my limited resources, and if it does, I can double the size again, and so on until I find the limit.
To preserve a realistic distribution of values I created the first subset by randomly choosing half the rows from 3.3 billion row table. Every row had an equal probability of being selected. It took a bit less than 55 minutes to create a 1.6 billion row subset like so:
I repeated the procedure on that subset to get an 800 million row subset, then again on that one to get 400 million, and so on, finishing up with a 6.25 million row subset.
The subsets are subsets of my rawdata. Each subset was used to populate the tables of its corresponding star schema. It was easy to verify that this procedure had successfully chosen a representative sample preserving the relative frequency of key values so that the only difference was the size of the dataset.
I'm using unloaddb in the usual way. Most of the most frequently used utilities work with VectorWise, with only a few documented exceptions. In this case unloaddb is unloading a mix of VectorWise tables and classic Ingres tables from the same database. I did hit a snag with the script generated by unloaddb, but I reckon it's alpha software and besides it was easily fixed. I logged the problem anyway.
To kill some time while I am waiting for the unload to finish I thought I'd show how I created subsets of my largest table to produce quick test cases. As I have mentioned repeatedly, one of my tables is really enormous. Although I can run some queries on it, others are infeasible owing to the lack of physical resources on the machine I'm using. Also, no matter how fast VectorWise is, there are times when I just want to prove that a sequence of steps produces a correct result so I want nearly instant response time and a dataset I can check by other means. I'm not going to go through a 3.3 billlion row table by hand to verify VectorWise's arithmetic.
I decided to take my big table and produce a series of subsets of it, each half the size of the previous one. The idea is that I can start any testing with the smallest subset (6.25 million rows). It is small enough to load into classic Ingres to run verification queries in a reasonable amount of time, and it gives nearly instant response times using VectorWise. I can then use a set twice the size to see if that works within my limited resources, and if it does, I can double the size again, and so on until I find the limit.
To preserve a realistic distribution of values I created the first subset by randomly choosing half the rows from 3.3 billion row table. Every row had an equal probability of being selected. It took a bit less than 55 minutes to create a 1.6 billion row subset like so:
Code:
create table subset as select * from rawdata where randomf() < .5 with structure=vectorwise;
The subsets are subsets of my rawdata. Each subset was used to populate the tables of its corresponding star schema. It was easy to verify that this procedure had successfully chosen a representative sample preserving the relative frequency of key values so that the only difference was the size of the dataset.
Total Comments 4
Comments
-
Posted 2010-03-26 at 04:07 AM by marcin
-
Not very.
It is a stream of page impression statistics for Wikipedia. I posted the description a few days ago in response to igm's query. Each row represents how many times a specific page from a particular project got served in a particular hour of a particular day. I am preparing a detailed post on the whole project which I had hoped to post before now. I have high hopes that I will post it in the next few days.Posted 2010-03-26 at 04:29 AM by rhann
-
Hi Ray,
Can you tell me how can I download the VectorWise?Posted 2010-03-28 at 06:21 PM by bicealyh
-
I got access to the alpha-release on the Technology Preview Program.
The first beta-release is imminent but I you will have to apply in a similar way to have access to it.Posted 2010-03-29 at 12:10 AM by rhann
Updated 2010-03-29 at 12:31 AM by rhann (added new and better information)








