How Full-Featured is the VectorWise SELECT?
Posted 2010-03-18 at 09:19 AM by rhann
Tags ingres, vectorwise
I have been tied up with other things for the last couple of days so I haven't got around to capturing the results of querying my big database yet. For now, I'm going to report some more very early experiences from a couple of weeks ago.
The database I'm writing about today implements a simple star-schema for analysing Wikipedia page statistics. The Wikipedia statistics are freely available like everything else on Wikipedia, and they are ideal for my purpose here because it's not synthetic data and the dataset is immense (growing at about 5 million rows/hour).
To start with I loaded three hours' data, which is about 15 million rows. Because the raw data is not in a form suitable for loading directly into the star schema I needed to do some preliminary processing on it (e.g. attaching surrogate keys to the rows in the dimension tables, and also working out what day of the week each date is). VectorWise doesn't currently allow updating so I had to pre-process the data using a mix of Linux tools and classic Ingres before loading it. (I should have used a proper ETL tool but my ETL expert wasn't available, and I was impatient.) Once I'd loaded the pre-processed data I was able to construct my fact table using VectorWise with a CREATE TABLE pagestat AS SELECT...WITH STRUCTURE=VECTORWISE statement. It could hardly be easier. (Unfortunately I don't have a record of how long it took, but nothing took more than a few seconds.)
Let's take a look at some queries. One of my tables is called pages. How many rows in it?
That took much less than a second to count almost 15 million rows. That is going to be handy in future. I frequently want to use a row count in a more complex calculation and now instead of computing it ahead of time and storing it in a session global temporary table and writing some potentially obscure SQL every time I need the count, I can just do a count.
Let's try something more complex. For each day in a specific month, how many page impressions were delivered?
So far I have data for just three hours so there is only one date. This was a less trivial query though, and it took about 2 seconds to query the pagestat table's ~14 million rows.
I'm not going to bother reporting any more times using this little database running on a puny desktop machine. (I'll come back to the speed reports when I look at my big database.) As I stated in my first posting about my objectives, I am more interested in functionality. VectorWise is currently more or less read-only, but it supports nearly all the Ingres SQL SELECT syntax. Here are some examples of queries that it successfully executes:
The SQL need not be particularly elegant:
VectorWise is not entirely read-only:
There is one caveat that probably should be borne in mind. If your query contains a subquery, as many of these do, VectorWise can execute it only if the subquery can be flattened. Ingres always tries to flatten queries and is very good at it; it automatically flattened all the ones above. But not all queries can be flattened automatically. If you run into one of those cases—which so far I haven't—you will have to re-write it so that it can be.
The database I'm writing about today implements a simple star-schema for analysing Wikipedia page statistics. The Wikipedia statistics are freely available like everything else on Wikipedia, and they are ideal for my purpose here because it's not synthetic data and the dataset is immense (growing at about 5 million rows/hour).
To start with I loaded three hours' data, which is about 15 million rows. Because the raw data is not in a form suitable for loading directly into the star schema I needed to do some preliminary processing on it (e.g. attaching surrogate keys to the rows in the dimension tables, and also working out what day of the week each date is). VectorWise doesn't currently allow updating so I had to pre-process the data using a mix of Linux tools and classic Ingres before loading it. (I should have used a proper ETL tool but my ETL expert wasn't available, and I was impatient.) Once I'd loaded the pre-processed data I was able to construct my fact table using VectorWise with a CREATE TABLE pagestat AS SELECT...WITH STRUCTURE=VECTORWISE statement. It could hardly be easier. (Unfortunately I don't have a record of how long it took, but nothing took more than a few seconds.)
Let's take a look at some queries. One of my tables is called pages. How many rows in it?
Code:
-- report database size SELECT count(*) from pages; Fri Mar 5 15:00:41 2010 Executing . . . +----------------------+ |col1 | +----------------------+ | 14994911| +----------------------+ (1 row) Fri Mar 5 15:00:41 2010
Let's try something more complex. For each day in a specific month, how many page impressions were delivered?
Code:
-- Q_DAYSTAT (year,month) SELECT di.caldate, sum(ps.page_count) FROM pagestat ps JOIN datesinfo di ON di.id=ps.date_id WHERE di.calmonth=2 and di.calyear=2010 GROUP BY di.caldate ORDER BY di.caldate; Fri Mar 5 15:00:55 2010 Executing . . . +----------+----------------------+ |caldate |col2 | +----------+----------------------+ |2010-02-16| 87899109| +----------+----------------------+ (1 row) Fri Mar 5 15:00:57 2010
I'm not going to bother reporting any more times using this little database running on a puny desktop machine. (I'll come back to the speed reports when I look at my big database.) As I stated in my first posting about my objectives, I am more interested in functionality. VectorWise is currently more or less read-only, but it supports nearly all the Ingres SQL SELECT syntax. Here are some examples of queries that it successfully executes:
Code:
-- Q_TOP_PROJECTS ( N,year,month )
SELECT FIRST 20 pj.project, sum(ps.page_count) AS sm
FROM pagestat ps JOIN datesinfo di ON di.id=ps.date_id
JOIN projects pj ON pj.id=ps.project_id
WHERE di.calmonth=2
AND di.calyear=2010
GROUP BY pj.project
ORDER BY sm DESC;
Code:
-- Q_NOPROJECTS(Year,Month,Day) SELECT pj.project FROM projects pj LEFT OUTER JOIN pagestat ps ON ps.project_id = pj.id WHERE ps.project_id IS NULL ORDER BY project;
Code:
-- Q_10000HITS_2(Year,Month,Day)
SELECT FIRST 50 pg.page, t.mx
FROM pages pg JOIN (SELECT ps.page_id, max(ps.page_count) AS mx
FROM pagestat ps JOIN datesinfo di ON di.id=ps.date_id
WHERE di.calmonth=2
AND di.calyear=2010
AND di.calday=16
GROUP BY ps.page_id
HAVING max(page_count) > 10000) t ON t.page_id=pg.id;
Code:
-- Q_NOPROJECTS_1(Year,Month,Day)
SELECT pj.project
FROM projects pj LEFT OUTER JOIN (SELECT distinct ps.project_id AS pid
FROM pagestat ps
WHERE date_id IN (SELECT di.id
FROM datesinfo di
WHERE di.calmonth=2
AND di.calyear=2010
AND di.calday=16)) t1
ON pj.id=t1.pid
WHERE t1.pid IS NULL
ORDER BY project;
Code:
CREATE TABLE pagestat_daily ( page_id int NOT NULL, date_id_from smallint NOT NULL, date_id_to smallint NOT NULL, project_id smallint NOT NULL, page_count int NOT NULL ) WITH STRUCTURE=VECTORWISE; INSERT INTO pagestat_daily (project_id, page_id, date_id_from, date_id_to, page_count ) SELECT ps.project_id, ps.page_id, min(ps.date_id), max(ps.date_id), sum(ps.page_count) FROM pagestat ps JOIN datesinfo di ON ps.date_id=di.id WHERE calmonth=2 AND calyear=2010 GROUP BY ps.project_id, ps.page_id, di.caldate;
Total Comments 2
Comments
-
Posted 2010-03-19 at 02:48 AM by jpzuate
-
If you liked that...
Jean-Pierre, you might like this one I did just for you:
That's counting rows at the rate of 239 million per second, on my crummy desktop.Code:* select local_time; * select bigint(count(*)) from rawdata; * select local_time; * \g Executing . . . +-------------------------------+ |col1 | +-------------------------------+ |10:20:54.892488000 | +-------------------------------+ (1 row) +----------------------+ |col1 | +----------------------+ | 3355878457| +----------------------+ (1 row) +-------------------------------+ |col1 | +-------------------------------+ |10:21:08.907818000 | +-------------------------------+ (1 row) continue
Incidentally, it also just slow enough to show that the rows really are being counted. Someone had objected that the query I showed in my blog entry could have been satisfied by looking in a catalogue.Posted 2010-03-19 at 03:28 AM by rhann








