Login Register Actian.com  

Actian Community Blogs

Rate this Entry

How Full-Featured is the VectorWise SELECT?

Submit "How Full-Featured is the VectorWise SELECT?" to Digg Submit "How Full-Featured is the VectorWise SELECT?" to del.icio.us Submit "How Full-Featured is the VectorWise SELECT?" to StumbleUpon Submit "How Full-Featured is the VectorWise SELECT?" to Google
Posted 2010-03-18 at 09:19 AM by rhann

I have been tied up with other things for the last couple of days so I haven't got around to capturing the results of querying my big database yet. For now, I'm going to report some more very early experiences from a couple of weeks ago.

The database I'm writing about today implements a simple star-schema for analysing Wikipedia page statistics. The Wikipedia statistics are freely available like everything else on Wikipedia, and they are ideal for my purpose here because it's not synthetic data and the dataset is immense (growing at about 5 million rows/hour).

To start with I loaded three hours' data, which is about 15 million rows. Because the raw data is not in a form suitable for loading directly into the star schema I needed to do some preliminary processing on it (e.g. attaching surrogate keys to the rows in the dimension tables, and also working out what day of the week each date is). VectorWise doesn't currently allow updating so I had to pre-process the data using a mix of Linux tools and classic Ingres before loading it. (I should have used a proper ETL tool but my ETL expert wasn't available, and I was impatient.) Once I'd loaded the pre-processed data I was able to construct my fact table using VectorWise with a CREATE TABLE pagestat AS SELECT...WITH STRUCTURE=VECTORWISE statement. It could hardly be easier. (Unfortunately I don't have a record of how long it took, but nothing took more than a few seconds.)

Let's take a look at some queries. One of my tables is called pages. How many rows in it?
Code:
-- report database size
SELECT count(*) from pages;
Fri Mar  5 15:00:41 2010
Executing . . .


+----------------------+
|col1                  |
+----------------------+
|              14994911|
+----------------------+
(1 row)
Fri Mar  5 15:00:41 2010
That took much less than a second to count almost 15 million rows. That is going to be handy in future. I frequently want to use a row count in a more complex calculation and now instead of computing it ahead of time and storing it in a session global temporary table and writing some potentially obscure SQL every time I need the count, I can just do a count.

Let's try something more complex. For each day in a specific month, how many page impressions were delivered?
Code:
-- Q_DAYSTAT (year,month) 
SELECT di.caldate, sum(ps.page_count) 
FROM pagestat ps JOIN datesinfo di 
ON di.id=ps.date_id  
WHERE di.calmonth=2 and di.calyear=2010 
GROUP BY di.caldate 
ORDER BY di.caldate;
Fri Mar  5 15:00:55 2010
Executing . . .


+----------+----------------------+
|caldate   |col2                  |
+----------+----------------------+
|2010-02-16|              87899109|
+----------+----------------------+
(1 row)
Fri Mar  5 15:00:57 2010
So far I have data for just three hours so there is only one date. This was a less trivial query though, and it took about 2 seconds to query the pagestat table's ~14 million rows.

I'm not going to bother reporting any more times using this little database running on a puny desktop machine. (I'll come back to the speed reports when I look at my big database.) As I stated in my first posting about my objectives, I am more interested in functionality. VectorWise is currently more or less read-only, but it supports nearly all the Ingres SQL SELECT syntax. Here are some examples of queries that it successfully executes:
Code:
-- Q_TOP_PROJECTS ( N,year,month )
SELECT FIRST 20 pj.project, sum(ps.page_count) AS sm 
FROM pagestat ps JOIN datesinfo di ON di.id=ps.date_id 
                 JOIN projects pj ON pj.id=ps.project_id  
WHERE di.calmonth=2 
  AND di.calyear=2010 
GROUP BY pj.project 
ORDER BY sm DESC;
Code:
-- Q_NOPROJECTS(Year,Month,Day) 
SELECT pj.project 
FROM projects pj LEFT OUTER JOIN pagestat ps ON ps.project_id = pj.id 
WHERE ps.project_id IS NULL 
ORDER BY project;
Code:
-- Q_10000HITS_2(Year,Month,Day)
SELECT FIRST 50  pg.page, t.mx 
FROM pages pg JOIN (SELECT ps.page_id, max(ps.page_count) AS mx  
                    FROM pagestat ps JOIN datesinfo di ON di.id=ps.date_id 
                    WHERE di.calmonth=2 
                      AND di.calyear=2010 
                      AND di.calday=16   
                    GROUP BY ps.page_id  
                    HAVING max(page_count) > 10000) t ON t.page_id=pg.id;
The SQL need not be particularly elegant:
Code:
-- Q_NOPROJECTS_1(Year,Month,Day)
SELECT pj.project 
FROM projects pj LEFT OUTER JOIN (SELECT distinct ps.project_id AS pid 
      FROM pagestat ps
      WHERE date_id IN (SELECT di.id 
             FROM datesinfo di 
             WHERE di.calmonth=2 
                  AND di.calyear=2010 
                  AND di.calday=16)) t1 
ON pj.id=t1.pid 
WHERE t1.pid IS NULL 
ORDER BY project;
VectorWise is not entirely read-only:
Code:
CREATE TABLE pagestat_daily 
(
   page_id int NOT NULL,
   date_id_from smallint  NOT NULL,
   date_id_to smallint  NOT NULL,
   project_id smallint  NOT NULL,
   page_count int  NOT NULL 
) WITH STRUCTURE=VECTORWISE;

INSERT INTO pagestat_daily (project_id, page_id, date_id_from, date_id_to, page_count ) 
SELECT ps.project_id, ps.page_id, min(ps.date_id), max(ps.date_id), sum(ps.page_count) 
FROM pagestat ps JOIN datesinfo di ON ps.date_id=di.id 
WHERE calmonth=2 
  AND calyear=2010 
GROUP BY ps.project_id, ps.page_id, di.caldate;
There is one caveat that probably should be borne in mind. If your query contains a subquery, as many of these do, VectorWise can execute it only if the subquery can be flattened. Ingres always tries to flatten queries and is very good at it; it automatically flattened all the ones above. But not all queries can be flattened automatically. If you run into one of those cases—which so far I haven't—you will have to re-write it so that it can be.
Posted in VectorWise
Views 18253 Comments 2 Edit Tags

« Prev     Main     Next »

Total Comments 2

Comments

  1. Old Comment
    jpzuate's Avatar
    1 second for 15 millions rows remember me a query I made in the past on a terradata database : 2 seconds to count 32 millions rows ...

    Thank you Roy for this work ...
    permalink
    Posted 2010-03-19 at 02:48 AM by jpzuate jpzuate is offline
  2. Old Comment
    rhann's Avatar

    If you liked that...

    Jean-Pierre, you might like this one I did just for you:
    Code:
    * select local_time;
    * select bigint(count(*)) from rawdata;
    * select local_time;
    * \g
    Executing . . .
    
    
    +-------------------------------+
    |col1                           |
    +-------------------------------+
    |10:20:54.892488000             |
    +-------------------------------+
    (1 row)
    
    +----------------------+
    |col1                  |
    +----------------------+
    |            3355878457|
    +----------------------+
    (1 row)
    
    +-------------------------------+
    |col1                           |
    +-------------------------------+
    |10:21:08.907818000             |
    +-------------------------------+
    (1 row)
    continue
    That's counting rows at the rate of 239 million per second, on my crummy desktop.

    Incidentally, it also just slow enough to show that the rows really are being counted. Someone had objected that the query I showed in my blog entry could have been satisfied by looking in a catalogue.
    permalink
    Posted 2010-03-19 at 03:28 AM by rhann rhann is offline
 
© 2011 Actian Corporation. All Rights Reserved