<?xml version="1.0" encoding="utf-8"?>

<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
	<channel>
		<title>Actian Community Forums - Blogs - kuonirat</title>
		<link>http://community.actian.com/forum/blogs/kuonirat/</link>
		<description><![CDATA[Actian Corporation is a leading provider of open source database management software and support services. [Toll Free] +1 (888) 446-4737]]></description>
		<language>en</language>
		<lastBuildDate>Wed, 22 Feb 2012 22:25:17 GMT</lastBuildDate>
		<generator>vBulletin</generator>
		<ttl>60</ttl>
		<image>
			<url>http://community.actian.com/forum/ingres4/misc/rss.jpg</url>
			<title>Actian Community Forums - Blogs - kuonirat</title>
			<link>http://community.actian.com/forum/blogs/kuonirat/</link>
		</image>
		<item>
			<title>Designing warehouse data flow</title>
			<link>http://community.actian.com/forum/blogs/kuonirat/89-designing-warehouse-data-flow.html</link>
			<pubDate>Mon, 06 Feb 2012 11:25:35 GMT</pubDate>
			<description>*Making people happy* 
 
So, the first thing to know is, data warehouse is supposed to make the end users happy. One could feel bad about being man...</description>
			<content:encoded><![CDATA[<div><font size="6"><b>Making people happy</b></font><br />
<br />
So, the first thing to know is, data warehouse is supposed to make the end users happy. One could feel bad about being man in the middle, or even worse - the warehouse technical support, but this doesn't matter. It's the end user who has to be happy. So, happy about what? Well, if he gets answers he needs, we assume, he's happy. Nevertheless I would really like to make an opposition here. For example, &quot;can I have some data about users that satisfy these conditions?&quot; is not a good question. &quot;Some data&quot; is not a concrete and not a kind of question You can get answer for.<br />
<br />
<font size="6"><b>In summary, why querying on the production servers is a bad idea?</b></font><br />
<ul><li>analytical queries performance is very poor</li>
<li>analytical queries could slow down the entire production system</li>
<li>additional potential crash reason</li>
<li>joins between databases on different servers are costy in OLTP solutions</li>
<li>data is usually sharded</li>
<li>data is not cleaned</li>
<li>data is cryptic</li>
<li>security breach</li>
</ul><br />
<font size="6"><b>Filling analytical databases with data</b></font><br />
<br />
To give an answer, we have to have knowledge. Knowledge comes from data You can analyze in a reasonable time with a reasonable effort. That's why it should be stored in some database. Whether it will be YesSQL or NoSQL database is a matter of data type, not a matter of which side of this holly war You want to be on. So another purpose of this system is to fill appropriate databases with appropriate data in a reasonable time.<br />
<br />
<font size="6"><b>High throughoutput</b></font><br />
<br />
If you've got a lot of data to analyze, it means that a lot of data has to be loaded into the warehouse (Thanks, Captain Obvious!). So this has to work fast.<br />
<br />
<font size="5"><b>Pipe or stage?</b></font><br />
<br />
<b>In a sequenced process the speed is on the opposite side of the fast fault recovery ability</b>, because when you push the data stream through some mechanism, that can either go entirely well or entirely fail, you cannot take advantage of a part of the process being already done. You have to retry the whole process. You can divide the process into smaller, persistable stages (using storage areas), but this impacts performance. Thus You have to reach compromise.<br />
<br />
A general rule:<br />
<ul><li>use piping where you have to stream a lot of data, but the failure is rare, because the data is narrow and homogenic. An example of this may be a very long fact table (billions of rows), which doesn't have much columns and the columns are of basic types (e. g. integers)</li>
<li>store your data in every stage otherwise, because this improves the ability to recover fast from a failure</li>
</ul><br />
<font size="5"><b>Use ETL tool?</b></font><br />
<br />
I recommend using dedicated ETL tools for small (up to several GB) amounts of data and complicated transformations. If You want to reach high troughoutput, get rid of them. Code Your ETL manually, use pure SQL. You can use high level language for the ETL (like Java, or even PHP if this is a requirement), but this will be ok only if You can fire it up in a clustered environment (like Hadoop) or it will be a rare case, so one of Your server cores can handle it entirely while not consuming much of other resources needed by other parallelly running transformations.<br />
<br />
<font size="5"><b>MySQL as one of the storage areas</b></font><br />
<br />
It's ok, but use easily loadable storage engine (like MyISAM). Avoid using indexes. The processes should be designed in a &quot;one time full scan&quot; pattern, avoid per row lookups.<br />
<br />
<font size="5"><b>Loading from various sources</b></font><br />
<br />
The system has to be able to load from various sources. But believe me, maintaining many connectors may be very inconvenient. So what to do? Create only one for each data source and use them as gates used by your internal system mechanisms. Don't connect to external sources directly from various points (task processing stages) of Your system. The better idea is to plug those points into Your very few connectors (through some well defined interface), which in turn do the external communication thing. Failure in communication will be easier to handle if You've got, simplifying, only one potential place in code, You have to look at. But remember, that every new external connection needed during the workflow is a very risky thing that You really want to avoid, because it increases the dependance on some external system to actually be working. And hope is not what an engineer should rely in the first place on.<br />
<br />
<b>This is actually the main disadvantage of ready to use ETL systems. Almost every time You create a transformation or a job, You define Your connection. This is bad.</b> I know there are systems, which handle centralized data sources (Pentaho BI Server may be an example), but, ironically, they don't easily integrate with the data integration designing tools. They're more dedicated to charting and reporting (but maybe You can find something better?). Anyway, I think <b>for bigger systems, multipoint connections to the same data source are an antipattern</b>.<br />
<br />
<font size="6"><b>High query performance</b></font><br />
<br />
You have to use the most responsive and fast analytical database you can afford. Use VectorWise for less than 1 TB of data to reload every day (data from production databases). Seriously, don't believe that any single machine solution can be faster. At least not for now (the beginning of 2012). Use some distributed, unstructured solution for more behavioral data (such as logging http requests), that requires more than several dozens of TB to store.<br />
<br />
<font size="5"><b>Load tables entirely or incrementally? Use SCD?</b></font><br />
<br />
If Your business users know what they want and they tell You this, You're in heaven. Why? Because You can design Your architecture to meet those needs. You count costs and decide, whether it is worth to implement <a href="http://en.wikipedia.org/wiki/Slowly_changing_dimension" target="_blank">SCD</a> principles. If you've got a lot of data, millions of users, this cost may be to high. You can select the most needed data, design facts and dimensions and do beautiful incremental loading solutions... Ok, now back to earth. If Your business users don't have a clue what they want and they work in a more &quot;on demand&quot; way, just load the production tables entirely doing only basic ETL before. Decode mysterious &quot;type_id&quot; columns, get rid of meaningless columns, maybe clean some data and do some helper snapshot tables, that will increase queries performance. Just for the biggest ones (or those extracted e. g. daily from some exotic data source) consider doing some incremental loading mechanisms. Don't be afraid to break the rules, because afterall, it's the end users' happiness what matters, not Yours (the engineer). This is not that bad, as it sounds.<br />
<br />
Moreover, using SCDs assumes that people want to analyze the consequences of their decisions far after they're made. This assumption should be correct, but in practice most of the questions are about &quot;What's going on now? I want a number to show, we're doing well!&quot;. To analyze the past, one should be really self-conscious and able to admit, he has made some mistakes maybe (surely, he hasn't seen <a href="http://www.youtube.com/watch?v=HhxcFGuKOys" target="_blank">this</a>). This is not convinient in many cases, so the cost of implementing SCDs maybe to high comparing to the actual requirements (those which come from practice, not those specified). This is not only a critisism, it's just how people function.<br />
<br />
<font size="6"><b>Availability</b></font><br />
<br />
<font size="5"><b>Tasks</b></font><br />
<br />
To increase availability, use concept of tasks. If You granulate Your one big process, You increase availability. Moreover, You increase performance, because many of the tasks can run in parallel fashion, taking advantage of multicore machines.<br />
<br />
<font size="4"><b>Tasks mutual independence</b></font><br />
<br />
To achieve good availability, failure of one task should not affect failure of other. As long as two tasks do not depend on each other, also hangup of one task should not cause other task not to start eventually. This pushes You to <b>design Your system in a more task oriented, than batch oriented architecture</b>. Should You use maybe some already written tasks management system? Well, if You find one, that fits Your needs... but beware of limitations. It's really unlucky, if one day You come to a conclusion, that because of the limitations, You cannot make a very much required feature or optimization... especially, when it's far too late to change Your mind.<br />
<br />
<font size="4"><b>Tasks ordering</b></font><br />
<br />
Tasks should be ordered. Ordered like in a restaurant, not like in sorting. It means You should divide Your system into ordering part, and the making the dirty work part, which updates the task statuses accordingly during the workflow. You should make a new task for every job, not reuse old, already done tasks. So create a list of tasks every day or so, pick them, execute and set as finished with a success or an error. If You don't follow this pattern, You'll create a batch processing oriented architecture, which is bad, because it is not failure resistant and it is slower (does not take much advantage of parallel processing).<br />
<br />
<font size="4"><b>Tasks granularity</b></font><br />
<br />
Choose Your tasks granularity wisely. Should it be per table? Per shard? Maybe even per partition? Maybe You should use multiple granularity levels? Think it over. The more granularity You achieve, the more failure resistant and parallelized system You create. But this is almost always at the cost of development, administrating and You can easily exaggerate and the throughoutput performance will suffer too much. Experiment.<br />
<br />
<font size="4"><b>Tasks parallelization</b></font><br />
<br />
Beware of deadlocks. Remember, that You create a parallel system. If You don't know what mutex or semaphore is, critical section and so on, better know it before. For advanced programming I recommend reading &quot;The Art of Multiprocessor Programming&quot; by Herlihy Maurice and Shavit Nir.<br />
<br />
<font size="4"><b>Tasks statuses</b></font><br />
<br />
Your tasks should have meaningful statuses, that can indicate both progress and potential errors. The higher granularity, the more monitoring You can apply, but at the cost of development and You sacrifice the generic nature of the statuses. So choose wisely according to Your needs. Perhaps You have to handle different types of tasks, with different workflows (thus maybe different statuses sets).<br />
<br />
<font size="5"><b>Double buffering</b></font><br />
<br />
If You can afford Vertica, at least in theory (never tested it acutally) You don't have to worry about core hours, when Your warehouse has to work and reserve some night hours for data loading. Otherwise consider the double buffering solution:<br />
<ul><li>create views that for Your queries are the only access to Your underlying tables</li>
<li>load data into tables that are not being queried at the time You're loading</li>
<li>switch the view to the new tables instantly, so no one will notice</li>
<li>repeat this every hour, day or week - depending on how often You reload Your data</li>
</ul><br />
VectorWise can handle this double (or more) buffering solutions quite ok, but the transactions isolations do not always work as expected. This is not an issue, if You can just repeat a query in a new transaction (connection). Otherwise You can be quite astonished, You don't see any rows in Your current view anymore ;-)<br />
<br />
<font size="6"><b>Maintainability</b></font><br />
<br />
The system should be at least remotely maintainable in a way, that adding new processes, transformations, etc. would be as harmless as possible. One way to achieve this is to code the system in a way that... doesn't require much coding. The less source code, the less possibility to make a mistake or to commit a crime, such as copy-paste antipattern. So remember one, very important paradigm: use <b>convention over configuration</b>. Don't reinvent the wheel every time a new transformation appears. This also has another plus side: if You <b>design Your system with some layering in mind</b>, You can plug various new transformations at various points of Your system, making good external interfaces and reusable code.<br />
<br />
For example, let's say You load Your data from some SQL dumps, that arrive to Your data warehouse backup server from production servers. Now, let's say someone says: &quot;We wan't also load some data from a couple of additional servers.&quot; This is the time when You say: &quot;Ok, let's do that, but make Your administrators to upload the dumps in the same way I receive the old ones. This way I don't have to change much in my system and it can follow highly tested workflow.&quot; Argue, if anyone disagrees, because afterall it's the end user happiness what matters and he won't be happy, if he sees too many errors too often.<br />
<br />
<font size="6"><b>Monitoring</b></font><br />
<br />
I can confirm Kimball's words, which I heard from him with my own ears: <b>You have to monitor Your ETL</b>. You have to know, if anything went bad and fix it. ETL is not something that just works, because the input changes constantly. Because You alter it constantly to meet new needs.<br />
<br />
<font size="5"><b>Task statuses monitoring</b></font><br />
<br />
<b>Your tasks must provide an easy way to read their current statuses</b>. If they're stored in a database table, it should have a column, which indicates the status and there should be a limited number of possible status values (use some enum for example). The status should be easily interpretable and human readable (for example it should contain &quot;Error&quot; suffix, if it indicates any error). Doing this will allow You to create easy queries that can summerize the data flow processing in a nice, instant report. The goal is to quickly know, if the whole processing is going fine and if it is not, which tasks have failed. <b>It's very important</b> (thus the verbosity of this simple paragraph).<br />
<br />
<font size="5"><b>Logging</b></font><br />
<br />
Make Your workflow verbose, do heavy logging, especially when You encounter error. Check te processes statuses after exit, don't swallow exceptions. Believe me, it will pay back! At the cost of initial errors waterfall, You get some serious peace later...<br />
<br />
<font size="6"><b>Implementing a buzzword</b></font><br />
<br />
For business users a data warehouse is a black box they get answers from. They may even treat You as a part of this black box... Nevertheless, they don't understand the limits and possibilities. Not even now I'm sure whether it's an issue or not. If You have to &quot;make a data warehouse&quot;, this means not only, You've got no idea what exactly You have to create. It is also a possibility to experiment, to adapt to the needs on demand. Am I talking about agile? I don't know, I really don't like buzzwords... You can educate, if You're a believer - the World needs such people. For others I suggest not to ask too many questions. They confuse. Just do what You think will be best and try to know for others.</div>

]]></content:encoded>
			<dc:creator>kuonirat</dc:creator>
			<dc:publisher>42516</dc:publisher>
			<guid isPermaLink="true">http://community.actian.com/forum/blogs/kuonirat/89-designing-warehouse-data-flow.html</guid>
		</item>
		<item>
			<title><![CDATA[How to deal with "Out of memory" problem in VectorWise]]></title>
			<link>http://community.actian.com/forum/blogs/kuonirat/77-how-deal-out-memory-problem-vectorwise.html</link>
			<pubDate>Mon, 11 Oct 2010 07:38:32 GMT</pubDate>
			<description><![CDATA[Until now, VectorWise's temporary queries results have to fit memory. Not always there is a possibility to rewrite the query in a "better" way...]]></description>
			<content:encoded><![CDATA[<div>Until now, VectorWise's temporary queries results have to fit memory. Not always there is a possibility to rewrite the query in a &quot;better&quot; way (otherwise one could optimize infinitely to reach zero query time and no memory usage ;) ). Consider an example, where one would like to count distinct users, which visit your website, but per day. Following the example, let's say Your website has been running for more than four years and You've got approximately from 7 to 13 millions of unique users per day... But You want the exact numbers. I don't know, how much RAM You have, but believe me, hundrends of gigabates would be needed.<br />
<br />
One approach is to write a script, that would generate many sql queries with highly narrowing constraints. It's not hard to code (for example in ruby ;) ) a simple script, that in loop would generate the aforementioned queries with constraints per day. Now to get the result in the exact same way, as it would come with the original query, just put these statements at the end of the generated queries:<br />
<br />
<div style="margin:20px; margin-top:5px">
	<div class="smallfont" style="margin-bottom:2px">Code:</div>
	<pre class="alt2" dir="ltr" style="
		margin: 0px;
		padding: 6px;
		border: 1px inset;
		width: 640px;
		height: 114px;
		text-align: left;
		overflow: auto">\silent
\script result.csv
\notitles
\vdelim TAB
\trim
\g</pre>
</div>Of course there should be only one '\g' command preceeded by the queries separated only by semicolons.<br />
<br />
But what if we need some temporary aggregate, that don't fit the memory? Could we use &quot;create table as&quot; syntax to create such table? The answer is &quot;no&quot;, because before being written to the temporary table, the result also has to fit the memory. But there is another solution. Assuming, that rows are generated independently, one could create a temporary table, and make many &quot;insert select&quot; statements into this table, with each one of them fitting the memory. Performing the reducing query on such table will be vector-wise-fast.</div>

]]></content:encoded>
			<dc:creator>kuonirat</dc:creator>
			<dc:publisher>42516</dc:publisher>
			<guid isPermaLink="true">http://community.actian.com/forum/blogs/kuonirat/77-how-deal-out-memory-problem-vectorwise.html</guid>
		</item>
		<item>
			<title>Migrating large tables from MySQL to VectorWise</title>
			<link>http://community.actian.com/forum/blogs/kuonirat/70-migrating-large-tables-mysql-vectorwise.html</link>
			<pubDate>Fri, 02 Jul 2010 10:08:16 GMT</pubDate>
			<description>One can face a problem when migrating from MySQL to VectorWise, when a fast and appropriate method is required. Using multiple inserts is not the way...</description>
			<content:encoded><![CDATA[<div>One can face a problem when migrating from MySQL to VectorWise, when a fast and appropriate method is required. Using multiple inserts is not the way VectorWise likes to see it, but using additional dumps can have a high storage cost.<br />
<br />
I shall give here a hint, how to migrate large tables from MySQL to VectorWise using low disk space and memory usage. We'll use linux named pipes. First we have to create a pipe:<br />
<br />
<font face="Courier New">$ mknod my_pipe p</font><br />
<br />
Now we make a query to mysql:<br />
<br />
<font face="Courier New">$ mysql --quick -h my_host -umy_user my_dbname -B --default-character-set=utf8 -e &quot;select col1, col2, col3 from \`my_table\`;&quot; &gt; pipe &amp;</font><br />
<br />
This will make mysql to hang on pipe in the background (notice the ampersand), while collecting results, that are pushed down by MySQL server.<br />
<br />
Notice the --quick option which is the key to the whole memory saving. It will make mysql not wait until the result is returned from the server. Instead, it will directly output to the pipe. Someone has to immediately start to read from the other side of the pipe, so we use a COPY FROM statement to do that:<br />
<br />
<font face="Courier New">$ echo &quot;COPY TABLE my_table (<br />
  col1 = 'c0tab',<br />
  col2 = 'c0tab',<br />
  col3 = 'c0nl'<br />
) FROM 'pipe' WITH ON_ERROR = CONTINUE \g&quot; | sql vw_dbname</font><br />
<br />
This method is ok, as soon as You have a very simple data (not necessarily small of course). However we'll need some more conversion if we have NULL columns in MySQL or timestamps like '0000-00-00 00:00' which are not supported by VectorWise. We could use more complicated query to MySQL server (which is probably better choice) or use sed to alter some data. For example:<br />
<br />
<font face="Courier New">$ mysql --quick -h my_host -umy_user my_dbname -B --default-character-set=utf8 -e &quot;select col1, col2, col3 from \`my_table\`;&quot; | sed 's/NULL//g;s/\n//g;s/0000-00-00/0001-01-01/g' &gt; pipe &amp;</font><br />
<br />
Which will on the fly remove &quot;NULL&quot; strings, newline from within character fields and replace the problematic timestamp with more &quot;eatable&quot; for VectorWise value. Note that sed will be, surprisingly, the bottleneck of this procedure, so should be avoided in favor of more sophisticated query to MySQL server. This method would also cause problems, when the character fields contains some &quot;dangerous&quot; characters, for example tabs.<br />
<br />
Konrad Procak</div>

]]></content:encoded>
			<dc:creator>kuonirat</dc:creator>
			<dc:publisher>42516</dc:publisher>
			<guid isPermaLink="true">http://community.actian.com/forum/blogs/kuonirat/70-migrating-large-tables-mysql-vectorwise.html</guid>
		</item>
	</channel>
</rss>

