From maze at strahlungsfrei.de Wed Jan 2 10:56:31 2008 From: maze at strahlungsfrei.de (Martin Honermeyer) Date: Wed, 2 Jan 2008 16:56:31 +0100 Subject: [Activewarehouse-discuss] SCD type 2 handling Message-ID: <200801021656.33425.maze@strahlungsfrei.de> Hello guys, I can't get my head around the handling of type 2 SCD records in ActiveWarehouse. In theory (from "The Data Warehouse Toolkit"), a dimension can have a mixture of type 1 and type 2 fields, defined separately on a field-by-field basis. When there is a change, the ETL process should check whether any type 2 fields are affected. If so, it should create a new dimension record. Otherwise, an existing dimension must be updated (or created, if the record is new). As far as I can see, ActiveWarehouse allows an SCD setting at the row level only. If scd_type is set to 2, I can define some scd_fields which will be handled as type 2 fields. If a record is processed (in destination.rb => process_change) and scd_fields are defined, _only_ those fields are CRC-checked for changes. If none of those are changed, no record is updated or created. This seems an odd behaviour to me. Fields in a type 2 dimension not defined as scd_fields should be treated as type 1 fields, so an existing record has to be updated (or created). Otherwise changes in those other fields wont' be reflected in the data warehouse. Please tell me if I am getting that wrong. Sincerely, Martin From maze at strahlungsfrei.de Wed Jan 2 13:33:00 2008 From: maze at strahlungsfrei.de (Martin Honermeyer) Date: Wed, 2 Jan 2008 19:33:00 +0100 Subject: [Activewarehouse-discuss] How to preserve the surrogate key with SCDs? Message-ID: <200801021933.02163.maze@strahlungsfrei.de> Concerning SCDs, another question comes to mind. I have a large fact table having foreign keys to some big dimensions. I want to load this fact table incrementally (no truncation of the fact table) in order to keep the ETL process short. So the dimensions have to be loaded incrementally, too. Changed dimension records (type 1 ATM) have to keep their surrogate keys, so that only new fact records have to be loaded in each batch. With the current design of AW-ETL, this seems impossible. SCDs always delete the old record and insert a new record using the bulk loader. I can use the surrogate key generator on a field, but it doesn't seem to honor surrogate keys on existing dimension records. This pertains to type 2 mode - the type 1 mode only works when truncating and reloading the complete dimension table. I would migrate to type 2 SCDs here if that helped. I really don't understand how currently type 2 SCDs are supposed to work with AW-ETL. The only possibility I can imagine is to have the natural key in the fact table. Then you could constrain on the natural key and the effective dates in the dimension table using a date in the fact table. Really complicated! Some hints would be greatly appreciated. Regards Martin From mghaught at gmail.com Mon Jan 7 16:03:01 2008 From: mghaught at gmail.com (Marty Haught) Date: Mon, 7 Jan 2008 14:03:01 -0700 Subject: [Activewarehouse-discuss] Render Report Changes Coming Message-ID: <57f29e620801071303m7a468ec2g36cd0041a90df3c1@mail.gmail.com> Hey Everyone, I'm in the process of checking in some new view-related code to ActiveWarehouse's trunk. These new models will make it easier to create new views of reports. If you're using the trunk of the plugin, you'll want to prepare for changing how you render your table reports. I'll document how to use the new view code in the next week. Cheers, Marty Haught From bdimchef at wieldim.com Mon Jan 7 16:22:07 2008 From: bdimchef at wieldim.com (Brandon Dimcheff) Date: Mon, 7 Jan 2008 16:22:07 -0500 Subject: [Activewarehouse-discuss] [PATCH] REPLACE support for MySQL bulk load in adapter_extensions Message-ID: <0A44B4DA-E094-4414-97B7-E1748C19E89D@wieldim.com> I modified adapter_extensions so I can use the bulk loader to replace existing records in my database rather than just ignoring records that already exist in the database. I've attached a patch with tests for this functionality. - Brandon -------------- next part -------------- A non-text attachment was scrubbed... Name: replace.patch Type: application/octet-stream Size: 3302 bytes Desc: not available Url : http://rubyforge.org/pipermail/activewarehouse-discuss/attachments/20080107/79a52fee/attachment.obj -------------- next part -------------- From me at twifkak.com Tue Jan 15 01:13:52 2008 From: me at twifkak.com (Devin Mullins) Date: Tue, 15 Jan 2008 01:13:52 -0500 Subject: [Activewarehouse-discuss] Fact/Dimension Many-to-Many Relationship (i.e. multivalued dimensions)? Message-ID: <20080115061352.GA13026@twifkak.com> Hi all, I'm brand new (as in, this morning) to AW and OLAP, and have come up against a wall pretty quick. I'm working on a newspaper site, where an Article has_many :regions and has_many :topics. I see from pages like http://www.dbmsmag.com/9808d05.html and http://technet.microsoft.com/en-us/library/ms345139.aspx that the star schema can be extended with bridges/helpers/snowflakery to support multivalued dimensions, but I have questions: 1. A couple of places seemed to argue that the bridge tables defeated the performance advantages of the star schema, as bridge tables would have even more rows than fact tables. Agree/disagree? 2. I suppose I could just make three fact tables, Article, ArticleRegion, and ArticleTopic, but that's a bunch of added complexity, and I lose the ability to slice by both topic and region at once (or by topic grouping, without double-counting). Agree/disagree? And most importantly: 3. How would I implement such a beast in ActiveWarehouse? I got as far as `script/generate bridge`. :P Is there cube support for this type of thing? Thanks, Devin From anthonyeden at gmail.com Tue Jan 15 08:03:18 2008 From: anthonyeden at gmail.com (Anthony Eden) Date: Tue, 15 Jan 2008 08:03:18 -0500 Subject: [Activewarehouse-discuss] Fact/Dimension Many-to-Many Relationship (i.e. multivalued dimensions)? In-Reply-To: <20080115061352.GA13026@twifkak.com> References: <20080115061352.GA13026@twifkak.com> Message-ID: Regions and topics should probably be dimensions and article a fact. What I can't figure out is what the measurements are in this (aside from counts perhaps) since there is very little to go on. Keep in mind that you are developing a database structure that is designed specifically for analytical queries and thus you want to minimize or completely remove joins to large tables, hence your dimensions should be small. V/r Anthony On Jan 15, 2008 1:13 AM, Devin Mullins wrote: > Hi all, > > I'm brand new (as in, this morning) to AW and OLAP, and have come up > against a wall pretty quick. I'm working on a newspaper site, where an > Article has_many :regions and has_many :topics. I see from pages like > http://www.dbmsmag.com/9808d05.html and > http://technet.microsoft.com/en-us/library/ms345139.aspx that the star > schema can be extended with bridges/helpers/snowflakery to support > multivalued dimensions, but I have questions: > 1. A couple of places seemed to argue that the bridge tables defeated > the performance advantages of the star schema, as bridge tables > would have even more rows than fact tables. Agree/disagree? > 2. I suppose I could just make three fact tables, Article, > ArticleRegion, and ArticleTopic, but that's a bunch of added > complexity, and I lose the ability to slice by both topic and region > at once (or by topic grouping, without double-counting). > Agree/disagree? > > And most importantly: > 3. How would I implement such a beast in ActiveWarehouse? I got as far > as `script/generate bridge`. :P Is there cube support for this type > of thing? > > Thanks, > Devin > _______________________________________________ > Activewarehouse-discuss mailing list > Activewarehouse-discuss at rubyforge.org > http://rubyforge.org/mailman/listinfo/activewarehouse-discuss > From me at twifkak.com Tue Jan 15 08:41:14 2008 From: me at twifkak.com (Devin Mullins) Date: Tue, 15 Jan 2008 08:41:14 -0500 Subject: [Activewarehouse-discuss] Fact/Dimension Many-to-Many Relationship (i.e. multivalued dimensions)? In-Reply-To: References: <20080115061352.GA13026@twifkak.com> Message-ID: <20080115134114.GA13986@twifkak.com> Anthony, thanks for responding so promptly. On Tue, Jan 15, 2008 at 08:03:18AM -0500, Anthony Eden wrote: > Regions and topics should probably be dimensions and article a fact. > What I can't figure out is what the measurements are in this (aside > from counts perhaps) since there is very little to go on. Yes, counts is it, at least for now. Maybe aggregate hits some day. Assuming the idea of three fact tables is out, then, the question remains: How do I implement multivalued dimensions in ActiveWarehouse? > you want to minimize or > completely remove joins to large tables, hence your dimensions should > be small. Well, that's a problem. The bridge table would be huge -- more rows than the fact table. (Though, granted, only two columns -- it's essentially a :through table.) But saying "no" to the customer isn't really an option I'd like to take right now... Thanks, Devin From anthonyeden at gmail.com Tue Jan 15 08:48:49 2008 From: anthonyeden at gmail.com (Anthony Eden) Date: Tue, 15 Jan 2008 08:48:49 -0500 Subject: [Activewarehouse-discuss] Fact/Dimension Many-to-Many Relationship (i.e. multivalued dimensions)? In-Reply-To: <20080115134114.GA13986@twifkak.com> References: <20080115061352.GA13026@twifkak.com> <20080115134114.GA13986@twifkak.com> Message-ID: I believe you said: Article has_many Topics Article has_many Regions Is this correct? If it is then you might have something like this: topic_dimension * id * name * [other attributes] region_dimension * id * name * [other attributes] article_facts * topic_id * region_id * hit (always 1) With this structure you'd be able to aggregate your facts by the various attributes in both topic and dimension. Is this starting to look like what you are trying to do or am I still missing something here? V/r Anthony On Jan 15, 2008 8:41 AM, Devin Mullins wrote: > Anthony, thanks for responding so promptly. > > On Tue, Jan 15, 2008 at 08:03:18AM -0500, Anthony Eden wrote: > > Regions and topics should probably be dimensions and article a fact. > > What I can't figure out is what the measurements are in this (aside > > from counts perhaps) since there is very little to go on. > Yes, counts is it, at least for now. Maybe aggregate hits some day. > Assuming the idea of three fact tables is out, then, the question > remains: How do I implement multivalued dimensions in ActiveWarehouse? > > > you want to minimize or > > completely remove joins to large tables, hence your dimensions should > > be small. > Well, that's a problem. The bridge table would be huge -- more rows than > the fact table. (Though, granted, only two columns -- it's essentially a > :through table.) But saying "no" to the customer isn't really an option > I'd like to take right now... > > Thanks, > Devin > From twifkak at gmail.com Tue Jan 15 13:04:44 2008 From: twifkak at gmail.com (Devin Mullins) Date: Tue, 15 Jan 2008 13:04:44 -0500 Subject: [Activewarehouse-discuss] Fact/Dimension Many-to-Many Relationship (i.e. multivalued dimensions)? Message-ID: (Meta: Sorry for not replying in-thread -- I'm posting this from another email account.) > I believe you said: > Article has_many Topics > Article has_many Regions > > Is this correct? If it is then you might have something like this: > > ... > article_facts > * topic_id > * region_id > * hit (always 1) > > With this structure you'd be able to aggregate your facts by the > various attributes in both topic and dimension. Is this starting to > look like what you are trying to do or am I still missing something > here? I think you're still missing something. If I do the above, I'm stuck with one of two alternatives: 1. I have to choose a "primary" topic and region for each article, and discard the others, or 2. For each article, I have m*n article_facts, where m is the number of regions and n is the number of topics. The former leads to artificially low numbers (where an article is not counted for all but one of its topics/regions). The latter leads to artificially high numbers (where an article is n-tuply counted for every region, and m-tuply counted for every topic). I suppose I could fudge the numbers a bit, by creating a "weighting" column which 1/sqrt(m*n) (inverse of geometric mean, so the over/undercounting would be right "on average", assuming the number of regions and number of topics are independent variables), but then I'd feel dirty. Perhaps I scared you by referencing an MSDN technet article. The other article (http://www.dbmsmag.com/9808d05.html) was written by Ralph Kimball. :) Looking at the source code more, it looks like it's hard-coded to recognize HierarchicalBridges, and there's no real support for pluggable bridging. (Please correct me if I'm wrong.) If I decide to follow the approach laid out in the above article, I'd have to modify the source code to whatever Aggregate I'm using. Does this sound correct? Feasible? Stupid? I'd rather use ActiveWarehouse if I can, because it looks like there's a lot of knowledge here I could benefit from, and a lot of boilerplate I can save, but I suppose lacking a solution to the above, I'll just start hard coding some aggregate tables. Not that I blame the authors -- it's no fault of yours if none of your data has multivalued dimensions. If there's another way I can granulate the data that gets rid of them but still lets me slice/count the way I'd like, I'm open to that, as well. From phylae at gmail.com Sat Jan 19 03:37:48 2008 From: phylae at gmail.com (Paul Cortens) Date: Sat, 19 Jan 2008 00:37:48 -0800 Subject: [Activewarehouse-discuss] rake warehouse:build_date_dimension problems Message-ID: Hi, I have been having trouble with rake warehouse:build_date_dimension Some dates have been missing and some have been duplicated. It had to do with daylight savings. Here is how I fixed it: activewarehouse/tasks/active_warehouse_tasks.rake < # start_date = (ENV['START_DATE'] ? Time.parse(ENV['START_DATE']) : Time.now.years_ago(5)) < # end_date = (ENV['END_DATE'] ? Time.parse(ENV['END_DATE']) : Time.now) < < start_date = (ENV['START_DATE'] ? ENV['START_DATE'].to_time : Date.today.to_s.to_time.years_ago(5)) < end_date = (ENV['END_DATE'] ? ENV['END_DATE'].to_time : Date.today.to_s.to_time ) --- > start_date = (ENV['START_DATE'] ? Time.parse(ENV['START_DATE']) : Time.now.years_ago(5)) > end_date = (ENV['END_DATE'] ? Time.parse(ENV['END_DATE']) : Time.now ) This works for now, but what is the "right" way to handle this? I am on Ubuntu 7.10 with Rails 2.0.2; Ruby 1.8.6; and my timezone is Pacific -0800 Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/activewarehouse-discuss/attachments/20080119/ff7870eb/attachment.html From thibaut.barrere at gmail.com Sat Jan 19 13:54:35 2008 From: thibaut.barrere at gmail.com (=?ISO-8859-1?Q?Thibaut_Barr=E8re?=) Date: Sat, 19 Jan 2008 19:54:35 +0100 Subject: [Activewarehouse-discuss] A little beginner's guide to datawarehouse Message-ID: <4a68b8cf0801191054x5c629812x1ec631f185558cbe@mail.gmail.com> Hi! after a presentation I gave a while back about datawarehousing with ruby, I got a bunch of questions about how to get started with datawarehousing in general. I've compiled the answers (mostly pointers, books references and a couple of tips) in an article here: http://blog.logeek.fr/2008/1/19/a-beginner-s-guide-to-datawarehouse Nothing really new for most of you I guess, but I hope it will be helpful to some! cheers, -- Thibaut From obrien.andrew at gmail.com Mon Jan 28 16:22:46 2008 From: obrien.andrew at gmail.com (Andrew O'Brien) Date: Mon, 28 Jan 2008 16:22:46 -0500 Subject: [Activewarehouse-discuss] SQLite & ETL problem Message-ID: I was wondering if anyone else is having problems using SQLite and AW-ETL together. The problem I'm running into is that the ActiveRecord SQLite adapter doesn't symbolize keys on the config hash so when it fails when it tries to lookup config[:database] (since the hash still have config["database"] keys). MySQL (and indeed most of the other adapters) symbolize the keys beforehand, so I don't know if this is just an omission or intentional. The weird thing is that my Rails app that uses the SQLite3 adapter does fine, but I can't find where the keys are being symbolized. So should AW-ETL call symbolize_keys? Or should ActiveRecord's SQLite adapter? Or am I just doing something wrong? Thanks, Andrew From mwlang at mdlogix.com Wed Jan 30 20:08:54 2008 From: mwlang at mdlogix.com (Michael Lang) Date: Wed, 30 Jan 2008 20:08:54 -0500 Subject: [Activewarehouse-discuss] Is the demo or online article up-to-date? Message-ID: <47A11FA6.6010601@mdlogix.com> I am trying to run the demo by following this article: http://anthonyeden.com/2006/12/20/activewarehouse-example-with-rails-svn-logs I cheated like an impatient reader jumping to last page of a mystery novel and pulled down the code: |svn checkout svn://rubyforge.org/var/svn/activewarehouse/rails_warehouse/trunk I then pulled down the rails logs with: ~/rails_warehouse/db/etl/download_rails_log.rb (since the download_aw_log.rb resulted in an empty input/aw_log.xml file) I then changed the *.ctl files to reflect input/rails_log.xml and successfully loaded the data into the tables, so I'm pretty sure I have that much working fine. However, I noticed a couple issues with the app itself... The checked out code differs from the article in that the Revisions Controller has an empty index and a by_author function that has same code as the index def whereas the article just has the index method. There are no rake warehouse:xxxx tasks. "rake -T | grep warehouse" returns an empty set. When I follow the article, generating new controllers, views, models, and run the code, I get for http://localhost:3000/revision_reports the following: NameError in Revision reportsController#index uninitialized constant RevisionReportsController::ActiveWarehouse RAILS_ROOT: script/../config/.. Application Trace | Framework Trace | Full Trace /opt/local/lib/ruby/gems/1.8/gems/activesupport-1.4.2/lib/active_support/dependencies.rb:477:in `const_missing' app/controllers/revision_reports_controller.rb:4:in `index' I have the following (subset) of gems installed: activewarehouse (0.3.0) activewarehouse-etl (0.9.0) adapter_extensions (0.4.0) mongrel (1.0.1) rails (1.2.3, 1.1.6) rails_sql_views (0.6.1) ...and am running the following Ruby version: ruby 1.8.6 (2007-03-13 patchlevel 0) [i686-darwin8.10.1] (on a Macbook Pro, Tiger) Any ideas what might be the problem? Regards, Michael |