Data Warehousing For Mere Mortals: Redefining Requirements Gathering

Prelude

In the past decade the business community has discovered that it is entirely possible to take delivery of a Data Warehouse in time and within budget. In fact it will tolerate none other than successful and timely implementation of phased deliverables normally in the form of Subject Area Marts. Each phase must be completed in no longer than 12 weeks. And each deliverable will be deemed successful only if accompanied by a meaningful presentation layer. Henceforth the Data Warehouse professional must take every precaution to ensure each task and milestone within the Development Life Cycle is met and no time is wasted at any point in the EDW project.

Requirements Gathering in the traditional sense is neither effective nor necessary in large-scale EDW projects. More EDW projects fail due to over-analysis than any other reason. As “lack of adequate requirements” is used as justification for a failed EDW attempt, this paper invalidates such position for the sole reason that business requirements do not entirely define a Data Warehouse.

In this paper we will study the Analysis phase of Enterprise Data Warehouse projects and will compare and contrast rigid traditional Requirements Gathering activities with those of less rigid ones; those that put much emphasis on clarifying Business Processes and Business Rules with ones that favor exploring Data and Data Sources. I should also note that this paper does not propose a purely data driven analysis, rather analysis by heuristics distributed throughout the Development Life Cycle.

Traditionally both OLTP and OLAP System Development Life Cycles put much emphasis on completing one phase before starting another as in a Waterfall. Rigid interdependencies between phases of a system development life cycle often resulted in the downfall of the project. It is needless to say this strategy is most impractical for an Enterprise Data Warehouse. As new methodologies are explored and experimented such as Modified Waterfall, JAD, SCRUM, Agile, Extreme, RAD, etc., EDW projects are still lacking a well-defined and well-understood Development Life Cycle methodology. Perhaps this is the paradox in EDW environments where the strict rules of old (and new) Development Life Cycle models go against the very grain and nature of Enterprise Data Warehouses. EDW development cycles cannot be boxed into any one specific methodology; rather it requires high level of learned heuristics within a fluid and adaptable development environment.

The problem with requirements gathering

If we agree that a bottom-up design methodology is suitable for a large-scale Enterprise Data Warehouse and we have embarked on creating a BUS Architecture that allows us to create one federated Data Mart at a time, we should then agree we have embarked on an iterative development cycle. If this development methodology has been proven highly effective and is now widely practiced, why are we then insistent on having EDW requirements defined at the outset?

Aside from the architecture the most significant distinction that separates a Data Warehouse from an OLTP is the lack of well-defined behavior of the system prior to deployment. In an OLTP System Development Life Cycle requirements are precise, understood and documented down to the module level and prior to the design phase. This is feasible because the expected behavior of the application is often mandated by the business processes hence requirements may be clearly defined. This sort of precision does not exist in the early stages of an EDW project and may never reach that level of maturity or granularity. Furthermore business is often ambiguous when it comes to OLAP and it truly does not know what is expected of the Data Warehouse. What defines a Data Warehouse at the outset is therefore not business requirements, but the data that feeds it.

Some traditional requirements gathering activities still practiced today include group and one-on-one sessions with heterogeneous staff including executives, unit managers, analysts, and subject area experts from both IT and the business community. I am not entirely against this rather am questioning the weight, the time and effort, and the resources we dedicate to it and our expectations of return on investment from this activity. In the past 15 years of my involvement with Data Warehouse projects, I have seen many hours, days, and months being spent on similar JAD (Joint Application Design) type of activity where large groups gather in conference rooms day after day, week after week and month after month to hash out business rules, business processes, review internal and external systems; their history, their inputs/outputs; discuss systems relationships, systems data; their idiosyncrasies, their myriad data issues; propose cleansing and scrub methods; review old reports and propose new ones, propose KPIs, mock-up dashboards, whiteboard data models, network diagrams, and data flowcharts, draw stick figures, circles, rectangles, pyramids and other varieties of geometric shapes – and then document everything.

By the time the EDW project team is finished with the above analysis it finds itself six months well into the project with many dollars spent and nothing to show for it except for myriad of not so useful files and documentation that no one wants to ever revisit.

The art of prototyping

The remedy to the above seemingly impossible problem is but a single concept; Prototype. To some a Prototype means a scaled down version of the larger system developed for proof of concept. Prototyping can also be viewed as an integration of multiple phases of the System Development Life Cycle performed within the confines of limited number of iterations. In this paper, we will view Prototyping as set of heuristics that consolidates the Analysis phase with the Design and Development phase in an Enterprise Data Warehouse project. Although it is important to point out prototyping is not a random experiment. It has structure. It is methodical. It is learned heuristics that brings high level of expertise and experience within the respective areas of both business and technology.

To illustrate the point let us consider the well-known composer Johan Sebastian Bach. Many of his pieces seem difficult, almost impossible to deconstruct. They are often structured with mathematical precision; perfectly logical and highly complex motifs making up haunting melodies that intentionally chase each other, expressing ideas in the most beautiful way imaginable. Bach did not use the big bang theory nor did he top-down design on paper to write his music. He was often under strict time constraints and did not over-analyze his ideas before placing his masterful hands on the black and white keys. He had the deep knowledge and the extensive experience and the technical knowhow to experiment well.

Analyze while you stage

In the early stages of the EDW Development Life Cycle, focus must be solely on understanding source systems and source systems’ data. The 80/20 rule applies well here. Most of the EDW requirements may be deduced from the data sources and their relationships with each other. This understanding enables requirements, along with other components of the Data Warehouse to solidify during iterations. Multiple source data analysis is in itself a type of Requirements Gathering with the distinction that it can be performed relatively quickly and almost entirely at your desks rather than in conference rooms during long winded group sessions. Additionally, much of the data analysis work can be automated through Data Profiling. Data profiling tools can be written with relative ease or acquired from third party vendors with little cost. But the most interesting point to this unconventional method of requirements gathering is that it can be performed in parallel with each phase of the EDW project including Staging.

Staging must be the number one task on any EDW project list since it is fundamental to the existence of a Data Warehouse. Quick and effective study of company’s data can be performed only if disparate source systems’ data have been offloaded in a staging environment. It would be difficult to effectively profile large volumes of data from multiple sources and understand commonalities and relationships between each system in their production settings. It is highly recommended to always stage all disparate source system’s data at the outset. Although this may not always be possible due to time constraints, but if it can be done, it will provide a much more fluid development environment and a fast- paced Development Life Cycle. It will be especially useful as you integrate dimensional data elements from disparate source systems into conformed and shared dimensions. So in the spirit of data integration and facilitating federated Data Marts it is much more suitable to have most, if not all data sources staged at the outset.

History Retention – An often missed requirement

Prototyping in an EDW setting must not be confused with short-term or temporary solutions. On the contrary, the tools along with prototyping methodologies described in this paper are presented with long-term strategy in mind. As Bach’s music passes the test of time, the Data Warehouse crafted in this fashion will prove stable for years to come.

In the context of long-term strategy and sustainable systems, one cannot speak about staging and not mention historical data retention. Surprisingly enough this is often missed by the experts. And many IT departments shy away from archiving data or have no clear strategy or policies around historical data retention. It is the responsibility of the Data Warehouse professional to look into such policies or the lack thereof and ensure comprehensive data retention processes are in place. Without adequate history an EDW effort will undoubtedly fail. Although the topic of comprehensive historical data retention or Operational Data Stores (ODS) are outside of the scope of this paper, the need for historical data must be considered an absolute “requirement”, if it has not already been identified as such, and a brief overview is warranted due to its critical importance.

Normally while staging databases are being created, archival ones get created in parallel or at least immediately after. The archival database usually has a different structure than the staged database. It may not include all data elements of all data sources, however, if in doubt it is better to archive than not.

It is the belief of some that comprehensive archiving is unnecessary since the Data Warehouse itself stores history. They will argue a separate archival environment is both redundant and excessive. It is the responsibility of the BI professional to make a strong case against this for very good reasons. Since data is highly transformed in Data Marts; the ability to refer to its original state at any point in time is critical for both audit purposes and scalability. A Data Warehouse development cycle is phased and highly iterative, it is fluid and adapts to constantly changing business; it expands and matures through time. Each time data elements are changed or added to facts and dimensions it must be reloaded in its entirety. This will not be possible without historical source data readily available in its raw form. If we’re unable to reload facts and dimensions in their entirety, Data Marts will quickly become ragged and incomplete limiting the EDW considerably. To recap, without comprehensive and adequate staging and archival environments, a Data Warehouse will not be sustainable.

About interviewing subject area experts

When interviewing subject area experts, sessions should be kept short and casual. EDW teams can easily get bogged down in long winded formal meetings producing little. Greater results are achieved from one-on-one sessions. More importantly ask for access to all existing reports from every person that is interviewed. I’ve never run into an organization, no matter how young or mature, that did not have reports in some shape or form. Reports provide much of the requirements. They can be used as tools to extrapolate or reverse engineer facts and dimensional elements if necessary. One must not expect high ROI from interview sessions – they should be considered as venues to help fill in the blanks as you diligently perform much of the analysis using all the tools mentioned in this paper.

Using heuristics

Similarity Heuristics is a powerful concept and an effective tool to perfect the art of Prototyping. It puts high reliance on past experience and comparison to the current experience. In other words you gain the cognitive maturity, ability and expertise through learning by past mistakes and improving each time you repeat a similar task.

For example, your very first Loan Origination Data Mart assignment probably went much less smoothly than your 10^th similar assignment. Today you have perfected the heuristics in implementing Loan Origination Data Marts and can efficiently and quickly iterate through the Development Life Cycle. You have overcome not only the technical challenges, but are also more adept in the business of loan origination in general. You are well acquainted with its universal needs including data, analytics, and reporting. You are familiar with the idiosyncrasies of loan origination systems and can anticipate common issues and how to solve them as they arise. Although not every Loan Origination Data Mart will look identical, the heuristics in which they are designed, developed and implemented are very similar.

A simplified and high level project flow for a Loan Origination Data Mart will look much like the following. Notice there is high level of concurrency within tasks and phases.

Week 1 – Week 3 (Stage)

1. Stage source data using metadata driven processes

2. Profile source data using stored procedures and tables to store profiling information

3. Identify history inherent in the database and identify the lack thereof

Week 2 – Week 4 (Archive)

4. Set up archival processes to begin retaining history (from Stage) into an existing archival database in forms of snapshots and transactional change data capture.

5. Integrate address information from the new source into existing address validation and scrub database/processes

Week 3 – Week 6 (Dimensions)

6. List all potential dimensional elements categorizing each into their respective areas as well as determining if they will be conformed or non-conformed

7. Identify all other data cleansing candidates and set up scrub and conform processes

8. Integrate dimensional elements into existing global and conformed dimensions (i.e., Properties, Agents, Investors, Vendors, Employees, Channels, etc.)

9. Draft/create new dimensions with their hierarchies - those that are only specific to loan origination (i.e., Applications, Program, Campaigns, Branches/Teams, Status, Events, Fees, Denials, etc.)

Week 5 – Week 10 (Facts)

10. Review existing reports relative to loan origination

11. Create a comprehensive list of all event dates and measures (including calculated measures) and categorize them into their respective subject area facts

12. Draft/create transactional and snapshot fact tables (I.e., Applications, Credit, Funding, Events, Process SLAs, Fees, etc.)

Week 7 – Week 12 (Reports)

13. With dimensions in place and first fact table drafted, begin reports development

14. As more facts complete begin development of additional reports

15. Begin Dashboard development

16. Get signoff from business

Two very important points not listed but implied in the above project flow are one; the ability to communicate with key business members at any point in time, and two; the ability to iterate to and from any given task or phase. These are very much interwoven throughout an EDW project flow. Well balanced interaction with the key business members throughout an EDW project ensures you stay on the right path at all times. Freedom to iterate nearly at any given point within an EDW project enables a rapid prototype Development Life Cycle as illustrated in the following graph.

Measuring success

In a Data Warehouse project business groups measure success by what they can see. You might be most proud of what you’ve been able to produce under the hood but without an outstanding presentation layer, your Data Warehouse will not get the recognition it deserves. Therefore it is imperative each completed subject area mart include visibility and access to the data in some shape or form. Whether deliverables are reports, dashboards, or BI query tools, the information produced must be accurate. Reports and dashboards must be intuitive, flexible and visually appealing. History must be available and reflected in your reports. Aggregates must be presented in time series where today’s numbers are compared with yesterdays’ or with same day last month, as well as month over month comparisons and distributions, etc. Reports must be parameter driven improving their functionality and flexibility. They must be drillable enabling users drill through each hierarchical level and down to the transactional records. Last but not least, data must be refreshed in a timely manner and SLAs consistently met.

Postlude

At the turn of the 20^th century pilots in their open-cockpit biplanes navigated by means of dead reckoning. A method that estimates one's current position based upon a previously determined position and advancing based upon estimated speeds using rough aerial charts and known land marks. The first flight along a new route was tricky and challenging. But the more flights the pilot completed around the same route, the more accurate his/her dead reckoning process became. The pilot in this situation is like the experienced Architect and the passenger the key business person to help navigate along the way. When flying on uncharted territories, we the pilot, have the skills to fly and navigate the plane from point A to point B. The passenger is there to assist at any point in time, and all along the route when needed, ensuring we do not steer too far off course.

Just as the passenger rides alongside the pilot throughout the entire route, the key business group’s involvement in EDW projects should be distributed evenly throughout the entire Development Life Cycle. However, the extent and the method of their involvement should be reevaluated. Too much backseat driving will cause problems during flight. Equally, when implementing Data Warehouses, too much dependency on business groups and/or on requirements gathering becomes problematic. Just as a lack of a good navigator is not an excuse to crash a plane, lack of requirements is no longer a valid excuse for failed EDW attempts. Lastly, Business is often ambiguous and vague and is increasingly relying on the BI professional for guidance. We must rely more on our own expertise and experiences and perfect those heuristics that help us succeed in today’s rapid and iterative system implementation model.

Data Warehousing For Mere Mortals

Friday, May 27, 2011

Redefining Requirements Gathering

No comments:

Post a Comment