Sarka Dejan

Applied Microsoft Business Intelligence


Скачать книгу

process with as much involvement from any stakeholders that are willing to assist. You may host these sources within or outside an organization.

      Regardless of whether the data is internal or external, you should carefully perform discovery to ensure that you have included all relevant data in the project. This process often requires several iterations. During report development, someone may recognize that data is missing due to an oversight. As a result, you'll need to modify the ETL process to include the new source.

      Using Internal Data Sources

      While you source the majority of the data from traditional IT-managed relational databases, you'll always have some data managed and maintained outside the IT department. This data is likely stored in spreadsheets, Access databases, comma-delimited files, text files, or other file types, and they may reside on someone's desktop or laptop. Often vital, these datasets contain small nuggets of information that can cripple the project if they are not included.

      End users may also manage other internal data sources, such as SharePoint lists or third-party applications that came with database backed during installation. In the case of the latter, the hosting department does not even realize what they have installed. In some cases these are common back ends, and others may require custom drivers to access the data.

      Using External Data Sources

      The data may also come from an external source via a web service, an OData feed, or even a hosted RDBMS (SQL Server or Oracle). If the source is a web service or OData feed, the data is typically accessed via a web URL. The consumption, on the other hand, may require some custom interface that parses and displays the data in a fashion meaningful to end users. Developers may overlook this data because no one on the team knows it exists; the same may apply to the hosted databases. As a result, this further heightens the need to involve end users because they may be the only people who know it's there and needed.

      Using a Data Warehouse (or Not)

      For the experienced business intelligence developer, developing a business intelligence solution without a data warehouse may seem absurd. However, with today's savvy end users, readily accessible data is no longer an option; it is a requirement. Therefore, nightly refreshes of data is becoming a thing of the past. And as a result, including a data warehouse as part of a business intelligence project is now optional.

      Traditionally, a data warehouse is loaded at some time interval – daily, weekly, and some even monthly. Depending on the organization the time period may be longer. For example, some colleges or universities load data into a data warehouse only at the end of the semester. As data needs become more stringent, the periods of latency between live data and analytical data have become smaller and smaller, presenting challenges that are often difficult to overcome. The primary challenge is moving data from the source systems to the data warehouse, which leads to the question: Is a data warehouse required?

Think back to Chapter 1, specifically to Figure 1-1, which we're showing again in this chapter (see Figure 2.1). This figure depicts an illustration of reporting from multiple data sources. How can you create a single report to reference multiple desperate sources? Which tool would you use? A few may accomplish this task, and because this is a Microsoft-focused book, the tool that comes to mind is Power Pivot. Power Pivot is an Excel add-in that creates an in-memory semantic model based on a plethora of data sources. Figure 2.2 displays an abbreviated list of data sources that possibly source the data warehouse.

image

Figure 2.1 Reporting against disparate data sources

image

Figure 2.2 Data sources that source a data warehouse

      With that in mind, and because this feature is available, you might ask yourself: Why should I develop a data warehouse? Couldn't data be accessed directly from the source, modeled, and then reported against? The answer is: absolutely. But in most cases the data has not been validated so that it could meet the needs of every aspect of the business, or possibly two systems exist that store similar data. Which set should you use for reporting? Another possibility is that a value is calculated one way by one set of users and another way by another set. Which means of calculation is the correct one? This is where the ETL process becomes a significant part of your business intelligence solution.

      NOTE

      Although this book does not focus on loading the data warehouse using SQL Server Integration Services (SSIS), please do not discount the importance of an ETL solution. The Microsoft Business Intelligence 24-Hour Trainer provides an SSIS section that provides a good starting point and overview.

      During the ETL process, the data is extracted from the original data sources, transformed into a format or formats that meet the business requirements, and finally loaded into a central repository (data warehouse). You can use the repository as a direct source for reporting or as a semantic model. Regardless of the approach, leveraging a single source for either makes the process of obtaining and visualizing data simpler for any individual or group that needs to access the data.

      Implementing and Enforcing Data Governance

The section “What Are the Data Sources,” reiterated the importance of a group of end users over and over again. You have a similar case when deciding how to implement and enforce data governance. From a technical perspective, this step is typically performed prior to loading the data in the data warehouse. So, why is it located after the data warehouse section? When planning a business intelligence solution, time is valuable. Why invest time and resources on something that you may not need? If data is accessed directly from the source, the want or need for data governance is reduced. However, because you may have multiple sources of data for a single data warehouse, you may require a data governance strategy to ensure that you deliver accurate, consistent, and trustworthy data to end users. Figure 2.3 illustrates where data governance would fit in a Microsoft business intelligence solution.

      DATA GOVERNANCE

      So, what is data governance? In short, data governance is a process or set of processes that include:

      ● Data quality

      ● Data management

      ● Data monitoring

      ● Data maintenance

      ● Data security

image

Figure 2.3 Data governance in a business intelligence project

      These tasks are accomplished by putting people familiar with the data together with a technology solution or set of solutions. In the case of Microsoft, it would be a set of technologies: Data Quality Services (DQS) and Master Data Services (MDS).

      NOTE

      A discussion about these technologies is well beyond the scope of this book; however, the exclusion of a data quality process could be detrimental to the entire business intelligence solution. The Tutorial: Enterprise Information Management using SSIS, MDS and DQS Together, teaches developers how to implement a sample Enterprise Information Management (EIM) solution. It can be downloaded from this link: http://www.microsoft.com/en-us/download/details.aspx?id=35462.

      For example, assume that during the data discovery process you realize that two very similar datasets are obtained from two completely different sources; for the sake of brevity and simplicity, assume that both datasets reference gender data. The catch is that in one system gender is stored as Male and Female, and in the other the data is stored as M and F. Prior to importing the data into the data warehouse, a decision must be made regarding how to store gender; this is part of the data governance process.