Notes from the Field: 2009

Tuesday, December 29, 2009

Building Great Cubes: Tip 1

Less is More
By Peter Sprague

I have seen a lot of SSAS 2005 cubes that look very similar despite being created independently by different customers and partners across the US and Canada. They always look something like this:

- 5 or 6 dimensions with few or no natural hierarchies
- 70 or more (often many more...) attribute hierarchies, including the ever popular attribute [Customer Fax]
- 4 or more user defined time hierarchies
- Time attributes that contain all dates between 1901 and 2060
- 35+ measures
- Multiple measure groups
- Little relation to a specific user problem

Why is this a concern? These cubes makes it more difficult for the business user to navigate the data, and learn the tools. The complexity also makes it more difficult to understand the data that is returned. The biggest concern is that the users may misunderstand the data, and then base decisions on that misunderstanding. One of the reasons for this complexity is that analytic and monitoring tools (as opposed to reporting tools) directly surface the cube metadata as part of the user interface (Excel, PerformancePoint, ProClarity). Subsequently, multiple measure groups can be dangerous. Let me be clear... ALL of the features of SSAS 2005 are useful, they just aren't useful all the time.

These problem cubes rarely help users support a series of decisions. Seldom do they help users analyze their data to get to an actionable step.

Let me introduce Vilfredo Pareto, an Italian economist who in 1906 observed that 20% of the population in Italy owned 80% of the country's wealth. He also noted that this ratio held true for other scientific and economic distributions. A hundred years and hundreds of self-help books later, we have the Pareto Principle: 80% of the value comes from 20% of the resources. This principle holds true for cube design with the additional observation that the extra complexity from the 80% of the resources cost far greater than the 20% of the value that those resources provide. The next time you create a cube, strongly consider what design set will result in 80% functionality and stop there. In my experience, these cubes looks similar to this:

- 6 to 10 dimensions all with 1 strong natural hierarchy
- 6 or 7 exposed attribute hierarchies
- 1 time hierarchy that only contains dates relevant to the period the cube data
- All metadata expressed in business friendly terms
- Strong relation to a business problem

Remember our goal, to provide a cube that helps our users understand the data and supports further business action or decision. Consider this the first rule of cube design, Less is More.

.

Wednesday, November 25, 2009

Analysis Services 2008 Metadata Report Pack

Reporting on Analysis Services Metadata using DMV Schema Rowsets

By Dan Meyers

Analysis Services 2008 Metadata Report Pack Download

I often get asked by clients about the best way to get metadata about the various SQL Server Analysis Services (SSAS) objects on a particular server. As usual, there are a few ways to go about getting this information. Unfortunately for DBAs and report writers, most of them are a headache to implement and often have you writing code rather than queries. In the recent versions of SSAS we have a somewhat straightforward way to accomplish this. SSAS exposes a lot of good information via built-inschema rowsets for OLAP and data mining objects. This means that all you have to do is write some queries and slap the results in a Reporting Services (SSRS) report and you are done.

If you think this sounds too good to be true, you are half correct. Although we finally have a nice simple solutions for reporting on this information, we are left wanting more. Not everything about the objects is exposed in these rowsets. If you want every detail about your objects then the best approaches are: to buy a 3rd party tool, or to put your programming hat on since you will be using AMOMD.NET to get all of the stuff that SSAS rowsets leave out.

In a previous blog post, I discussed a method where you can use a linked server to run MDX queries against your cubes. For these reports, I am going to use the same approach to execute DMX queries against SSAS 2008 using a linked server and the OLE DP provider for Analysis Services to retrieve metadata about our cubes, dimensions, measures, etc... from the rowsets.

This download provides you with a relatively comprehensive set of pre-built reports that you can easily deploy to your Reporting Services 2008 instance.

Before the reports will run, you have to create a linked server in SQL Server that points to your SSAS 2008 instance. Below is the SQL code needed to create a linked server named [SSAS_Metadata].

EXEC master.dbo.sp_addlinkedserver
@server=N'SSAS_METADATA',
@srvproduct=N'MSOLAP',
@provider=N'MSOLAP',
@datasrc=N'localhost'

EXEC master.dbo.sp_addlinkedsrvlogin
@rmtsrvname=N'SSAS_METADATA',
@useself=N'False',
@locallogin=NULL,
@rmtuser=NULL,
@rmtpassword=NULL

The download contains the entire reporting project pre-=configured to use the localhost as the data source for the SQL Server, Reporting Services, and Analysis Services instances.

The project contains the following 26 reports:
- Cube Details
- Cube Dimension Details
- Cube Dimension Relationships
- Cube Dimensions
- Cube KPIs
- Cube Measures
- Cube Perspectives
- Cubes
- Data Source Details
- Data Sources
- Database
- Dimension Attribute Hierarchies
- Dimension Attribute Hierarchy Details
- Dimension Details
- Dimension User Hierarchies
- Dimension User Hierarchy Details
- Dimensions
- KPI Details
- Measure Details
- Mining Model Column Details
- Mining Model Details
- Mining Models
- Mining Structure Column Details
- Mining Structure Details
- Mining Structures
- Perspective Details

The reports in the report pack were designed and modeled after the output provided by my favorite 3rd party documentation tool for SQL Server BI, BI Documenter. All of the reports contain many navigational links in the header as well as click-through navigation of the objects listed in the reports that you can use to drill into child objects for more detail.

As you can see by the names of the reports listed above, the reports touch all of the major objects and should satisfy the majority of the inquiries about them. One report that I find to be quite useful in the Cube Dimension Relationships report. It is an attempt to reproduce the grid on the Dimension Usage tab of the cube designer in BIDS that tells you at what granularity the dimensions relate to the measure groups.

Hopefully this report pack will save you some time and provide you with some good samples that you can use to create your own customized reports that might work better for you.

Monday, October 12, 2009

Empowering the Data Warehouse with External data

By Rick Durham

To build a powerful data warehouse you must include as much relevant data from internal and external sources as possible to optimize the decision processes that managers and “C level” executives are called to make each day.

As an example, retailers have current/historic sales data along with pricing information, but this will only provide partial insight into the determinants that are driving sales. Information such as weather, income tax distribution periods, regional or local population growth, household demography, may also play a key factor in driving sales and must be taken into consideration.

Where do you go to get data that will complement your internal data sources to provide a much richer data warehouse and BI experience?

The government and many other organizations capture and deliver this data and distribute it free or for a nominal fee. Here are some examples:

Weather: Yahoo offers an RSS fed that can be called using an http request as follows:
http://weather.yahooapis.com/forecastrss?p=48161

The parameters to the request are the following:
Parameter, Description, Examples
p, US zip code or Location ID, p=95089 or p=USCA1116
u, Units for temperature (case sensitive), f: Fahrenheit or c: Celsius

The RSS response from this request includes the following information:
· Geographic latitude/longitude
· Weather Conditions (48 distinct codes)
· Temperature (F,C)
· Forecast (Condition, High Temperature, Low Temperature)

Importing this data daily or even hourly using SSIS packages along with Sales data goes a long way in understanding if weather is a factor in why certain items were purchased at a particular time/date and determining longer term sales trends.

Using Demographic Data along with internally generated data can go a long way to enhance the data warehouse. The following are examples of where this data can be obtained:
http://www.geolytics.com/?gclid=CMeliJqrqJ0CFU1M5QodekqHkA

With limited data (address or lat/long information) you can get 60 demographic attributes for that address that include factors such as income, average number of people per home, average age, education…

Likewise another good site for demographic data and data validation is:
http://www.melissadata.com/dqt/index.htm

This site offers validation against address, phone numbers, email and perform name parsing via Web Services calls which can help accelerate the ETL development process, provided you do not have to develop the code and maintain large demographic databases onsite. Additionally, this site offers demographic data on income, media locations, reverse phone and mailing lists.

Finally, the Federal government maintains thousands of databases with data gathered from various agencies that contain information that can be coupled with internal data to make your data warehouse far more powerful. For example:
http://research.stlouisfed.org/fred2/
http://www.data.gov/catalog
http://www.census.gov/

These sites contain historic economic and demographic data the government has collected regarding income, population, interest rates, commodity prices, housing sales… and the downloads are free.

The goal of data warehouse development should be to provide the tools and data for optimal decision making. To assure this goal is achieved, make sure external source are also included in the initial and ongoing data warehouse implementation.

Monday, October 5, 2009

Formatting Linked Server MDX Query Results Using CTEs

By Dan Meyers

In my previous post I described how to export MDX query results in SSIS without having to use a linked server. Even though a linked server is not required and probably not advised the majority of the time, there might be times when it is actually the best solution to your problem.

One of the biggest headaches with this approach it that the data types of all of the fields returned are either nText or nVarchar. This means that you have to deal with converting and formatting the results before you can use them in a report. There are a couple of ways to go about this.

CTEs (Common Table Expressions) are the most elegant technique to employ here. Plus, they are a bit more readable than their SQL Server 2000 equivalent, Derived Tables. CTEs are only available in SQL Server 2005 and later. If you are using SQL Server 2000 then you have to use a derived table in place of the CTE.

Below is an example of using a CTE to rename and format the results of an MDX query executed against an Analysis Services cube via a linked server:

WITH MDXResults (FiscalYear, FiscalSemester, FiscalQuarter, SalesAmount, GrossProfit)
AS
(
SELECT *
FROM OPENQUERY(SSAS_AW2008,
'SELECT
{[Measures].[Sales Amount], [Measures].[Gross Profit]} ON COLUMNS,
{[Date].[Fiscal].[Fiscal Quarter].Members} ON ROWS
FROM
[Adventure Works]')
)
SELECT
CONVERT(VARCHAR(255),FiscalYear) AS FiscalYear,
CONVERT(VARCHAR(255),FiscalSemester) AS FiscalSemester,
CONVERT(VARCHAR(255),FiscalQuarter) AS FiscalQuarter,
CONVERT(FLOAT,SalesAmount) AS SalesAmount,
CONVERT(FLOAT, GrossProfit) AS GrossProfit
FROM
MDXResults

If you are using SQL Server 2000 then you have to use a derived table instead. Below is an example using the same MDX query that is used in the example above:

SELECT
CONVERT(VARCHAR(255),FiscalYear) AS FiscalYear,
CONVERT(VARCHAR(255),FiscalSemester) AS FiscalSemester,
CONVERT(VARCHAR(255),FiscalQuarter) AS FiscalQuarter,
CONVERT(FLOAT,SalesAmount) AS SalesAmount,
CONVERT(FLOAT, GrossProfit) AS GrossProfit
FROM
(
SELECT *
FROM OPENQUERY(SSAS_AW2008,
'SELECT
{[Measures].[Sales Amount], [Measures].[Gross Profit]} ON COLUMNS,
{[Date].[Fiscal].[Fiscal Quarter].Members} ON ROWS
FROM
[Adventure Works]')
) AS MDXResults (FiscalYear, FiscalSemester, FiscalQuarter, SalesAmount, GrossProfit)

Monday, September 21, 2009

SSAS Association Algorithm and ItemSet Sparcity

Checking Data Size and Parameters while working with a Market Basket Analysis model

By Rick Durham

Recently, I was developing a Market Basket Analysis model at one of my clients, using SQL Server Analysis Services 2005 (SSAS). The reason I we built this model was to understand what items are purchased together so for physical product placement as well as how to position items in advertisements.

After spending several hours on the model and increasing the number of records in the data set to feed it, I was still not seeing many itemsets where the number of itemsets > 2. This perplexed me given the nature of the business and what I knew to be true i.e. literally there should be dozens of itemset > 2.

Size Matters

It turns out that not all data is equal. While many retail organizations have purchasing data whereby a few thousand receipts will yield many itemsets > 2 or more, you might be surprised how often this is not the case. In the Market Basket Analysis I’ve done, I’ve found that many organizations have receipts where only on occasion does the customer purchase more than two items, making the number of itemsets > 2 in the raw data sparse. As a result, if you are working with this type of data versus grocery store data, the size of the data set needs to be 10 – 100 times larger.. In my case, I finally pulled in three million rows of data to produce a meaningful model.

Parameters Matter

Using the correct SSAS algorithm parameters matters. When dealing with data sets where the itemssets > 2 are sparsely distributed, it’s important to set the parameters correctly. In my case, I had to set the Minimum_Probability value from its default of .4 to .2 and adjust the Minimum_Support to a value of .03 from 10 in order to get itemsets > 2 in the model. The reasons I chose these parameters:

- The Minimum_ Support value by default is set to a percentage of the total. Given what I knew regarding the data, I felt that at a minimum we needed ten cases of the itemset to be able to identify itemsets when the itemsets in the data were so sparse. A lower value approaching one would yield too many while going above 10 yielded started limiting the number returned.

- The Minimum_Probability value sets the probability that a rule is true. By adjusting this value lower we are willing to accept that that we may generate rules that have a lower probability of being true. Again this was necessary given how sparse the itemsets were in the data.

Conclusion
Follow your intuition. If while building your data mining model you do not at first get the results you expect, it may be that you do not have enough data, the right kind of data or the incorrect parameters in set. Expect the process to take time because data mining can be a highly iterative process and you cannot look at the raw data to gauge the output of the model.

BUT, the potential benefits of taking time to use the Association Algorithm are extensive. In the Market Basket Analysis I did using this particular model, I discovered several itemset that I would not have predicted would be purchased together.

Monday, August 31, 2009

Exporting MDX Query Data with SSIS

Query from the data flow without having to use a linked server or a script task

By Dan Meyers

I was recently working on a Scorecard application that displays KPI information from multiple data sources which included a few Analysis Services cubes. This required me to extract some KPI values from the cubes and place them into a SQL Server table (they did not have PerformancePoint up and going yet). I wanted to do this using a data flow in a SSIS package but ran into an issue that I thought had been fixed.

I was attempting to export the results of my MDX queries to a SQL Server table using a normal data flow task when I encountered one of Microsoft’s very un-helpful error messages. The OLE DB source adapter that runs the MDX query against the cube was failing and reporting the error message below.

SSIS Error Code DTS_E_OLEDBERROR. An OLE DB error has occurred. Error code: 0x80040E21.

After doing some searching, I was surprised to find out how many people are having issues trying to accomplish this. Even more surprising was the fact that Microsoft has not been very proactive in providing developers with a more direct method of doing this that does not produce errors out-of-the-box. This drove me crazy for hours before I figured it out. The annoying part of it was that in the UI in SSIS you can successfully preview the results on your query and even get the column definitions back but it would not run when you executed the package or data flow task.

Now, there are a couple of alternative approaches that you could use here. The two most popular ones that I have seen mentioned on the web are: to use a SQL Server Linked Server and a SQL Server data source to pass through your MDX query to the SSAS server and return a tabular dataset; or to substitute the OLE DB data source adapter with a Script Component and ADOMD.NET to create a dataset from the results returned.

The linked server approach seems to work just fine but is still a “work around” and not advisable. The Script Component requires a bit of scripting and is more cumbersome to implement.

Instead of using either of those alternatives, I decided to try a little trick that I remembered seeing in BOL when reading about the specific properties supported by XMLA.

Open up your Analysis Services data source and click the 'All' button on the left to go into the properties.

In the text field provided for the 'Extended Properties' property enter 'Format=Tabular' and save your changes.

You will get a warning, one for each column, where SSIS will complain about not being able to map the data type and saying it will use DT_WSTR data type instead. This is fine; it appears SSIS is unable to automatically derive data types from an MDX query.

To fix this, all you have to do is convert your data types using a Derived Columns task. You may have to get a little creative when converting some of the numeric values coming out of your MDX queries from text to the appropriate SQL Server data type if they are in expressed using scientific notation, for example. Maybe I will blog about that another time.

Hopefully this will save you some headaches and allow you to query a cube from the data flow without having to use a linked server or a script task.

Notes from the Field