Monday, September 21, 2009

SSAS Association Algorithm and ItemSet Sparcity

Checking Data Size and Parameters while working with a Market Basket Analysis model

By Rick Durham

Recently, I was developing a Market Basket Analysis model at one of my clients, using SQL Server Analysis Services 2005 (SSAS). The reason I we built this model was to understand what items are purchased together so for physical product placement as well as how to position items in advertisements.

After spending several hours on the model and increasing the number of records in the data set to feed it, I was still not seeing many itemsets where the number of itemsets > 2. This perplexed me given the nature of the business and what I knew to be true i.e. literally there should be dozens of itemset > 2.

Size Matters

It turns out that not all data is equal. While many retail organizations have purchasing data whereby a few thousand receipts will yield many itemsets > 2 or more, you might be surprised how often this is not the case. In the Market Basket Analysis I’ve done, I’ve found that many organizations have receipts where only on occasion does the customer purchase more than two items, making the number of itemsets > 2 in the raw data sparse. As a result, if you are working with this type of data versus grocery store data, the size of the data set needs to be 10 – 100 times larger.. In my case, I finally pulled in three million rows of data to produce a meaningful model.

Parameters Matter

Using the correct SSAS algorithm parameters matters. When dealing with data sets where the itemssets > 2 are sparsely distributed, it’s important to set the parameters correctly. In my case, I had to set the Minimum_Probability value from its default of .4 to .2 and adjust the Minimum_Support to a value of .03 from 10 in order to get itemsets > 2 in the model. The reasons I chose these parameters:

- The Minimum_ Support value by default is set to a percentage of the total. Given what I knew regarding the data, I felt that at a minimum we needed ten cases of the itemset to be able to identify itemsets when the itemsets in the data were so sparse. A lower value approaching one would yield too many while going above 10 yielded started limiting the number returned.

- The Minimum_Probability value sets the probability that a rule is true. By adjusting this value lower we are willing to accept that that we may generate rules that have a lower probability of being true. Again this was necessary given how sparse the itemsets were in the data.

Follow your intuition. If while building your data mining model you do not at first get the results you expect, it may be that you do not have enough data, the right kind of data or the incorrect parameters in set. Expect the process to take time because data mining can be a highly iterative process and you cannot look at the raw data to gauge the output of the model.
BUT, the potential benefits of taking time to use the Association Algorithm are extensive. In the Market Basket Analysis I did using this particular model, I discovered several itemset that I would not have predicted would be purchased together.