**Market Basket Analysis using Association Rule Mining**

Hello everyone, now I am going to explain about my project Market Basket Analysis using Association Rule Mining. This market basket analysis will be helpful for the identification of the analysis that are being purchased by the people and rate of the people they buy so by entilting some of the factors used in Ml we will be able to classify the market basket analysis.

Recently, I got an opportunity to work on a retail project where the objective was to improve the store performance regarding sales & reduce the overall store inventory. As the timelines were tight & had never worked on such kind of project before, it was quite challenging to understand the concept & then implement it in a short time.

This experience encouraged me to write a post which would help beginners like me to understand the concept with an example along with the code.

**What is Market basket analysis?**

Market Basket Analysis is one of the fundamental techniques used by large retailers to uncover the association between items. In other words, it allows retailers to identify the relationship between items which are more frequently bought together.

**Let’s understand the concept with an example:**

Assume we have a data set of 20 customers who visited the grocery store out of which 11 made the purchase:

Customer 1: Bread, egg, papaya and oat packet

Customer 2: Papaya, bread, oat packet and milk

Customer 3: Egg, bread, and butter

Customer 4: Oat packet, egg, and milk

Customer 5: Milk, bread, and butter

Customer 6: Papaya and milk

Customer 7: Butter, papaya, and bread

Customer 8: Egg and bread

Customer 9: Papaya and oat packet

Customer 10: Milk, papaya, and bread

Customer 11: Egg and milk

Here we observe that 3 customers have bought bread and butter together. The outcome of this technique can be understood merely as “** if this, then that**” (if a customer buys bread, there are chances customer will buy butter).

In general, the analysis is run on millions of transaction data set to identify the association between items. Analyzing such enormous dataset was the

**main challenge**which I faced during market basket analysis. The

**main benefit**of conducting market basket analysis is it uncovers hidden purchasing patterns by customers (which products sell together well), and thus, retailers can run specific campaigns/promotions to cross-sell the items (bundling of two items).

You can see

**day to day implementation**of market basket analysis in groceries stores, retail outlets and product recommendation on e-commerce sites (Amazon’s customers who bought this product also bought these products).

**Key metrics for association rules:**

There are 3 key metrics to consider when evaluating association rules:

**Support:**Percentage of orders that contain the item set. In the example above, there are 11 orders in total, and {bread, butter} occurs in 3 of them.

Support = Freq(X,Y)/N

Support = 3/11 = 0.27

2. **Confidence:** Given two items, X and Y, confidence measures the percentage of times that item Y is purchased, given that item X was purchased. This is expressed as:

Confidence = Freq(X,Y)/Freq(X)

Looking back to the example, percentage of times that butter(X) is purchased, given that bread(Y) was bought:

Confidence (butter -> bread) = 3/3 = 1

Confidence values range from 0 to 1, where 0 indicates that Y is never purchased when X is purchased, and 1 indicates that Y is always purchased whenever X is purchased. Note that the confidence measure is directional. This means that we can also compute the percentage of times that bread is purchased, given that item butter was purchased:

Confidence (bread->butter) = 3/7 = 0.428

Here we see that all of the orders that contain bread also contain butter. However, does this mean that there is a relationship between these two items, or are they occurring together in the same orders simply by chance? To answer this question, we look at another measure which takes into account the popularity of both items.

**3. Lift:** Unlike the confidence metric whose value may vary depending on direction (eg: confidence{X ->Y} may be different from confidence{Y ->X}), **lift has no direction**. This means that the lift{X,Y} is always equal to the lift{Y,X}:

lift{X,Y} = lift{Y,X} = support{X,Y} / (support{X} * support{Y})

lift{butter, bread} = lift{bread, butter} = support{butter, bread} / (support{butter} * support{bread})

lift{butter, bread} = lift{bread, butter} =(3/11)/((3/11)*(7/11))

lift{butter, bread} = lift{bread, butter} =1.571

In the example above, if butter occurred in 27.2% (=3/11)of the orders and bread occurred in 63.6% (= 7/11) of the orders, then if there was no relationship between them, we would expect both of them to show up together in the same order 17.35% of the time (ie: 27.2% * 63.6%). The numerator, on the other hand, represents how often butter and bread actually appear together in the same order (27.2%). Taking the numerator and dividing it by the denominator, we get to know how many more times butter and bread appear in the same order, compared to if there was no relationship between them (i.e., they are occurring together simply at random).

In summary, lift can take the following values:

- Lift = 1; implies no relationship between X and Y (i.e., X and Y occur together only by chance)
- Lift > 1; implies that there is a positive relationship between X and Y (i.e., X and Y occur together more often than random)
- Lift < 1; implies that there is a negative relationship between X and Y (i.e., X and Y occur together less often than random)

In our example, butter and bread occur together 1.57 times *more* than random, so we conclude that there exists a positive relationship between them.

Let’s look at the market basket analysis using R:

Load the librarieslibrary(arules)library(arulesViz)library(datasets) # Load the data setsetwd(“/Users/niharikagoel/Downloads”)tr <- read.transactions(‘MBA_blog.csv’, format = ‘basket’, sep = ‘,’)summary(tr)View(tr)# Create an item frequency plot for the top 20 itemsitemFrequencyPlot(tr,topN=20,type=”absolute”)# Get the rulesrules <- apriori(tr, parameter = list(supp = 0.001, conf = 0.001, target=”rules”))rules<-sort(rules, by=”confidence”, decreasing=TRUE)summary(rules)#Number of rulesprint(length(rules))# Show the top 5 rules, but only 2 digitsoptions(digits=2)#inspect(rules[1:5])write(rules, file = “MBA_blog_result.csv”, sep = “,”, quote = TRUE, row.names = FALSE)