top of page

Finding Relationships: Association versus Correlation

  • dekerr
  • Jun 7, 2014
  • 4 min read

Association and correlation are common techniques for identifying meaningful relationships between different aspects of human behavior. Use of these two techniques differs based on the analyst’s field of study. If the analyst has a marketing, business, or computer science background, they probably use association to identify behavioral patterns (e.g., which items are commonly purchased together?). If the analyst has a social science or psychology background, they probably use correlation to identify patterns in variables (e.g., which predictors covary?).

Though the two techniques are very similar in intent, there are differences in their underlying mathematical processes that lead to important distinctions in how the results of each technique should be interpreted.

What is Association?

The math behind association is fairly simple. Association consists of two statistics: support and confidence. Support is an indicator of the frequency of a given behavior, and is calculated by determining the proportion of entries in the dataset that contain the behavior of interest (e.g., P(A) ). Confidence is an indicator of the frequency of the cooccurrence of the given behavior with an additional behavior (or behaviors), and is calculated by determining the proportion of entries containing the behavior of interest that also contain the additional behavior (e.g., P(AB)/P(A) ).

While support and confidence are good indicators of a relationship between behaviors in the data, they can produce very misleading results when B occurs more frequently than A. This stems largely from the fact that association is unidirectional (e.g., the probability of B given A is not the same as the probability of A given B). For example, if 10 people bought Gooey Fun Snacks and 8 of those 10 bought milk, the confidence that Gooey Fun Snacks and milk purchases are related is 80%. However, if 1,000 people bought milk (only 8 of whom bought Gooey Fun Snacks), the confidence that milk and Gooey Fun Snacks purchases are related is only 0.8%.

Therefore, analysts using association frequently calculate lift as well as support and confidence. Lift is an indicator of the frequency of the cooccurrence of the behaviors, and it adds the probability of B to the denominator of the calculation for confidence (e.g., P(AB)/P(A)P(B) ). While the use of lift increases the interpretability of the results, association still refers only to the data that was analyzed and does not indicate that the identified relationship is true in the population (e.g., it is possible that there will not be a relationship between A and B in next month's dataset, even if the relationship between them is strong in this month's dataset).

What is Correlation?

The math behind correlation is only slightly more complicated than the math behind association. Correlation, like association, consists of two statistics: the correlation coefficient (r) and significance (p). The correlation coefficient is a measure of how much the two variables change together (e.g., as education level increases income also increases, so education level and income are correlated). Significance is a measure of how likely it is that the same relationship will occur in a different dataset (or, really, how unlikely it is to not occur in a different dataset).

The formula for calculating the correlation coefficient can be written in a form similar to the formula for calculating lift. Rather than calculating P(AB)/P(A)P(B), as done in association, correlation uses the calculation SS(AB)/SQRT(SS(A)SS(B), where SS stands for Sum of Squares and SS(AB) is calculated by sum(AB)-(sum(A)sum(B)/n), where n is the sample size of the dataset. Values close to 1 indicate a positive relationship (e.g., as A increases, B increases), values close to -1 indicate a negative relationship (e.g., as A increases, B decreases), and values close to 0 indicate no relationship at all.

Significance is determined by calculating r/SQRT[(1-r2)-(n-2)] and comparing the resulting value to a t-table for n-2 to determine the probability that the relationship will not exist in other datasets. Depending on the field, significance values smaller than .05 or .01 are generally considered good, since that indicates less than a 5% (or 1%) chance that the correlation will not occur in another dataset.

Correlation vs. Association

Correlation has some benefits over association. First, unlike association, correlation is bidirectional (e.g., the correlation between A and B is the same as the correlation between B and A). Second, correlation provides a test for the significance of the observed relationship. This makes it easier to identify spurious relationships using correlation than using association (provided that the number of correlations run is fairly small).

However, when the sample size is very large (e.g., in the millions or hundreds of millions) or there are hundreds or even thousands of relationships that need to be examined, spurious relationships are almost as likely to be found using correlation as when using association. In such cases, association has a practical advantage over correlation. While the correlation calculation requires every cell of the dataset to be examined, the association calculation has been optimized in a number of ways to drastically reduce the search space.

It is for these reasons that analysts with a marketing, business, or computer science background often prefer association (which can handle the size of their data much more gracefully than correlation without being much less reliable), while analysts with a social science or psychology background (whose data is often orders of magnitude smaller, and for whom accuracy is usually paramount) often prefer correlation.

 
 
 

Comments


Featured Posts
Recent Posts
Search By Tags
Follow Us
  • Facebook Classic
  • Twitter Classic
  • Google Classic

FOLLOW DE

  • LinkedIn App Icon
  • scholarRound.PNG

© 2015 by Deirdre Kerr. Created with Wix.

bottom of page