

-Statistics by De-
DeTistics
Techniques
An index of statistical and data mining techniques with brief definitions of each technique, sample research questions each technique can address, and links to related DeTistics content.
C
Classification
Classification is a data mining technique for determining which of a predefined set of groups a given observation/action/person belongs to. Classification techniques require an existing set of preclassified data, which can either be obtained by hand-coding a subset of the data or by using other data mining techniques to identify the number of groups and assign group membership to a subset of the data. Classification techniques then compare the features of each unclassified observation/action/person to the features of the existing preclassified data points to determine which preexisting group each new observation/action/person is most similar to.
Research Questions Addressed: Which player/item type does each new player/item belong to? Which strategy is each new player/student using?
Cluster Analysis
Cluster Analysis is a data mining technique for identifying groups of people who perform similar actions or groups of actions performed by the same people. Cluster analysis techniques generally work by plotting the data in multidimensional space and then calculating the distance between each point in that space. Points that are close to each other are assigned to the same cluster, and regions of the space with few or no points separate the clusters from each other. Hard clustering techniques require each point to belong to a single cluster, while fuzzy clustering techniques assign a probability of each point belonging to each cluster.
Research Questions Addressed: What are the different player/item types in this data set? What are the different strategies players/students are using in this context?
A
Association Rule Mining
Association Rule Mining is a data mining technique for identifying actions that frequently co-occur. Association rules indicate the probability that an action (B) occurs, given the occurence of another action (A). Each association rule has a corresponding support, confidence, and lift. Support for a given rule indicates the proportion of observations containing A. Confidence for a given rule indicates the percentage of A that co-occurs with B. Lift for a given rule indicates the ratio of the observed support to the support if A and B were independent. Note that association rule mining does not contain any information about the relative order of A and B. If the relative order is important, sequence mining should be used instead.
Research Questions Addressed: Players who equipped a given item also usually equipped which other items? Students who answered a given question correctly also generally answered which other questions correctly?
F
Factor Analysis
​
Factor Analysis is a statistical technique for combining a number of observed variables into a smaller number of unobserved latent variables called factors. Factor analysis techniques seek to explain the correlations between observed variables by identifying the combination of vectors that best fit the observed data points. The resulting factor loadings indicate the degree to which values of each variable are influenced by each of the underlying vectors.
Research Questions Addressed: Which factors explain the observed responses to a survey or test, and which survey or test items load on which factors? Which factors explain player performance across game levels, and which game levels load on which factors?
G
Generalizability Theory
​
Generalizability Theory is a statistical technique for separating the different sources of variance in a given measure. Generalizabilty consists of two components: a G-study and a D-study. The G-study determines the amount of variance accounted for by each facet of the measure (e.g., students, school, raters, items). The D-study determines the minimum size of each facet required for a reliable score (e.g., how many items, raters, etc.).
Research Questions Addressed: How many items, levels, raters, observations, etc. do I need to get a good measurement? How much of a person's performance is based on the group they're in?
H
Hierarchical Linear Modeling
​
Hierarchical Linear Modeling (HLM) is a statistical technique based on linear regression that allows the slope and intercept of the linear function to differ for different groups of players/students. Situations where hierarchical groupings exist (e.g., students nested in schools, players nested in countries/guilds) violate the indepence of observations assumption of standard linear regression because people in a given group are often more similar to each other than they are to people in other groups, and that similarity often increases over time. Higherarical Linear Modeling allows for the explicit modeling of these dependencies.
Research Questions Addressed: What is the effect of a given intervention on students from different schools? What is the effect of a given incentive on players from different countries or guilds?
I
Item Response Theory
​
Item Response Theory (IRT) is a statistical technique that simultaneously identifies the difficulty of each item on a test and the ability of each individual who took that test. Items are initially placed on a difficulty vector based on the number of individuals answering the item correctly and individuals are initially placed on the same vector based on the difficulty of the items they answered correctly. After initial placement, items and individuals locations are modified based to account for slips and guesses. For example, if an indiidual got a bunch of hard items correct but missed a single easy item, the algorithm can assume they knew the answer to the easy item, thereby adjusting the location of the individual to the right (as they know more than originally thought) and the item will be moved to the left (since it is easier than originally thought).
Research Questions Addressed: What is the relative difficulty of each item/level/character in a given set of items/levels/characters? What is the ability of each student/test taker/player measured across those items/levels/characters?
L
Linear Regression
Linear Regression is a statistical technique that predicts a continuous variable (y) based on a linear function of another variable (x). The intercept of the linear function represents the value of x that predicts a 0 value for y. The slope of the linear function represents the change in y for every one-unit change in x.
Research Questions Addressed: What is the relationship between the number of books in a child's home on his/her reading ability? How does the number of levels completed relate to the amount of money a player will spend in the game?
Logistic Regression
​
Logistic Regression is a statistical technique that predicts the odds of a given value of a binary or categorical variable (y) based on a logistic function of another variable (x). Logistic regression (also known as logit regression) is a basic probabilistic classification model that determines the odds of y given different values of x.
Research Questions Addressed: What is the probability that a child will pass third grade, given the number of books in his/her home? What is the probability that a player will purchase extended content for a game, given the number of levels completed in the game?
N
Natural Language Processing
​
Natural Language Processing (NLP) is a family of data mining techniques for automatically identifying the content or quality of a given corpus of unstructured text. NLP techniques are generally either algorithm-based techniques that rely on proximity to provide meaning (and length and feature counts to determine quality), or rule-based grammatical or linguistical techniques that rely on the underlying structure of the text to provide meaning (and adherence to those rules to determine quality).
Research Questions Addressed: Which web sites/essays mention a given topic? What does a given web site/essay say about a given topic? What is the quality of the writing in a given web site/essay?
P
Principal Component Analysis
​
Principal Component Analysis (PCA) is a statistical technique that is similar to factor analysis, but with some additional constraints. Rather than identifying the combination of vectors that best fit the observed data points, PCA first identifies the single vector that explains most of the data. Subsequent vectors are constrained to be orthogonal to all preceeding vectors.
Research Questions Addressed: What other factors besides general math ability explain the observed responses to a math test, and which survey or test items load on these additional factors? Which other factors besides general gaming ability explain player performance across game levels, and which game levels load on these additinal factors?
S
Sequence Mining
​
Sequence Mining is a data mining technique for identifying actions that frequently occur after other actions. Sequence mining is association rule mining that takes the order of actions into account. In this case, the rules produced indicate the probability that an action (B) occurs, given the previous occurance of another action (A). Sequence mining algorithms can be parameterized to find action sets wherein B occurs immediately after A or when B occurs within a given window of time following A. Though it is most common to search for action pairs, sequence mining algorithms also allow the number of actions in each action set to be parameterized so that chains of actions of any given length can be identified.
Research Questions Addressed: What page of the website or portion of the task do people most frequently navigate to after the one in question? What is the game most people play next? In what order do people most frequently navigate through the help menu or hint menu?
Structural Equation Modeling
​
Structural Equation Modeling (SEM) is a family of statistical techniques, including path analysis and confirmatory factor analysis, for determining the fit of a given latent variable model to observed data or for determining the strength of the relationship between each observed variable and the latent variable it provides evidence of. In structural equation models latent variables are represented as ovals, observed variables are represented as squares, and arrows drawn from each latent variable to the appropriate observed variables represent the relationship between the observed and latent variables.
Research Questions Addressed: Does a given model fit equally well for different groups (e.g., men and women, children and adults, etc.)? Is there a signiticant relationship between a given observed variable and the latent variable it is assumed to be related to?
Survival Analysis
​
Survival Analysis is a statistical technique for determining the amount of time until the occurrence of a given event (death, in the original medical research in this area). Survival analysis takes into account information from uncensored observations (who experienced the event in the time period being studied) and censored observations (who did not experience the event during the study) in order to produce a survival function and a hazard function. The survival function indicates the probability for surviving (not experiencing the event) for each time point. The hazard function indicates the likelihood of the event occurring in a given time point, given survival up to that time point.
Research Questions Addressed: How long will different types of people play a given game? Which game levels have the most drop-off? How many items will different types of people solve in a timed test?