The Teachable Agents Group at Vanderbilt University
 Vanderbilt University | School of Engineering: EECS | ISIS

Educational Data Mining

The learning and problem-solving tasks in our learning environments (e.g., Betty's Brain and CTSiM) are complex, open-ended, and choice-rich, so learners must employ a variety of cognitive, metacognitive, and self-regulative skills to achieve success. For example, at the cognitive level in Betty's Brain, they need to identify and understand relevant information from the resources in the system, represent that information in the causal map format to teach their agent, and use questions and quizzes to explore Betty's understanding and assess her overall progress. At the metacognitive level, they need to set goals and choose strategies related to their knowledge construction and monitoring tasks. To self-regulate, they have to make choices about how best to use the learning environments they are working in, and how to allocate their time between the different activities. In other words, they must decided when and how to acquire information, build and modify the causal map, check Betty's progress, and reflect on their own understanding of both the science knowledge and the evolving causal map structure. At the same time, they have to keep track of their progress and regulate & modify their behaviors or seek help when they are unable to make meaningful progress.

Many details of students' learning activities, as well as additional information like performance on learning and problem-solving tasks, can be tracked by computer-based learning environments (CBLEs). Because the learning activities logged by the system result from a variety of internal cognitive and metacognitive states and processes used by the student, this wealth of data provides opportunities to assess, model, and understand student learning behaviors and strategies more accurately. Further, better understanding and more effective assessment of student learning behaviors can be used to inform and improve learning strategy feedback and support in these systems. However, realizing these opportunities requires effective methods for identifying important learning behavior patterns in the activity trace data.

Our work in educational data mining aims to identify these learning behaviors and strategies through the development of new techniques for modeling and mining student learning activity traces. Our current work includes two primary areas of focus: generation of hidden Markov models to provide an overview of student learning stratgies and sequence mining techniques to identify specific activity patterns corresponding to important student learning behaviors.

Hidden Markov Models

Hidden Markov models (HMMs) can provide an aggregate probabilistic repesentation of students' internal states and strategies during learning in CBLEs. Analyses that do not take into account the sequential nature of student interactions with the system, such as action frequency counts, capture limited and indirect information about learning strategies and students' cognitive state. Employing a representation of these internal states and strategies with a probabilistic automaton, such as an HMM, has the potential for facilitating identification, interpretation, and comparison of student learning behaviors. In particular, HMMs can provide a concise, aggregated representation of student learning strategies and behaviors (e.g., strategies and their relationships, as opposed to simple action or action-sequence frequencies) derived from learning activity sequences.

Like a student's mental processes, the states of an HMM are hidden, meaning that they cannot be directly observed, but they produce observable emissions (in this case, actions in a learning environment). Together, three sets of probabilities form a complete model: (1) transition probabilities, which determine the likelihood of going from one state to another at each step; (2) state output probabilities, which define the likelihood of observing different actions in each state; and (3) state initial probabilities, which define the likelihood that a state will be the starting state for an activity sequence. Our methodology for generating these models from student learning activity traces has facilitated identification, interpretation, and comparison of learning behaviors at an aggregate level.


HMMs generated for two experimental conditions in a recent Betty's Brain study

Differential Sequence Mining

While hidden Markov models provide an important overview of student learning strategies, their aggregated descriptions of learning behaviors can also mask the exact manifestation of some learning behaviors that are important for a deeper understanding of student behavior and improving feedback and support in the system. To identify and analyze these specific learning behaviors from trace data, we have developed an exploratory data mining methodology, including a differential sequence mining (DSM) technique that provides a finer-grained analysis and more precise definitions of learning behavior differences between groups of students or phases of student activity. The DSM technique uses sequence mining on multiple measures of pattern frquency to identify differentially frequent patterns between two sets of action sequences. In other words, DSM autonomously identifies and ranks patterns that are used more frequently in one group of action sequences than in another. This comparison between groups allows us to identify potentially important patterns. For better identification of important learning behaviors, student activity sequences are also broken up into productive and unproductive phases based on the changes in a performance metric (e.g., how closely the current causal map matches the expert map in a Betty's Brain unit) using a linear segmentation algorithm. Differential sequence mining is then applied to identify differentially frequent patterns between productive and unproductive phases in a group of students or between corresponding phases in two different groups of students. We interpret identified patterns in terms of student learning behaviors with respect to our cognitive and metacognitive model of student learning in the system. Finally, we investigate the specific instances of promising activity patterns to determine their potential for informing pedagogical feedback and support in the learning environment.


Map score and phase segmentation for example high- and low-performing students in Betty's Brain

Applying this exploratory data mining methodology to data from Betty's Brain studies has allowed us to identify a variety of patterns relevant to student learning behaviors that may prove useful for improving feedback from the pedagogical agents. For example, a comparison of high-performing versus low-performing students illustrated that a distinguishing characteristic of the high performers was a tendency to employ reading in a monitoring context (e.g., following a quiz by reading relevant resource pages). Further, other differentially frequent patterns, such as taking a quiz following the addition of a questionable link (e.g., one unrelated to recently read pages in the resources) and then following the quiz with the removal of the questionable link, suggested that students in the high group were also able to effectively use quizzes for monitoring their agent's (and their own) understanding. In general, the identified patterns from this analysis suggested that high-performing students made more frequent and more effective attempts to monitor their agents' and their own understanding with both quizzes and reading. In constrast, the low-performing students tended to be most productive when reading resource pages for the first time (as opposed to a variety of re-reading activities observed in the high students' productive periods) and when they systematically combined reading with editing. However, the low students more often performed extended sequences of only reading or only editing and generally took a less systematic approach to reading and editing in comparison with the high students. Overall, the analysis identified behavioral trends and specific patterns of activity that may be important in triggering and informing strategy feedback for both low-performing and high-performing students.

Our educational data mining tools are implmented in Java and are available by emailing John Kinnebrew [john.s.kinnebrew (AT) vanderbilt.edu] or Gautam Biswas [gautam.biswas (AT) vanderbilt.edu].