KT Learning Lab 4: A Conceptual Overview
Based on long short term memory networks
Fits on sequence of student performance across skills
Can fit very complex functions
A range of knowledge tracing algorithms based on different variants on Deep Learning
Now literally hundreds of published variants
Most of them tiny tweaks to get tiny gains in performance
But in aggregate, there appear to be some real improvements to predictive performance (see comparison in Gervet et al. (2020) 2020 for example)
We will discuss some of the key issues that researchers have tried to address, and what their approaches were.
Yeung and Yeung (2018) reported degenerate behavior for DKT
Getting answers right leads to lower knowledge
Wild swings in probability estimates in short periods of time
They proposed adding two types of regularization to moderate these swings
Increasing weight of current prediction for future prediction
Reducing amount model is allowed to change future estimates
DKT Family generally predicts individual item correctness, not skills.
Some variants estimate skill ability, but those skills are not always the same as the pre-defined skills, and the estimates may not be accurate.
Jiani Zhang et al. (2017) proposed an extension to DKT, called DKVMN, that fits an item-skill mapping too
Based on Memory-Augmented Neural Network, that keeps an external memory matrix that neurons update and refer back to
Latent skills are “discovered” by the algorithm and difficult to interpret.
Lee and Yeung (2019) proposed an alternative to DKT, called KQN, that attempts to output more interpretable latent skill estimates
Again, fits an external memory network to fit skills
Also attempts to fit amount of information transfer between skills
Still not that interpretable
Yeung (2019) proposed an alternative to DKT, called Deep-IRT, that attempts to output more interpretable latent skill estimates
Again, fits an external memory network to fit skills
Fits separate networks to estimate student ability, item difficulty
Uses estimated ability and difficulty to predict correctness with an item response theory model.
Somewhat more interpretable (the IRT half, at least)
Some deep learning-based algorithms attempt to estimate skill level.
Their skill estimates are rarely, if ever, compared to post-tests or other estimates of skill level.
(Most large datasets don’t have that data available)
Therefore, we don’t really know if the estimates are any good.
Scruggs, Baker, and McLaren (2019) proposed AOA, an extension to any knowledge tracing algorithm
Human-derived skill-item mapping used
Predicted performance on all items in skill averaged
Led to successful prediction of post-tests outside the learning system
In unpublished work, I used DKVMN’s internal concept estimates to predict a posttest, but they were less predictive than skill estimates generated by AOA.
In Scruggs et al. (2023) internal skill estimates from Elo and BKT were outperformed by AOA skill estimates generated from those algorithms’ correctness predictions.
Automated skill discovery would make it a lot easier to use knowledge tracing on data without skill tags.
It could also show relationships between skills.
Piech et al. (2015) mention that DKT accurately clustered items to skills in a synthetic data set.
Jiani Zhang et al. (2017) repeat the experiment for DKVMN and also show reasonable item clusters for Assistments data.
The figures shown in Jiani Zhang et al. (2017) use t-SNE Van der Maaten and Hinton (2008) to visualize neural network weights.
t-SNE is a very popular method, but the clusters it creates can be strongly influenced by the value of the perplexity parameter - lower values make t-SNE try harder to create clusters.
In unpublished work, I used DKVMN on a large dataset with very reliable skill tags; the resultant clusters sometimes reflect the underlying skills, but sometimes do not.
DKVMN can cluster the exercises in the Synthetic-5 dataset into their five ground-truth concepts.
TSNE with perplexity=5
Different colors are different skills
TSNE with perplexity=50
Different colors are different skills
Finally, Karumbaiah, Ocumpaugh, and Baker (2022) found that DKVMN’s correctness predictions were more accurate when the model had no skill tags at all (it treated all items as belonging to the same skill) than when it had possibly-unreliable skill tags, or when it had accurate domain-level tags.
This suggests that deep learning algorithms may be well suited for data without good skill tags.
What information can DKT-family algorithms provide teachers?
What do you do for entirely new items?
Ding and Larson (2019) demonstrated theoretically that a lot of what DKT learns is how good a student is overall
They replicate that finding in a 2021 paper using a larger dataset.
Pandey and Karypis (2019) proposed a DKT variant, called SAKT, which fits attentional weights between exercises and more explicitly predicts performance on current exercise from performance on past exercises
Gets a little better fit, doubles down a little more on some limitations we’ve already discussed
Ghosh, Heffernan, and Lan (2020) proposed a DKT variant, called AKT, which
Explicitly stores and uses learner’s entire past practice history for each prediction
Uses exponential decay curve to downweight past actions
Uses Rasch-model embeddings to calculate item difficulty
Scarlatos, Brinton, and Lan (2022) added timing and use of resources such as calculator
Additional information leads to better performance
Most DKT-family papers report large improvements over previous algorithms, including other DKT-family algorithms
Improvements that seem to mostly or entirely dissipate in the next paper
Poor validation and over-fitting
A lot of DKT-family papers don’t use student-level cross-validation
A lot of DKT-family papers fit their own hyperparameters but use past hyperparameters for other algorithms
Gervet et al. (2020) compares KT algorithms on several data sets
Key findings
Different data sets have different winners
DKT-family performs better than other algorithms on large data sets, but worse on smaller data sets
DKT-family algorithms perform worse than LKT-family on data sets with very high numbers of practices per skill (i.e. language learning)
DKT-family algorithms do better at predicting if exact order of items matters (which can occur if items within a skill vary a lot)
DKT-family algorithms reach peak performance faster than other algorithms (also seen in Jiayi Zhang et al. (2021))
Schmucker et al. (2021) compares KT algorithms on four large datasets
Their feature-based logistic regression model outperformed all other approaches on nearly all datasets tested.
DKT was the best-performing algorithm on one dataset.
Later DKT-family variants were outperformed by standard DKT on all datasets.
Open-Ended Knowledge Tracing Liu et al. (2022) integrates KT with
In order to generate predicted student code which makes predicted specific errors
Dozens of recent papers trying to get better results by adjusting the deep learning framework in various ways
Better results = higher AUC values for predictions of next-item correctness on test data in selected datasets.
As shown in Schmucker and Mitchell (2022), better results on some datasets do not always translate to better results on all datasets.
Is the prediction of next-problem correctness the right thing to fit on?
How can you show that one DKT-family algorithm is better than another one?
Every paper will claim great performance.
Look at the methods. Do they mention student-level cross-validation? Hyperparameter fitting procedures?
Look at the results; find an algorithm and dataset that were also tested in another paper. Check to see if the numbers match.
If you actually want to use the algorithm yourself, I’d go a little deeper.
Download the implementation and try to replicate a result.
Try running it on one of the smaller Assistments datasets, making sure to use student-level cross-validation.
One more note: Implementation performance will vary. Some implementations are much faster than others.
You care about predicting next-problem correctness
You may have unreliable skill tags, or no skill tags at all
Your dataset has a reasonably balanced number of attempts - or you don’t care as much about items/skills with fewer attempts
Your dataset has students working through material in predefined sequences
You want interpretable parameters
You have a small dataset (<1M interactions)
You want to add new items without refitting the model.
You want an algorithm with more thoroughly-understood and more consistent behavior.