Module 4: Deep Knowledge Tracing

Deep Knowledge Tracing (DKT) Piech et al. (2015)

Based on long short term memory networks
Fits on sequence of student performance across skills
- Predicts performance on future items within system
Can fit very complex functions
- Very complex relationships between items over time

Deep Knowledge Tracing (DKT)

Initial paper reported massively better performance than original BKT or PFA Piech et al. (2015)
Seemed at first too good to be true, and Xiong et al. (2016) reported that Piech et al. (2015) had used the same data points for both training and test

Contemporary comparisons of DKT

Khajah, Lindsey, and Mozer (2016) compared DKT to modern extensions to BKT on same data set
- Particularly beneficial to re-fit item-skill mappings
Wilson et al. (2016) compared DKT to temporal IRT on same data set
Bottom line: All three approaches appeared to perform comparably well

But this was the beginning of what could be called DKT-Family algorithms

A range of knowledge tracing algorithms based on different variants on Deep Learning
Now literally hundreds of published variants
- Most of them tiny tweaks to get tiny gains in performance
- But in aggregate, there appear to be some real improvements to predictive performance (see comparison in Gervet et al. (2020) 2020 for example)
We will discuss some of the key issues that researchers have tried to address, and what their approaches were.

Degenerate behavior

Yeung and Yeung (2018) reported degenerate behavior for DKT
- Getting answers right leads to lower knowledge
- Wild swings in probability estimates in short periods of time
They proposed adding two types of regularization to moderate these swings
- Increasing weight of current prediction for future prediction
- Reducing amount model is allowed to change future estimates

Impossible to interpret in terms of skills

DKT Family generally predicts individual item correctness, not skills.
Some variants estimate skill ability, but those skills are not always the same as the pre-defined skills, and the estimates may not be accurate.

Extension for Latent Knowledge Estimation

Jiani Zhang et al. (2017) proposed an extension to DKT, called DKVMN, that fits an item-skill mapping too
- Based on Memory-Augmented Neural Network, that keeps an external memory matrix that neurons update and refer back to
- Latent skills are “discovered” by the algorithm and difficult to interpret.

Extension for Latent Knowledge Estimation

Lee and Yeung (2019) proposed an alternative to DKT, called KQN, that attempts to output more interpretable latent skill estimates
- Again, fits an external memory network to fit skills
- Also attempts to fit amount of information transfer between skills
- Still not that interpretable

Extension for Latent Knowledge Estimation

Yeung (2019) proposed an alternative to DKT, called Deep-IRT, that attempts to output more interpretable latent skill estimates
- Again, fits an external memory network to fit skills
- Fits separate networks to estimate student ability, item difficulty
- Uses estimated ability and difficulty to predict correctness with an item response theory model.
- Somewhat more interpretable (the IRT half, at least)

One caveat for skill estimation

Some deep learning-based algorithms attempt to estimate skill level.
Their skill estimates are rarely, if ever, compared to post-tests or other estimates of skill level.
(Most large datasets don’t have that data available)
Therefore, we don’t really know if the estimates are any good.

Extension for Latent Knowledge Estimation

Scruggs, Baker, and McLaren (2019) proposed AOA, an extension to any knowledge tracing algorithm
- Human-derived skill-item mapping used
- Predicted performance on all items in skill averaged
  - Including both unseen and already-seen items
Led to successful prediction of post-tests outside the learning system

Latent Knowledge Estimation

In unpublished work, I used DKVMN’s internal concept estimates to predict a posttest, but they were less predictive than skill estimates generated by AOA.
In Scruggs et al. (2023) internal skill estimates from Elo and BKT were outperformed by AOA skill estimates generated from those algorithms’ correctness predictions.

Deep learning and skill discovery

Automated skill discovery would make it a lot easier to use knowledge tracing on data without skill tags.
It could also show relationships between skills.
Piech et al. (2015) mention that DKT accurately clustered items to skills in a synthetic data set.
Jiani Zhang et al. (2017) repeat the experiment for DKVMN and also show reasonable item clusters for Assistments data.

Deep learning and skill discovery

The figures shown in Jiani Zhang et al. (2017) use t-SNE Van der Maaten and Hinton (2008) to visualize neural network weights.
t-SNE is a very popular method, but the clusters it creates can be strongly influenced by the value of the perplexity parameter - lower values make t-SNE try harder to create clusters.
In unpublished work, I used DKVMN on a large dataset with very reliable skill tags; the resultant clusters sometimes reflect the underlying skills, but sometimes do not.

Deep learning and skill discovery

DKVMN can cluster the exercises in the Synthetic-5 dataset into their five ground-truth concepts.

Deep learning and skill discovery

TSNE with perplexity=5

Different colors are different skills

Deep learning and skill discovery

TSNE with perplexity=50

Different colors are different skills

Deep learning and skills

Finally, Karumbaiah, Ocumpaugh, and Baker (2022) found that DKVMN’s correctness predictions were more accurate when the model had no skill tags at all (it treated all items as belonging to the same skill) than when it had possibly-unreliable skill tags, or when it had accurate domain-level tags.
This suggests that deep learning algorithms may be well suited for data without good skill tags.

Discussion 1

What information can DKT-family algorithms provide teachers?
- Is “next problem correctness” useful? Why or why not?
What do you do for entirely new items?
- BKT and PFA fit skill-level parameters which make it much easier to add new items without retraining the model.

What is DKT really learning?

Ding and Larson (2019) demonstrated theoretically that a lot of what DKT learns is how good a student is overall
They replicate that finding in a 2021 paper using a larger dataset.

What is DKT really learning?

Jiayi Zhang et al. (2021) followed this up with empirical work showing that most of the improvement in performance for DKVMN is in the first attempt on a new skill

What is DKT really learning?

In particular, there’s essentially no benefit to deep learning after several attempts on a skill (about where students often reach mastery, if they didn’t already know skill)

Other Important DKT Variants: SAKT

Pandey and Karypis (2019) proposed a DKT variant, called SAKT, which fits attentional weights between exercises and more explicitly predicts performance on current exercise from performance on past exercises
Gets a little better fit, doubles down a little more on some limitations we’ve already discussed

Other Important DKT variants: AKT

Ghosh, Heffernan, and Lan (2020) proposed a DKT variant, called AKT, which
- Explicitly stores and uses learner’s entire past practice history for each prediction
- Uses exponential decay curve to downweight past actions
- Uses Rasch-model embeddings to calculate item difficulty

Adding in more information: SAINT+

(shin2021saint?)+ added elapsed time and lag time as additional inputs, leading to better performance

Adding in more information: Process-BERT

Scarlatos, Brinton, and Lan (2022) added timing and use of resources such as calculator
Additional information leads to better performance

Curious methodological note

Most DKT-family papers report large improvements over previous algorithms, including other DKT-family algorithms
Improvements that seem to mostly or entirely dissipate in the next paper

Some reasons

Poor validation and over-fitting
A lot of DKT-family papers don’t use student-level cross-validation
- Poor cross-validation benefits DKT-family algorithms more than other algorithms, because DKT-family fits more aggressively
A lot of DKT-family papers fit their own hyperparameters but use past hyperparameters for other algorithms

An evaluation

Gervet et al. (2020) compares KT algorithms on several data sets
Key findings
- Different data sets have different winners
- DKT-family performs better than other algorithms on large data sets, but worse on smaller data sets
- DKT-family algorithms perform worse than LKT-family on data sets with very high numbers of practices per skill (i.e. language learning)
- DKT-family algorithms do better at predicting if exact order of items matters (which can occur if items within a skill vary a lot)
- DKT-family algorithms reach peak performance faster than other algorithms (also seen in Jiayi Zhang et al. (2021))

Another evaluation

Schmucker et al. (2021) compares KT algorithms on four large datasets
Their feature-based logistic regression model outperformed all other approaches on nearly all datasets tested.
DKT was the best-performing algorithm on one dataset.
Later DKT-family variants were outperformed by standard DKT on all datasets.

Next Frontier for DKT-family: Beyond Correctness

Option Tracing Ghosh, Raspat, and Lan (2021) extends output layer to predict which multiple choice item the student will select

Next Frontier for DKT-family: Beyond Correctness

Open-Ended Knowledge Tracing Liu et al. (2022) integrates KT with
- A GPT-2 model fine-tuned on 2.1 million Java code exercises and written descriptions of them
In order to generate predicted student code which makes predicted specific errors

DKT-family: work continues

Dozens of recent papers trying to get better results by adjusting the deep learning framework in various ways
Better results = higher AUC values for predictions of next-item correctness on test data in selected datasets.
As shown in Schmucker and Mitchell (2022), better results on some datasets do not always translate to better results on all datasets.

Discussion 2

Is the prediction of next-problem correctness the right thing to fit on?
- Are there other options?
How can you show that one DKT-family algorithm is better than another one?

How to quickly evaluate a new deep learning KT algorithm

Every paper will claim great performance.
Look at the methods. Do they mention student-level cross-validation? Hyperparameter fitting procedures?
- Many won’t. That’s not always a dealbreaker. Check the code.
Look at the results; find an algorithm and dataset that were also tested in another paper. Check to see if the numbers match.
- If there’s no overlap, or the numbers disagree, I’d give up on it.

How to quickly evaluate a new deep learning KT algorithm

If you actually want to use the algorithm yourself, I’d go a little deeper.
Download the implementation and try to replicate a result.
Try running it on one of the smaller Assistments datasets, making sure to use student-level cross-validation.
One more note: Implementation performance will vary. Some implementations are much faster than others.

Why use a DKT-family algorithm

You care about predicting next-problem correctness
- Or you’re willing to use a method like AOA to get skill estimates
You may have unreliable skill tags, or no skill tags at all
Your dataset has a reasonably balanced number of attempts - or you don’t care as much about items/skills with fewer attempts
Your dataset has students working through material in predefined sequences

Why not use a DKT-family algorithm

You want interpretable parameters
You have a small dataset (<1M interactions)
You want to add new items without refitting the model.
You want an algorithm with more thoroughly-understood and more consistent behavior.

Ding, Xinyi, and Eric C Larson. 2019. “Why Deep Knowledge Tracing Has Less Depth Than Anticipated.” International Educational Data Mining Society.

Gervet, Theophile, Ken Koedinger, Jeff Schneider, Tom Mitchell, et al. 2020. “When Is Deep Learning the Best Approach to Knowledge Tracing?” Journal of Educational Data Mining 12 (3): 31–54.

Ghosh, Aritra, Neil Heffernan, and Andrew S Lan. 2020. “Context-Aware Attentive Knowledge Tracing.” In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2330–39.

Ghosh, Aritra, Jay Raspat, and Andrew Lan. 2021. “Option Tracing: Beyond Correctness Analysis in Knowledge Tracing.” In International Conference on Artificial Intelligence in Education, 137–49. Springer.

Karumbaiah, Shamya, Jaclyn Ocumpaugh, and Ryan S Baker. 2022. “Context Matters: Differing Implications of Motivation and Help-Seeking in Educational Technology.” International Journal of Artificial Intelligence in Education 32 (3): 685–724.

Khajah, Mohammad, Robert V Lindsey, and Michael C Mozer. 2016. “How Deep Is Knowledge Tracing?” arXiv Preprint arXiv:1604.02416.

Lee, Jinseok, and Dit-Yan Yeung. 2019. “Knowledge Query Network for Knowledge Tracing: How Knowledge Interacts with Skills.” In Proceedings of the 9th International Conference on Learning Analytics & Knowledge, 491–500.

Liu, Naiming, Zichao Wang, Richard Baraniuk, and Andrew Lan. 2022. “Open-Ended Knowledge Tracing for Computer Science Education.” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.

Pandey, Shalini, and George Karypis. 2019. “A Self-Attentive Model for Knowledge Tracing.” arXiv Preprint arXiv:1907.06837.

Piech, Chris, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami, Leonidas J Guibas, and Jascha Sohl-Dickstein. 2015. “Deep Knowledge Tracing.” Advances in Neural Information Processing Systems 28.

Scarlatos, Alexander, Christopher Brinton, and Andrew Lan. 2022. “Process-BERT: A Framework for Representation Learning on Educational Process Data.” arXiv Preprint arXiv:2204.13607.

Schmucker, Robin, and Tom M Mitchell. 2022. “Transferable Student Performance Modeling for Intelligent Tutoring Systems.” arXiv Preprint arXiv:2202.03980.

Schmucker, Robin, Jingbo Wang, Shijia Hu, and Tom M Mitchell. 2021. “Assessing the Performance of Online Students–New Data, New Approaches, Improved Accuracy.” arXiv Preprint arXiv:2109.01753.

Scruggs, Richard, Ryan S Baker, and Bruce M McLaren. 2019. “Extending Deep Knowledge Tracing: Inferring Interpretable Knowledge and Predicting Post-System Performance.” arXiv Preprint arXiv:1910.12597.

Scruggs, Richard, Ryan S Baker, Philip I Pavlik Jr, Bruce M McLaren, and Ziyang Liu. 2023. “How Well Do Contemporary Knowledge Tracing Algorithms Predict the Knowledge Carried Out of a Digital Learning Game?” Educational Technology Research and Development 71 (3): 901–18.

Van der Maaten, Laurens, and Geoffrey Hinton. 2008. “Visualizing Data Using t-SNE.” Journal of Machine Learning Research 9 (11).

Wilson, Kevin H, Xiaolu Xiong, Mohammad Khajah, Robert V Lindsey, Siyuan Zhao, Yan Karklin, Eric G Van Inwegen, et al. 2016. “Estimating Student Proficiency: Deep Learning Is Not the Panacea.” In In Neural Information Processing Systems, Workshop on Machine Learning for Education. Vol. 3.

Xiong, Xiaolu, Siyuan Zhao, Eric G Van Inwegen, and Joseph E Beck. 2016. “Going Deeper with Deep Knowledge Tracing.” International Educational Data Mining Society.

Yeung, Chun-Kit. 2019. “Deep-IRT: Make Deep Learning Based Knowledge Tracing Explainable Using Item Response Theory.” arXiv Preprint arXiv:1904.11738.

Yeung, Chun-Kit, and Dit-Yan Yeung. 2018. “Addressing Two Problems in Deep Knowledge Tracing via Prediction-Consistent Regularization.” In Proceedings of the Fifth Annual ACM Conference on Learning at Scale, 1–10.

Zhang, Jiani, Xingjian Shi, Irwin King, and Dit-Yan Yeung. 2017. “Dynamic Key-Value Memory Networks for Knowledge Tracing.” In Proceedings of the 26th International Conference on World Wide Web, 765–74.

Zhang, Jiayi, Rohini Das, Ryan Baker, and Richard Scruggs. 2021. “Knowledge Tracing Models’ Predictive Performance When a Student Starts a Skill.” In Proceedings of the 14th International Conference on Educational Data Mining. EDM, Paris, France, 625–29.