Machine Learning & AI

Overview

Machine learning and artificial intelligence are transforming educational assessment by enabling automated scoring, adaptive learning, and novel ways to extract meaning from complex educational data.

Our lab explores the intersection of psychometrics and ML/AI — both using ML methods to solve measurement problems, and applying psychometric thinking to evaluate AI-generated outputs rigorously.

Recent work has examined how large language models perform on cognitive-demanding tasks in science, how ML can score constructed-response items, and how physiological computing data can be analyzed to understand student engagement.

Key Research Themes

Automated Scoring

Using NLP and ML to score constructed-response and argumentation tasks, then validating scoring quality with CDMs and IRT.

AI Evaluation

Assessing the capabilities and limitations of AI systems (e.g., large language models) on cognitively demanding educational tasks.

Physiological Computing

Analyzing eye-tracking and other physiological data to understand student engagement, with attention to equity and diverse learner populations.

Process Data

Leveraging log files and interaction data from digital assessments to enrich diagnostic measurement beyond final responses.

Core Questions

How can ML/AI be used to improve educational assessment practices?
Can ML/AI scoring match human rater reliability and validity?
How do we ensure fairness when ML/AI is used in assessment?

Related Areas

Paper Highlights

Representative recent publications from this research area

Featured Science & Education, 34, 649–670

Can Generative AI and ChatGPT Outperform Humans on Cognitive-Demanding Problem-Solving Tasks in Science?

Zhai, X., Nyaaba, M., & Ma, W. (2025) · DOI: 10.1007/s11191-024-00496-1

A central assumption in AI-assisted education is that AI struggles with highly cognitive-demanding tasks just as humans do. This study tests that assumption directly by evaluating ChatGPT and GPT-4 on 54 NAEP science assessment items coded by cognitive complexity and dimensionality, then comparing AI performance to actual student scores.

Results show both AI tools consistently outperformed most students across Grades 4, 8, and 12. Crucially, while students required higher ability to succeed on more cognitively demanding items, the AI tools showed no such sensitivity — their performance was largely unaffected by increases in cognitive demand.

Findings: Current cognitive intensity tasks are insufficient for differentiating human from AI capability, calling for a shift toward creativity, critical thinking, and novel problem-solving in science education and assessment.

Read Paper →

Example NAEP science constructed-response item used in the study

Featured Research in Science Education, 53, 405–424

Assessing Argumentation Using Machine Learning and Cognitive Diagnostic Modeling

Zhai, X., Haudek, K., & Ma, W. (2023) · DOI: 10.1007/s11165-022-10062-w

Scientific argumentation requires coordinating three skills: making claims (C), using evidence (E), and providing warrants (W). Traditional total scores obscure which of these skills individual students actually possess. This study builds ML scoring algorithms for 19 constructed-response argumentation items, then uses CDM to reveal fine-grained cognitive patterns.

ML scoring achieved strong human agreement (average Cohen's κ = 0.73). CDM analysis identified 21 distinct skill-mastery profiles among 932 Grades 5–8 students — with the 9 most common profiles covering over 70% of the sample. As the radar charts illustrate, three students with the same total score can have very different argumentation skill patterns.

Findings: CDM provides actionable diagnostic information that total scores cannot — enabling targeted instructional feedback on claims, evidence, or warrants for each student.

Read Paper →

Radar charts showing argumentation skill profiles for three students with the same total score

Featured Journal of Educational and Behavioral Statistics (2025)

Incorporating Process Information Into Cognitive Diagnostic Models: A Four-Component Joint Modeling Approach

Rajeb, M., Ma, W., He, Q., & Shi, Q. (2025) · DOI: 10.3102/10769986251334788

Computer-based testing generates rich process data beyond correctness — including response time and action sequences. This paper proposes a four-component joint model that simultaneously uses item responses, response time, response similarity (how closely actions resemble a reference sequence), and response efficiency to estimate both ability and attribute mastery.

The model is estimated via MCMC and captures both person-level (ability, speed, similarity, efficiency) and item-level parameters within a unified framework. The path diagram illustrates how these four data streams feed into a coherent latent structure for cognitive diagnosis.

Findings: Incorporating process data substantially improves attribute classification accuracy, particularly when response actions carry information beyond correctness alone.

Read Paper → Data & Code (OSF)

Path diagram of the four-component joint CDM incorporating process information