Comparative Study of AI and Human Evaluation for Student Website Projects

Authors

DOI:

https://doi.org/10.34190/icair.5.1.4301

Keywords:

Website development, UI/UX design, code quality, artificial intelligence tools, automatic vs human evaluation, large language models

Abstract

Artificial intelligence tools based on large language models are increasingly being adopted across a wide range of fields, including higher education. Given the substantial workload often faced by educators, these tools offer promising potential to assist in the evaluation of student work. However, empirical research on their reliability—particularly in assessing practical, design-oriented assignments such as student-developed websites—remains limited. This study aimed to investigate the ability of various AI tools to evaluate student website projects and the consistency between the evaluations given by AI tools and human instructors using the same criteria. Based on a literature review, a set of evaluation criteria was developed across three categories: user interface (UI), user experience (UX), and code quality. Each student project included a website prototype and the corresponding implementation code. Nine student projects were evaluated independently by seven AI tools and HI, using a Likert scale. To reduce variability, all AI tools were provided with the same evaluation prompt. The Wilcoxon signed-rank test revealed no statistically significant differences in many evaluation criteria between AI tools and HIs, suggesting general similarity in overall scoring. On the other hand, the Spearman correlation analysis revealed low consistency in how AI tools and HI evaluated specific aspects of the projects. This indicates that while the evaluation provided by AI tools and HIs may appear similar at a surface level, their underlying judgment patterns—particularly regarding certain criteria of UI/UX design and code quality—can diverge. However, ChatGPT-4.5 and ChatGPT-4o delivered particularly promising outcomes. From an educational perspective, the study results highlight the importance of treating AI tools as supportive assistants rather than autonomous evaluators—at least for now—especially in domains involving subjective or context-sensitive judgment. Identifying where AI tools’ evaluations align or conflict with human judgment provides valuable insight into the appropriate use, potential, and limitations of such tools in academic evaluation.

Author Biographies

Lidia Feklistova, University of Tartu

Lidia Feklistova is a lecturer in informatics at the University of Tartu, Estonia. She earned her PhD in Computer Science in 2022. Her research focuses on integrating AI tools into education and computational thinking education. She also serves as a national co-organizer of the Bebras Challenge on informatics and computational thinking.

Artur Kašnikov, University of Tartu

Artur Kašnikov is a Master’s student in the Software Engineering program at the University of Tartu, Estonia. His interests focus on software development. In addition, Artur is passionate about exploring artificial intelligence applications to create innovative, intelligent solutions for real-world challenges.

Downloads

Published

2025-12-04