essay grading software

Beware of websites that resemble EssayGrader.ai. Look for the domain name “EssayGrader.ai” without dashes or hyphens to make sure you are on the correct website. We noticed an imitator trying to mimic our name and branding to mislead users. They are not affiliated with us and we cannot guarantee the authenticity or security of the payment methods, products, or services offered on such websites.

The fastest way to grade essays

EssayGrader is an AI powered grading assistant that gives high quality, specific and accurate writing feedback for essays. Thousands of teachers use EssayGrader to manage their grading load everyday. On average it takes a teacher 10 minutes to grade a single essay, with EssayGrader that time is cut down to 30 seconds That's a 95% reduction in the time it takes to grade an essay, with the same results.

How we've done

EssayGrader analyzes essays with the power of AI. Our software is trained on massive amounts of diverse text data, inlcuding books, articles and websites. This gives us the ability to provide accurate and detailed writing feedback to students and save teachers loads of time. We are the perfect AI powered grading assitant.

EssayGrader analyzes essays for grammar, punctuation, spelling, coherence, clarity and writing style errors. We provide detailed reports of the errors found and suggestions on how to fix those errors. Our error reports help speed up grading times by quickly highlighting mistakes made in the essay.

Bulk uploading

Uploading a single essay at a time, then waiting for it to complete is a pain. Bulk uploading allows you to upload an entire class worth of essays at a single time. You can work on other important tasks, come back in a few minutes to see all the essays perfectly graded.

Custom rubrics

We don't assume how you want to grade your essays. Instead, we provide you with the ability to create the same rubrics you already use. Those rubrics are then used to grade essays with the same grading criteria you are already accustomed to.

Sometimes you don't want to read a 5000 word essay and you'd just like a quick summary. Or maybe you're a student that needs to provide a summary of your essay to your teacher. We can help with our summarizer feature. We can provide a concise summary including the most important information and unique phrases.

AI detector

Our AI detector feature allows teachers to identify if an essay was written by AI or if only parts of it were written by AI. AI is becoming very popular and teachers need to be able to detect if essays are being written by students or AI.

Create classes to neatly organize your students essays. This is an essential feature when you have multiple classes and need to be able to track down students essays quickly.

Our mission

At EssayGrader, our mission is crystal clear: we're transforming the grading experience for teachers and students alike. Picture a space where teachers can efficiently and accurately grade essays, lightening their workload, while empowering students to enhance their writing skills.

Our software is a dynamic work in progress, a testament to our commitment to constant improvement. We're dedicated to refining and enhancing our platform continually. With each update, we strive to simplify the lives of both educators and learners, making the process of grading and writing essays smoother and more efficient.

We recognize the immense challenges teachers face – the heavy burdens, the long hours, and the often underappreciated efforts. EssayGrader is our way of shouldering some of that load. We are here to support you, to make your tasks more manageable, and to give you the tools you need to excel in your teaching journey.

Join the Newsletter

Subscribe to get our latest content by email.

The Criterion ® Online Writing Evaluation Service streamlines the student writing experience.

The Criterion ® Online Writing Evaluation Service is a web-based, instructor-led automated writing tool that helps students plan, write and revise their essays. It offers immediate feedback, freeing up valuable class time by allowing instructors to concentrate on higher-level writing skills.

Automatically scores students’ writing, so instructors save time

Provides immediate, online feedback to improve student writing skills

Includes detailed reports to help administrators track performance

It gives immediate feedback to students … which in turn motivates them to fix mistakes and raise their scores. Criterion can do in seconds what would take teachers hours if they hand-graded and edited essays.

~ Institution in Oregon

I have the opportunity to jump in while they are producing their first drafts to give them my own feedback without penalty — that is, before they submit for a grade. This program is especially helpful for my virtual students.

~ Institution in Florida

LICENSE THE CRITERION SERVICE IP

Enhance your offerings with the Criterion writing tool, part of the ETS ® Assessment Services, and become a distributor.

Assessment API
Higher Education
Personnel Testing
Contact/Demo

Transforming Assessment With AI-Powered Essay Scoring

IntelliMetric ® delivers accurate, consistent, and reliable scores for writing prompts for K-12 school districts, for higher education, for personnel testing, and as an API for software.

IntelliMetric ® Is The Gold Standard In AI Scoring Of Written Prompts

Trusted by educational institutions for surpassing human expert scoring, IntelliMetric ® is the go-to essay scoring platform for colleges and universities. IntelliMetric ® also aids in hiring by identifying candidates with excellent communication skills. As an assessment API, it enhances software products and increases product value. Unlock its potential today.

Proven Capabilities Across Markets

Whether you’re a hiring manager, school district administrator, or higher education administrator, IntelliMetric® can help you meet your organization’s goals. Click below to learn how it works for your industry.

Powerful Features

IntelliMetric ® delivers relevant scores and expert feedback tailored to writers’ capabilities. IntelliMetric ® scores prompts of varying lengths, providing invaluable insights for both testing and writing improvement. Don't settle for less; unleash the power of IntelliMetric ® for scoring excellence.

IntelliMetric ® scores writing instantly to help organizations save time and money evaluating writing with the same level of accuracy and consistency as expert human scorers.

IntelliMetric ® can be used to either test writing skills or improve instruction by providing detailed and accurate feedback according to a rubric.

IntelliMetric ® gives teachers and business professionals the bandwidth to focus on other more impactful job duties by scoring writing prompts that would otherwise take countless hours each day.

Using Legitimacy detection, IntelliMetric ® ensures all writing is original without any plagiarism - and doesn’t contain any messages that diverge from the assigned writing prompt.

Case Studies and Testimonials

Below are use cases and testimonials from customers who used IntelliMetric ® to reach their goals by automating the process of analyzing and grading written responses. These users found IntelliMetric ® to be a vital tool in providing instant feedback and scoring written responses.

Santa Ana School Disctrict Through the use of IntelliMetric ® , the Santa Ana school district was able to evaluate student writing and their students were able to use the instantaneous feedback to drastically improve their writing. The majority of teachers found IntelliMetric to benefit their classrooms as an instructional tool and found that students were more motivated to write.

Santa Ana School District

I have worked with Vantage Learning’s MY Access Automated Essay Scoring product both as a teacher and as a secondary ELA Curriculum specialist for grades 6-12. As a teacher, I immediately saw the benefits of the program. My students were more motivated to write because they knew that they would receive immediate feedback upon completion and submission of their essays. I also taught my students how to use the “My Tutor” and “My Editor” feedback in order to revise their essays. In the past, I felt like Sisyphus pushing a boulder up the hill, but with MY Access that heavy weight was lifted and my students were revising using specific feedback from My Access and consequently their writing drastically improved. When it comes to giving instantaneous feedback, MY Access performed more efficiently than myself.

More than 350 research studies conducted both in-house and by third-party experts have determined that IntelliMetric® has levels of consistency, accuracy and reliability that meet, and more often exceed, those of human expert scorers. After performing the practice DWA within SAUSD, I surveyed our English teachers and asked them about their recent experience with MY Access. Of the 85 teachers that responded to the survey, 82% of the teachers felt that their students’ experience with MY Access was either fair, good or very good. Similarly, 75% of the teachers thought the accuracy of the scoring was fair, good, or very good. Lastly, 77% of the teachers surveyed said that they would like to use MY Access in their classrooms as an instructional tool.

Many of the teachers’ responses to the MY Access survey included a call for a plagiarism detector. At the time, we had not opted for the addition of Cite Smart, the onboard plagiarism detector for MY Access This year, however, we will be using it and teachers across the district are excited to have this much needed tool available.

As a teacher and as an ELA curriculum specialist, I know of no other writing tool available to teachers that is more helpful than MY Access. When I tell teachers that we will be using MY Access for instruction and not just benchmarking this year, the most common reply I receive is “Oh great! That means that I can teach a lot more writing!” Think about it - if a secondary teacher has 175 students (35 students in 5 periods) and the teacher takes 10 minutes to provide feedback on each student’s paper, then it would take the teacher 29 hours (1,750 minutes) to give effective feedback to his/her students. MY Access is a writing teacher’s best friend!

Jason Crabbe

Secondary Language Arts Curriculum Specialist

Santa Ana Unified School District

Arkansas State University “In 2018, our students performed poorly in style and mechanics. Other forms of intervention have not proven successful. We piloted IntelliWriter and IntelliMetric and produced great results. The University leadership has since implemented a full-scale rollout across all campuses” - Dr. Melodie Philhours, Arkansas State University

The United Nations The United Nations utilizes IntelliMetric® via the Adaptera Platform for real-time evaluation of personnel writing assessments, offering a cost-effective solution to ensure communication skills in the workforce.

The United Nations, Department of Homeland Security and the world's largest online retail store all access IntelliMetric ® for immediate scoring of personnel writing assessments scored by IntelliMetric ® using the Adaptera Platform. In a world where clear concise communication is essential in the workforce using IntelliMetric ® to score writing assessments provides immediate, cost-effective evaluation of your employee skills.

IntelliMetric ® Offers Multilingual Scoring & Support

Score written responses in your native language with IntelliMetric! The automated scoring platform offers instant feedback and scoring in several languages to provide more accuracy and consistency than human scoring wherever you’re located. Accessible any time or place for educational or professional needs, IntelliMetric® is the perfect solution to your scoring requirements.

Intelliwriter

IntelliMetric-powered Solutions

Automated Essay Scoring using AI.

Online Writing Instruction and Assessment.

Adaptive Learning and Assessment Platform.

District Level Capture and Assessment of Student Writing.

AI-Powered Assessment and Instruction APIs.

Advanced AI-Driven Writing Mastery Tool.

Read The Blog

Revolutionize Your Writing Process with Smodin AI Grader: A Smarter Way to get feedback and achieve academic excellence!

For Students

Stay ahead of the curve, with objective feedback and tools to improve your writing.

Your Virtual Tutor

Harness the expertise of a real-time virtual teacher who will guide every paragraph in your writing process, ensuring you produce an A+ masterpiece in a fraction of the time.

Unbiased Evaluation

Ensure an impartial and objective assessment, removing any potential bias or subjectivity that may be an influence in traditional grading methods.

Perfect your assignments

With the “Write with AI” tool, transform your ideas into words with a few simple clicks. Excel at all your essays, assignments, reports etc. and witness your writing skills soar to new heights

For teachers

Revolutionize your Teaching Methods

Spend less on grading

Embrace the power of efficiency and instant feedback with our cutting-edge tool, designed to save you time while providing a fair and unbiased evaluation, delivering consistent and objective feedback.

Reach out to more students

Upload documents in bulk and establish your custom assessment criteria, ensuring a tailored evaluation process. Expand your reach and impact by engaging with more students.

Focus on what you love

Let AI Grading handle the heavy lifting of assessments for you. With its data-driven algorithms and standardized criteria, it takes care of all your grading tasks, freeing up your valuable time to do what you're passionate about: teaching.

Grader Rubrics

Pick the systematic frameworks that work as guidelines for assessing and evaluating the quality, proficiency, and alignment of your work, allowing for consistent and objective grading without any bias.

Analytical Thinking

Originality

Organization

Focus Point

Write with AI

Set your tone and keywords, and generate brilliance through your words

AI Grader Average Deviation from Real Grade

Our AI grader matches human scores 82% of the time* AI Scores are 100% consistent**

Deviation from real grade (10 point scale)

Graph: A dataset of essays were graded by professional graders on a range of 1-10 and cross-referenced against the detailed criteria within the rubric to determine their real scores. Deviation was defined by the variation of scores from the real score. The graph contains an overall score (the average of all criterias) as well as each individual criteria. The criteria are the premade criteria available on Smodin's AI Grader, listed in the graph as column headings. The custom rubrics were made using Smodin's AI Grader custom criteria generator to produce each criteria listed in Smodin's premade criterias (the same criteria as the column headings). The overall score for Smodin Premade Rubrics matched human scores 73% of the time with our advanced AI, while custom rubrics generated by Smodin's custom rubric generator matched human grades 82% of the time with our advanced AI. The average deviation from the real scores for all criteria is shown above.

* Rubrics created using Smodin's AI custom criteria matched human scores 82% of the time on the advanced AI setting. Smodin's premade criteria matched human scores 73% of the time. When the AI score differed from the human scores, 86% of the time the score only differed by 1 point on a 10 point scale.

** The AI grader provides 100% consistency, meaning that the same essay will produce the same score every time it's graded. All grades used in the data were repeated 3 times and produced 100% consistency across all 3 grading attempts.

AI Feedback

Unleash the Power of Personalized Feedback: Elevate Your Writing with the Ultimate Web-based Feedback Tool

Elevate your essay writing skills with Smodin AI Grader, and achieve the success you deserve with Smodin. the ultimate AI-powered essay grader tool. Whether you are a student looking to improve your grades or a teacher looking to provide valuable feedback to your students, Smodin has got you covered. Get objective feedback to improve your essays and excel at writing like never before! Don't miss this opportunity to transform your essay-writing journey and unlock your full potential.

Smodin AI Grader: The Best AI Essay Grader for Writing Improvement

As a teacher or as a student, writing essays can be a daunting task. It takes time, effort, and a lot of attention to detail. But what if there was a tool that could make the process easier? Meet Smodin Ai Grader, the best AI essay grader on the market that provides objective feedback and helps you to improve your writing skills.

Objective Feedback with Smodin - The Best AI Essay Grader

Traditional grading methods can often be subjective, with different teachers providing vastly different grades for the same piece of writing. Smodin eliminates this problem by providing consistent and unbiased feedback, ensuring that all students are evaluated fairly. With advanced algorithms, Smodin can analyze and grade essays in real-time, providing instant feedback on strengths and weaknesses.

Improve Your Writing Skills with Smodin - The Best AI Essay Grader

Smodin can analyze essays quickly and accurately, providing detailed feedback on different aspects of your writing, including structure, grammar, vocabulary, and coherence. By identifying areas that need improvement and providing suggestions on how to make your writing more effective, if Smodin detects that your essay has a weak thesis statement, it will provide suggestions on how to improve it. If it detects that your essay has poor grammar, it will provide suggestions on how to correct the errors. This makes it easier for you to make improvements to your essay and get better grades and become a better writer.

Smodin Ai Grader for Teachers - The Best Essay Analysis Tool

For teachers, Smodin can be a valuable tool for grading essays quickly and efficiently, providing detailed feedback to students, and helping them improve their writing skills. With Smodin Ai Grader, teachers can grade essays in real-time, identify common errors, and provide suggestions on how to correct them.

Smodin Ai Grader for Students - The Best Essay Analysis Tool

For students, Smodin can be a valuable tool for improving your writing skills and getting better grades. By analyzing your essay's strengths and weaknesses, Smodin can help you identify areas that need improvement and provide suggestions on how to make your writing more effective. This can be especially useful for students who are struggling with essay writing and need extra help and guidance.

Increase your productivity - The Best AI Essay Grader

Using Smodin can save you a lot of time and effort. Instead of spending hours grading essays manually or struggling to improve your writing without feedback, you can use Smodin to get instant and objective feedback, allowing you to focus on other important tasks.

Smodin is the best AI essay grader on the market that uses advanced algorithms to provide objective feedback and help improve writing skills. With its ability to analyze essays quickly and accurately, Smodin can help students and teachers alike to achieve better results in essay writing.

PAPER GRADER FOR TEACHERS

Measuring student progress

There are a multitude of different ways to measure student comprehension and teacher performance in the classroom, from student self-assessments to daily quizzes. Unfortunately, many of the most common performance measures involve a great deal of paper grading for teachers. The upside is that the data that can be gathered from the administration of frequent check-ins, formative assessments, and other testing tools can be helpful for much more than simply measuring achievement or assigning grades. They are also particularly useful for providing feedback and insight into what lessons are being learned, what content areas are creating stumbling blocks, and what teaching methods are proving to be most successful. It is getting those papers graded in a timely manner in order to access actionable information that is the challenge.

To ensure that teachers can respond to the formative feedback available, student assessments and other assignments must be scored promptly. That is why so many teachers rely on some type of paper grader app, software, or other paper grading system. Depending on the level of flexibility and functionality, this can greatly speed up the grading process. It is not uncommon for these test grader systems to involve pre-printed bubble sheets that allow teachers to use basic multiple choice or true/false question types that students can answer. These answer forms can be scanned by a special document camera or other scanning device that can fairly speedily score the correct responses. This expedites the time it takes for teachers to grade papers, but that is where the benefits tend to stop.

Before and after grading papers

Expediting the grading of quizzes, tests, homework, papers, and other assignments is certainly a valuable timesaver for teachers, but it is really only one step in a rather complex assessment administration process. Before assessments can be even given, they must first be created. But limiting the answer options to standard multiple choice means also limiting the usability for answers that do not neatly fit within a standard format, like number grids, rubrics, and handwritten responses. Additionally, it is important for today’s teachers to be able to connect test questions to one or more mandated or custom standards for tracking and reporting purposes. And, of course, once papers are scored, the data they provide still has to be compiled, analyzed, and recorded.

The data educators ultimately gather from grading papers often also needs to be able to be shared in a variety of different ways for parent/teacher conferences, Professional Learning Communities (PLCs), administrative reviews, mandated reporting requirements, etc. These preparatory and post-test tasks can take up an enormous amount of time above and beyond the actual grading, despite the paper grading component being the primary focus of attention in any assessment conversation. In fact, even the manual transfer of student scores into a teacher gradebook is a more burdensome task than most administrators appreciate according to a recent Tech & Learning report. In the larger context of assessment processes and corresponding teacher needs, any paper grader technology than just grades papers is only saving time at one stage of a multilayered, time-consuming process.

ASSISTANCE AT EVERY STEP

Fortunately, GradeCam took all of these needs, and more, into consideration in order to develop a truly comprehensive assessment solution that saves time at every opportunity in the assessment process. Sure, GradeCam is an easy grader system to use for quick scanning and scoring of assessments, but it’s also an amazing teacher app for creating tests , with enormous flexibility for customization. It easily accommodates traditional multiple choice and true/false type answer forms. Plus, it allows for customization of the contents inside the bubbles. It also greatly expands upon answer type possibilities to allow for customized answer grids, teacher-graded rubrics, and even handwritten answers that are perfect for fill-in-the-blanks, math solutions, and other short written responses. Grading papers by scanning and scoring has never been so fast and flexible!

GradeCam doesn’t stop there, though. It is also capable of tracking existing and custom standards, generating a variety of flexible data reports, and automatically recording grades in any digital gradebook with the touch of a button. (Literally, transferring an entire class of grades only requires a single key stroke.) Using GradeCam’s Student Portal, students are even able to grade their own papers as they complete their tests for immediate feedback. Of course, the teacher is able to choose whether students are able to view their missed questions or simply their final scores. But the fact that paper grading for an entire class of students could be completed the moment the last assessment is turned in – without the teacher ever touching them – is a pretty remarkable feat to witness.

Paper Grader Highlights:

See GradeCam in Action

No more road trips with stacks of paper in the passenger seat and sorted piles in the floorboard - Rodney Crouse, Teacher

Try GradeCam Gradient free for 60 days.

Find the solution that’s right for you.

Supported by

Essay-Grading Software Offers Professors a Break

Share full article

By John Markoff

April 4, 2013

Imagine taking a college exam, and, instead of handing in a blue book and getting a grade from a professor a few weeks later, clicking the “send” button when you are done and receiving a grade back instantly, your essay scored by a software program.

And then, instead of being done with that exam, imagine that the system would immediately let you rewrite the test to try to improve your grade.

EdX , the nonprofit enterprise founded by Harvard and the Massachusetts Institute of Technology to offer courses on the Internet, has just introduced such a system and will make its automated software available free on the Web to any institution that wants to use it. The software uses artificial intelligence to grade student essays and short written answers, freeing professors for other tasks.

The new service will bring the educational consortium into a growing conflict over the role of automation in education. Although automated grading systems for multiple-choice and true-false tests are now widespread, the use of artificial intelligence technology to grade essay answers has not yet received widespread endorsement by educators and has many critics.

Anant Agarwal, an electrical engineer who is president of EdX, predicted that the instant-grading software would be a useful pedagogical tool, enabling students to take tests and write essays over and over and improve the quality of their answers. He said the technology would offer distinct advantages over the traditional classroom system, where students often wait days or weeks for grades.

“There is a huge value in learning with instant feedback,” Dr. Agarwal said. “Students are telling us they learn much better with instant feedback.”

But skeptics say the automated system is no match for live teachers. One longtime critic, Les Perelman, has drawn national attention several times for putting together nonsense essays that have fooled software grading programs into giving high marks. He has also been highly critical of studies that purport to show that the software compares well to human graders.

“My first and greatest objection to the research is that they did not have any valid statistical test comparing the software directly to human graders,” said Mr. Perelman, a retired director of writing and a current researcher at M.I.T.

He is among a group of educators who last month began circulating a petition opposing automated assessment software. The group, which calls itself Professionals Against Machine Scoring of Student Essays in High-Stakes Assessment , has collected nearly 2,000 signatures, including some from luminaries like Noam Chomsky.

“Let’s face the realities of automatic essay scoring,” the group’s statement reads in part. “Computers cannot ‘read.’ They cannot measure the essentials of effective written communication: accuracy, reasoning, adequacy of evidence, good sense, ethical stance, convincing argument, meaningful organization, clarity, and veracity, among others.”

But EdX expects its software to be adopted widely by schools and universities. EdX offers free online classes from Harvard, M.I.T. and the University of California, Berkeley; this fall, it will add classes from Wellesley, Georgetown and the University of Texas. In all, 12 universities participate in EdX, which offers certificates for course completion and has said that it plans to continue to expand next year, including adding international schools.

The EdX assessment tool requires human teachers, or graders, to first grade 100 essays or essay questions. The system then uses a variety of machine-learning techniques to train itself to be able to grade any number of essays or answers automatically and almost instantaneously.

The software will assign a grade depending on the scoring system created by the teacher, whether it is a letter grade or numerical rank. It will also provide general feedback, like telling a student whether an answer was on topic or not.

Dr. Agarwal said he believed that the software was nearing the capability of human grading.

“This is machine learning and there is a long way to go, but it’s good enough and the upside is huge,” he said. “We found that the quality of the grading is similar to the variation you find from instructor to instructor.”

EdX is not the first to use automated assessment technology, which dates to early mainframe computers in the 1960s. There is now a range of companies offering commercial programs to grade written test answers, and four states — Louisiana, North Dakota, Utah and West Virginia — are using some form of the technology in secondary schools. A fifth, Indiana, has experimented with it. In some cases the software is used as a “second reader,” to check the reliability of the human graders.

But the growing influence of the EdX consortium to set standards is likely to give the technology a boost. On Tuesday, Stanford announced that it would work with EdX to develop a joint educational system that will incorporate the automated assessment technology.

Two start-ups, Coursera and Udacity , recently founded by Stanford faculty members to create “massive open online courses,” or MOOCs, are also committed to automated assessment systems because of the value of instant feedback.

“It allows students to get immediate feedback on their work, so that learning turns into a game, with students naturally gravitating toward resubmitting the work until they get it right,” said Daphne Koller, a computer scientist and a founder of Coursera.

Last year the Hewlett Foundation, a grant-making organization set up by one of the Hewlett-Packard founders and his wife, sponsored two $100,000 prizes aimed at improving software that grades essays and short answers. More than 150 teams entered each category. A winner of one of the Hewlett contests, Vik Paruchuri, was hired by EdX to help design its assessment software.

“One of our focuses is to help kids learn how to think critically,” said Victor Vuchic, a program officer at the Hewlett Foundation. “It’s probably impossible to do that with multiple-choice tests. The challenge is that this requires human graders, and so they cost a lot more and they take a lot more time.”

Mark D. Shermis, a professor at the University of Akron in Ohio, supervised the Hewlett Foundation’s contest on automated essay scoring and wrote a paper about the experiment. In his view, the technology — though imperfect — has a place in educational settings.

With increasingly large classes, it is impossible for most teachers to give students meaningful feedback on writing assignments, he said. Plus, he noted, critics of the technology have tended to come from the nation’s best universities, where the level of pedagogy is much better than at most schools.

“Often they come from very prestigious institutions where, in fact, they do a much better job of providing feedback than a machine ever could,” Dr. Shermis said. “There seems to be a lack of appreciation of what is actually going on in the real world.”

Explore Our Coverage of Artificial Intelligence

News and Analysis

Saudi Arabia is plowing money into glitzy events, computing power and artificial intelligence research, putting it in the middle of an escalating U.S.-China struggle for technological influence.

Microsoft gave more signs that its hefty investments in A.I. were beginning to bear fruit, as it reported a 17 percent jump in revenue and a 20 percent increase in profit for the first three months of the year.

Meta projected that revenue for the current quarter would be lower than what Wall Street anticipated and said it would spend billions of dollars more on its artificial intelligence efforts, even as it reported robust revenue and profits for the first three months of the year.

The Age of A.I.

A new category of apps promises to relieve parents of drudgery, with an assist from A.I . But a family’s grunt work is more human, and valuable, than it seems.

Despite Mark Zuckerberg’s hope for Meta’s A.I. assistant to be the smartest , it struggles with facts, numbers and web search.

Much as ChatGPT generates poetry, a new A.I. system devises blueprints for microscopic mechanisms that can edit your DNA.

Could A.I. change India’s elections? Avatars are addressing voters by name, in whichever of India’s many languages they speak. Experts see potential for misuse in a country already rife with disinformation.

Which A.I. system writes the best computer code or generates the most realistic image? Right now, there’s no easy way to answer those questions, our technology columnist writes .

Introducing Gradescope

Gradescope is the transformative paper-to-digital grading platform that revolutionizes the way you approach grading and assessment.

Discover the power of Gradescope

With Gradescope, you can unlock the power of scalable reporting, harness the potential of data collection, and seamlessly integrate our platform with other crucial institutional systems, such as Learning Management Systems (LMSs). This seamless integration ensures that your institution's investment is maximized, and central support becomes a cornerstone of your grading workflows.

The core of Gradescope's mission is elevating both instructors and students. With our pioneering grading platform, instructors become more impactful, delivering faster, clearer, and more consistent feedback. Anonymous grading reduces unintended bias, fostering an environment of equitable evaluation. Customized accommodations ensure that every student has the opportunity to excel, driving a level playing field.

Unveiling benefits at every turn

With Gradescope’s grading platform, instructors and administrators can:

By utilizing "show your work" assessments, instructors gain swift insights into learning outcomes, aligning seamlessly with accreditation standards and institutional objectives.

Embrace electronic returns that protect FERPA-sensitive data, ensuring a secure environment for both students and instructors.

Sync rosters and grades effortlessly between Gradescope and your Learning Management System (LMS), ensuring a seamless experience for instructors and students alike.

Experience Gradescope for yourself

By completing this form, you agree to Turnitin's Privacy Policy . Turnitin uses the information you provide to contact you with relevant information. You may unsubscribe from these communications at any time.

Recaptcha Error

What is automated essay scoring?

Automated essay scoring (AES) is an important application of machine learning and artificial intelligence to the field of psychometrics and assessment. In fact, it’s been around far longer than “machine learning” and “artificial intelligence” have been buzzwords in the general public! The field of psychometrics has been doing such groundbreaking work for decades.

So how does AES work, and how can you apply it?

The first and most critical thing to know is that there is not an algorithm that “reads” the student essays. Instead, you need to train an algorithm. That is, if you are a teacher and don’t want to grade your essays, you can’t just throw them in an essay scoring system. You have to actually grade the essays (or at least a large sample of them) and then use that data to fit a machine learning algorithm. Data scientists use the term train the model , which sounds complicated, but if you have ever done simple linear regression, you have experience with training models.

There are three steps for automated essay scoring:

Establish your data set (collate student essays and grade them).
Determine the features (predictor variables that you want to pick up on).
Train the machine learning model.

Here’s an extremely oversimplified example:

You have a set of 100 student essays, which you have scored on a scale of 0 to 5 points.
The essay is on Napoleon Bonaparte, and you want students to know certain facts, so you want to give them “credit” in the model if they use words like: Corsica, Consul, Josephine, Emperor, Waterloo, Austerlitz, St. Helena. You might also add other Features such as Word Count, number of grammar errors, number of spelling errors, etc.
You create a map of which students used each of these words, as 0/1 indicator variables. You can then fit a multiple regression with 7 predictor variables (did they use each of the 7 words) and the 5 point scale as your criterion variable. You can then use this model to predict each student’s score from just their essay text.

Obviously, this example is too simple to be of use, but the same general idea is done with massive, complex studies. The establishment of the core features (predictive variables) can be much more complex, and models are going to be much more complex than multiple regression (neural networks, random forests, support vector machines).

Here’s an example of the very start of a data matrix for features, from an actual student essay. Imagine that you also have data on the final scores, 0 to 5 points. You can see how this is then a regression situation.

How do you score the essay?

If they are on paper, then automated essay scoring won’t work unless you have an extremely good software for character recognition that converts it to a digital database of text. Most likely, you have delivered the exam as an online assessment and already have the database. If so, your platform should include functionality to manage the scoring process, including multiple custom rubrics. An example of our FastTest platform is provided below.

Some rubrics you might use:

Supporting arguments
Organization
Vocabulary / word choice

How do you pick the Features?

This is one of the key research problems. In some cases, it might be something similar to the Napoleon example. Suppose you had a complex item on Accounting, where examinees review reports and spreadsheets and need to summarize a few key points. You might pull out a few key terms as features (mortgage amortization) or numbers (2.375%) and consider them to be Features. I saw a presentation at Innovations In Testing 2022 that did exactly this. Think of them as where you are giving the students “points” for using those keywords, though because you are using complex machine learning models, it is not simply giving them a single unit point. It’s contributing towards a regression-like model with a positive slope.

In other cases, you might not know. Maybe it is an item on an English test being delivered to English language learners, and you ask them to write about what country they want to visit someday. You have no idea what they will write about. But what you can do is tell the algorithm to find the words or terms that are used most often, and try to predict the scores with that. Maybe words like “jetlag” or “edification” show up in students that tend to get high scores, while words like “clubbing” or “someday” tend to be used by students with lower scores. The AI might also pick up on spelling errors. I worked as an essay scorer in grad school, and I can’t tell you how many times I saw kids use “ludacris” (name of an American rap artist) instead of “ludicrous” when trying to describe an argument. They had literally never seen the word used or spelled correctly. Maybe the AI model finds to give that a negative weight. That’s the next section!

How do you train a model?

Well, if you are familiar with data science, you know there are TONS of models, and many of them have a bunch of parameterization options. This is where more research is required. What model works the best on your particular essay, and doesn’t take 5 days to run on your data set? That’s for you to figure out. There is a trade-off between simplicity and accuracy. Complex models might be accurate but take days to run. A simpler model might take 2 hours but with a 5% drop in accuracy. It’s up to you to evaluate.

If you have experience with Python and R, you know that there are many packages which provide this analysis out of the box – it is a matter of selecting a model that works.

How well does automated essay scoring work?

Well, as psychometricians love to say, “it depends.” You need to do the model fitting research for each prompt and rubric. It will work better for some than others. The general consensus in research is that AES algorithms work as well as a second human, and therefore serve very well in that role. But you shouldn’t use them as the only score; of course, that’s impossible in many cases.

Here’s a graph from some research we did on our algorithm, showing the correlation of human to AES. The three lines are for the proportion of sample used in the training set; we saw decent results from only 10% in this case! Some of the models correlated above 0.80 with humans, even though this is a small data set. We found that the Cubist model took a fraction of the time needed by complex models like Neural Net or Random Forest; in this case it might be sufficiently powerful.

How can I implement automated essay scoring without writing code from scratch?

There are several products on the market. Some are standalone, some are integrated with a human-based essay scoring platform. ASC’s platform for automated essay scoring is SmartMarq; click here to learn more . It is currently in a standalone approach like you see below, making it extremely easy to use. It is also in the process of being integrated into our online assessment platform, alongside human scoring, to provide an efficient and easy way of obtaining a second or third rater for QA purposes.

Want to learn more? Contact us to request a demonstration .

Latest Posts

Nathan Thompson, PhD

Latest posts by nathan thompson, phd ( see all ).

What is a T score? - April 15, 2024
Item Review Workflow for Exam Development - April 8, 2024
Likert Scale Items - February 9, 2024

Online Assessment

Psychometrics.

An automated essay scoring systems: a systematic literature review

Published: 23 September 2021
Volume 55 , pages 2495–2527, ( 2022 )

Cite this article

Dadi Ramesh ORCID: orcid.org/0000-0002-3967-8914 1 , 2 &
Suresh Kumar Sanampudi 3

35k Accesses

85 Citations

5 Altmetric

Explore all metrics

Assessment in the Education system plays a significant role in judging student performance. The present evaluation system is through human assessment. As the number of teachers' student ratio is gradually increasing, the manual evaluation process becomes complicated. The drawback of manual evaluation is that it is time-consuming, lacks reliability, and many more. This connection online examination system evolved as an alternative tool for pen and paper-based methods. Present Computer-based evaluation system works only for multiple-choice questions, but there is no proper evaluation system for grading essays and short answers. Many researchers are working on automated essay grading and short answer scoring for the last few decades, but assessing an essay by considering all parameters like the relevance of the content to the prompt, development of ideas, Cohesion, and Coherence is a big challenge till now. Few researchers focused on Content-based evaluation, while many of them addressed style-based assessment. This paper provides a systematic literature review on automated essay scoring systems. We studied the Artificial Intelligence and Machine Learning techniques used to evaluate automatic essay scoring and analyzed the limitations of the current studies and research trends. We observed that the essay evaluation is not done based on the relevance of the content and coherence.

Automated Essay Scoring Systems

Automated Essay Scoring System Based on Rubric

Avoid common mistakes on your manuscript.

1 Introduction

Due to COVID 19 outbreak, an online educational system has become inevitable. In the present scenario, almost all the educational institutions ranging from schools to colleges adapt the online education system. The assessment plays a significant role in measuring the learning ability of the student. Most automated evaluation is available for multiple-choice questions, but assessing short and essay answers remain a challenge. The education system is changing its shift to online-mode, like conducting computer-based exams and automatic evaluation. It is a crucial application related to the education domain, which uses natural language processing (NLP) and Machine Learning techniques. The evaluation of essays is impossible with simple programming languages and simple techniques like pattern matching and language processing. Here the problem is for a single question, we will get more responses from students with a different explanation. So, we need to evaluate all the answers concerning the question.

Automated essay scoring (AES) is a computer-based assessment system that automatically scores or grades the student responses by considering appropriate features. The AES research started in 1966 with the Project Essay Grader (PEG) by Ajay et al. ( 1973 ). PEG evaluates the writing characteristics such as grammar, diction, construction, etc., to grade the essay. A modified version of the PEG by Shermis et al. ( 2001 ) was released, which focuses on grammar checking with a correlation between human evaluators and the system. Foltz et al. ( 1999 ) introduced an Intelligent Essay Assessor (IEA) by evaluating content using latent semantic analysis to produce an overall score. Powers et al. ( 2002 ) proposed E-rater and Intellimetric by Rudner et al. ( 2006 ) and Bayesian Essay Test Scoring System (BESTY) by Rudner and Liang ( 2002 ), these systems use natural language processing (NLP) techniques that focus on style and content to obtain the score of an essay. The vast majority of the essay scoring systems in the 1990s followed traditional approaches like pattern matching and a statistical-based approach. Since the last decade, the essay grading systems started using regression-based and natural language processing techniques. AES systems like Dong et al. ( 2017 ) and others developed from 2014 used deep learning techniques, inducing syntactic and semantic features resulting in better results than earlier systems.

Ohio, Utah, and most US states are using AES systems in school education, like Utah compose tool, Ohio standardized test (an updated version of PEG), evaluating millions of student's responses every year. These systems work for both formative, summative assessments and give feedback to students on the essay. Utah provided basic essay evaluation rubrics (six characteristics of essay writing): Development of ideas, organization, style, word choice, sentence fluency, conventions. Educational Testing Service (ETS) has been conducting significant research on AES for more than a decade and designed an algorithm to evaluate essays on different domains and providing an opportunity for test-takers to improve their writing skills. In addition, they are current research content-based evaluation.

The evaluation of essay and short answer scoring should consider the relevance of the content to the prompt, development of ideas, Cohesion, Coherence, and domain knowledge. Proper assessment of the parameters mentioned above defines the accuracy of the evaluation system. But all these parameters cannot play an equal role in essay scoring and short answer scoring. In a short answer evaluation, domain knowledge is required, like the meaning of "cell" in physics and biology is different. And while evaluating essays, the implementation of ideas with respect to prompt is required. The system should also assess the completeness of the responses and provide feedback.

Several studies examined AES systems, from the initial to the latest AES systems. In which the following studies on AES systems are Blood ( 2011 ) provided a literature review from PEG 1984–2010. Which has covered only generalized parts of AES systems like ethical aspects, the performance of the systems. Still, they have not covered the implementation part, and it’s not a comparative study and has not discussed the actual challenges of AES systems.

Burrows et al. ( 2015 ) Reviewed AES systems on six dimensions like dataset, NLP techniques, model building, grading models, evaluation, and effectiveness of the model. They have not covered feature extraction techniques and challenges in features extractions. Covered only Machine Learning models but not in detail. This system not covered the comparative analysis of AES systems like feature extraction, model building, and level of relevance, cohesion, and coherence not covered in this review.

Ke et al. ( 2019 ) provided a state of the art of AES system but covered very few papers and not listed all challenges, and no comparative study of the AES model. On the other hand, Hussein et al. in ( 2019 ) studied two categories of AES systems, four papers from handcrafted features for AES systems, and four papers from the neural networks approach, discussed few challenges, and did not cover feature extraction techniques, the performance of AES models in detail.

Klebanov et al. ( 2020 ). Reviewed 50 years of AES systems, listed and categorized all essential features that need to be extracted from essays. But not provided a comparative analysis of all work and not discussed the challenges.

This paper aims to provide a systematic literature review (SLR) on automated essay grading systems. An SLR is an Evidence-based systematic review to summarize the existing research. It critically evaluates and integrates all relevant studies' findings and addresses the research domain's specific research questions. Our research methodology uses guidelines given by Kitchenham et al. ( 2009 ) for conducting the review process; provide a well-defined approach to identify gaps in current research and to suggest further investigation.

We addressed our research method, research questions, and the selection process in Sect. 2 , and the results of the research questions have discussed in Sect. 3 . And the synthesis of all the research questions addressed in Sect. 4 . Conclusion and possible future work discussed in Sect. 5 .

2 Research method

We framed the research questions with PICOC criteria.

Population (P) Student essays and answers evaluation systems.

Intervention (I) evaluation techniques, data sets, features extraction methods.

Comparison (C) Comparison of various approaches and results.

Outcomes (O) Estimate the accuracy of AES systems,

Context (C) NA.

2.1 Research questions

To collect and provide research evidence from the available studies in the domain of automated essay grading, we framed the following research questions (RQ):

RQ1 what are the datasets available for research on automated essay grading?

The answer to the question can provide a list of the available datasets, their domain, and access to the datasets. It also provides a number of essays and corresponding prompts.

RQ2 what are the features extracted for the assessment of essays?

The answer to the question can provide an insight into various features so far extracted, and the libraries used to extract those features.

RQ3, which are the evaluation metrics available for measuring the accuracy of algorithms?

The answer will provide different evaluation metrics for accurate measurement of each Machine Learning approach and commonly used measurement technique.

RQ4 What are the Machine Learning techniques used for automatic essay grading, and how are they implemented?

It can provide insights into various Machine Learning techniques like regression models, classification models, and neural networks for implementing essay grading systems. The response to the question can give us different assessment approaches for automated essay grading systems.

RQ5 What are the challenges/limitations in the current research?

The answer to the question provides limitations of existing research approaches like cohesion, coherence, completeness, and feedback.

2.2 Search process

We conducted an automated search on well-known computer science repositories like ACL, ACM, IEEE Explore, Springer, and Science Direct for an SLR. We referred to papers published from 2010 to 2020 as much of the work during these years focused on advanced technologies like deep learning and natural language processing for automated essay grading systems. Also, the availability of free data sets like Kaggle (2012), Cambridge Learner Corpus-First Certificate in English exam (CLC-FCE) by Yannakoudakis et al. ( 2011 ) led to research this domain.

Search Strings : We used search strings like “Automated essay grading” OR “Automated essay scoring” OR “short answer scoring systems” OR “essay scoring systems” OR “automatic essay evaluation” and searched on metadata.

2.3 Selection criteria

After collecting all relevant documents from the repositories, we prepared selection criteria for inclusion and exclusion of documents. With the inclusion and exclusion criteria, it becomes more feasible for the research to be accurate and specific.

Inclusion criteria 1 Our approach is to work with datasets comprise of essays written in English. We excluded the essays written in other languages.

Inclusion criteria 2 We included the papers implemented on the AI approach and excluded the traditional methods for the review.

Inclusion criteria 3 The study is on essay scoring systems, so we exclusively included the research carried out on only text data sets rather than other datasets like image or speech.

Exclusion criteria We removed the papers in the form of review papers, survey papers, and state of the art papers.

2.4 Quality assessment

In addition to the inclusion and exclusion criteria, we assessed each paper by quality assessment questions to ensure the article's quality. We included the documents that have clearly explained the approach they used, the result analysis and validation.

The quality checklist questions are framed based on the guidelines from Kitchenham et al. ( 2009 ). Each quality assessment question was graded as either 1 or 0. The final score of the study range from 0 to 3. A cut off score for excluding a study from the review is 2 points. Since the papers scored 2 or 3 points are included in the final evaluation. We framed the following quality assessment questions for the final study.

Quality Assessment 1: Internal validity.

Quality Assessment 2: External validity.

Quality Assessment 3: Bias.

The two reviewers review each paper to select the final list of documents. We used the Quadratic Weighted Kappa score to measure the final agreement between the two reviewers. The average resulted from the kappa score is 0.6942, a substantial agreement between the reviewers. The result of evolution criteria shown in Table 1 . After Quality Assessment, the final list of papers for review is shown in Table 2 . The complete selection process is shown in Fig. 1 . The total number of selected papers in year wise as shown in Fig. 2 .

Selection process

Year wise publications

3.1 What are the datasets available for research on automated essay grading?

To work with problem statement especially in Machine Learning and deep learning domain, we require considerable amount of data to train the models. To answer this question, we listed all the data sets used for training and testing for automated essay grading systems. The Cambridge Learner Corpus-First Certificate in English exam (CLC-FCE) Yannakoudakis et al. ( 2011 ) developed corpora that contain 1244 essays and ten prompts. This corpus evaluates whether a student can write the relevant English sentences without any grammatical and spelling mistakes. This type of corpus helps to test the models built for GRE and TOFEL type of exams. It gives scores between 1 and 40.

Bailey and Meurers ( 2008 ), Created a dataset (CREE reading comprehension) for language learners and automated short answer scoring systems. The corpus consists of 566 responses from intermediate students. Mohler and Mihalcea ( 2009 ). Created a dataset for the computer science domain consists of 630 responses for data structure assignment questions. The scores are range from 0 to 5 given by two human raters.

Dzikovska et al. ( 2012 ) created a Student Response Analysis (SRA) corpus. It consists of two sub-groups: the BEETLE corpus consists of 56 questions and approximately 3000 responses from students in the electrical and electronics domain. The second one is the SCIENTSBANK(SemEval-2013) (Dzikovska et al. 2013a ; b ) corpus consists of 10,000 responses on 197 prompts on various science domains. The student responses ladled with "correct, partially correct incomplete, Contradictory, Irrelevant, Non-domain."

In the Kaggle (2012) competition, released total 3 types of corpuses on an Automated Student Assessment Prize (ASAP1) (“ https://www.kaggle.com/c/asap-sas/ ” ) essays and short answers. It has nearly 17,450 essays, out of which it provides up to 3000 essays for each prompt. It has eight prompts that test 7th to 10th grade US students. It gives scores between the [0–3] and [0–60] range. The limitations of these corpora are: (1) it has a different score range for other prompts. (2) It uses statistical features such as named entities extraction and lexical features of words to evaluate essays. ASAP + + is one more dataset from Kaggle. It is with six prompts, and each prompt has more than 1000 responses total of 10,696 from 8th-grade students. Another corpus contains ten prompts from science, English domains and a total of 17,207 responses. Two human graders evaluated all these responses.

Correnti et al. ( 2013 ) created a Response-to-Text Assessment (RTA) dataset used to check student writing skills in all directions like style, mechanism, and organization. 4–8 grade students give the responses to RTA. Basu et al. ( 2013 ) created a power grading dataset with 700 responses for ten different prompts from US immigration exams. It contains all short answers for assessment.

The TOEFL11 corpus Blanchard et al. ( 2013 ) contains 1100 essays evenly distributed over eight prompts. It is used to test the English language skills of a candidate attending the TOFEL exam. It scores the language proficiency of a candidate as low, medium, and high.

International Corpus of Learner English (ICLE) Granger et al. ( 2009 ) built a corpus of 3663 essays covering different dimensions. It has 12 prompts with 1003 essays that test the organizational skill of essay writing, and13 prompts, each with 830 essays that examine the thesis clarity and prompt adherence.

Argument Annotated Essays (AAE) Stab and Gurevych ( 2014 ) developed a corpus that contains 102 essays with 101 prompts taken from the essayforum2 site. It tests the persuasive nature of the student essay. The SCIENTSBANK corpus used by Sakaguchi et al. ( 2015 ) available in git-hub, containing 9804 answers to 197 questions in 15 science domains. Table 3 illustrates all datasets related to AES systems.

3.2 RQ2 what are the features extracted for the assessment of essays?

Features play a major role in the neural network and other supervised Machine Learning approaches. The automatic essay grading systems scores student essays based on different types of features, which play a prominent role in training the models. Based on their syntax and semantics and they are categorized into three groups. 1. statistical-based features Contreras et al. ( 2018 ); Kumar et al. ( 2019 ); Mathias and Bhattacharyya ( 2018a ; b ) 2. Style-based (Syntax) features Cummins et al. ( 2016 ); Darwish and Mohamed ( 2020 ); Ke et al. ( 2019 ). 3. Content-based features Dong et al. ( 2017 ). A good set of features appropriate models evolved better AES systems. The vast majority of the researchers are using regression models if features are statistical-based. For Neural Networks models, researches are using both style-based and content-based features. The following table shows the list of various features used in existing AES Systems. Table 4 represents all set of features used for essay grading.

We studied all the feature extracting NLP libraries as shown in Fig. 3 . that are used in the papers. The NLTK is an NLP tool used to retrieve statistical features like POS, word count, sentence count, etc. With NLTK, we can miss the essay's semantic features. To find semantic features Word2Vec Mikolov et al. ( 2013 ), GloVe Jeffrey Pennington et al. ( 2014 ) is the most used libraries to retrieve the semantic text from the essays. And in some systems, they directly trained the model with word embeddings to find the score. From Fig. 4 as observed that non-content-based feature extraction is higher than content-based.

Usages of tools

Number of papers on content based features

3.3 RQ3 which are the evaluation metrics available for measuring the accuracy of algorithms?

The majority of the AES systems are using three evaluation metrics. They are (1) quadrated weighted kappa (QWK) (2) Mean Absolute Error (MAE) (3) Pearson Correlation Coefficient (PCC) Shehab et al. ( 2016 ). The quadratic weighted kappa will find agreement between human evaluation score and system evaluation score and produces value ranging from 0 to 1. And the Mean Absolute Error is the actual difference between human-rated score to system-generated score. The mean square error (MSE) measures the average squares of the errors, i.e., the average squared difference between the human-rated and the system-generated scores. MSE will always give positive numbers only. Pearson's Correlation Coefficient (PCC) finds the correlation coefficient between two variables. It will provide three values (0, 1, − 1). "0" represents human-rated and system scores that are not related. "1" represents an increase in the two scores. "− 1" illustrates a negative relationship between the two scores.

3.4 RQ4 what are the Machine Learning techniques being used for automatic essay grading, and how are they implemented?

After scrutinizing all documents, we categorize the techniques used in automated essay grading systems into four baskets. 1. Regression techniques. 2. Classification model. 3. Neural networks. 4. Ontology-based approach.

All the existing AES systems developed in the last ten years employ supervised learning techniques. Researchers using supervised methods viewed the AES system as either regression or classification task. The goal of the regression task is to predict the score of an essay. The classification task is to classify the essays belonging to (low, medium, or highly) relevant to the question's topic. Since the last three years, most AES systems developed made use of the concept of the neural network.

3.4.1 Regression based models

Mohler and Mihalcea ( 2009 ). proposed text-to-text semantic similarity to assign a score to the student essays. There are two text similarity measures like Knowledge-based measures, corpus-based measures. There eight knowledge-based tests with all eight models. They found the similarity. The shortest path similarity determines based on the length, which shortest path between two contexts. Leacock & Chodorow find the similarity based on the shortest path's length between two concepts using node-counting. The Lesk similarity finds the overlap between the corresponding definitions, and Wu & Palmer algorithm finds similarities based on the depth of two given concepts in the wordnet taxonomy. Resnik, Lin, Jiang&Conrath, Hirst& St-Onge find the similarity based on different parameters like the concept, probability, normalization factor, lexical chains. In corpus-based likeness, there LSA BNC, LSA Wikipedia, and ESA Wikipedia, latent semantic analysis is trained on Wikipedia and has excellent domain knowledge. Among all similarity scores, correlation scores LSA Wikipedia scoring accuracy is more. But these similarity measure algorithms are not using NLP concepts. These models are before 2010 and basic concept models to continue the research automated essay grading with updated algorithms on neural networks with content-based features.

Adamson et al. ( 2014 ) proposed an automatic essay grading system which is a statistical-based approach in this they retrieved features like POS, Character count, Word count, Sentence count, Miss spelled words, n-gram representation of words to prepare essay vector. They formed a matrix with these all vectors in that they applied LSA to give a score to each essay. It is a statistical approach that doesn’t consider the semantics of the essay. The accuracy they got when compared to the human rater score with the system is 0.532.

Cummins et al. ( 2016 ). Proposed Timed Aggregate Perceptron vector model to give ranking to all the essays, and later they converted the rank algorithm to predict the score of the essay. The model trained with features like Word unigrams, bigrams, POS, Essay length, grammatical relation, Max word length, sentence length. It is multi-task learning, gives ranking to the essays, and predicts the score for the essay. The performance evaluated through QWK is 0.69, a substantial agreement between the human rater and the system.

Sultan et al. ( 2016 ). Proposed a Ridge regression model to find short answer scoring with Question Demoting. Question Demoting is the new concept included in the essay's final assessment to eliminate duplicate words from the essay. The extracted features are Text Similarity, which is the similarity between the student response and reference answer. Question Demoting is the number of repeats in a student response. With inverse document frequency, they assigned term weight. The sentence length Ratio is the number of words in the student response, is another feature. With these features, the Ridge regression model was used, and the accuracy they got 0.887.

Contreras et al. ( 2018 ). Proposed Ontology based on text mining in this model has given a score for essays in phases. In phase-I, they generated ontologies with ontoGen and SVM to find the concept and similarity in the essay. In phase II from ontologies, they retrieved features like essay length, word counts, correctness, vocabulary, and types of word used, domain information. After retrieving statistical data, they used a linear regression model to find the score of the essay. The accuracy score is the average of 0.5.

Darwish and Mohamed ( 2020 ) proposed the fusion of fuzzy Ontology with LSA. They retrieve two types of features, like syntax features and semantic features. In syntax features, they found Lexical Analysis with tokens, and they construct a parse tree. If the parse tree is broken, the essay is inconsistent—a separate grade assigned to the essay concerning syntax features. The semantic features are like similarity analysis, Spatial Data Analysis. Similarity analysis is to find duplicate sentences—Spatial Data Analysis for finding Euclid distance between the center and part. Later they combine syntax features and morphological features score for the final score. The accuracy they achieved with the multiple linear regression model is 0.77, mostly on statistical features.

Süzen Neslihan et al. ( 2020 ) proposed a text mining approach for short answer grading. First, their comparing model answers with student response by calculating the distance between two sentences. By comparing the model answer with student response, they find the essay's completeness and provide feedback. In this approach, model vocabulary plays a vital role in grading, and with this model vocabulary, the grade will be assigned to the student's response and provides feedback. The correlation between the student answer to model answer is 0.81.

3.4.2 Classification based Models

Persing and Ng ( 2013 ) used a support vector machine to score the essay. The features extracted are OS, N-gram, and semantic text to train the model and identified the keywords from the essay to give the final score.

Sakaguchi et al. ( 2015 ) proposed two methods: response-based and reference-based. In response-based scoring, the extracted features are response length, n-gram model, and syntactic elements to train the support vector regression model. In reference-based scoring, features such as sentence similarity using word2vec is used to find the cosine similarity of the sentences that is the final score of the response. First, the scores were discovered individually and later combined two features to find a final score. This system gave a remarkable increase in performance by combining the scores.

Mathias and Bhattacharyya ( 2018a ; b ) Proposed Automated Essay Grading Dataset with Essay Attribute Scores. The first concept features selection depends on the essay type. So the common attributes are Content, Organization, Word Choice, Sentence Fluency, Conventions. In this system, each attribute is scored individually, with the strength of each attribute identified. The model they used is a random forest classifier to assign scores to individual attributes. The accuracy they got with QWK is 0.74 for prompt 1 of the ASAS dataset ( https://www.kaggle.com/c/asap-sas/ ).

Ke et al. ( 2019 ) used a support vector machine to find the response score. In this method, features like Agreeability, Specificity, Clarity, Relevance to prompt, Conciseness, Eloquence, Confidence, Direction of development, Justification of opinion, and Justification of importance. First, the individual parameter score obtained was later combined with all scores to give a final response score. The features are used in the neural network to find whether the sentence is relevant to the topic or not.

Salim et al. ( 2019 ) proposed an XGBoost Machine Learning classifier to assess the essays. The algorithm trained on features like word count, POS, parse tree depth, and coherence in the articles with sentence similarity percentage; cohesion and coherence are considered for training. And they implemented K-fold cross-validation for a result the average accuracy after specific validations is 68.12.

3.4.3 Neural network models

Shehab et al. ( 2016 ) proposed a neural network method that used learning vector quantization to train human scored essays. After training, the network can provide a score to the ungraded essays. First, we should process the essay to remove Spell checking and then perform preprocessing steps like Document Tokenization, stop word removal, Stemming, and submit it to the neural network. Finally, the model will provide feedback on the essay, whether it is relevant to the topic. And the correlation coefficient between human rater and system score is 0.7665.

Kopparapu and De ( 2016 ) proposed the Automatic Ranking of Essays using Structural and Semantic Features. This approach constructed a super essay with all the responses. Next, ranking for a student essay is done based on the super-essay. The structural and semantic features derived helps to obtain the scores. In a paragraph, 15 Structural features like an average number of sentences, the average length of sentences, and the count of words, nouns, verbs, adjectives, etc., are used to obtain a syntactic score. A similarity score is used as semantic features to calculate the overall score.

Dong and Zhang ( 2016 ) proposed a hierarchical CNN model. The model builds two layers with word embedding to represents the words as the first layer. The second layer is a word convolution layer with max-pooling to find word vectors. The next layer is a sentence-level convolution layer with max-pooling to find the sentence's content and synonyms. A fully connected dense layer produces an output score for an essay. The accuracy with the hierarchical CNN model resulted in an average QWK of 0.754.

Taghipour and Ng ( 2016 ) proposed a first neural approach for essay scoring build in which convolution and recurrent neural network concepts help in scoring an essay. The network uses a lookup table with the one-hot representation of the word vector of an essay. The final efficiency of the network model with LSTM resulted in an average QWK of 0.708.

Dong et al. ( 2017 ). Proposed an Attention-based scoring system with CNN + LSTM to score an essay. For CNN, the input parameters were character embedding and word embedding, and it has attention pooling layers and used NLTK to obtain word and character embedding. The output gives a sentence vector, which provides sentence weight. After CNN, it will have an LSTM layer with an attention pooling layer, and this final layer results in the final score of the responses. The average QWK score is 0.764.

Riordan et al. ( 2017 ) proposed a neural network with CNN and LSTM layers. Word embedding, given as input to a neural network. An LSTM network layer will retrieve the window features and delivers them to the aggregation layer. The aggregation layer is a superficial layer that takes a correct window of words and gives successive layers to predict the answer's sore. The accuracy of the neural network resulted in a QWK of 0.90.

Zhao et al. ( 2017 ) proposed a new concept called Memory-Augmented Neural network with four layers, input representation layer, memory addressing layer, memory reading layer, and output layer. An input layer represents all essays in a vector form based on essay length. After converting the word vector, the memory addressing layer takes a sample of the essay and weighs all the terms. The memory reading layer takes the input from memory addressing segment and finds the content to finalize the score. Finally, the output layer will provide the final score of the essay. The accuracy of essay scores is 0.78, which is far better than the LSTM neural network.

Mathias and Bhattacharyya ( 2018a ; b ) proposed deep learning networks using LSTM with the CNN layer and GloVe pre-trained word embeddings. For this, they retrieved features like Sentence count essays, word count per sentence, Number of OOVs in the sentence, Language model score, and the text's perplexity. The network predicted the goodness scores of each essay. The higher the goodness scores, means higher the rank and vice versa.

Nguyen and Dery ( 2016 ). Proposed Neural Networks for Automated Essay Grading. In this method, a single layer bi-directional LSTM accepting word vector as input. Glove vectors used in this method resulted in an accuracy of 90%.

Ruseti et al. ( 2018 ) proposed a recurrent neural network that is capable of memorizing the text and generate a summary of an essay. The Bi-GRU network with the max-pooling layer molded on the word embedding of each document. It will provide scoring to the essay by comparing it with a summary of the essay from another Bi-GRU network. The result obtained an accuracy of 0.55.

Wang et al. ( 2018a ; b ) proposed an automatic scoring system with the bi-LSTM recurrent neural network model and retrieved the features using the word2vec technique. This method generated word embeddings from the essay words using the skip-gram model. And later, word embedding is used to train the neural network to find the final score. The softmax layer in LSTM obtains the importance of each word. This method used a QWK score of 0.83%.

Dasgupta et al. ( 2018 ) proposed a technique for essay scoring with augmenting textual qualitative Features. It extracted three types of linguistic, cognitive, and psychological features associated with a text document. The linguistic features are Part of Speech (POS), Universal Dependency relations, Structural Well-formedness, Lexical Diversity, Sentence Cohesion, Causality, and Informativeness of the text. The psychological features derived from the Linguistic Information and Word Count (LIWC) tool. They implemented a convolution recurrent neural network that takes input as word embedding and sentence vector, retrieved from the GloVe word vector. And the second layer is the Convolution Layer to find local features. The next layer is the recurrent neural network (LSTM) to find corresponding of the text. The accuracy of this method resulted in an average QWK of 0.764.

Liang et al. ( 2018 ) proposed a symmetrical neural network AES model with Bi-LSTM. They are extracting features from sample essays and student essays and preparing an embedding layer as input. The embedding layer output is transfer to the convolution layer from that LSTM will be trained. Hear the LSRM model has self-features extraction layer, which will find the essay's coherence. The average QWK score of SBLSTMA is 0.801.

Liu et al. ( 2019 ) proposed two-stage learning. In the first stage, they are assigning a score based on semantic data from the essay. The second stage scoring is based on some handcrafted features like grammar correction, essay length, number of sentences, etc. The average score of the two stages is 0.709.

Pedro Uria Rodriguez et al. ( 2019 ) proposed a sequence-to-sequence learning model for automatic essay scoring. They used BERT (Bidirectional Encoder Representations from Transformers), which extracts the semantics from a sentence from both directions. And XLnet sequence to sequence learning model to extract features like the next sentence in an essay. With this pre-trained model, they attained coherence from the essay to give the final score. The average QWK score of the model is 75.5.

Xia et al. ( 2019 ) proposed a two-layer Bi-directional LSTM neural network for the scoring of essays. The features extracted with word2vec to train the LSTM and accuracy of the model in an average of QWK is 0.870.

Kumar et al. ( 2019 ) Proposed an AutoSAS for short answer scoring. It used pre-trained Word2Vec and Doc2Vec models trained on Google News corpus and Wikipedia dump, respectively, to retrieve the features. First, they tagged every word POS and they found weighted words from the response. It also found prompt overlap to observe how the answer is relevant to the topic, and they defined lexical overlaps like noun overlap, argument overlap, and content overlap. This method used some statistical features like word frequency, difficulty, diversity, number of unique words in each response, type-token ratio, statistics of the sentence, word length, and logical operator-based features. This method uses a random forest model to train the dataset. The data set has sample responses with their associated score. The model will retrieve the features from both responses like graded and ungraded short answers with questions. The accuracy of AutoSAS with QWK is 0.78. It will work on any topics like Science, Arts, Biology, and English.

Jiaqi Lun et al. ( 2020 ) proposed an automatic short answer scoring with BERT. In this with a reference answer comparing student responses and assigning scores. The data augmentation is done with a neural network and with one correct answer from the dataset classifying reaming responses as correct or incorrect.

Zhu and Sun ( 2020 ) proposed a multimodal Machine Learning approach for automated essay scoring. First, they count the grammar score with the spaCy library and numerical count as the number of words and sentences with the same library. With this input, they trained a single and Bi LSTM neural network for finding the final score. For the LSTM model, they prepared sentence vectors with GloVe and word embedding with NLTK. Bi-LSTM will check each sentence in both directions to find semantic from the essay. The average QWK score with multiple models is 0.70.

3.4.4 Ontology based approach

Mohler et al. ( 2011 ) proposed a graph-based method to find semantic similarity in short answer scoring. For the ranking of answers, they used the support vector regression model. The bag of words is the main feature extracted in the system.

Ramachandran et al. ( 2015 ) also proposed a graph-based approach to find lexical based semantics. Identified phrase patterns and text patterns are the features to train a random forest regression model to score the essays. The accuracy of the model in a QWK is 0.78.

Zupanc et al. ( 2017 ) proposed sentence similarity networks to find the essay's score. Ajetunmobi and Daramola ( 2017 ) recommended an ontology-based information extraction approach and domain-based ontology to find the score.

3.4.5 Speech response scoring

Automatic scoring is in two ways one is text-based scoring, other is speech-based scoring. This paper discussed text-based scoring and its challenges, and now we cover speech scoring and common points between text and speech-based scoring. Evanini and Wang ( 2013 ), Worked on speech scoring of non-native school students, extracted features with speech ratter, and trained a linear regression model, concluding that accuracy varies based on voice pitching. Loukina et al. ( 2015 ) worked on feature selection from speech data and trained SVM. Malinin et al. ( 2016 ) used neural network models to train the data. Loukina et al. ( 2017 ). Proposed speech and text-based automatic scoring. Extracted text-based features, speech-based features and trained a deep neural network for speech-based scoring. They extracted 33 types of features based on acoustic signals. Malinin et al. ( 2017 ). Wu Xixin et al. ( 2020 ) Worked on deep neural networks for spoken language assessment. Incorporated different types of models and tested them. Ramanarayanan et al. ( 2017 ) worked on feature extraction methods and extracted punctuation, fluency, and stress and trained different Machine Learning models for scoring. Knill et al. ( 2018 ). Worked on Automatic speech recognizer and its errors how its impacts the speech assessment.

3.4.5.1 The state of the art

This section provides an overview of the existing AES systems with a comparative study w. r. t models, features applied, datasets, and evaluation metrics used for building the automated essay grading systems. We divided all 62 papers into two sets of the first set of review papers in Table 5 with a comparative study of the AES systems.

3.4.6 Comparison of all approaches

In our study, we divided major AES approaches into three categories. Regression models, classification models, and neural network models. The regression models failed to find cohesion and coherence from the essay because it trained on BoW(Bag of Words) features. In processing data from input to output, the regression models are less complicated than neural networks. There are unable to find many intricate patterns from the essay and unable to find sentence connectivity. If we train the model with BoW features in the neural network approach, the model never considers the essay's coherence and coherence.

First, to train a Machine Learning algorithm with essays, all the essays are converted to vector form. We can form a vector with BoW and Word2vec, TF-IDF. The BoW and Word2vec vector representation of essays represented in Table 6 . The vector representation of BoW with TF-IDF is not incorporating the essays semantic, and it’s just statistical learning from a given vector. Word2vec vector comprises semantic of essay in a unidirectional way.

In BoW, the vector contains the frequency of word occurrences in the essay. The vector represents 1 and more based on the happenings of words in the essay and 0 for not present. So, in BoW, the vector does not maintain the relationship with adjacent words; it’s just for single words. In word2vec, the vector represents the relationship between words with other words and sentences prompt in multiple dimensional ways. But word2vec prepares vectors in a unidirectional way, not in a bidirectional way; word2vec fails to find semantic vectors when a word has two meanings, and the meaning depends on adjacent words. Table 7 represents a comparison of Machine Learning models and features extracting methods.

In AES, cohesion and coherence will check the content of the essay concerning the essay prompt these can be extracted from essay in the vector from. Two more parameters are there to access an essay is completeness and feedback. Completeness will check whether student’s response is sufficient or not though the student wrote correctly. Table 8 represents all four parameters comparison for essay grading. Table 9 illustrates comparison of all approaches based on various features like grammar, spelling, organization of essay, relevance.

3.5 What are the challenges/limitations in the current research?

From our study and results discussed in the previous sections, many researchers worked on automated essay scoring systems with numerous techniques. We have statistical methods, classification methods, and neural network approaches to evaluate the essay automatically. The main goal of the automated essay grading system is to reduce human effort and improve consistency.

The vast majority of essay scoring systems are dealing with the efficiency of the algorithm. But there are many challenges in automated essay grading systems. One should assess the essay by following parameters like the relevance of the content to the prompt, development of ideas, Cohesion, Coherence, and domain knowledge.

No model works on the relevance of content, which means whether student response or explanation is relevant to the given prompt or not if it is relevant to how much it is appropriate, and there is no discussion about the cohesion and coherence of the essays. All researches concentrated on extracting the features using some NLP libraries, trained their models, and testing the results. But there is no explanation in the essay evaluation system about consistency and completeness, But Palma and Atkinson ( 2018 ) explained coherence-based essay evaluation. And Zupanc and Bosnic ( 2014 ) also used the word coherence to evaluate essays. And they found consistency with latent semantic analysis (LSA) for finding coherence from essays, but the dictionary meaning of coherence is "The quality of being logical and consistent."

Another limitation is there is no domain knowledge-based evaluation of essays using Machine Learning models. For example, the meaning of a cell is different from biology to physics. Many Machine Learning models extract features with WordVec and GloVec; these NLP libraries cannot convert the words into vectors when they have two or more meanings.

3.5.1 Other challenges that influence the Automated Essay Scoring Systems.

All these approaches worked to improve the QWK score of their models. But QWK will not assess the model in terms of features extraction and constructed irrelevant answers. The QWK is not evaluating models whether the model is correctly assessing the answer or not. There are many challenges concerning students' responses to the Automatic scoring system. Like in evaluating approach, no model has examined how to evaluate the constructed irrelevant and adversarial answers. Especially the black box type of approaches like deep learning models provides more options to the students to bluff the automated scoring systems.

The Machine Learning models that work on statistical features are very vulnerable. Based on Powers et al. ( 2001 ) and Bejar Isaac et al. ( 2014 ), the E-rater was failed on Constructed Irrelevant Responses Strategy (CIRS). From the study of Bejar et al. ( 2013 ), Higgins and Heilman ( 2014 ), observed that when student response contain irrelevant content or shell language concurring to prompt will influence the final score of essays in an automated scoring system.

In deep learning approaches, most of the models automatically read the essay's features, and some methods work on word-based embedding and other character-based embedding features. From the study of Riordan Brain et al. ( 2019 ), The character-based embedding systems do not prioritize spelling correction. However, it is influencing the final score of the essay. From the study of Horbach and Zesch ( 2019 ), Various factors are influencing AES systems. For example, there are data set size, prompt type, answer length, training set, and human scorers for content-based scoring.

Ding et al. ( 2020 ) reviewed that the automated scoring system is vulnerable when a student response contains more words from prompt, like prompt vocabulary repeated in the response. Parekh et al. ( 2020 ) and Kumar et al. ( 2020 ) tested various neural network models of AES by iteratively adding important words, deleting unimportant words, shuffle the words, and repeating sentences in an essay and found that no change in the final score of essays. These neural network models failed to recognize common sense in adversaries' essays and give more options for the students to bluff the automated systems.

Other than NLP and ML techniques for AES. From Wresch ( 1993 ) to Madnani and Cahill ( 2018 ). discussed the complexity of AES systems, standards need to be followed. Like assessment rubrics to test subject knowledge, irrelevant responses, and ethical aspects of an algorithm like measuring the fairness of student response.

Fairness is an essential factor for automated systems. For example, in AES, fairness can be measure in an agreement between human score to machine score. Besides this, From Loukina et al. ( 2019 ), the fairness standards include overall score accuracy, overall score differences, and condition score differences between human and system scores. In addition, scoring different responses in the prospect of constructive relevant and irrelevant will improve fairness.

Madnani et al. ( 2017a ; b ). Discussed the fairness of AES systems for constructed responses and presented RMS open-source tool for detecting biases in the models. With this, one can change fairness standards according to their analysis of fairness.

From Berzak et al.'s ( 2018 ) approach, behavior factors are a significant challenge in automated scoring systems. That helps to find language proficiency, word characteristics (essential words from the text), predict the critical patterns from the text, find related sentences in an essay, and give a more accurate score.

Rupp ( 2018 ), has discussed the designing, evaluating, and deployment methodologies for AES systems. They provided notable characteristics of AES systems for deployment. They are like model performance, evaluation metrics for a model, threshold values, dynamically updated models, and framework.

First, we should check the model performance on different datasets and parameters for operational deployment. Selecting Evaluation metrics for AES models are like QWK, correlation coefficient, or sometimes both. Kelley and Preacher ( 2012 ) have discussed three categories of threshold values: marginal, borderline, and acceptable. The values can be varied based on data size, model performance, type of model (single scoring, multiple scoring models). Once a model is deployed and evaluates millions of responses every time for optimal responses, we need a dynamically updated model based on prompt and data. Finally, framework designing of AES model, hear a framework contains prompts where test-takers can write the responses. One can design two frameworks: a single scoring model for a single methodology and multiple scoring models for multiple concepts. When we deploy multiple scoring models, each prompt could be trained separately, or we can provide generalized models for all prompts with this accuracy may vary, and it is challenging.

4 Synthesis

Our Systematic literature review on the automated essay grading system first collected 542 papers with selected keywords from various databases. After inclusion and exclusion criteria, we left with 139 articles; on these selected papers, we applied Quality assessment criteria with two reviewers, and finally, we selected 62 writings for final review.

Our observations on automated essay grading systems from 2010 to 2020 are as followed:

The implementation techniques of automated essay grading systems are classified into four buckets; there are 1. regression models 2. Classification models 3. Neural networks 4. Ontology-based methodology, but using neural networks, the researchers are more accurate than other techniques, and all the methods state of the art provided in Table 3 .

The majority of the regression and classification models on essay scoring used statistical features to find the final score. It means the systems or models trained on such parameters as word count, sentence count, etc. though the parameters extracted from the essay, the algorithm are not directly training on essays. The algorithms trained on some numbers obtained from the essay and hear if numbers matched the composition will get a good score; otherwise, the rating is less. In these models, the evaluation process is entirely on numbers, irrespective of the essay. So, there is a lot of chance to miss the coherence, relevance of the essay if we train our algorithm on statistical parameters.

In the neural network approach, the models trained on Bag of Words (BoW) features. The BoW feature is missing the relationship between a word to word and the semantic meaning of the sentence. E.g., Sentence 1: John killed bob. Sentence 2: bob killed John. In these two sentences, the BoW is "John," "killed," "bob."

In the Word2Vec library, if we are prepared a word vector from an essay in a unidirectional way, the vector will have a dependency with other words and finds the semantic relationship with other words. But if a word has two or more meanings like "Bank loan" and "River Bank," hear bank has two implications, and its adjacent words decide the sentence meaning; in this case, Word2Vec is not finding the real meaning of the word from the sentence.

The features extracted from essays in the essay scoring system are classified into 3 type's features like statistical features, style-based features, and content-based features, which are explained in RQ2 and Table 3 . But statistical features, are playing a significant role in some systems and negligible in some systems. In Shehab et al. ( 2016 ); Cummins et al. ( 2016 ). Dong et al. ( 2017 ). Dong and Zhang ( 2016 ). Mathias and Bhattacharyya ( 2018a ; b ) Systems the assessment is entirely on statistical and style-based features they have not retrieved any content-based features. And in other systems that extract content from the essays, the role of statistical features is for only preprocessing essays but not included in the final grading.

In AES systems, coherence is the main feature to be considered while evaluating essays. The actual meaning of coherence is to stick together. That is the logical connection of sentences (local level coherence) and paragraphs (global level coherence) in a story. Without coherence, all sentences in a paragraph are independent and meaningless. In an Essay, coherence is a significant feature that is explaining everything in a flow and its meaning. It is a powerful feature in AES system to find the semantics of essay. With coherence, one can assess whether all sentences are connected in a flow and all paragraphs are related to justify the prompt. Retrieving the coherence level from an essay is a critical task for all researchers in AES systems.

In automatic essay grading systems, the assessment of essays concerning content is critical. That will give the actual score for the student. Most of the researches used statistical features like sentence length, word count, number of sentences, etc. But according to collected results, 32% of the systems used content-based features for the essay scoring. Example papers which are on content-based assessment are Taghipour and Ng ( 2016 ); Persing and Ng ( 2013 ); Wang et al. ( 2018a , 2018b ); Zhao et al. ( 2017 ); Kopparapu and De ( 2016 ), Kumar et al. ( 2019 ); Mathias and Bhattacharyya ( 2018a ; b ); Mohler and Mihalcea ( 2009 ) are used content and statistical-based features. The results are shown in Fig. 3 . And mainly the content-based features extracted with word2vec NLP library, but word2vec is capable of capturing the context of a word in a document, semantic and syntactic similarity, relation with other terms, but word2vec is capable of capturing the context word in a uni-direction either left or right. If a word has multiple meanings, there is a chance of missing the context in the essay. After analyzing all the papers, we found that content-based assessment is a qualitative assessment of essays.

On the other hand, Horbach and Zesch ( 2019 ); Riordan Brain et al. ( 2019 ); Ding et al. ( 2020 ); Kumar et al. ( 2020 ) proved that neural network models are vulnerable when a student response contains constructed irrelevant, adversarial answers. And a student can easily bluff an automated scoring system by submitting different responses like repeating sentences and repeating prompt words in an essay. From Loukina et al. ( 2019 ), and Madnani et al. ( 2017b ). The fairness of an algorithm is an essential factor to be considered in AES systems.

While talking about speech assessment, the data set contains audios of duration up to one minute. Feature extraction techniques are entirely different from text assessment, and accuracy varies based on speaking fluency, pitching, male to female voice and boy to adult voice. But the training algorithms are the same for text and speech assessment.

Once an AES system evaluates essays and short answers accurately in all directions, there is a massive demand for automated systems in the educational and related world. Now AES systems are deployed in GRE, TOEFL exams; other than these, we can deploy AES systems in massive open online courses like Coursera(“ https://coursera.org/learn//machine-learning//exam ”), NPTEL ( https://swayam.gov.in/explorer ), etc. still they are assessing student performance with multiple-choice questions. In another perspective, AES systems can be deployed in information retrieval systems like Quora, stack overflow, etc., to check whether the retrieved response is appropriate to the question or not and can give ranking to the retrieved answers.

5 Conclusion and future work

As per our Systematic literature review, we studied 62 papers. There exist significant challenges for researchers in implementing automated essay grading systems. Several researchers are working rigorously on building a robust AES system despite its difficulty in solving this problem. All evaluating methods are not evaluated based on coherence, relevance, completeness, feedback, and knowledge-based. And 90% of essay grading systems are used Kaggle ASAP (2012) dataset, which has general essays from students and not required any domain knowledge, so there is a need for domain-specific essay datasets to train and test. Feature extraction is with NLTK, WordVec, and GloVec NLP libraries; these libraries have many limitations while converting a sentence into vector form. Apart from feature extraction and training Machine Learning models, no system is accessing the essay's completeness. No system provides feedback to the student response and not retrieving coherence vectors from the essay—another perspective the constructive irrelevant and adversarial student responses still questioning AES systems.

Our proposed research work will go on the content-based assessment of essays with domain knowledge and find a score for the essays with internal and external consistency. And we will create a new dataset concerning one domain. And another area in which we can improve is the feature extraction techniques.

This study includes only four digital databases for study selection may miss some functional studies on the topic. However, we hope that we covered most of the significant studies as we manually collected some papers published in useful journals.

Adamson, A., Lamb, A., & December, R. M. (2014). Automated Essay Grading.

Ajay HB, Tillett PI, Page EB (1973) Analysis of essays by computer (AEC-II) (No. 8-0102). Washington, DC: U.S. Department of Health, Education, and Welfare, Office of Education, National Center for Educational Research and Development

Ajetunmobi SA, Daramola O (2017) Ontology-based information extraction for subject-focussed automatic essay evaluation. In: 2017 International Conference on Computing Networking and Informatics (ICCNI) p 1–6. IEEE

Alva-Manchego F, et al. (2019) EASSE: Easier Automatic Sentence Simplification Evaluation.” ArXiv abs/1908.04567 (2019): n. pag

Bailey S, Meurers D (2008) Diagnosing meaning errors in short answers to reading comprehension questions. In: Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications (Columbus), p 107–115

Basu S, Jacobs C, Vanderwende L (2013) Powergrading: a clustering approach to amplify human effort for short answer grading. Trans Assoc Comput Linguist (TACL) 1:391–402

Article Google Scholar

Bejar, I. I., Flor, M., Futagi, Y., & Ramineni, C. (2014). On the vulnerability of automated scoring to construct-irrelevant response strategies (CIRS): An illustration. Assessing Writing, 22, 48-59.

Bejar I, et al. (2013) Length of Textual Response as a Construct-Irrelevant Response Strategy: The Case of Shell Language. Research Report. ETS RR-13-07.” ETS Research Report Series (2013): n. pag

Berzak Y, et al. (2018) “Assessing Language Proficiency from Eye Movements in Reading.” ArXiv abs/1804.07329 (2018): n. pag

Blanchard D, Tetreault J, Higgins D, Cahill A, Chodorow M (2013) TOEFL11: A corpus of non-native English. ETS Research Report Series, 2013(2):i–15, 2013

Blood, I. (2011). Automated essay scoring: a literature review. Studies in Applied Linguistics and TESOL, 11(2).

Burrows S, Gurevych I, Stein B (2015) The eras and trends of automatic short answer grading. Int J Artif Intell Educ 25:60–117. https://doi.org/10.1007/s40593-014-0026-8

Cader, A. (2020, July). The Potential for the Use of Deep Neural Networks in e-Learning Student Evaluation with New Data Augmentation Method. In International Conference on Artificial Intelligence in Education (pp. 37–42). Springer, Cham.

Cai C (2019) Automatic essay scoring with recurrent neural network. In: Proceedings of the 3rd International Conference on High Performance Compilation, Computing and Communications (2019): n. pag.

Chen M, Li X (2018) "Relevance-Based Automated Essay Scoring via Hierarchical Recurrent Model. In: 2018 International Conference on Asian Language Processing (IALP), Bandung, Indonesia, 2018, p 378–383, doi: https://doi.org/10.1109/IALP.2018.8629256

Chen Z, Zhou Y (2019) "Research on Automatic Essay Scoring of Composition Based on CNN and OR. In: 2019 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, p 13–18, doi: https://doi.org/10.1109/ICAIBD.2019.8837007

Contreras JO, Hilles SM, Abubakar ZB (2018) Automated essay scoring with ontology based on text mining and NLTK tools. In: 2018 International Conference on Smart Computing and Electronic Enterprise (ICSCEE), 1-6

Correnti R, Matsumura LC, Hamilton L, Wang E (2013) Assessing students’ skills at writing analytically in response to texts. Elem Sch J 114(2):142–177

Cummins, R., Zhang, M., & Briscoe, E. (2016, August). Constrained multi-task learning for automated essay scoring. Association for Computational Linguistics.

Darwish SM, Mohamed SK (2020) Automated essay evaluation based on fusion of fuzzy ontology and latent semantic analysis. In: Hassanien A, Azar A, Gaber T, Bhatnagar RF, Tolba M (eds) The International Conference on Advanced Machine Learning Technologies and Applications

Dasgupta T, Naskar A, Dey L, Saha R (2018) Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring. In: Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications p 93–102

Ding Y, et al. (2020) "Don’t take “nswvtnvakgxpm” for an answer–The surprising vulnerability of automatic content scoring systems to adversarial input." In: Proceedings of the 28th International Conference on Computational Linguistics

Dong F, Zhang Y (2016) Automatic features for essay scoring–an empirical study. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing p 1072–1077

Dong F, Zhang Y, Yang J (2017) Attention-based recurrent convolutional neural network for automatic essay scoring. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017) p 153–162

Dzikovska M, Nielsen R, Brew C, Leacock C, Gi ampiccolo D, Bentivogli L, Clark P, Dagan I, Dang HT (2013a) Semeval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge

Dzikovska MO, Nielsen R, Brew C, Leacock C, Giampiccolo D, Bentivogli L, Clark P, Dagan I, Trang Dang H (2013b) SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge. *SEM 2013: The First Joint Conference on Lexical and Computational Semantics

Educational Testing Service (2008) CriterionSM online writing evaluation service. Retrieved from http://www.ets.org/s/criterion/pdf/9286_CriterionBrochure.pdf .

Evanini, K., & Wang, X. (2013, August). Automated speech scoring for non-native middle school students with multiple task types. In INTERSPEECH (pp. 2435–2439).

Foltz PW, Laham D, Landauer TK (1999) The Intelligent Essay Assessor: Applications to Educational Technology. Interactive Multimedia Electronic Journal of Computer-Enhanced Learning, 1, 2, http://imej.wfu.edu/articles/1999/2/04/ index.asp

Granger, S., Dagneaux, E., Meunier, F., & Paquot, M. (Eds.). (2009). International corpus of learner English. Louvain-la-Neuve: Presses universitaires de Louvain.

Higgins, D., & Heilman, M. (2014). Managing what we can measure: Quantifying the susceptibility of automated scoring systems to gaming behavior. Educational Measurement: Issues and Practice, 33(3), 36–46.

Horbach A, Zesch T (2019) The influence of variance in learner answers on automatic content scoring. Front Educ 4:28. https://doi.org/10.3389/feduc.2019.00028

https://www.coursera.org/learn/machine-learning/exam/7pytE/linear-regression-with-multiple-variables/attempt

Hussein, M. A., Hassan, H., & Nassef, M. (2019). Automated language essay scoring systems: A literature review. PeerJ Computer Science, 5, e208.

Ke Z, Ng V (2019) “Automated essay scoring: a survey of the state of the art.” IJCAI

Ke, Z., Inamdar, H., Lin, H., & Ng, V. (2019, July). Give me more feedback II: Annotating thesis strength and related attributes in student essays. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3994-4004).

Kelley K, Preacher KJ (2012) On effect size. Psychol Methods 17(2):137–152

Kitchenham B, Brereton OP, Budgen D, Turner M, Bailey J, Linkman S (2009) Systematic literature reviews in software engineering–a systematic literature review. Inf Softw Technol 51(1):7–15

Klebanov, B. B., & Madnani, N. (2020, July). Automated evaluation of writing–50 years and counting. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 7796–7810).

Knill K, Gales M, Kyriakopoulos K, et al. (4 more authors) (2018) Impact of ASR performance on free speaking language assessment. In: Interspeech 2018.02–06 Sep 2018, Hyderabad, India. International Speech Communication Association (ISCA)

Kopparapu SK, De A (2016) Automatic ranking of essays using structural and semantic features. In: 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), p 519–523

Kumar, Y., Aggarwal, S., Mahata, D., Shah, R. R., Kumaraguru, P., & Zimmermann, R. (2019, July). Get it scored using autosas—an automated system for scoring short answers. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, No. 01, pp. 9662–9669).

Kumar Y, et al. (2020) “Calling out bluff: attacking the robustness of automatic scoring systems with simple adversarial testing.” ArXiv abs/2007.06796

Li X, Chen M, Nie J, Liu Z, Feng Z, Cai Y (2018) Coherence-Based Automated Essay Scoring Using Self-attention. In: Sun M, Liu T, Wang X, Liu Z, Liu Y (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. CCL 2018, NLP-NABD 2018. Lecture Notes in Computer Science, vol 11221. Springer, Cham. https://doi.org/10.1007/978-3-030-01716-3_32

Liang G, On B, Jeong D, Kim H, Choi G (2018) Automated essay scoring: a siamese bidirectional LSTM neural network architecture. Symmetry 10:682

Liua, H., Yeb, Y., & Wu, M. (2018, April). Ensemble Learning on Scoring Student Essay. In 2018 International Conference on Management and Education, Humanities and Social Sciences (MEHSS 2018). Atlantis Press.

Liu J, Xu Y, Zhao L (2019) Automated Essay Scoring based on Two-Stage Learning. ArXiv, abs/1901.07744

Loukina A, et al. (2015) Feature selection for automated speech scoring.” BEA@NAACL-HLT

Loukina A, et al. (2017) “Speech- and Text-driven Features for Automated Scoring of English-Speaking Tasks.” SCNLP@EMNLP 2017

Loukina A, et al. (2019) The many dimensions of algorithmic fairness in educational applications. BEA@ACL

Lun J, Zhu J, Tang Y, Yang M (2020) Multiple data augmentation strategies for improving performance on automatic short answer scoring. In: Proceedings of the AAAI Conference on Artificial Intelligence, 34(09): 13389-13396

Madnani, N., & Cahill, A. (2018, August). Automated scoring: Beyond natural language processing. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 1099–1109).

Madnani N, et al. (2017b) “Building better open-source tools to support fairness in automated scoring.” EthNLP@EACL

Malinin A, et al. (2016) “Off-topic response detection for spontaneous spoken english assessment.” ACL

Malinin A, et al. (2017) “Incorporating uncertainty into deep learning for spoken language assessment.” ACL

Mathias S, Bhattacharyya P (2018a) Thank “Goodness”! A Way to Measure Style in Student Essays. In: Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications p 35–41

Mathias S, Bhattacharyya P (2018b) ASAP++: Enriching the ASAP automated essay grading dataset with essay attribute scores. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).

Mikolov T, et al. (2013) “Efficient Estimation of Word Representations in Vector Space.” ICLR

Mohler M, Mihalcea R (2009) Text-to-text semantic similarity for automatic short answer grading. In: Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009) p 567–575

Mohler M, Bunescu R, Mihalcea R (2011) Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies p 752–762

Muangkammuen P, Fukumoto F (2020) Multi-task Learning for Automated Essay Scoring with Sentiment Analysis. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop p 116–123

Nguyen, H., & Dery, L. (2016). Neural networks for automated essay grading. CS224d Stanford Reports, 1–11.

Palma D, Atkinson J (2018) Coherence-based automatic essay assessment. IEEE Intell Syst 33(5):26–36

Parekh S, et al (2020) My Teacher Thinks the World Is Flat! Interpreting Automatic Essay Scoring Mechanism.” ArXiv abs/2012.13872 (2020): n. pag

Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543).

Persing I, Ng V (2013) Modeling thesis clarity in student essays. In:Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) p 260–269

Powers DE, Burstein JC, Chodorow M, Fowles ME, Kukich K (2001) Stumping E-Rater: challenging the validity of automated essay scoring. ETS Res Rep Ser 2001(1):i–44

Google Scholar

Powers, D. E., Burstein, J. C., Chodorow, M., Fowles, M. E., & Kukich, K. (2002). Stumping e-rater: challenging the validity of automated essay scoring. Computers in Human Behavior, 18(2), 103–134.

Ramachandran L, Cheng J, Foltz P (2015) Identifying patterns for short answer scoring using graph-based lexico-semantic text matching. In: Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications p 97–106

Ramanarayanan V, et al. (2017) “Human and Automated Scoring of Fluency, Pronunciation and Intonation During Human-Machine Spoken Dialog Interactions.” INTERSPEECH

Riordan B, Horbach A, Cahill A, Zesch T, Lee C (2017) Investigating neural architectures for short answer scoring. In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications p 159–168

Riordan B, Flor M, Pugh R (2019) "How to account for misspellings: Quantifying the benefit of character representations in neural content scoring models."In: Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

Rodriguez P, Jafari A, Ormerod CM (2019) Language models and Automated Essay Scoring. ArXiv, abs/1909.09482

Rudner, L. M., & Liang, T. (2002). Automated essay scoring using Bayes' theorem. The Journal of Technology, Learning and Assessment, 1(2).

Rudner, L. M., Garcia, V., & Welch, C. (2006). An evaluation of IntelliMetric™ essay scoring system. The Journal of Technology, Learning and Assessment, 4(4).

Rupp A (2018) Designing, evaluating, and deploying automated scoring systems with validity in mind: methodological design decisions. Appl Meas Educ 31:191–214

Ruseti S, Dascalu M, Johnson AM, McNamara DS, Balyan R, McCarthy KS, Trausan-Matu S (2018) Scoring summaries using recurrent neural networks. In: International Conference on Intelligent Tutoring Systems p 191–201. Springer, Cham

Sakaguchi K, Heilman M, Madnani N (2015) Effective feature integration for automated short answer scoring. In: Proceedings of the 2015 conference of the North American Chapter of the association for computational linguistics: Human language technologies p 1049–1054

Salim, Y., Stevanus, V., Barlian, E., Sari, A. C., & Suhartono, D. (2019, December). Automated English Digital Essay Grader Using Machine Learning. In 2019 IEEE International Conference on Engineering, Technology and Education (TALE) (pp. 1–6). IEEE.

Shehab A, Elhoseny M, Hassanien AE (2016) A hybrid scheme for Automated Essay Grading based on LVQ and NLP techniques. In: 12th International Computer Engineering Conference (ICENCO), Cairo, 2016, p 65-70

Shermis MD, Mzumara HR, Olson J, Harrington S (2001) On-line grading of student essays: PEG goes on the World Wide Web. Assess Eval High Educ 26(3):247–259

Stab C, Gurevych I (2014) Identifying argumentative discourse structures in persuasive essays. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) p 46–56

Sultan MA, Salazar C, Sumner T (2016) Fast and easy short answer grading with high accuracy. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies p 1070–1075

Süzen, N., Gorban, A. N., Levesley, J., & Mirkes, E. M. (2020). Automatic short answer grading and feedback using text mining methods. Procedia Computer Science, 169, 726–743.

Taghipour K, Ng HT (2016) A neural approach to automated essay scoring. In: Proceedings of the 2016 conference on empirical methods in natural language processing p 1882–1891

Tashu TM (2020) "Off-Topic Essay Detection Using C-BGRU Siamese. In: 2020 IEEE 14th International Conference on Semantic Computing (ICSC), San Diego, CA, USA, p 221–225, doi: https://doi.org/10.1109/ICSC.2020.00046

Tashu TM, Horváth T (2019) A layered approach to automatic essay evaluation using word-embedding. In: McLaren B, Reilly R, Zvacek S, Uhomoibhi J (eds) Computer Supported Education. CSEDU 2018. Communications in Computer and Information Science, vol 1022. Springer, Cham

Tashu TM, Horváth T (2020) Semantic-Based Feedback Recommendation for Automatic Essay Evaluation. In: Bi Y, Bhatia R, Kapoor S (eds) Intelligent Systems and Applications. IntelliSys 2019. Advances in Intelligent Systems and Computing, vol 1038. Springer, Cham

Uto M, Okano M (2020) Robust Neural Automated Essay Scoring Using Item Response Theory. In: Bittencourt I, Cukurova M, Muldner K, Luckin R, Millán E (eds) Artificial Intelligence in Education. AIED 2020. Lecture Notes in Computer Science, vol 12163. Springer, Cham

Wang Z, Liu J, Dong R (2018a) Intelligent Auto-grading System. In: 2018 5th IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS) p 430–435. IEEE.

Wang Y, et al. (2018b) “Automatic Essay Scoring Incorporating Rating Schema via Reinforcement Learning.” EMNLP

Zhu W, Sun Y (2020) Automated essay scoring system using multi-model Machine Learning, david c. wyld et al. (eds): mlnlp, bdiot, itccma, csity, dtmn, aifz, sigpro

Wresch W (1993) The Imminence of Grading Essays by Computer-25 Years Later. Comput Compos 10:45–58

Wu, X., Knill, K., Gales, M., & Malinin, A. (2020). Ensemble approaches for uncertainty in spoken language assessment.

Xia L, Liu J, Zhang Z (2019) Automatic Essay Scoring Model Based on Two-Layer Bi-directional Long-Short Term Memory Network. In: Proceedings of the 2019 3rd International Conference on Computer Science and Artificial Intelligence p 133–137

Yannakoudakis H, Briscoe T, Medlock B (2011) A new dataset and method for automatically grading ESOL texts. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies p 180–189

Zhao S, Zhang Y, Xiong X, Botelho A, Heffernan N (2017) A memory-augmented neural model for automated grading. In: Proceedings of the Fourth (2017) ACM Conference on Learning@ Scale p 189–192

Zupanc K, Bosnic Z (2014) Automated essay evaluation augmented with semantic coherence measures. In: 2014 IEEE International Conference on Data Mining p 1133–1138. IEEE.

Zupanc K, Savić M, Bosnić Z, Ivanović M (2017) Evaluating coherence of essays using sentence-similarity networks. In: Proceedings of the 18th International Conference on Computer Systems and Technologies p 65–72

Dzikovska, M. O., Nielsen, R., & Brew, C. (2012, June). Towards effective tutorial feedback for explanation questions: A dataset and baselines. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 200-210).

Kumar, N., & Dey, L. (2013, November). Automatic Quality Assessment of documents with application to essay grading. In 2013 12th Mexican International Conference on Artificial Intelligence (pp. 216–222). IEEE.

Wu, S. H., & Shih, W. F. (2018, July). A short answer grading system in chinese by support vector approach. In Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications (pp. 125-129).

Agung Putri Ratna, A., Lalita Luhurkinanti, D., Ibrahim I., Husna D., Dewi Purnamasari P. (2018). Automatic Essay Grading System for Japanese Language Examination Using Winnowing Algorithm, 2018 International Seminar on Application for Technology of Information and Communication, 2018, pp. 565–569. https://doi.org/10.1109/ISEMANTIC.2018.8549789 .

Sharma A., & Jayagopi D. B. (2018). Automated Grading of Handwritten Essays 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2018, pp 279–284. https://doi.org/10.1109/ICFHR-2018.2018.00056

Download references

Not Applicable.

Author information

Authors and affiliations.

School of Computer Science and Artificial Intelligence, SR University, Warangal, TS, India

Dadi Ramesh

Research Scholar, JNTU, Hyderabad, India

Department of Information Technology, JNTUH College of Engineering, Nachupally, Kondagattu, Jagtial, TS, India

Suresh Kumar Sanampudi

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dadi Ramesh .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (XLSX 80 KB)

Rights and permissions.

Reprints and permissions

About this article

Ramesh, D., Sanampudi, S.K. An automated essay scoring systems: a systematic literature review. Artif Intell Rev 55 , 2495–2527 (2022). https://doi.org/10.1007/s10462-021-10068-2

Download citation

Published : 23 September 2021

Issue Date : March 2022

DOI : https://doi.org/10.1007/s10462-021-10068-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Short answer scoring
Essay grading
Natural language processing
Deep learning
Find a journal
Publish with us
Track your research

Essay-Grading Software Seen as Time-Saving Tool

Jeff Pence knows the best way for his 7th grade English students to improve their writing is to do more of it. But with 140 students, it would take him at least two weeks to grade a batch of their essays.

So the Canton, Ga., middle school teacher uses an online, automated essay-scoring program that allows students to get feedback on their writing before handing in their work.

“It doesn’t tell them what to do, but it points out where issues may exist,” said Mr. Pence, who says the a Pearson WriteToLearn program engages the students almost like a game.

With the technology, he has been able to assign an essay a week and individualize instruction efficiently. “I feel it’s pretty accurate,” Mr. Pence said. “Is it perfect? No. But when I reach that 67th essay, I’m not real accurate, either. As a team, we are pretty good.”

With the push for students to become better writers and meet the new Common Core State Standards, teachers are eager for new tools to help out. Pearson, which is based in London and New York City, is one of several companies upgrading its technology in this space, also known as artificial intelligence, AI, or machine-reading. New assessments to test deeper learning and move beyond multiple-choice answers are also fueling the demand for software to help automate the scoring of open-ended questions.

Critics contend the software doesn’t do much more than count words and therefore can’t replace human readers , so researchers are working hard to improve the software algorithms and counter the naysayers.

While the technology has been developed primarily by companies in proprietary settings, there has been a new focus on improving it through open-source platforms. New players in the market, such as the startup venture LightSide and edX , the nonprofit enterprise started by Harvard University and the Massachusetts Institute of Technology, are openly sharing their research. Last year, the William and Flora Hewlett Foundation sponsored an open-source competition to spur innovation in automated writing assessments that attracted commercial vendors and teams of scientists from around the world. (The Hewlett Foundation supports coverage of “deeper learning” issues in Education Week .)

“We are seeing a lot of collaboration among competitors and individuals,” said Michelle Barrett, the director of research systems and analysis for CTB/McGraw-Hill, which produces the Writing Roadmap for use in grades 3-12. “This unprecedented collaboration is encouraging a lot of discussion and transparency.”

Mark D. Shermis, an education professor at the University of Akron, in Ohio, who supervised the Hewlett contest, said the meeting of top public and commercial researchers, along with input from a variety of fields, could help boost performance of the technology. The recommendation from the Hewlett trials is that the automated software be used as a “second reader” to monitor the human readers’ performance or provide additional information about writing, Mr. Shermis said.

“The technology can’t do everything, and nobody is claiming it can,” he said. “But it is a technology that has a promising future.”

‘Hot Topic’

The first automated essay-scoring systems go back to the early 1970s, but there wasn’t much progress made until the 1990s with the advent of the Internet and the ability to store data on hard-disk drives, Mr. Shermis said. More recently, improvements have been made in the technology’s ability to evaluate language, grammar, mechanics, and style; detect plagiarism; and provide quantitative and qualitative feedback.

The computer programs assign grades to writing samples, sometimes on a scale of 1 to 6, in a variety of areas, from word choice to organization. The products give feedback to help students improve their writing. Others can grade short answers for content. To save time and money, the technology can be used in various ways on formative exercises or summative tests.

The Educational Testing Service first used its e-rater automated-scoring engine for a high-stakes exam in 1999 for the Graduate Management Admission Test, or GMAT, according to David Williamson, a senior research director for assessment innovation for the Princeton, N.J.-based company. It also uses the technology in its Criterion Online Writing Evaluation Service for grades 4-12.

Over the years, the capabilities changed substantially, evolving from simple rule-based coding to more sophisticated software systems. And statistical techniques from computational linguists, natural language processing, and machine learning have helped develop better ways of identifying certain patterns in writing.

But challenges remain in coming up with a universal definition of good writing, and in training a computer to understand nuances such as “voice.”

In time, with larger sets of data, more experts can identify nuanced aspects of writing and improve the technology, said Mr. Williamson, who is encouraged by the new era of openness about the research.

“It’s a hot topic,” he said. “There are a lot of researchers and academia and industry looking into this, and that’s a good thing.”

High-Stakes Testing

In addition to using the technology to improve writing in the classroom, West Virginia employs automated software for its statewide annual reading language arts assessments for grades 3-11. The state has worked with CTB/McGraw-Hill to customize its product and train the engine, using thousands of papers it has collected, to score the students’ writing based on a specific prompt.

“We are confident the scoring is very accurate,” said Sandra Foster, the lead coordinator of assessment and accountability in the West Virginia education office, who acknowledged facing skepticism initially from teachers. But many were won over, she said, after a comparability study showed that the accuracy of a trained teacher and the scoring engine performed better than two trained teachers. Training involved a few hours in how to assess the writing rubric. Plus, writing scores have gone up since implementing the technology.

Automated essay scoring is also used on the ACT Compass exams for community college placement, the new Pearson General Educational Development tests for a high school equivalency diploma, and other summative tests. But it has not yet been embraced by the College Board for the SAT or the rival ACT college-entrance exams.

The two consortia delivering the new assessments under the Common Core State Standards are reviewing machine-grading but have not committed to it.

Jeffrey Nellhaus, the director of policy, research, and design for the Partnership for Assessment of Readiness for College and Careers, or PARCC, wants to know if the technology will be a good fit with its assessment, and the consortium will be conducting a study based on writing from its first field test to see how the scoring engine performs.

Likewise, Tony Alpert, the chief operating officer for the Smarter Balanced Assessment Consortium, said his consortium will evaluate the technology carefully.

Open-Source Options

With his new company LightSide, in Pittsburgh, owner Elijah Mayfield said his data-driven approach to automated writing assessment sets itself apart from other products on the market.

“What we are trying to do is build a system that instead of correcting errors, finds the strongest and weakest sections of the writing and where to improve,” he said. “It is acting more as a revisionist than a textbook.”

The new software, which is available on an open-source platform, is being piloted this spring in districts in Pennsylvania and New York.

In higher education, edX has just introduced automated software to grade open-response questions for use by teachers and professors through its free online courses. “One of the challenges in the past was that the code and algorithms were not public. They were seen as black magic,” said company President Anant Argawal, noting the technology is in an experimental stage. “With edX, we put the code into open source where you can see how it is done to help us improve it.”

Still, critics of essay-grading software, such as Les Perelman, want academic researchers to have broader access to vendors’ products to evaluate their merit. Now retired, the former director of the MIT Writing Across the Curriculum program has studied some of the devices and was able to get a high score from one with an essay of gibberish.

“My main concern is that it doesn’t work,” he said. While the technology has some limited use with grading short answers for content, it relies too much on counting words and reading an essay requires a deeper level of analysis best done by a human, contended Mr. Perelman.

“The real danger of this is that it can really dumb down education,” he said. “It will make teachers teach students to write long, meaningless sentences and not care that much about actual content.”

Sign Up for EdWeek Update

Edweek top school jobs.

Sign Up & Sign In

Strengthen Writing Skills and Increase Student Confidence with MI Write

Our web-based learning environment saves teachers valuable time without sacrificing meaningful instruction and interaction with students. MI Write helps students in grades 3-12 to improve their writing through practice, timely feedback, and guided support anytime and anywhere.

How MI Write Works

MI Write is powered through our automated essay scoring engine (PEG), which instantly reviews essays and provides immediate feedback and recommendations to both students and teachers.

Engaging Content

Interactive lessons, prompts, and stimulus materials keep students engaged and interested.

Immediate Feedback

With instant scoring, suggestions for improvement, and recommended lessons, MI Write keeps your students on track in the writing and revision process.

Enhanced Collaboration

An updated peer review, instant messaging, and cross-curricular instruction encourages effective collaboration between students and teachers.

Admin Reports

MI Write gives you a bird’s eye view of progress and usage at the school, district, and state level through our wide selection of detailed reports. Click here to see how reports can change your instruction.

New to online writing or automated scoring? We can help!

Requesting a Trial

Do you want to try MI Write before you purchase a subscription? Then look no further! Free, 30-day trials are available for teachers with up to 30 students.

Scheduling a Demo

Want to see MI Write in action? One of our experts will guide you through a quick 20–30-minute demo and answer any questions you may have about the program.

Requesting a Quote

Whether you’re budgeting classroom tools or ready to get started with MI Write, we can help you with the next steps. Title 1 discounts are available!

If you would like to request a trial, schedule a demo, request a quote, or to request more information about the product, please send us an email at [email protected]

Our Industry-Leading Automated Essay Scoring System Inspires More Writing and Revision

Our approach

Responsibility
Infrastructure
Try Meta AI

IMAGES

5 Best Automated AI Essay Grader Software in 2024
70 Best Automated essay grading software AI tools
Best automated essay grading software in 2022
Creating and Grading Essays with an Exam Creation Software
5 Best Automated AI Essay Grader Software in 2024
ExamSoft

VIDEO

Essay Grading Tip ✏️
Essay Grading Demo
Essay Grading Scale and Writing Tips
AI Essay Grading and Suggestions
A Philosophy Professor Does a Deep Dive Into An A-grade Essay
AI-Powered Grading in eLearning: Rubric-Based Assessment with Articulate Storyline 360 & AIReady

COMMENTS

Online Essay Grading App that Scores Essays and Papers
GradeCam is a solution for teachers who want to score essays and papers using rubric forms and a live scanner. It allows teachers to create, fill, scan, and grade assignments with custom answers, instant feedback, and easy transfer to digital gradebooks.
EssayGrader
The fastest way to grade essays. EssayGrader is an AI powered grading assistant that gives high quality, specific and accurate writing feedback for essays. Thousands of teachers use EssayGrader to manage their grading load everyday. On average it takes a teacher 10 minutes to grade a single essay, with EssayGrader that time is cut down to 30 ...
5 Best Automated AI Essay Grader Software in 2024
Project Essay Grade by Measurement Incorporated (MI), is a great automated grading software that uses AI technology to read, understand, process and give you results. By the use of the advanced statistical techniques found in this software, PEG can analyze written prose, make calculations based on more than 300 measurements (fluency, diction ...
AI Essay Grader
In summary, ClassX's AI Essay Grader represents a groundbreaking leap in the evolution of educational assessment. By seamlessly integrating advanced AI technology with the art of teaching, this tool unburdens educators from the arduous task of essay grading, while maintaining the highest standards of accuracy and fairness.
About the e-rater Scoring Engine
The e-rater automated scoring engine uses AI technology and Natural Language Processing (NLP) to evaluate the writing proficiency of student essays by providing automatic scoring and feedback. The engine provides descriptive feedback on the writer's grammar, mechanics, word use and complexity, style, organization and more.
Gradescope
Transform grading into learning from anywhere with Gradescope's modern assessment platform, efficient grading workflows, and actionable student data. ... Papers In-depth investigations into the pressing issues of education and technology today. ... ExamSoft is the leading provider of assessment software for on-campus and remote programs. Similarity
SmartMarq: Essay marking with rubrics and AI
SmartMarq will streamline your essay marking process. SmartMarq makes it easy to implement large-scale, professional essay scoring. Once raters are done, run the results through our AI to train a custom machine learning model for your data, obtaining a second "rater.". Note that our powerful AI scoring is customized, specific to each one of ...
The e-rater Scoring Engine
ETS is a global leader in educational assessment, measurement and learning science. Our AI technology, such as the e-rater ® scoring engine, informs decisions and creates opportunities for learners around the world. The e-rater engine automatically: assess and nurtures key writing skills. scores essays and provides feedback on writing using a ...
The Criterion Online Writing Evaluation Service
The Criterion ® Online Writing Evaluation Service is a web-based, instructor-led automated writing tool that helps students plan, write and revise their essays. It offers immediate feedback, freeing up valuable class time by allowing instructors to concentrate on higher-level writing skills. About the Criterion Service.
Home
Trusted by educational institutions for surpassing human expert scoring, IntelliMetric® is the go-to essay scoring platform for colleges and universities. IntelliMetric® also aids in hiring by identifying candidates with excellent communication skills. As an assessment API, it enhances software products and increases product value.
AI Grader
Our AI grader matches human scores 82% of the time*AI Scores are 100% consistent**. Standard AI Advanced AI. Deviation from real grade (10 point scale) Real grade. Graph: A dataset of essays were graded by professional graders on a range of 1-10 and cross-referenced against the detailed criteria within the rubric to determine their real scores.
The Paper Grading App for Teachers to Easily Grades Papers
Grading papers, reports, and essays is simple thanks to GradeCam's innovative assessment technology. Using the easy paper grader app, teachers can assess papers, quickly scan scores to give students prompt feedback, and automatically record grades and generate reporting. The app allows teachers more time to focus on improving student writing, grammar, spelling, and more.
Essay-Grading Software Offers Professors a Break
The software uses artificial intelligence to grade student essays and short written answers, freeing professors for other tasks. The new service will bring the educational consortium into a ...
Free Online Paper and Essay Checker
PaperRater's online essay checker is built for easy access and straightforward use. Get quick results and reports to turn in assignments and essays on time. 2. Advanced Checks. Experience in-depth analysis and detect even the most subtle errors with PaperRater's comprehensive essay checker and grader. 3.
Elevate your workflows with Gradescope's grading software NOA
The core of Gradescope's mission is elevating both instructors and students. With our pioneering grading platform, instructors become more impactful, delivering faster, clearer, and more consistent feedback. Anonymous grading reduces unintended bias, fostering an environment of equitable evaluation. Customized accommodations ensure that every ...
What is Automated Essay Scoring, Marking, Grading?
Nathan Thompson, PhDApril 25, 2023. Automated essay scoring (AES) is an important application of machine learning and artificial intelligence to the field of psychometrics and assessment. In fact, it's been around far longer than "machine learning" and "artificial intelligence" have been buzzwords in the general public!
An automated essay scoring systems: a systematic literature review
This paper aims to provide a systematic literature review (SLR) on automated essay grading systems. An SLR is an Evidence-based systematic review to summarize the existing research. It critically evaluates and integrates all relevant studies' findings and addresses the research domain's specific research questions.
Automated essay scoring
Automated essay scoring (AES) is the use of specialized computer programs to assign grades to essays written in an educational setting.It is a form of educational assessment and an application of natural language processing.Its objective is to classify a large set of textual entities into a small number of discrete categories, corresponding to the possible grades, for example, the numbers 1 to 6.
Essay-Grading Software Seen as Time-Saving Tool
Essay-Grading Software Seen as Time-Saving Tool. Jeff Pence knows the best way for his 7th grade English students to improve their writing is to do more of it. But with 140 students, it would take ...
New edX software would automate essay grading
New edX software would automate essay grading. Dive Summary: MOOC provider edX will release free software that grades student essays and short written answers using artificial intelligence. While automated grading software for multiple choice and true-false tests is widely used, systems that grade written answers have many critics who say such ...
MI Write
Strengthen Writing Skills and Increase Student Confidence with MI Write. Our web-based learning environment saves teachers valuable time without sacrificing meaningful instruction and interaction with students. MI Write helps students in grades 3-12 to improve their writing through practice, timely feedback, and guided support anytime and anywhere.
Automated Essay Scoring Software
I'm starting my eleventh year of teaching English Language Arts. I've graded thousands of essays, from 9th grade to 12th grade. The process of grading, for me, always starts with correcting grammar. I have a hard time judging an essay for its merits when it's a hot mess of grammar mistakes.
Introducing Meta Llama 3: The most capable openly available LLM to date
Today, we're introducing Meta Llama 3, the next generation of our state-of-the-art open source large language model. Llama 3 models will soon be available on AWS, Databricks, Google Cloud, Hugging Face, Kaggle, IBM WatsonX, Microsoft Azure, NVIDIA NIM, and Snowflake, and with support from hardware platforms offered by AMD, AWS, Dell, Intel, NVIDIA, and Qualcomm.