How We Verified 2,583 Claims Across 200 Topics

TL;DR

We tested Novavoy's AI-generated lessons across 200 topics and 10 subject categories. Accuracy rate: 99% across 2,583 individually verified claims.
Every lesson is built from verified knowledge sources before the AI writes anything. We call this Source-First architecture.
No other AI learning app publishes verified accuracy data. We looked. This is the first public benchmark of its kind in consumer EdTech.

The Problem

You've probably asked ChatGPT something and gotten an answer that sounded perfect. Confident tone. Clean formatting. Completely wrong.

AI hallucination — where language models generate plausible-sounding but false information — is a known problem. It's annoying when you're asking about a recipe. It's a real problem when you're trying to learn quantum mechanics, or the causes of World War I, or how compound interest works.

Most AI-powered learning tools don't acknowledge this. They don't publish error rates. They don't explain how they handle hallucination. They just... hope you don't notice.

We think that's a bad foundation for education.

If you're going to spend your time learning something, you should be able to trust what you're being taught. So we built a system to verify that trust, ran it at scale, and decided to publish the results — even though no one in our space does this.

Our Approach: Source-First Architecture

Most AI learning tools work like this: they prompt a language model, the model generates content, and the user gets whatever comes out. If the model hallucinates, the user learns something false.

We do it differently.

We build the knowledge base before we build the lesson.

Before our AI writes a single sentence about any topic, we retrieve verified information from multiple authoritative sources — Wikipedia, PubMed, Nature, MIT OpenCourseWare, Britannica, and other educational domains. This retrieved knowledge becomes the foundation the AI must build on.

Think of it like the difference between asking someone to give a lecture from memory versus giving them a stack of vetted research papers and saying "teach from these."

The AI still does the creative work — designing engaging explanations, crafting game questions, building scenarios. But it does that work while anchored to verified facts, not floating on whatever its training data vaguely remembers.

We call this Source-First architecture because the sources come first. Always.

Knowledge Catalog

Retrieve verified sources

Wikipedia, PubMed, Nature, MIT OCW, Britannica. 6-10 articles parsed and indexed per topic. Zero LLM cost.

source_retrieval

Curriculum Design

Tag concepts to sources

Chunk tags assigned at design time. Every concept mapped to source material. Zero hallucination surface.

Roman_army:1

Lesson Generation

Teach from sources, not memory

Deterministic retrieval by chunk ID. Post-generation contradiction checker. AI teaches from sources, not memory.

contradiction_check

The Results

We ran a comprehensive accuracy benchmark across 200 topics spanning 10 subject categories. Here's what we found.

Overall

2,583

Claims verified

99%

Accuracy rate

Hallucination rate

Contradiction rate

By Category

Humanities

100%

Physics

100%

History

99%

Technology

99%

Business & Economics

99%

Biology & Life Sciences

99%

Psychology & Social Sci.

99%

Earth & Environmental

99%

Mathematics & Logic

99%

Arts & Creative Fields

99%

Two categories hit 100%. The rest sit at 99%. We're not satisfied with anything less than that across the board, but we're publishing the real numbers, not rounded-up ones.

How It Works

Step 1: Build the Knowledge Base

When you pick a topic, our system retrieves verified information from multiple authoritative sources. This isn't a single Google search — it's a structured retrieval process that pulls from educational databases, academic publishers, and curated reference works.

The result is a knowledge base specific to your topic, built from sources that have editorial oversight and fact-checking built into their own processes.

Step 2: Design the Curriculum

The AI designs your learning path — what concepts to cover, in what order, at what depth — using the knowledge base as its source of truth. The curriculum structure is generated, but the facts it's built around are retrieved, not generated.

Step 3: Ground Every Lesson

Every lesson goes through our source-first grounding process. That means every claim, example, and explanation in your lesson is anchored to the verified knowledge base. The AI can be creative in how it teaches. It cannot be creative with what it teaches.

100% of lessons go through this grounding process. Not most. All of them.

What We Check

Grounding the AI to verified sources handles most of the problem. But we added a second layer.

Post-generation contradiction detection scans every completed lesson for internal contradictions and factual inconsistencies. If a lesson says "water boils at 100 degrees Celsius at sea level" in one card and then implies a different temperature in a game question, the system catches it before you see it.

This two-layer approach — source grounding plus contradiction detection — is why we hit 99% accuracy at scale, not 90% or 95%.

The Difference Source-First Makes

Without Source-First

57%

SimpleQA baseline
Gemini Flash Lite ungrounded

+42

percentage points

With Source-First

99%

Verified accuracy
Same model + Source-First pipeline

What This Means For You

Every fact in your Novavoy lesson has been verified against real sources. Not "the AI is pretty good so it's probably fine." Verified.

When you learn that the speed of light is 299,792,458 meters per second, that number came from a retrieved source. When a history lesson tells you the Treaty of Westphalia was signed in 1648, that date was grounded in verified reference material. When a business lesson explains how a balance sheet works, the definitions match what you'd find in an accounting textbook.

You can learn with confidence that what you're being taught is true. That's the whole point.

Methodology

For full transparency, here's how we ran the benchmark.

Scope: 200 topics across 10 subject categories, selected to represent the breadth of subjects users explore on Novavoy. Topics range from introductory ("basics of cooking techniques") to advanced ("quantum entanglement").
Claim extraction: We generated complete lessons for each topic using the same pipeline that serves real users. From each lesson, we extracted individual factual claims — specific statements that can be independently verified as true or false. This produced 2,583 claims total.
Verification: Each claim was evaluated against authoritative reference sources using an LLM-as-judge methodology. The evaluating model checks whether each claim is supported by, contradicted by, or not addressed in the reference material.
Classification: Claims were categorized as Accurate (supported by reference sources), Hallucinated (stated something not supported by any reference source), or Contradicted (directly conflicted with reference source material).
Multi-source validation: Claims were checked against multiple sources, not just one. A claim needed source support to be classified as accurate — the absence of contradiction alone wasn't sufficient.

Limitations: This benchmark tests factual accuracy of generated content. It does not measure pedagogical effectiveness (whether you learn better), engagement quality, or the accuracy of dynamically generated game questions (which use the same grounding pipeline but were not included in this specific benchmark). We plan to publish those separately.

What Comes Next

We're committed to publishing updated accuracy benchmarks as we expand our topic coverage and improve our verification pipeline. This isn't a one-time marketing exercise — it's an ongoing commitment to transparency.

If you have questions about our methodology or want to see specific topics tested, reach out. We'd rather have the conversation than avoid it.