TL;DR
- We tested Novavoy's AI-generated lessons across 200 topics and 10 subject categories. Accuracy rate: 99% across 2,583 individually verified claims.
- Every lesson is built from verified knowledge sources before the AI writes anything. We call this Source-First architecture.
- No other AI learning app publishes verified accuracy data. We looked. This is the first public benchmark of its kind in consumer EdTech.
The Problem
You've probably asked ChatGPT something and gotten an answer that sounded perfect. Confident tone. Clean formatting. Completely wrong.
AI hallucination — where language models generate plausible-sounding but false information — is a known problem. It's annoying when you're asking about a recipe. It's a real problem when you're trying to learn quantum mechanics, or the causes of World War I, or how compound interest works.
Most AI-powered learning tools don't acknowledge this. They don't publish error rates. They don't explain how they handle hallucination. They just... hope you don't notice.
We think that's a bad foundation for education.
If you're going to spend your time learning something, you should be able to trust what you're being taught. So we built a system to verify that trust, ran it at scale, and decided to publish the results — even though no one in our space does this.
Our Approach: Source-First Architecture
Most AI learning tools work like this: they prompt a language model, the model generates content, and the user gets whatever comes out. If the model hallucinates, the user learns something false.
We do it differently.
We build the knowledge base before we build the lesson.
Before our AI writes a single sentence about any topic, we retrieve verified information from multiple authoritative sources — Wikipedia, PubMed, Nature, MIT OpenCourseWare, Britannica, and other educational domains. This retrieved knowledge becomes the foundation the AI must build on.
Think of it like the difference between asking someone to give a lecture from memory versus giving them a stack of vetted research papers and saying "teach from these."
The AI still does the creative work — designing engaging explanations, crafting game questions, building scenarios. But it does that work while anchored to verified facts, not floating on whatever its training data vaguely remembers.
We call this Source-First architecture because the sources come first. Always.
Knowledge Catalog
Retrieve verified sources
Wikipedia, PubMed, Nature, MIT OCW, Britannica. 6-10 articles parsed and indexed per topic. Zero LLM cost.
source_retrievalCurriculum Design
Tag concepts to sources
Chunk tags assigned at design time. Every concept mapped to source material. Zero hallucination surface.
Roman_army:1Lesson Generation
Teach from sources, not memory
Deterministic retrieval by chunk ID. Post-generation contradiction checker. AI teaches from sources, not memory.
contradiction_checkThe Results
We ran a comprehensive accuracy benchmark across 200 topics spanning 10 subject categories. Here's what we found.
Overall
By Category
Two categories hit 100%. The rest sit at 99%. We're not satisfied with anything less than that across the board, but we're publishing the real numbers, not rounded-up ones.
How It Works
Step 1: Build the Knowledge Base
When you pick a topic, our system retrieves verified information from multiple authoritative sources. This isn't a single Google search — it's a structured retrieval process that pulls from educational databases, academic publishers, and curated reference works.
The result is a knowledge base specific to your topic, built from sources that have editorial oversight and fact-checking built into their own processes.
Step 2: Design the Curriculum
The AI designs your learning path — what concepts to cover, in what order, at what depth — using the knowledge base as its source of truth. The curriculum structure is generated, but the facts it's built around are retrieved, not generated.
Step 3: Ground Every Lesson
Every lesson goes through our source-first grounding process. That means every claim, example, and explanation in your lesson is anchored to the verified knowledge base. The AI can be creative in how it teaches. It cannot be creative with what it teaches.
100% of lessons go through this grounding process. Not most. All of them.
What We Check
Grounding the AI to verified sources handles most of the problem. But we added a second layer.
Post-generation contradiction detection scans every completed lesson for internal contradictions and factual inconsistencies. If a lesson says "water boils at 100 degrees Celsius at sea level" in one card and then implies a different temperature in a game question, the system catches it before you see it.
This two-layer approach — source grounding plus contradiction detection — is why we hit 99% accuracy at scale, not 90% or 95%.
The Difference Source-First Makes
Gemini Flash Lite ungrounded
Same model + Source-First pipeline
What This Means For You
Every fact in your Novavoy lesson has been verified against real sources. Not "the AI is pretty good so it's probably fine." Verified.
When you learn that the speed of light is 299,792,458 meters per second, that number came from a retrieved source. When a history lesson tells you the Treaty of Westphalia was signed in 1648, that date was grounded in verified reference material. When a business lesson explains how a balance sheet works, the definitions match what you'd find in an accounting textbook.
You can learn with confidence that what you're being taught is true. That's the whole point.
Methodology
For full transparency, here's how we ran the benchmark.
- Scope: 200 topics across 10 subject categories, selected to represent the breadth of subjects users explore on Novavoy. Topics range from introductory ("basics of cooking techniques") to advanced ("quantum entanglement").
- Claim extraction: We generated complete lessons for each topic using the same pipeline that serves real users. From each lesson, we extracted individual factual claims — specific statements that can be independently verified as true or false. This produced 2,583 claims total.
- Verification: Each claim was evaluated against authoritative reference sources using an LLM-as-judge methodology. The evaluating model checks whether each claim is supported by, contradicted by, or not addressed in the reference material.
- Classification: Claims were categorized as Accurate (supported by reference sources), Hallucinated (stated something not supported by any reference source), or Contradicted (directly conflicted with reference source material).
- Multi-source validation: Claims were checked against multiple sources, not just one. A claim needed source support to be classified as accurate — the absence of contradiction alone wasn't sufficient.
Limitations: This benchmark tests factual accuracy of generated content. It does not measure pedagogical effectiveness (whether you learn better), engagement quality, or the accuracy of dynamically generated game questions (which use the same grounding pipeline but were not included in this specific benchmark). We plan to publish those separately.
What Comes Next
We're committed to publishing updated accuracy benchmarks as we expand our topic coverage and improve our verification pipeline. This isn't a one-time marketing exercise — it's an ongoing commitment to transparency.
If you have questions about our methodology or want to see specific topics tested, reach out. We'd rather have the conversation than avoid it.