Crowdsourced accuracy
The community checks the transcripts — and learns while it does.
A model is only as trustworthy as the transcripts it produces, and you can't improve what no one has checked. So the checking is done by the people who care most about getting Torah right: learners. They read along, correct what's wrong, confirm what's right — and every correction makes both that transcript and the next model better.
The problem it solves
You can't trust — or improve — a transcript no one has read.
A raw machine transcript is a starting point, not a finished text. Some of it is exactly right; some of it mishears a name, a pasuk, or a Gemara phrase. Without a way to find and fix those errors, two things stay broken at once: the transcripts can't be trusted for serious use, and the model can't get better, because there's no corrected data to learn from. Checking solves both.
How it works
Read along, fix the errors, confirm the rest.
The model produces a draft transcript aligned to the audio, segment by segment. A learner listens and reads along, correcting any segment that's wrong and confirming the ones that are right — the same close attention a person already gives when reviewing a shiur. Where the audio follows a known text, the draft is anchored to the canonical version in Sefaria, so checking a daf or a parsha is also a way of learning it.
- Aligned to the audio. Each segment of text is tied to its moment in the recording, so a checker can hear exactly what was said and judge the transcript against it.
- Anchored to known texts. When the source is a pasuk, a Mishnah, or a Gemara, the draft is matched to the canonical text — so correcting it is mostly confirming, and the bar for accuracy is the real one.
- Small, do-able pieces. Checking happens a segment at a time. A person can contribute in the few minutes between sedarim, not only in long sittings.
The double benefit
Every correction works twice.
This is the heart of it. The moment someone fixes a segment, that transcript is improved for everyone who reads it afterward. But the same correction is also a piece of high-quality, human-verified training data — so it feeds back into the next version of the model and makes its first drafts more accurate.
Accuracy compounds. A better model produces cleaner drafts, which are faster to check, which yield more corrected data, which trains a still-better model. The work of the community doesn't just clean up today's transcripts — it raises the floor for every transcript that comes after.
An open repository
A communal asset, not a company's private dataset.
The corrected transcripts are public. Anyone can read them, search them, quote them, and preserve them; the model and the data alike belong to the community rather than to a vendor. What the community builds together, it keeps together — open to use and open to build on, the same as the model itself.
And the labor is limud. People doing this aren't annotating data in the abstract — they're sitting with a shiur, a daf, a parsha, and learning it closely enough to get every word right. The checking is a mitzvah loop: the act of validating the text is itself an act of learning Torah, and the fruit of that learning is an accurate, open record that serves the next person.
Starting honestly
We begin where the texts already exist.
Any community effort faces a chicken-and-egg problem: the platform is most useful once many people use it, and people join once it's useful. We don't pretend that away. The way through is to start with the material that already has a canonical written form — Tanach, Mishnah, Gemara — so the first pass is correction against a known text, not transcription from a blank page. That makes the earliest contributions easy and rewarding, and builds the corrected base that the harder, less-documented material — Yiddish, Aramaic, Yeshivish — can grow from over time.