Ever since BERT, NLP practitioners have enjoyed a golden age of text classification — simply download your favorite pre-trained model from HuggingFace and fine-tune on training data for any number of text classification problems. This has allowed practitioners to set aside fun games of eeking out architectural improvements and focus on the highest leverage move: collecting high quality training and evaluation data. Unfortunately, this has been a fundamentally slow and expensive process. Aligning human annotators to guidelines, ensuring high quality work, and just sitting down and looking at the data are all slow, difficult steps.
As an example of the cost and effort to develop a high quality training corpus, consider the NCBI Disease Corpus, a standard benchmark for biomedical NER. Its creators employed 14 annotators with biomedical informatics backgrounds across two summers (2011 and 2012), each labeling ~120 PubMed abstracts. Every document was independently annotated by two people, disagreements were discussed to reach consensus, and then a final pass checked consistency across the entire corpus. The result, 793 annotated abstracts, required hundreds of person-hours of expert work.
This is the norm in NER. Building a high-quality training corpus means recruiting domain experts, developing annotation guidelines through iterative pilot rounds, running multi-annotator campaigns, adjudicating disagreements, and auditing for consistency. For specialized domains like healthcare or finance, qualified annotators are scarce and expensive. The annotation bottleneck is often the single biggest barrier to deploying custom NER.
What if you could skip most of that?
Tonic Textual‘s custom entity workflow replaces the human annotation campaign with an LLM. The process:
The expensive human annotation is confined to a small validation set. The bulk training data is annotated by the LLM, which follows your guidelines consistently across thousands of documents.
We tested this on the NCBI Disease Corpus: 593 training documents, 100 validation, 100 held-out test. All PubMed abstracts annotated with disease mentions, including genuinely hard cases like abbreviations that refer to both genes and diseases (“APC”, “VHL”), disease names used as modifiers (“HD gene”), and composite mentions (“colorectal and ovarian cancers”).
In biomedical text, abbreviations like “APC,” “VHL,” and “DM” name both a gene and its associated disease. The NCBI corpus labels these as disease mentions even when they appear in gene context: “the APC gene,” “DM protein kinase,” “A-T allele” all contain disease spans. An NER model has to learn to tag “APC” as a disease in “the APC gene caused APC” while a naive model would suppress tagging near words like “gene,” “kinase,” and “allele.” Our error analysis found this gene/disease polysemy accounts for 60% of complete misses.
To answer this directly, we ran a controlled experiment. We took the LLM annotations produced by the custom entity workflow (F1=0.71 against ground truth on the training set) and trained RoBERTa LoRA models in our own training loop so we could precisely control the label mix. All 593 training documents were always used; we varied what fraction had their LLM labels replaced with human ground truth.

| % Human Labels | # Human-Labeled Docs | Test F1 |
|---|---|---|
| 0% | 0 | 0.70 |
| 5% | 30 | 0.70 |
| 10% | 59 | 0.71 |
| 25% | 148 | 0.73 |
| 50% | 297 | 0.77 |
| 75% | 445 | 0.79 |
| 100% | 593 | 0.81 |
With zero human-labeled training data, LLM annotations alone produce a model at 0.70 F1. Adding human labels helps (the full human-labeled set reaches 0.81) but the returns diminish steeply. The first 148 human-labeled documents (25%) adds only 3 F1 points. Each subsequent increment adds less.
The cost of NER is no longer “annotate all your training data.” It’s “annotate a good validation set, write good guidelines, and let the LLM handle the rest.”
How does 0.70 F1 compare? We benchmarked several approaches on the same test set:
| Approach | Test F1 | Human Annotation Required |
|---|---|---|
| GLiNER2 zero-shot | 0.395 | None |
| Custom entity workflow (Textual SDK) | 0.71 | Validation set only (~10 docs) |
| Same training loop, 100% human labels | 0.81 | Full training set (593 docs) |
GLiNER2, a recent zero-shot NER model, achieves high precision (0.82) but misses three-quarters of disease mentions (recall = 0.26). It catches obvious cases like “colorectal cancer” but misses abbreviations, modifiers, and disease classes. Zero-shot NER on specialized domains remains unreliable.
Meanwhile, the Textual SDK workflow using just 10 human-labeled documents for guideline refinement achieves 0.71 F1, matching the 0% human learning curve result. The human effort that matters isn’t labeling thousands of training examples; it’s writing precise guidelines and validating them against a curated set.
We should be honest: 0.70-0.81 F1 is below the ~0.87-0.90 that specialized biomedical models (BioBERT, PubMedBERT) achieve with full fine-tuning on this benchmark. Two factors explain most of the gap:
roberta-base, a general-purpose model. A biomedical base model would close much of this gap.Both are solvable (swap the base model, refine the guidelines) but beside the point. This benchmark isn’t about setting a new SOTA. It’s about showing that the annotation bottleneck no longer blocks training useful NER models.
The traditional NER pipeline looks like:
Recruit annotators -> Develop guidelines -> Pilot annotation -> Adjudicate -> Full annotation campaign -> Train model -> Evaluate -> Iterate
This takes weeks or months of calendar time and thousands of dollars in annotation costs per entity type.
The custom entity workflow compresses this to:
Write guidelines -> Label ~100 validation examples -> Refine guidelines -> Train -> Evaluate
The total human effort is a validation set of 10-100 documents, plus the intellectual work of writing good guidelines. The LLM annotates hundreds or thousands of training documents in minutes instead of weeks.
Our two case studies bracket the range of what to expect. On a difficult academic benchmark with context-dependent labels and gene/disease ambiguity, LLM annotations alone produce a 0.70 F1 model, and mixing in human labels pushes steadily toward 0.81. On a practical production entity type with clear structural patterns, the workflow exceeds production release thresholds on the first try.
The cost of performance is now concentrated where it should be: understanding what you want to extract and validating that understanding against a high-quality dataset.
The NCBI benchmark tests a hard public dataset, but there’s an elephant in the room. The NCBI Disease Corpus is well-known in NLP, and the LLM annotator has almost certainly seen it (or data very like it) during pretraining. How much of the 0.70 F1 comes from the workflow, and how much from the LLM recognizing familiar examples?
To control for this, we also evaluated on a private dataset that cannot have leaked into any LLM’s training data: healthcare identifiers (MRNs, account numbers, unit/bed numbers, encounter IDs) in clinical EHR documents.
Starting from our existing production guidelines for healthcare_number, the LLM refinement loop rapidly improved annotation quality:
| Iteration | LLM F1 |
|---|---|
| v1 (seed) | 0.896 |
| v2 | 0.946 |
| v3 | 0.960 |
Three iterations, each taking a few minutes. The system identified edge cases (distinguishing healthcare IDs from procedure codes like CPT and ICD, handling alphanumeric formats like “482-A”) and tightened the guidelines automatically.
The encoder model trained on 1,119 LLM-annotated documents achieved 0.947 F1, exceeding our production release threshold of 0.914.
| Metric | Custom Entity Model | Production Threshold |
|---|---|---|
| F1 | 0.947 | 0.914 |
| Precision | – | 0.900 |
| Recall | – | 0.929 |
This is a production-ready model, trained with zero human-labeled training data. The only human effort was the 123-document validation set, which we already had from our existing annotation pipeline.
The NCBI benchmark is intentionally hard: gene/disease polysemy, subtle context-dependent labels, a corpus designed to stress-test NER systems. Healthcare IDs are a more typical production entity type. They are structurally distinctive (numbers, alphanumeric codes), appear in predictable contexts (after “MRN:”, “Account#:”), and have clear inclusion/exclusion criteria.
For entity types like this, where the surface patterns are distinctive and the guidelines are relatively unambiguous, the workflow delivers production-quality models with zero human-labeled training data.
Ready to train your own custom NER model without the annotation bottleneck? Tonic Textual’s custom entity workflow lets you go from guidelines to a production-ready model in hours, not months. Start your free trial and see how far you can get with just a small validation set.