Methodology

How Empirica earns the right to grade

A rating is only as good as the standards it's held to. The framework below is the one used to evaluate clinical-practice guidelines — eight standards developed at the US Institute of Medicine and refined into the AGREE II appraisal tool. We map our own system to each one, and we mark honestly where we're strong, where we're partial, and where we're still working.

Panel credentials

Partial

Who decides whether a piece of research passes? What expertise and accountability do they bring?

An autonomous validation pipeline running Claude Sonnet (4.5) for the empirical-citation check and Claude Haiku (4.5) for logic and depth. Three independent checks, then a final decision. The rubric is fixed and public, the model versions are named, and the same pipeline scores our own internal output as scores external submissions. No human reviewer sits in the publish loop — by design, so the bar is reproducible and the same model is one you can run yourself.

Conflict of interest

Strong

Does the rating body have a financial stake in the outcome? Do reviewers have ties to the work being rated?

We don't accept payment from authors, institutions, or publishers in exchange for a grade — past, present, or future. Submission is free; the score is the score. Revenue comes from bulk API access for institutions and recommendation referrals downstream of the rating, never from the rating itself. The validator agents have no awareness of who the submitter is at scoring time — name and email are stored on the record but not in the prompt.

Systematic review of evidence

Strong

Is there a defined process for identifying and evaluating the underlying evidence base?

Each submission is checked against the paper list it cites. We query OpenAlex and arXiv for citation metadata. Every [P1]..[PN] reference must match an abstract we can retrieve; any [Author YEAR] citation outside that list is treated as a fabrication. The process is identical across submissions of the same content type, and the queries are logged.

Rating quality of evidence

Strong

Is the underlying evidence assessed for strength, not just presence?

The Empirical Check evaluates not whether sources are cited but whether they actually support the claims made. Citations that overstate or misrepresent what the cited abstract says are flagged; specific factual claims contradicted by the source are hard-failed. We assess evidence strength qualitatively per claim, not in aggregate.

Transparent presentation

Strong

Are the criteria, the process, and the per-result reasoning public and inspectable?

The full rubric lives at /rankings/scoring. The tier ladder lives at /empiricas. Every scored submission gets a per-check breakdown (logic, empirical, depth) shown on its status page and emailed to the submitter, naming specific claims, citations, and reasoning steps that lowered the score. Nothing about the process is private.

Explicit values & consistency

Strong

Are the value judgements embedded in the rating made explicit? Is the same standard applied across cases?

The rubric encodes specific value judgements (citation rigour over rhetorical force, falsifiability over confidence, qualified hedging over bold claim) and those judgements are spelled out in the public scoring page. The rubric is the same for every submission within a content type; the only branching is academic-note vs industry-publication, with different pass floors per check, both publicly documented.

Strength of recommendation

Working

Beyond a score, is there guidance on what to do with the rated work — where it applies, what to read next, where it sits in the field?

Partial. Today the per-rubric breakdown tells a reader where a paper's strengths are (e.g. strong empirical, weaker depth). What's missing is recommendation surface — "given this paper, here are three adjacent pieces that build on it" — currently in design as the next feature. Guideline Paths and the recommendation engine are the deliberate roadmap here.

Optimal presentation

Partial

Are the results accessible to a non-specialist reader? Is jargon kept down? Are the most useful signals up front?

Two layers: the precise 0–100 score for readers who want the underlying number, and the Empirica's tier (zero to three) for readers scanning at speed. Per-rubric breakdowns use plain language naming specific claims, not jargon. Where coverage is thin in a field, the breakdown is the primary signal — comparative anchors ("this paper scored 87; the median in this domain scored 73") are on the roadmap.

Limits

What we won't claim

Universal field coverage.We've scored hundreds of papers in agent economy, applied AI, and quantitative strategy. Coverage in clinical medicine, pure physics, organic chemistry, history, and most of the humanities is essentially zero. The Empirica's in those fields would be a brand promise we haven't earned.

Replacement for peer review.Our pipeline catches what an LLM-with-rubric can catch — hallucinated citations, logical inconsistencies, shallow synthesis. It doesn't catch what a domain expert with twenty years of tacit knowledge catches. The two are complements, not substitutes.

Predictive of citation count today.We started a calibration corpus on 25 May 2026 — daily snapshots of every scored output, so we can later verify (or fail to verify) that high Empirica's scores correlate with adoption, citation, and downstream use. Until that data accumulates, we can't make the claim, so we don't.

Read the scoring rubric →See the Empirica's tier ladder About the team →