How Does Turnitin Check Against Published Papers?

When your university runs your essay through Turnitin, the software does not just check it against websites and other student submissions. It also compares your text against a database of hundreds of millions of published academic papers, journal articles, and books — many of which are behind paywalls you cannot access yourself. Understanding how this works helps explain why a paraphrase from a journal article can still get flagged, and why a 0% similarity score does not guarantee your sources are all clear.

The published paper database

Turnitin's database of academic literature is built through a programme called Similarity Check, run in partnership with Crossref — the not-for-profit organisation that assigns DOIs (Digital Object Identifiers) to academic publications. Through this programme, academic publishers contribute their content directly into Turnitin's database in exchange for access to the checking service at reduced cost.

The scale is significant: the Similarity Check database currently contains over 78 million full-text scholarly content items. These are not just titles and abstracts — publishers provide the complete text of their articles, which Turnitin indexes and makes searchable for similarity comparison. This means Turnitin can match your submission against the actual body of a journal article, not just its title or metadata.

The database grows continuously. As more publishers join the Similarity Check programme, their entire catalogue — current publications and historical archives — gets added. Turnitin's Content Intake System accesses Crossref's metadata daily, follows the full-text URLs provided by publishers in that metadata, and indexes the content it finds. The result is a database that is both large and regularly updated.

Which publishers and journals are included

The Similarity Check programme includes major academic publishers across virtually every discipline. Participants include large commercial publishers such as Elsevier, Wiley, Springer Nature, Taylor & Francis, and SAGE, as well as university presses, learned societies, and independent scholarly publishers. Because membership requires contributing full-text content, the database is concentrated among publishers who actively assign DOIs — which covers the vast majority of peer-reviewed academic output.

Beyond the Similarity Check corpus, Turnitin also checks against:

Open access repositories. Content available publicly — including preprint servers like arXiv, bioRxiv, and SSRN — is indexed as part of Turnitin's broader web crawl, since this content is freely accessible online.
Books and book chapters. Where publishers have contributed book content through the programme, these are included in the comparison database alongside journal articles.
Institutional repositories. Theses, dissertations, and working papers deposited in university repositories and indexed on the web are also captured.
Previously submitted student papers. Turnitin's own repository of over 1.6 billion student submissions is checked in parallel — a separate database from the published literature.

Full text — not just abstracts

This is the point that surprises most students. When people think of accessing journal articles, they think of the paywall experience: you can see the title, the authors, and perhaps the abstract for free, but the full article requires a subscription. It is natural to assume Turnitin faces the same barrier.

It does not. Publishers who participate in Similarity Check grant Turnitin read-only access to the full text of their articles specifically for comparison purposes. This access is part of the licensing arrangement. The article may be behind a paywall for general readers, but Turnitin's matching engine has already indexed the entire text and can compare your submission against it. A paragraph you paraphrased from page 7 of a subscription-only journal article is as visible to Turnitin as a paragraph copied from a public webpage.

How the matching works

Turnitin's matching against published papers works the same way as its matching against any other source — through a combination of string matching and semantic analysis. The system breaks your submitted text into segments and compares them against the indexed content in its database, identifying overlaps above a minimum word threshold.

Verbatim matches — passages copied directly from a source — are the most straightforward to detect. Turnitin flags these regardless of whether the source is a website, a student paper, or a journal article behind a paywall.

Paraphrased content is more nuanced. Turnitin uses semantic analysis alongside string matching, which means it can detect content where the wording has been changed but the underlying argument, structure, and ideas closely mirror a source. A paraphrase that changes most of the words while keeping the same sentence structure and sequence of ideas may still generate a partial match. The degree to which this occurs depends on how thoroughly the source text has been rewritten and how distinctive the original phrasing was.

When a match is found against a published paper, the Similarity Report identifies the source — typically showing the journal name, article title, and publication details. Instructors can click through to see the specific matching passages highlighted side by side with the source text.

What Turnitin cannot catch from published papers

Despite the size of its database, Turnitin does not index every published paper ever written. A few important gaps exist:

Publishers not in the Similarity Check programme. If a publisher has not contributed their content to Crossref's programme and their articles are not publicly accessible online, Turnitin cannot check against them. Smaller publishers, some regional journals, and older historical publications may not be represented.
Content not yet indexed. Very recently published articles may not yet have been indexed by Turnitin's Content Intake System. There is typically a lag between publication and indexing.
Truly original paraphrasing. If a student reads a source, fully understands it, and reconstructs the argument from scratch in their own words and structure, Turnitin is unlikely to flag it as a match — even if the underlying ideas came from that source. This is what proper academic writing looks like, and the similarity score is not designed to penalise it.

A 0% similarity score does not mean a paper contains no borrowed ideas — it means no borrowed text patterns were detected. Plagiarism of ideas without textual overlap is an academic integrity concern that falls outside what Turnitin can detect. As University College London notes in its Turnitin guidance, Turnitin has a large database of sources but does not — and cannot — include everything ever written.

What this means for your citations

The practical implication for students is straightforward: cite everything, regardless of whether you think Turnitin can see the source. Turnitin's database covers far more published literature than most students realise, and paywalled content is not a shield. A properly cited quote or paraphrase from a journal article will appear in the similarity report but will be immediately recognisable to your instructor as attributed material. An uncited passage from the same article looks identical in the report but carries a very different implication.

The exclusion filters available in Turnitin's report — exclude bibliography, exclude quotes — exist specifically so instructors can distinguish between acknowledged source material and potential problems. Proper citation is not just an academic convention; it is what determines how a match in the similarity report is interpreted. If you want to understand how those score interpretations work in detail, our guide on understanding your Turnitin similarity score explains every percentage band and what drives it.

Frequently asked questions

Can Turnitin see journal articles that are behind a paywall?

Yes. Publishers who participate in Crossref's Similarity Check programme grant Turnitin read-only access to the full text of their articles for comparison purposes. The paywall applies to general readers, not to Turnitin's indexing system. A passage from a subscription-only journal article is fully visible to Turnitin's matching engine.

How many published papers does Turnitin check against?

The Similarity Check database — which powers Turnitin's academic literature checking — contains over 78 million full-text scholarly content items. This includes journal articles, books, and other DOI-assigned academic content contributed by publishers through the Crossref Similarity Check programme. The database is indexed and updated daily.

Does Turnitin detect paraphrasing from journal articles?

Partially. Turnitin uses both string matching and semantic analysis, which means closely paraphrased content — where sentence structure and argument sequence mirror the source — can still generate a partial match. Thoroughly rewritten content that reconstructs the argument from scratch in genuinely original phrasing is much less likely to be flagged. The degree of detection depends on how distinctively the source was written and how substantially the text was rewritten.

Are preprints like arXiv and bioRxiv in Turnitin's database?

Yes. Preprint servers like arXiv, bioRxiv, and SSRN make their content publicly accessible online, which means it is captured through Turnitin's general web indexing. Preprints do not need to be part of the Similarity Check programme to be indexed — their public accessibility is sufficient for Turnitin to include them in its comparisons.