You've lost hours extracting data from invoice PDFs—copying totals, line items, and client details your application needs but can't access programmatically. PDFs lock structured data behind a format built for printing, not parsing. Your code can't natively read what your eyes see.
PDF parsing libraries solve this by converting static documents into extractable text, metadata, and structured content. The challenge: each library excels at different tasks.
Some handle straightforward text extraction efficiently. Others preserve coordinates and layout for structured data extraction. A few prioritize streaming architectures for memory-constrained environments.
This guide compares seven PDF parsing libraries—JavaScript libraries for Node.js, examining capabilities, trade-offs, and practical use cases.
In brief:
- Match parser to document type - simple text extraction versus coordinate-aware parsing for invoices, forms, and structured data
- Seven Node.js parsing libraries compared - pdf-parse, pdfjs-dist, pdf2json, pdfreader, unpdf, pdf.js-extract, and pdf-text-extract
- Selection criteria you can apply immediately - deployment constraints, memory limits, bundle sizes, and system dependencies
- Integration patterns for Strapi - transform uploaded PDFs into structured content entries through lifecycle hooks and entity services
1. pdf-parse
When your goal is to pull plain text out of a PDF and move on, pdf-parse
is the fastest way to get there. The library is published as a single Node.js module, so you don't have to wrestle with native binaries or external tools before you can start coding. Installation is the usual one-liner:
1npm install pdf-parse
Because pdf-parse
exposes a Promise-based API, you can drop it straight into an existing async workflow:
1import fs from 'node:fs/promises';
2import pdf from 'pdf-parse';
3
4const dataBuffer = await fs.readFile('./contract.pdf');
5
6const { text, numpages, info } = await pdf(dataBuffer);
7
8// `text` → full document text
9// `numpages` → page count
10// `info` → metadata (author, creation date, etc.)
11
12console.log(`Pages: ${numpages}`);
13console.log(`Author: ${info.Author}`);
14console.log(text.slice(0, 200)); // preview first 200 chars
That handful of lines delivers three things most parsing tasks start with: the entire text content, a reliable page count, and basic metadata.
For uploads you need to index, searchable archives you plan to build, or quick "does this file contain phrase X?" checks, that's often all you need.
The lightweight design has trade-offs. Benchmarks aggregated in PDF parsing overviews show that pdf-parse
struggles when documents contain more complex text.
Scanned contracts or receipts that require Optical Character Recognition (OCR), multi-column scientific papers, or invoices packed with complex tables will push it past its comfort zone. I
In those cases, you'll need to bolt on an OCR layer or reach for a more layout-aware library like pdf2json
or a rendering engine such as Mozilla's PDF.js.
Use pdf-parse
when you control the document format and it's text-based, especially in environments such as CI pipelines and serverless functions, where installing extra binaries is painful, or when you prefer to stream results quickly into downstream text analysis or search indexing jobs.
Looking for the same "no fuss" experience in Python? PyPDF2
occupies the same spot on that ecosystem's spectrum, giving you basic extraction without external dependencies or heavyweight setup.
pdf-parse
won't solve every PDF headache, but when you just need the words and need them now, it's the simplest, cleanest tool you can add to your Node.js toolbox.
2. pdfjs-dist
pdfjs-dist packages Mozilla's PDF.js engine for Node.js development. You're working with the same rendering core that powers Firefox's PDF viewer, which means you get capabilities most text-only parsers miss: precise layout coordinates, font data, image streams, and page rasterization.
This depth makes pdfjs-dist the right choice when you need reliable extraction from designer-heavy reports, marketing brochures, or any file where placement matters as much as content.
Installation is straightforward:
1npm install pdfjs-dist
Once installed, you can load a document, grab metadata, and walk through each page to collect text items. The snippet below shows a minimal async workflow; adapt it to capture images or glyph positions just as easily:
1// parse-pdf.js
2import { getDocument, OPS } from 'pdfjs-dist/webpack'; // ESM import
3
4const filePath = './sample.pdf';
5
6async function extract() {
7 const loadingTask = getDocument(filePath);
8 const pdf = await loadingTask.promise;
9
10 console.log(`Pages: ${pdf.numPages}`);
11 console.log('Metadata:', await pdf.getMetadata());
12
13 for (let pageNo = 1; pageNo <= pdf.numPages; pageNo += 1) {
14 const page = await pdf.getPage(pageNo);
15 const content = await page.getTextContent();
16
17 const text = content.items
18 .filter(item => item.str && item.transform) // ignore images
19 .map(item => item.str)
20 .join(' ');
21 console.log(`Page ${pageNo}: ${text.slice(0, 80)}…`);
22 }
23}
24
25extract().catch(console.error);
The power comes with complexity. You work with low-level operator lists (OPS
) and transform matrices, so parsing logic feels more verbose than high-level wrappers like pdf-parse.
Bundle size also jumps: even the slim build weighs around 2 MB gzipped, which impacts serverless cold starts and browser bundles. If deployment size matters, tree-shake unused components or offload parsing to a microservice.
Choose pdfjs-dist when you need:
- Building custom document viewers that replicate original layouts
- Extracting embedded images or fonts for downstream processing
- Parsing multi-column financial statements where position determines meaning
- Generating thumbnails or page previews on demand
If you only need raw text for search indexing, a lighter tool works better. But when precision rules, pdfjs-dist delivers—much like Python developers reach for PyMuPDF when pdfminer-six isn't enough.
Master its API once, and you own a single library capable of everything from pixel-perfect rendering to deep structural analysis, all inside your Node.js workflow.
3. pdf2json
Processing invoices requires more than extracting plain text. You need spatial context to determine if "$145.00" represents the total amount or just a line item.
pdf2json solves this by converting documents into structured JSON that preserves coordinates, font details, and styling information for each text element on the page.
This structure lets you query elements precisely instead of relying on brittle regular expressions that break when a vendor changes their layout.
Installing the library follows standard Node.js conventions:
1npm install pdf2json
A minimal extraction script looks like this:
1import fs from 'fs';
2import PDFParser from 'pdf2json';
3
4const pdfParser = new PDFParser();
5
6pdfParser.on('pdfParser_dataReady', pdf => {
7 // The JSON retains x & y coordinates, font size, and raw text
8 fs.writeFileSync('./invoice.json', JSON.stringify(pdf, null, 2));
9});
10
11pdfParser.loadPDF('./invoice.pdf');
Open the generated invoice.json
and you'll find an array of pages, each containing text objects like this:
1{
2 "x": 72.12,
3 "y": 640.28,
4 "w": 47.16,
5 "sw": 1,
6 "clr": 0,
7 "A": "left",
8 "R": [{ "T": "Subtotal", "S": -1, "TS": [0, 13, 0, 0] }]
9}
Those x
and y
values map directly to PDF coordinate space, letting you pinpoint fields like your eyes do. Need the total amount? Filter all text objects on the last page where T
matches a currency pattern and x
sits inside the right-aligned column.
Because the geometry lives in the JSON, you avoid false positives that plague plain-text outputs.
The trade-off: richer output means a steeper learning curve and larger payloads. For quick, single-column docs, a simpler extractor like pdf-parse
gets you to "hello world" faster.
But when you're pulling tax IDs from government forms or line totals from multi-page invoices, pdf2json's coordinate fidelity pays dividends.
Think of it as a Node.js library capable of extracting text and positional data from PDFs, but it does not offer the advanced segmentation into words, lines, and rectangles found in Python's pdfplumber
.
Reach for pdf2json when positional accuracy matters more than raw throughput: invoice automation, form ingestion, or any pipeline where numbers must match their labels every single time.
4. pdfreader
Processing a 500-page PDF on a low-memory server demands a smarter approach than loading the entire file at once. pdfreader offers a streaming, event-driven solution that processes files incrementally.
It emits one page or text item at a time while keeping the rest of the document on disk. This approach maintains constant memory usage, keeping the heap small regardless of document size.
1// server/parsers/invoice.js
2import fs from 'node:fs';
3import { PdfReader } from 'pdfreader';
4
5export function parseInvoice(path) {
6 return new Promise((resolve, reject) => {
7 const rows = {};
8 new PdfReader().parseFileItems(path, (err, item) => {
9 if (err) return reject(err);
10 if (!item) return resolve(Object.values(rows)); // EOF
11 if (item.text) { // text chunk
12 (rows[item.y] = rows[item.y] || []).push(item.text); // group by row
13 }
14 });
15 });
16}
17
18// later, inside a background job
19const rows = await parseInvoice('/tmp/invoice-2024-04.pdf');
20console.log(rows.join('\n'));
The callback fires for every token, ensuring memory usage never spikes—ideal for serverless functions, batch pipelines, or resource-constrained deployments.
The event-driven API requires adjustment if you're accustomed to async/await
, but the trade-off becomes worthwhile when streaming hundreds of documents in parallel.
Recent benchmarks demonstrate the value of low-overhead parsers. Stream-oriented libraries consistently match or exceed competitors in speed while maintaining high accuracy across document types (F1 ≥ 0.96). pdfreader brings this same philosophy to Node.js: process what you need, discard the rest.
pdfreader fits projects with large PDFs and tight memory limits. It works well for batch processing pipelines and environments where system dependencies cannot be installed. It's pure JavaScript with no external binaries required.
For quick metadata scraping on small documents, streaming may be unnecessary; promise-based wrappers provide less code overhead. But when simpler libraries fail on large research archives, pdfreader's incremental parser continues processing one event at a time.
5. unpdf
When you need a document parser in 2024, you want a library that fits your TypeScript stack without compromise. unpdf delivers exactly that—multiple extraction strategies with a consistent async/await API that lets you swap engines without rewriting code.
unpdf offers flexible extraction approaches you can mix and match. Need quick text for search indexing? Use the lightweight stream extractor. Processing invoices with complex tables? Switch to layout-aware mode that understands document structure. Each strategy shares the same API, so changing approaches never forces a complete rewrite.
Installation follows the familiar Node.js pattern:
1npm install unpdf
The API stays clean and direct:
1import { Unpdf } from 'unpdf';
2
3const parser = await Unpdf.load('invoice.pdf', { strategy: 'layout' });
4const rows = await parser.tables();
5console.log(rows[0]); // → ['SKU-001', '3', '$19.99']
That simplicity masks a flexible architecture. unpdf can call Tesseract-backed OCR for scanned pages or fall back to pure vector parsing for digitally-born PDFs to maintain speed. The library automatically chooses the right approach based on your document type.
Because unpdf is newer, it lacks the decades-old community around pdf.js or PDFBox. The benefit is velocity: rapid iterations let unpdf integrate recent research advances weeks after they appear. The library prioritizes TypeScript support, ESM modules, and developer experience.
Choose unpdf when you value modern APIs and flexible extraction strategies over battle-tested legacy libraries. If you need long-term corporate support today, a mature alternative like pdfjs-dist remains the safer choice.
unpdf is a modern parser built for TypeScript projects. It uses async/await APIs you already know and switches between fast text extraction and detailed parsing based on your document type.
6. pdf.js-extract
pdf.js-extract wraps Mozilla's PDF.js, the same engine rendering PDFs in Firefox, into a Node. js-friendly API.
Unlike basic text extractors, it captures each glyph's position, font, and styling alongside the raw text.
This spatial context lets you target specific data locations instead of running fragile regex patterns on text blobs.
Here's how to extract text with coordinates from a single page:
1import { PDFExtract } from 'pdf.js-extract';
2const pdfExtract = new PDFExtract();
3const options = { firstPage: 1, lastPage: 1 }; // narrow scope if you like
4
5pdfExtract.extract('invoices/2024-02-12.pdf', options).then((data) => {
6 data.pages[0].content.forEach((item) => {
7 console.log(
8 `${item.str.padEnd(25)} x:${item.x.toFixed(1)} y:${item.y.toFixed(1)}`
9 );
10 });
11});
Every text element includes x
, y
, width
, and height
properties. You can write simple predicates like "grab the number in the total column" instead of training ML models or hand-coding table parsers. Your extraction logic stays deterministic and transparent.
The trade-off is size and complexity. pdf.js-extract bundles the full PDF.js runtime, making your deployment artifact significantly larger than lightweight text-only tools. If you're targeting memory-constrained environments or edge functions, measure the impact first.
Choose pdf.js-extract when layout matters, invoices, purchase orders, or any document where columns, headers, or fonts carry meaning. It hits the sweet spot between pdf-parse (too basic for structured data) and pdfjs-dist (overkill for most extraction tasks).
For Python developers, pdfminer.six serves the same role—detailed coordinate output without the complexity of a full rendering engine. pdf.js-extract brings that same balance to JavaScript.
7. pdf-text-extract
For rock-solid text extraction, pdf-text-extract
wraps the proven pdftotext
utility from Poppler. This approach sidesteps the parsing issues you'll encounter with pure JavaScript solutions and delivers faster throughput on complex documents.
Since pdftotext
runs outside Node.js, you need both the system package and the npm wrapper:
1# Debian/Ubuntu
2sudo apt-get update && sudo apt-get install -y poppler-utils
3
4# macOS (Homebrew)
5brew install poppler
1# project directory
2npm install pdf-text-extract
The extraction API stays simple:
1// extract.js
2import pdfText from 'pdf-text-extract';
3
4pdfText('contract.pdf', (err, pages) => {
5 if (err) throw err;
6 console.log(pages.join('\n'));
7});
The library streams output page by page, handling multi-hundred-page documents without memory issues. Poppler's text layout engine manages rotated pages, unusual encodings, and ligatures that break JavaScript-only parsers.
Studies show Poppler-based tools maintain higher recall on edge-case characters than most open-source alternatives.
The trade-off? You must deploy Poppler binaries with your application. This blocks many serverless platforms and complicates Windows deployments unless you add WSL. In controlled container environments, the dependency rarely matters, and in high-volume pipelines, the performance gains justify the complexity.
Choose pdf-text-extract
for problematic documents: scanned invoices, legal documents with custom fonts, or international text that pure JavaScript libraries mangle. For simpler workloads or serverless constraints, pdf-parse
offers better portability.
But when precision and speed matter most, wrapping the battle-tested pdftotext
engine is hard to beat.
Questions to Ask Before You Choose The Right PDF Parsing Library
1. "Are your PDFs digital or scanned?" Digital PDFs work with any library; scanned documents require OCR tools like pdf-text-extract or unpdf with OCR strategy.
2. "Do you need text position and coordinates?" If yes, use pdf2json or pdf.js-extract. If you just need the words, pdf-parse is simpler and faster.
3. "What's your deployment memory limit?" Tight memory constraints require pdfreader's streaming architecture. Standard environments work with any library.
4. "Can you install system packages in production?" If yes, pdf-text-extract offers maximum reliability. If no, stick with pure JavaScript libraries.
5. "How many PDFs will you process daily?" Low volume favors convenience (pdf-parse, unpdf). High volume demands streaming (pdfreader) or native speed (pdf-text-extract).
6. "Do your PDFs come from controlled sources?" Consistent formats work with lightweight tools. Unpredictable documents need battle-tested parsers like pdfjs-dist.
7. "Is your codebase TypeScript-first?" TypeScript projects benefit from unpdf's excellent type definitions. JavaScript projects have more options.
Transform Parsed PDFs into API-Driven Content with Strapi
Parsing extracts data from PDFs. Strapi v5 turns that data into managed, deliverable content.
The workflow without a CMS: you parse PDFs successfully, but the extracted data sits in temporary storage or gets written to files.
Content teams can't access it, you can't search it, and delivering it to multiple platforms requires custom API code for every endpoint.
Strapi completes the pipeline:
- Upload PDFs through the Media Library - files arrive through Strapi's built-in upload system
- Parse automatically with lifecycle hooks - extraction runs when documents are uploaded, no manual processing
- Store as structured content - extracted text, metadata, and fields become queryable Strapi entries
- Deliver through auto-generated APIs - REST and GraphQL endpoints expose parsed content to any platform
- Let content teams manage extracted data - the admin panel makes parsed PDFs editable without developer involvement
A contract arrives as PDF, your parser extracts parties and terms, Strapi stores it as structured content, and your web app queries it through GraphQL, all without building custom infrastructure.
The parsing library handles extraction; Strapi handles storage, management, and delivery.