You open a legacy project and find a folder stuffed with 3,700 .rtf files. The content team needs everything in Markdown before tomorrow's deploy to the new headless CMS, and the migration script is stuck waiting on you. Every hour this conversion drags, developers stay blocked, the marketing site stays offline, and client confidence erodes.
Manually copying and pasting between Word and a text editor isn't an option—you need a repeatable, verifiable pipeline that finishes in minutes, not days.
This guide shows you exactly how to get there: auditing what's inside those RTF files, choosing the right tools, scripting large-scale batch jobs, automating cleanup, and integrating the entire process into your CI/CD workflow.
In brief:
- RTF and Markdown have fundamental structural differences that complicate conversion, especially for tables, lists, and embedded media
- Assess your RTF content complexity first to choose the right conversion approach: command-line (Pandoc), Node.js, or Python
- Automate post-conversion cleanup to fix HTML fragments, heading inconsistencies, and structural issues
- Integrate your conversion pipeline into CI/CD workflows for ongoing maintenance and quality assurance
What is RTF and Markdown?
Rich Text Format (RTF) is a proprietary document format that embeds formatting codes directly in text files, while Markdown is a lightweight markup language designed for plain text that can be easily converted to HTML and other formats. Understanding how each stores information will help you predict and prevent conversion headaches.
RTF packs every keystroke into a dense tangle of control words and grouped braces like {\rtf1\ansi…}. Those codes track fonts, colors, margins, even embedded OLE objects, all in the same file. Microsoft has tweaked the spec across Word releases, so two "identical" RTF files may render differently.
The format's hybrid text-plus-binary structure bloats file sizes and complicates parsing. Anything beyond plain text—images, charts, audio—gets stored in proprietary blobs with no web equivalent. RTF was built for GUI editors that can interpret that richness; outside that context, the markup becomes unreadable and verbose.
Markdown takes the opposite approach: plain text first, markup second. A heading is simply ##, bold text is **bold**, and your raw file stays readable in any editor. You can version, diff, and merge your docs just like code.
Different flavors—CommonMark for strict compliance, GitHub Flavored Markdown for issue trackers, MultiMarkdown for extended features—add optional syntax rather than hidden metadata.
Because Markdown prioritizes semantics over styling, it integrates smoothly with developer workflows: static site generators, headless CMSs treat it as a first-class text type. For automation, that plain-text nature is invaluable—you can grep, lint, or transform thousands of files without specialized parsers.
Key differences that impact RTF to Markdown conversion:
- Format philosophy: RTF embeds visual formatting while Markdown emphasizes semantic structure
- Table support: RTF supports complex tables with merged cells while Markdown uses simple pipe syntax
- Image handling: RTF embeds binary data directly while Markdown requires external file references
- List formatting: RTF allows custom bullet symbols and complex nesting that Markdown can't represent
- Styling options: RTF supports granular text styling (colors, fonts) that Markdown generally lacks
The conversion challenges surface immediately when moving between these formats. RTF tables support merged cells and nested grids, while Markdown tables are simple pipe-delimited matrices.
Converters often bail out to raw HTML or produce corrupted layouts. RTF bullets can be custom symbols with arbitrary indentation, but Markdown expects either - or 1.; deep nests routinely collapse during conversion.
RTF can color individual characters or embed charts, while Markdown has no spec-level answer, so that fidelity gets lost. You must extract binary streams from RTF images and rewrite them as external links. Legacy Windows-1252 or mixed encodings can garble smart quotes and symbols.
Finally, a converter that handles CommonMark perfectly might break when you switch to GitHub Flavored Markdown.
Grasping these architectural differences now prevents mysterious line breaks, mangled tables, and invisible images when you convert at scale.
Step-by-Step RTF to Markdown Conversion Workflow
The RTF to Markdown conversion process includes auditing source files for complexity, selecting the right conversion method (Pandoc, Node.js, or Python), automating post-conversion cleanup to fix structural issues, implementing comprehensive testing and verification, and integrating the complete pipeline into your CI/CD workflow for production deployment.
Step 1: Assess Your RTF Content
Before you touch a conversion script, take a hard look at what's inside your RTF archive. A quick audit now saves hours of rewrites later and keeps nasty surprises from slipping into production.
Start by classifying complexity. Open a representative slice—about five to ten percent of files, plus any outliers you already suspect are messy. As you skim the raw RTF (or preview it in a word processor), flag anything that stretches beyond plain paragraphs.
Tables with merged cells or nested grids often break during conversion because the RTF table model is far richer than Markdown's simple pipes, a gap highlighted in Pandoc's parser.
Multi-level or custom-bullet lists can mis-nest or flatten, as documented in another long-standing Pandoc ticket. Embedded images, OLE objects, or colored text signal extra extraction work since these features RTF handles natively but Markdown can only reference or approximate.
Legacy control words, odd encodings, or vendor-specific extensions betray the "binary-and-text hybrid" nature of RTF and may require pre-processing.
Next, map what you find to a Markdown flavor. If you need fenced code blocks, task lists, or tables that accept inline HTML, GitHub Flavored Markdown (GFM) is the safer bet. For strict doc sites where portability matters more than features, CommonMark keeps things lean. MultiMarkdown helps when you rely on footnotes or citations.
Finally, decide your conversion path. Minimal formatting with few tables means web converters work fine. Moderate complexity with images and basic tables means batch Pandoc commands cover most cases.
Heavy tables, deep lists, or custom styling means leaning on programmatic pipelines in Node.js or Python where you can insert regex fixes and fallback to HTML when Markdown gives up. Document these findings—they become the spec for every following step in the migration.
Step 2: Choose and Execute Your Conversion Method
Choose the conversion method that fits your stack—command-line for speed, JavaScript for build integration, or Python for scalability and data processing.
Before you dive into code, decide how deeply you want to integrate the conversion into your tooling. You can stick with a CLI approach, wire the process into a JavaScript build step, or treat the task like any other data pipeline in Python. The three options below map to those mindsets—pick the one that matches your stack and scale requirements.
Option A: Using Pandoc (Command-Line)
Pandoc offers the fastest path to conversion with minimal setup, handling thousands of files per hour while preserving tables and extracting images.
Pandoc works well when you need thousands of files converted quickly. The binary handles RTF parsing, Markdown rendering, and image extraction in one pass, which cuts out a lot of glue code. Because it's a single command, you can run it locally, inside Docker, or on any CI agent that gives you shell access.
1# macOS or Debian-based Linux
2brew install pandoc # or: sudo apt-get install pandocThe two flags you'll use most often are --wrap=none (keep long lines intact for cleaner diffs) and --extract-media (dump embedded images into a folder you choose). A typical one-off conversion looks like this:
1pandoc input.rtf \
2 --from=rtf \
3 --to=markdown \
4 --wrap=none \
5 --extract-media=./media \
6 --output=output.mdFor batch work, wrap the call in a script so you can watch logs scroll instead of hammering the up-arrow key.
1#!/usr/bin/env bash
2set -euo pipefail
3
4SRC_DIR="rtf"
5DST_DIR="md"
6LOG="conversion.log"
7mkdir -p "$DST_DIR" media
8
9for file in "$SRC_DIR"/*.rtf; do
10 base=$(basename "$file" .rtf)
11 if ! pandoc "$file" -f rtf -t markdown \
12 --wrap=none \
13 --extract-media=media/"$base" \
14 -o "$DST_DIR/$base.md" 2>>"$LOG"; then
15 echo "❌ Failed: $file" >>"$LOG"
16 fi
17doneThis approach routinely processes tens of thousands of documents per hour on a modern CPU, and it handles complex tables better than most alternatives—though you'll still want the post-conversion cleanup step for edge cases.
Option B: Node.js Programmatic Conversion
JavaScript-based conversion fits seamlessly into existing frontend build chains, offering greater control over the transformation pipeline and error handling.
If your build chain already lives in npm scripts or webpack, stay in JavaScript. The common pattern is to convert RTF to HTML first, then run that HTML through a Markdown converter such as Turndown. Because you control the whole pipeline, you can patch odd tags, add custom Markdown rules, or stream progress into your existing logging setup.
1// convert.js
2import fs from 'node:fs/promises';
3import { execFile } from 'node:child_process';
4import TurndownService from 'turndown'; // npm install turndown
5
6const src = 'docs/rtf';
7const dst = 'docs/md';
8await fs.mkdir(dst, { recursive: true });
9const td = new TurndownService({ headingStyle: 'atx' });
10
11for (const file of await fs.readdir(src)) {
12 if (!file.endsWith('.rtf')) continue;
13
14 const rtfPath = `${src}/${file}`;
15 const html = await new Promise((res, rej) =>
16 execFile('textutil', ['-convert', 'html', '-stdout', rtfPath], (e, out) => e ? rej(e) : res(out))
17 );
18
19 const md = td.turndown(html);
20 const outPath = `${dst}/${file.replace(/\.rtf$/, '.md')}`;
21 await fs.writeFile(outPath, md);
22 console.log(`✅ ${outPath}`);
23}You can call this script from package.json like any other build step:
1{
2 "scripts": {
3 "convert:rtf": "node convert.js"
4 }
5}Because everything is JavaScript, you can decorate the pipeline with progress bars or enqueue jobs for parallel execution. The flip side: malformed RTF can crash the HTML stage, so keep a fallback plan—dropping the file into pandoc or quarantining it for manual review—to avoid blocking the entire run.
Option C: Python Programmatic Conversion
Python excels at large-scale batch conversion with parallel processing, character encoding normalization, and integration with data science workflows.
Python works well when content migration feels more like data engineering. You get tight integration with scientific libraries, multiprocessing for speed, and mature text-processing tooling.
The two libraries that matter are pypandoc (a thin wrapper around the Pandoc binary) and Microsoft's markitdown for advanced document conversion pipelines. Start simple, then layer custom filters as needed.
1# convert.py
2import concurrent.futures as cf
3from pathlib import Path
4import subprocess
5import markitdown # pip install 'markitdown[all]'
6
7SRC = Path("rtf")
8DST = Path("md")
9DST.mkdir(exist_ok=True)
10
11def convert(path: Path) -> None:
12 out = DST / f"{path.stem}.md"
13 try:
14 # Use pypandoc via subprocess to avoid import overhead
15 subprocess.run(
16 ["pandoc", str(path), "-f", "rtf", "-t", "markdown", "-o", str(out)],
17 check=True
18 )
19 # Optional post-processing with markitdown for analytics
20 doc = markitdown.DocumentConverter(out.open("rb"))
21 cleaned = doc.convert_stream()
22 out.write_text(cleaned, encoding="utf-8")
23 print(f"✓ {out}")
24 except subprocess.CalledProcessError:
25 print(f"✗ failed: {path}")
26
27with cf.ProcessPoolExecutor() as pool:
28 pool.map(convert, SRC.glob("*.rtf"))This approach offers several advantages: Python lets you normalize everything to UTF-8, dodging Windows-1252 encoding issues. Multiprocessing scales across CPU cores, so you can process an archive in parallel without crafting shell loops.
Because the whole job runs inside Python, you can push status into a database, enqueue follow-up tasks, or feed the Markdown straight into a static-site generator.
For teams already using notebooks or data pipelines, this option keeps the conversion close to your analytics stack while taking advantage of Pandoc's proven parser under the hood. Pick the path that best fits your workflow—you can switch later, but making a clear choice now keeps the rest of the migration far more predictable.
Step 3: Automate Post-Conversion Cleanup
Automated cleanup scripts fix common conversion problems like HTML fragments, uneven headings, and orphaned list items to ensure your Markdown files render correctly.
Your converted .md files will contain baggage—HTML fragments, uneven headings, orphaned list items. Fix this systematically with quick, repeatable passes.
Start with show-stoppers. Tables that fell back to raw HTML break site layouts. Pandoc occasionally nests body text inside <td> tags when parsing RTF cell merges. Identify and repair these blocks before touching cosmetic issues.
Next, fix structural problems. Run a Markdown linter like markdownlint or remark over your directory to flag duplicate headings, mixed indentation, and rogue HTML. Add the linter to your pre-commit hook so every future change gets vetted automatically.
Automate cosmetic polish last. This one-liner removes invisible span tags that survived conversion:
1grep -rl '<\/\?span' docs/markdown | xargs sed -E -i '' 's/<\/?span[^>]*>//g'Heading levels often drift during conversion. To collapse level-1 headings to level-2 for clean document hierarchy, use:
1perl -pe 's/^#\s/## /' -i $(git ls-files '*.md')After automation, spot-check a sample set visually. Render the files in VS Code or Obsidian and compare them with your original RTF. Check for broken links, missing images, and metadata corruption that generic linters miss.
Finish by pushing the cleaned files through your CI pipeline. The same scripts that rescued today's batch will guard tomorrow's edits.
Step 4: Test and Verify Your Conversions
Implement a comprehensive testing strategy combining visual inspection, automated validation, and sample verification to catch rendering issues before they reach production.
Even reliable conversion scripts can corrupt nested tables or drop image links. Before publishing thousands of .md files, build a repeatable test suite that proves every document renders correctly.
Start with a visual sweep. Open high-risk files—those with tables, lists, or images—in a Markdown viewer like VS Code or Typora. Look for empty pages, misaligned headings, or missing media. This subjective pass catches problems automation misses.
Add structured checks next. A bash script can attempt to flag markdown files missing top-level # headers or that are zero bytes, but the example as shown may not consistently handle nested directories or large file counts:
1# Safer version finds missing headings recursively
2find docs -name '*.md' -exec grep -L '^# ' {} + # missing headings
3find docs -name '*.md' -size 0 # empty filesAutomate deeper validation with markdownlint for style rules and a link checker for broken references. For tables, round-trip a sample through pandoc (md → html → md) to surface cell-merge bugs. Apply the same approach to lists to catch nesting quirks.
When your corpus is huge, random-sample 2–5% of files per content type and compare the original RTF with rendered Markdown side-by-side using a diff tool. This keeps review time manageable while uncovering systemic errors.
Wire every check into your CI. A GitHub Action installs Pandoc, runs validation scripts, and blocks merges when failures appear. Publishing the test report as a build artifact gives your team a single source of truth and keeps formatting glitches out of production.
Step 5: Optimize and Automate Your Pipeline
Integrate your conversion pipeline into your CI/CD workflow as version-controlled code to ensure reliability, maintainability, and consistent results across all environments.
Once you have a reliable RTF-to-Markdown workflow, automate it into your daily tooling so the conversion runs itself. Treat the pipeline like application code: test it, version it, and deploy it on every push.
Start by wrapping the commands you trust—whether that's a pandoc invocation or a custom script built with RtfPipe or MarkItDown—in a repeatable task runner. An npm script delivers the quickest win:
1# package.json
2"scripts": {
3 "convert": "node scripts/rtf-to-md.js"
4}For polyglot repos, a Makefile keeps language barriers out of the conversation:
1convert:
2 pandoc docs/rtf/*.rtf -t markdown -o docs/md/$(notdir $@:.rtf=.md)From here, integrate the job into CI so every merge converts, lints, and pushes clean Markdown. A minimal GitHub Actions step looks like this:
1- uses: actions/checkout@v4
2- name: Install Pandoc
3 run: sudo apt-get update && sudo apt-get install -y pandoc
4- name: Run conversion
5 run: npm run convertContainerize when you need parity across machines. A slim Docker image with pandoc baked in ensures identical results on laptops, build agents, and staging servers.
Long-term sustainability comes from observability. Emit logs for each file, surface warnings when tables drop to raw HTML, and expose metrics—file count, failure rate—in your monitoring stack. Version the pipeline itself; breaking a conversion flag is a code change, not a mystery.
Document every flag and environment variable right next to the source code. Clear docs turn a one-off script into a portfolio-ready tool that you or the next developer can pick up without context-switching.
From Migration to Modern CMS: Powering Your Content with Strapi
Your converted Markdown files need a CMS that handles Markdown natively or via custom fields or plugins to avoid transformation overhead. Strapi's rich-text fields store content as HTML or structured JSON by default, but with appropriate plugins or custom fields, Markdown can be stored and exposed. Strapi generates REST and GraphQL endpoints automatically, so your content becomes immediately accessible to any frontend framework.
Implementation requires defining a Collection Type that matches your front-matter structure, then importing content through either the Admin Panel or direct API calls. You can automate this process by triggering POST requests when new .md files appear in your repository. With content delivery handled by Strapi's API layer, you can focus on building application features rather than maintaining document infrastructure.