Crawlboy

Crawl sitemaps to Markdown with Crawl4AI. Free on PyPI.

Open Source

About Crawlboy

Crawlboy is a powerful Python command-line interface (CLI) tool designed to transform website sitemaps into structured Markdown content. It leverages Crawl4AI to render web pages and generates a Markdown file for each URL, mirroring the site's path structure. This makes it ideal for batching static sites, documentation portals, and blogs into a corpus ready for search, Retrieval Augmented Generation (RAG) pipelines, mirroring, or offline reading, eliminating the need to build custom crawlers.

Sitemap Discovery: Automatically detects sitemaps from robots.txt or common paths, and recursively follows nested <sitemapindex> entries.
Markdown Output: Creates one Markdown file per page, preserving the URL path structure.
Optional HTML: Can save raw HTML files alongside Markdown using the --save-html flag.
Image Handling: Downloads and stores images in a deduplicated, content-addressed manner under the media/ directory with the --download-images flag.
Error Tracking: Logs crawl failures to errors.jsonl, with an option to stop immediately on the first error using --fail-fast.
Interactive Mode: Offers a guided wizard (-i or --interactive) for a more user-friendly setup, prompting for sitemap source, output directory, and options.
Output Structure: Organizes output into md/, html/ (optional), media/ (optional), and errors.jsonl.

Crawlboy is open-source under the MIT license and can be easily installed via PyPI using pip install crawlboy. It requires a one-time setup for Playwright and Chromium using crawl4ai-setup.

Built with

Python

CLI

Crawl4AI

Playwright

Chromium

Docker

PyPI

Markdown

HTML

JSONL