About Crawlboy
Crawlboy is a powerful Python command-line interface (CLI) tool designed to transform website sitemaps into structured Markdown content. It leverages Crawl4AI to render web pages and generates a Markdown file for each URL, mirroring the site's path structure. This makes it ideal for batching static sites, documentation portals, and blogs into a corpus ready for search, Retrieval Augmented Generation (RAG) pipelines, mirroring, or offline reading, eliminating the need to build custom crawlers.
- Sitemap Discovery: Automatically detects sitemaps from robots.txt or common paths, and recursively follows nested <sitemapindex> entries.
- Markdown Output: Creates one Markdown file per page, preserving the URL path structure.
- Optional HTML: Can save raw HTML files alongside Markdown using the --save-html flag.
- Image Handling: Downloads and stores images in a deduplicated, content-addressed manner under the media/ directory with the --download-images flag.
- Error Tracking: Logs crawl failures to errors.jsonl, with an option to stop immediately on the first error using --fail-fast.
- Interactive Mode: Offers a guided wizard (-i or --interactive) for a more user-friendly setup, prompting for sitemap source, output directory, and options.
- Output Structure: Organizes output into md/, html/ (optional), media/ (optional), and errors.jsonl.
Crawlboy is open-source under the MIT license and can be easily installed via PyPI using pip install crawlboy. It requires a one-time setup for Playwright and Chromium using crawl4ai-setup.