136
domdistill is a Python library designed for intent-driven semantic chunk distillation of DOM/HTML content. It intelligently splits HTML into sections based on headings, scores merged text chunks by their relevance to a given query and local headings, and employs dynamic programming to select the most pertinent set of chunks.
Cuts out the noise and keep only whats necessary
Key Features:
Splits HTML into heading-aware sections.
Scores text chunks based on query and heading relevance.
Uses dynamic programming for optimal chunk selection.
Provides both high-level API (HTMLIntentChunker) and discrete building blocks for custom pipelines.
Supports custom embedder injection for flexible model integration.
Includes benchmarking tools for performance analysis.
Offers detailed guidance on concurrency tuning for optimal performance.
The library is built with Python and leverages technologies like numpy for numerical operations and SentenceTransformer for embeddings. It is suitable for developers looking to extract meaningful information from web content efficiently.
Built with