Comprehensive Data: Includes 180 English source segments and 1,080 segment-level measurement rows.
Focus on Practicality: Uses internally authored, instructional text to mirror real-world localization challenges.
Key Metrics: Features `expansion_ratio` (translated character count / source character count) and `translated_chars_per_second` for readability assessment.
Detailed Documentation: Provides CSV files for `subtitle_expansion_ratios`, `source_segment_catalog`, `language_summary`, `language_profile`, and `DATA_DICTIONARY`, along with a JSON schema and license information.
Clear Provenance: Linked to the `AI Translate Video` project, offering context on the research and operational needs driving the dataset's creation.
Flexible Use: Suitable for estimating subtitle line growth, stress-testing layouts, planning QA, and comparing caption density for dubbing and lip-sync.
This dataset is intended as a structured reference for planning, QA, and benchmarking in multilingual publishing workflows. It helps users understand the impact of subtitle expansion on readability, timing, and localization costs.
Built with