ubtitle-expansion-ratios-by-language

Subtitle expansion ratios by language dataset.

Open Source

Analytics Tool • Productivity

This repository provides a structured dataset for analyzing subtitle expansion ratios across different languages. It is designed for research, quality assurance, and optimizing subtitle layouts for various platforms including YouTube, TikTok, and webinars. The dataset measures how subtitle length changes when English video scripts are translated into six target languages: Spanish, German, French, Portuguese, Chinese, and Japanese.

Key Features:

Comprehensive Data: Includes 180 English source segments and 1,080 segment-level measurement rows.
Focus on Practicality: Uses internally authored, instructional text to mirror real-world localization challenges.
Key Metrics: Features `expansion_ratio` (translated character count / source character count) and `translated_chars_per_second` for readability assessment.
Detailed Documentation: Provides CSV files for `subtitle_expansion_ratios`, `source_segment_catalog`, `language_summary`, `language_profile`, and `DATA_DICTIONARY`, along with a JSON schema and license information.
Clear Provenance: Linked to the `AI Translate Video` project, offering context on the research and operational needs driving the dataset's creation.
Flexible Use: Suitable for estimating subtitle line growth, stress-testing layouts, planning QA, and comparing caption density for dubbing and lip-sync.

This dataset is intended as a structured reference for planning, QA, and benchmarking in multilingual publishing workflows. It helps users understand the impact of subtitle expansion on readability, timing, and localization costs.

Built with

GitHub

CSV

JSON Schema

AI Translate Video