Innovative Software Technology-Generating Synthetic OCR Data for Right-to-Left Languages with SynthDoG-RTL

Generating Synthetic OCR Data for Right-to-Left Languages with SynthDoG-RTL

Developing robust Optical Character Recognition (OCR) models for languages written from right-to-left, such as Arabic, Urdu, Persian, and Hebrew, frequently encounters a significant hurdle: the scarcity of high-quality, annotated training data. Addressing this challenge, SynthDoG-RTL emerges as a powerful synthetic document generator. This tool is an enhanced version of Donut’s original SynthDoG, specifically engineered to accurately render and process RTL text. This guide will provide advanced developers with a comprehensive walkthrough on leveraging SynthDoG-RTL to create extensive synthetic datasets perfectly compatible with the Donut framework.

What is SynthDoG-RTL?

SynthDoG, or Synthetic Document Generator, was initially developed alongside the Donut model to facilitate on-the-fly generation of training data for document understanding tasks. SynthDoG-RTL takes this foundational concept further, offering crucial enhancements for RTL languages. Its key extensions include:

Comprehensive support for right-to-left text directionality and intelligent contextual script shaping, essential for accurate rendering of many RTL scripts.
A rich collection of sample corpora, diverse fonts, and pre-designed templates tailored for languages like Arabic, Urdu, Persian, and Hebrew.
The flexibility of custom YAML configurations, empowering users to define specific page layouts, introduce visual distortions, and apply various stylistic effects.

Installation and Setup

To get started with SynthDoG-RTL, follow these installation and setup instructions. First, clone the repository and navigate into the directory. Then, set up a dedicated Conda environment and install the necessary Python dependencies, including synthtiger. It’s also vital to install libraqm to ensure proper Arabic and RTL script shaping. macOS users may need to set the OBJC_DISABLE_INITIALIZE_FORK_SAFETY environment variable to prevent multiprocessing issues.

Preparing Resources

Before generating data, you must prepare the essential resources for each target language. This includes:

Corpus: A UTF-8 encoded text file containing example text for your chosen language, placed within the resources/corpus/ directory (e.g., urdu.txt, arabic.txt).
Fonts: TrueType (.ttf) or OpenType (.otf) font files specific to the language, organized under resources/font/<lang_code>/.
Backgrounds: Optional texture images that can be used as document backgrounds, stored in resources/backgrounds/. This structured approach ensures the generator can access all necessary components.

Configuring Generation

The behavior of the synthetic data generation is controlled by YAML configuration files, such as config_ur.yaml. These files allow you to precisely define parameters like page dimensions (width and height), the acceptable range for font sizes, various distortion effects to add realism, and paths to your prepared resources. For instance, an Urdu configuration might specify the corpus path, font directory, page dimensions, font size limits, rotation angles, and background textures, providing granular control over the generated output.

Generating Synthetic Data

Once resources are prepared and configurations are set, the data generation process is initiated via the command-line interface. A typical command involves specifying the output directory, the number of samples to generate, the number of parallel workers, the template script, the generator type (SynthDoG), and the specific YAML configuration file (e.g., config_ur.yaml). This command will produce a specified number of synthetic document images and their corresponding ground truth text within the designated output folder. The process can be repeated with different configuration files to generate datasets for multiple RTL languages.

Formatting for Donut

To be compatible with the Donut framework, the generated dataset needs to adhere to a specific structure: an image + JSON pair for each sample. Your dataset should be organized into train, validation, and test subdirectories. Within each split, a metadata.jsonl file will exist, where each line is a JSON object. This JSON object must link the file_name of the generated image to its ground_truth text, ensuring the text_sequence accurately reflects the RTL content. Donut handles internal tokenization, so the key is to provide correctly matched image and text data.

Advanced Tips

For even more sophisticated synthetic data generation, consider these advanced strategies:

Custom Layouts: Modify template.py to create complex document structures, including multi-column formats, headers, footers, and tables, enhancing the diversity of your dataset.
Realistic Effects: Incorporate various visual effects like noise, blur, and perspective distortions directly within your YAML configuration files to mimic real-world document imperfections and boost model robustness.
Font Diversity: Utilize a wide array of fonts for each language to prevent your model from overfitting to specific typographic styles.
Bilingual Documents: Integrate corpora from non-RTL languages (e.g., English) to simulate mixed-script documents, preparing your model for real-world scenarios that often involve multiple languages.
Large-Scale Generation: Aim to generate substantial datasets, typically ranging from 10,000 to 100,000 samples, for effective pre-training of Donut models, leading to better generalization and performance.

Conclusion

In conclusion, SynthDoG-RTL offers an invaluable solution for quickly developing comprehensive synthetic OCR datasets tailored for a wide array of major right-to-left languages. The data generated through this tool is designed for seamless integration with the Donut framework, providing developers with the capability to train or fine-tune highly robust document understanding models, particularly beneficial in environments where real, annotated data is scarce. This empowers significant progress in OCR for RTL scripts, overcoming common data limitations.