# ocr_finetune_example **Repository Path**: hf-datasets/ocr_finetune_example ## Basic Information - **Project Name**: ocr_finetune_example - **Description**: Mirror of https://huggingface.co/datasets/datalab-to/ocr_finetune_example - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-11-06 - **Last Updated**: 2025-11-06 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README --- dataset_info: features: - name: image dtype: image - name: text dtype: string splits: - name: train num_bytes: 8961757.0 num_examples: 4 download_size: 8964031 dataset_size: 8961757.0 configs: - config_name: default data_files: - split: train path: data/train-* --- # Example Dataset for Surya OCR Finetuning This dataset is an example that lays out the expected format for finetuning Surya OCR. ## Data Requirements Image column: The input images (full pages, blocks, or single text lines — mix freely). Text column: The transcription corresponding to each image. For math content, ensure or tags are wrapped around the latex ## Surya OCR supports: Various aspect ratios Different image types and qualities Full-page documents Cropped blocks of text Single-line snippets The base surya model is trained on a wide range of samples from all these categories, and you can combine any of these types in your training dataset for more robust performance, as demonstrated in this example dataset.