# ocr_finetune_example
**Repository Path**: hf-datasets/ocr_finetune_example
## Basic Information
- **Project Name**: ocr_finetune_example
- **Description**: Mirror of https://huggingface.co/datasets/datalab-to/ocr_finetune_example
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-11-06
- **Last Updated**: 2025-11-06
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
---
dataset_info:
features:
- name: image
dtype: image
- name: text
dtype: string
splits:
- name: train
num_bytes: 8961757.0
num_examples: 4
download_size: 8964031
dataset_size: 8961757.0
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Example Dataset for Surya OCR Finetuning
This dataset is an example that lays out the expected format for finetuning Surya OCR.
## Data Requirements
Image column: The input images (full pages, blocks, or single text lines — mix freely).
Text column: The transcription corresponding to each image.
For math content, ensure or tags are wrapped around the latex
## Surya OCR supports:
Various aspect ratios
Different image types and qualities
Full-page documents
Cropped blocks of text
Single-line snippets
The base surya model is trained on a wide range of samples from all these categories, and you can combine any of these types in your training dataset for more robust performance, as demonstrated in this example dataset.