# ocr_finetune_example

**Repository Path**: hf-datasets/ocr_finetune_example

## Basic Information

- **Project Name**: ocr_finetune_example
- **Description**: Mirror of https://huggingface.co/datasets/datalab-to/ocr_finetune_example
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-11-06
- **Last Updated**: 2025-11-06

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

---
dataset_info:
  features:
  - name: image
    dtype: image
  - name: text
    dtype: string
  splits:
  - name: train
    num_bytes: 8961757.0
    num_examples: 4
  download_size: 8964031
  dataset_size: 8961757.0
configs:
- config_name: default
  data_files:
  - split: train
    path: data/train-*
---

# Example Dataset for Surya OCR Finetuning

This dataset is an example that lays out the expected format for finetuning Surya OCR. 

## Data Requirements
    Image column: The input images (full pages, blocks, or single text lines — mix freely).
    Text column: The transcription corresponding to each image.
        For math content, ensure <math display="inline"></math> or <math display="block"></math> tags are wrapped around the latex

## Surya OCR supports:
    Various aspect ratios
    Different image types and qualities
    Full-page documents
    Cropped blocks of text
    Single-line snippets

The base surya model is trained on a wide range of samples from all these categories, and you can combine any of these types in your training dataset for more robust performance, as demonstrated in this example dataset.