# ocr-table

**Repository Path**: github_repo/ocr-table

## Basic Information

- **Project Name**: ocr-table
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-07-12
- **Last Updated**: 2025-07-12

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# ocr-table
This project aims to extract tables from scanned image PDFs using Optical Character Recognition.

# Install Requirements

1. Tesseract OCR
	```sh
	sudo apt-get install tesseract-ocr
	```

2. Imagemagick
	```sh
	sudo apt-get install imagemagick
	```

3. PDF Utilities
	```sh
	sudo apt-get install poppler-utils
	```

4. Python packages
	```sh
	sudo pip install -r requirements.txt
	```

# Usage

1. Clear the [pdf/](pdf) folder and copy all your pdf files to be scanned in it.

2. Run the OCR:
	```sh
	python3 shellocr.py
	```

3. The scanned text files shall be available in the [txt/](txt) folder once the process completes.

# Alternate

1. If the above doesn't work for you, try the alternate method.

2. Save your file as input.pdf in the root directory.

3. Run
	```sh
	python3 pdf_miner.py 
	```