No description
- Jupyter Notebook 97.1%
- Python 2.9%
| .vscode | ||
| .gitignore | ||
| .python-version | ||
| extract_ocr_output_markdown.py | ||
| image-preprocessing.ipynb | ||
| olmocr-sample.pdf | ||
| olmocr_base.ipynb | ||
| pyproject.toml | ||
| README.md | ||
| test_cuda.py | ||
| uv.lock | ||
** Work in Progress**
Prerequisites
Some image pre-processing steps use tesseract: sudo apt install tesseract-ocr
Set up a virtual environment with uv:
- Make sure uv is installed
- Run 'uv sync' in the project directory
- You may need to manually enter the virtual environment initially, by running
source .venv/bin/activate
Running from the CLI
Single file:
olmocr {OUTPUT_DIRECTORY} --markdown --pdfs {INPUT_FILES}/{FILE_NAME}.pdf --gpu-memory-utilization .9
Multi file:
olmocr {OUTPUT_DIRECTORY} --markdown --pdfs {INPUT_FILES}/*.pdf --gpu-memory-utilization .85 --workers 2 --pages_per_group 3
Breakdown of the the CLI command
olmocr: The name of the program you are telling the computer to run.OUTPUT_DIRECTORY: The folder where the finished files will be saved.--markdown: An instruction telling the program to save the OCR output in Markdown format.--pdfs {INPUT_FILES}/*.pdf: The location of your source files. The*is a wildcard that tells the program to grab every PDF file in that folder.--gpu-memory-utilization .85: A limit that tells the program it can use up to 85% of your graphics card's memory, leaving some room for other tasks.--workers 2: Tells the program to use 2 "workers" (simultaneous processes) at once to make the job go faster. How high you can set this will depend on the capabilities of your hardware.--pages_per_group 3: Tells the program to process the pages in batches of 3 for better efficiency. How high you can set this will depend on the capabilities of your hardware.
After running
- Check the FINAL METRICS SUMMARY output in the terminal
- Extract markdown to zip with
extract_ocr_output_markdown.py
Troubleshooting
Adjust the workers and page group size based on your system. Make sure it doesn't use too much RAM or VRAM.
Make sure you are in the virtual envirnment when executing olmOCR commands