No description
  • Jupyter Notebook 97.1%
  • Python 2.9%
Find a file
2026-06-11 21:37:03 -04:00
.vscode update image-preprocessing notebook 2026-06-11 10:20:47 -04:00
.gitignore update image-preprocessing notebook 2026-06-11 10:20:47 -04:00
.python-version init 2026-06-10 10:33:07 -04:00
extract_ocr_output_markdown.py update image-preprocessing notebook 2026-06-11 10:20:47 -04:00
image-preprocessing.ipynb fix images_to_pdf function 2026-06-11 21:25:28 -04:00
olmocr-sample.pdf init 2026-06-10 10:33:07 -04:00
olmocr_base.ipynb rebuild env 2026-06-10 13:00:00 -04:00
pyproject.toml update image-preprocessing notebook 2026-06-11 10:20:47 -04:00
README.md test push 2026-06-11 21:37:03 -04:00
test_cuda.py init 2026-06-10 10:33:07 -04:00
uv.lock update image-preprocessing notebook 2026-06-11 10:20:47 -04:00

** Work in Progress**

Prerequisites

Some image pre-processing steps use tesseract: sudo apt install tesseract-ocr

Set up a virtual environment with uv:

  • Make sure uv is installed
  • Run 'uv sync' in the project directory
  • You may need to manually enter the virtual environment initially, by running source .venv/bin/activate

Running from the CLI

Single file: olmocr {OUTPUT_DIRECTORY} --markdown --pdfs {INPUT_FILES}/{FILE_NAME}.pdf --gpu-memory-utilization .9

Multi file: olmocr {OUTPUT_DIRECTORY} --markdown --pdfs {INPUT_FILES}/*.pdf --gpu-memory-utilization .85 --workers 2 --pages_per_group 3

Breakdown of the the CLI command

  • olmocr: The name of the program you are telling the computer to run.
  • OUTPUT_DIRECTORY: The folder where the finished files will be saved.
  • --markdown: An instruction telling the program to save the OCR output in Markdown format.
  • --pdfs {INPUT_FILES}/*.pdf: The location of your source files. The * is a wildcard that tells the program to grab every PDF file in that folder.
  • --gpu-memory-utilization .85: A limit that tells the program it can use up to 85% of your graphics card's memory, leaving some room for other tasks.
  • --workers 2: Tells the program to use 2 "workers" (simultaneous processes) at once to make the job go faster. How high you can set this will depend on the capabilities of your hardware.
  • --pages_per_group 3: Tells the program to process the pages in batches of 3 for better efficiency. How high you can set this will depend on the capabilities of your hardware.

After running

  • Check the FINAL METRICS SUMMARY output in the terminal
  • Extract markdown to zip with extract_ocr_output_markdown.py

Troubleshooting

Adjust the workers and page group size based on your system. Make sure it doesn't use too much RAM or VRAM.

Make sure you are in the virtual envirnment when executing olmOCR commands