No description

Find a file

bowrey 7711b83db9 test push		2026-06-11 21:37:03 -04:00
.vscode	update image-preprocessing notebook	2026-06-11 10:20:47 -04:00
.gitignore	update image-preprocessing notebook	2026-06-11 10:20:47 -04:00
.python-version	init	2026-06-10 10:33:07 -04:00
extract_ocr_output_markdown.py	update image-preprocessing notebook	2026-06-11 10:20:47 -04:00
image-preprocessing.ipynb	fix images_to_pdf function	2026-06-11 21:25:28 -04:00
olmocr-sample.pdf	init	2026-06-10 10:33:07 -04:00
olmocr_base.ipynb	rebuild env	2026-06-10 13:00:00 -04:00
pyproject.toml	update image-preprocessing notebook	2026-06-11 10:20:47 -04:00
README.md	test push	2026-06-11 21:37:03 -04:00
test_cuda.py	init	2026-06-10 10:33:07 -04:00
uv.lock	update image-preprocessing notebook	2026-06-11 10:20:47 -04:00

** Work in Progress**

Prerequisites

Some image pre-processing steps use tesseract: sudo apt install tesseract-ocr

Set up a virtual environment with uv:

Make sure uv is installed
Run 'uv sync' in the project directory
You may need to manually enter the virtual environment initially, by running source .venv/bin/activate

Running from the CLI

Single file: olmocr {OUTPUT_DIRECTORY} --markdown --pdfs {INPUT_FILES}/{FILE_NAME}.pdf --gpu-memory-utilization .9

Multi file: olmocr {OUTPUT_DIRECTORY} --markdown --pdfs {INPUT_FILES}/*.pdf --gpu-memory-utilization .85 --workers 2 --pages_per_group 3

olmocr: The name of the program you are telling the computer to run.
OUTPUT_DIRECTORY: The folder where the finished files will be saved.
--markdown: An instruction telling the program to save the OCR output in Markdown format.
--pdfs {INPUT_FILES}/*.pdf: The location of your source files. The * is a wildcard that tells the program to grab every PDF file in that folder.
--gpu-memory-utilization .85: A limit that tells the program it can use up to 85% of your graphics card's memory, leaving some room for other tasks.
--workers 2: Tells the program to use 2 "workers" (simultaneous processes) at once to make the job go faster. How high you can set this will depend on the capabilities of your hardware.
--pages_per_group 3: Tells the program to process the pages in batches of 3 for better efficiency. How high you can set this will depend on the capabilities of your hardware.

Adjust the workers and page group size based on your system. Make sure it doesn't use too much RAM or VRAM.

Make sure you are in the virtual envirnment when executing olmOCR commands