google_drive_ocr package

Submodules

google_drive_ocr.application module

Google OCR Application

Create a project on Google Cloud Platform

Wizard: https://console.developers.google.com/start/api?id=drive

Instructions:

https://cloud.google.com/genomics/downloading-credentials-for-api-access
Select application type as “Installed Application”
Create credentials OAuth consent screen –> OAuth client ID
Save client_secret.json

References

class google_drive_ocr.application.Status(value)[source]

Bases: enum.Enum

An enumeration.

SUCCESS = 'Done!'

ALREADY = 'Already done!'

ERROR = 'Something went wrong!'

class google_drive_ocr.application.GoogleOCRApplication(client_secret: str, upload_folder_id: Optional[str] = None, ocr_suffix: str = '.google.txt', temporary_upload: bool = False, credentials_path: Optional[str] = None, scopes: Optional[str] = None)[source]

Bases: object

Google OCR Application

Perform OCR using Google-Drive API v3

client_secret: str

upload_folder_id: str = None

ocr_suffix: str = '.google.txt'

temporary_upload: bool = False

credentials_path: str = None

scopes: str = None

get_output_path(img_path: str) → str[source]

Get the output path

Output path is constructed by replacing the extension in img_path with ocr_suffix

Parameters: img_path (str) – Path to the input image file
Returns: Output path
Return type: str

get_credentials() → google.oauth2.credentials.Credentials[source]

Get valid user credentials

If no (valid) credentials are available, * Log the user in * Store the credentials for future use

Returns: Valid user credentials
Return type: Credentials or None

upload_image_as_document(img_path: str) → str[source]

Upload an image file as a Google Document

Parameters: img_path (str) – Path to the image file
Returns: ID of the uploaded Google document
Return type: str

download_document_as_text(file_id: str, output_path: str)[source]

Download a Google Document as text

Parameters

file_id (str) – ID of the Google document
output_path (str) – Path to where the document should be downloaded

delete_file(file_id: str)[source]

Delete a file from Google Drive

Parameters: file_id (str) – ID of the file on Google Drive to be deleted

perform_ocr(img_path: str, output_path: Optional[str] = None) → google_drive_ocr.application.Status[source]

Perform OCR on a single image

Upload the image to Google Drive as google-document
[Google adds OCR layer to the image]
Download the google-document as plain text

Parameters

img_path (str or Path) – Path to the image file
output_path (str or Path, optional) – Path where the OCR text should be stored If None, a new file will be created beside the image The default is None.

Returns

status – Status of the OCR operation

Return type

Status

_worker_ocr_batch(worker_arguments: dict) → float[source]

Worker to perform OCR on multiple files

Parameters: worker_arguments (dict) – Arguments for the worker
Returns: Time taken in seconds
Return type: float

perform_ocr_batch(image_files: list, workers: int = 1, disable_tqdm: Optional[bool] = None)[source]

Perform OCR on multiple files

Parameters

image_files (list) – List of paths to image files
workers (int, optional) – Number of workers The default is 1.
disable_tqdm (bool, optional) – If True, the progress bars from tqdm will be disabled. The default is None.

google_drive_ocr.cli module

Console script for Google OCR (Drive API v3)

google_drive_ocr.cli.main()[source]

google_drive_ocr.errors module

HTTP Errors

List of HTTP errors that can be fixed in most cases by trying again.

Provides a @retry decorator, which applies exponential backoff to a function.

google_drive_ocr.errors.retry(attempts: int = 4, delay: int = 1, backoff: int = 2, hook: Optional[Callable[[int, Exception, int], Any]] = None) → Callable[source]

Decorator to Retry with Exponential Backoff (on Exception)

A function that raises an exception on failure, when decorated with this decorator, will retry till it returns True or number of attempts runs out.

The decorator will call the function up to attempts times if it raises an exception.

By default it catches instances of the Exception class and subclasses. This will recover after all but the most fatal errors. You may specify a custom tuple of exception classes with the exceptions argument; the function will only be retried if it raises one of the specified exceptions.

Additionally you may specify a hook function which will be called prior to retrying with the number of remaining tries and the exception instance; This is primarily intended to give the opportunity to log the failure. Hook is not called after failure if no retries remain.

Parameters

attempts (int, optional) – Number of attempts in case of failure. The default is 4.
delay (int, optional) – Intinitial delay in seconds The default is 1.
backoff (int, optional) – Backoff multiplication factor The default is 2.
hook (Callable[[int, Exception, int], Any], optional) – Function with the parameters (tries_remaining, exception, delay) The default is None.

Returns

Decorator function

Return type

Callable

Raises

ValueError – If the backoff multiplication factor is less than 1.
ValueError – If the number of attempts is less than 0.
ValueError – If the initial delay is less than or equal to 0.

google_drive_ocr.utils module

Utility Functions

google_drive_ocr.utils.get_files(topdir: str, extn: str) → Generator[str, None, None][source]

Search topdir recursively for all files with extension extn

extension is checked with str.endswith(), instead of the supposedly better os.path.splitext(), in order to facilitate the search with multiple dots in the extn

i.e. >>> get_files(topdir, ".xyz.txt") wouldn’t have worked as expected if splitext() was used.

Parameters

topdir (str) – Path of the directory to search files in
extn (str) – Extension to look for

Returns

Matching file paths

Return type

Generator[str, None, None]

google_drive_ocr.utils.list_to_range(list_of_int: List[int]) → List[Tuple[int, int]][source]

Convert a list of integers into a list of ranges

A range is tuple (start, end)

Parameters: list_of_int (List[int]) – List of integers
Returns: List of ranges
Return type: List[Tuple[int, int]]

google_drive_ocr.utils.extract_pages(pdf_path: str, pages: Optional[Iterator[Tuple[int, int]]] = None) → Set[str][source]

Extract pages from a PDF file as image files

Pages are saved in the same directory as the PDF file, with the suffix .page-[number].jpg

Parameters

pdf_path (str) – Path to the PDF file
pages (Iterator[Tuple[int, int]], optional) – Page ranges to extract. If None, all pages will be extracted. The default is None.

Returns

Set of paths to extracted pages

Return type

Set[str]

Module contents

Google OCR (Drive API v3).