google_drive_ocr package

Submodules

google_drive_ocr.application module

Google OCR Application

Create a project on Google Cloud Platform

Wizard: https://console.developers.google.com/start/api?id=drive

Instructions:

References

class google_drive_ocr.application.Status(value)[source]

Bases: enum.Enum

An enumeration.

SUCCESS = 'Done!'
ALREADY = 'Already done!'
ERROR = 'Something went wrong!'
class google_drive_ocr.application.GoogleOCRApplication(client_secret: str, upload_folder_id: Optional[str] = None, ocr_suffix: str = '.google.txt', temporary_upload: bool = False, credentials_path: Optional[str] = None, scopes: Optional[str] = None)[source]

Bases: object

Google OCR Application

Perform OCR using Google-Drive API v3

client_secret: str
upload_folder_id: str = None
ocr_suffix: str = '.google.txt'
temporary_upload: bool = False
credentials_path: str = None
scopes: str = None
get_output_path(img_path: str) str[source]

Get the output path

Output path is constructed by replacing the extension in img_path with ocr_suffix

Parameters

img_path (str) – Path to the input image file

Returns

Output path

Return type

str

get_credentials() google.oauth2.credentials.Credentials[source]

Get valid user credentials

If no (valid) credentials are available, * Log the user in * Store the credentials for future use

Returns

Valid user credentials

Return type

Credentials or None

upload_image_as_document(img_path: str) str[source]

Upload an image file as a Google Document

Parameters

img_path (str) – Path to the image file

Returns

ID of the uploaded Google document

Return type

str

download_document_as_text(file_id: str, output_path: str)[source]

Download a Google Document as text

Parameters
  • file_id (str) – ID of the Google document

  • output_path (str) – Path to where the document should be downloaded

delete_file(file_id: str)[source]

Delete a file from Google Drive

Parameters

file_id (str) – ID of the file on Google Drive to be deleted

perform_ocr(img_path: str, output_path: Optional[str] = None) google_drive_ocr.application.Status[source]

Perform OCR on a single image

  • Upload the image to Google Drive as google-document

  • [Google adds OCR layer to the image]

  • Download the google-document as plain text

Parameters
  • img_path (str or Path) – Path to the image file

  • output_path (str or Path, optional) – Path where the OCR text should be stored If None, a new file will be created beside the image The default is None.

Returns

status – Status of the OCR operation

Return type

Status

_worker_ocr_batch(worker_arguments: dict) float[source]

Worker to perform OCR on multiple files

Parameters

worker_arguments (dict) – Arguments for the worker

Returns

Time taken in seconds

Return type

float

perform_ocr_batch(image_files: list, workers: int = 1, disable_tqdm: Optional[bool] = None)[source]

Perform OCR on multiple files

Parameters
  • image_files (list) – List of paths to image files

  • workers (int, optional) – Number of workers The default is 1.

  • disable_tqdm (bool, optional) – If True, the progress bars from tqdm will be disabled. The default is None.

google_drive_ocr.cli module

Console script for Google OCR (Drive API v3)

google_drive_ocr.cli.main()[source]

google_drive_ocr.errors module

HTTP Errors

List of HTTP errors that can be fixed in most cases by trying again.

Provides a @retry decorator, which applies exponential backoff to a function.

google_drive_ocr.errors.retry(attempts: int = 4, delay: int = 1, backoff: int = 2, hook: Optional[Callable[[int, Exception, int], Any]] = None) Callable[source]

Decorator to Retry with Exponential Backoff (on Exception)

A function that raises an exception on failure, when decorated with this decorator, will retry till it returns True or number of attempts runs out.

The decorator will call the function up to attempts times if it raises an exception.

By default it catches instances of the Exception class and subclasses. This will recover after all but the most fatal errors. You may specify a custom tuple of exception classes with the exceptions argument; the function will only be retried if it raises one of the specified exceptions.

Additionally you may specify a hook function which will be called prior to retrying with the number of remaining tries and the exception instance; This is primarily intended to give the opportunity to log the failure. Hook is not called after failure if no retries remain.

Parameters
  • attempts (int, optional) – Number of attempts in case of failure. The default is 4.

  • delay (int, optional) – Intinitial delay in seconds The default is 1.

  • backoff (int, optional) – Backoff multiplication factor The default is 2.

  • hook (Callable[[int, Exception, int], Any], optional) – Function with the parameters (tries_remaining, exception, delay) The default is None.

Returns

Decorator function

Return type

Callable

Raises
  • ValueError – If the backoff multiplication factor is less than 1.

  • ValueError – If the number of attempts is less than 0.

  • ValueError – If the initial delay is less than or equal to 0.

google_drive_ocr.utils module

Utility Functions

google_drive_ocr.utils.get_files(topdir: str, extn: str) Generator[str, None, None][source]

Search topdir recursively for all files with extension extn

extension is checked with str.endswith(), instead of the supposedly better os.path.splitext(), in order to facilitate the search with multiple dots in the extn

i.e. >>> get_files(topdir, ".xyz.txt") wouldn’t have worked as expected if splitext() was used.

Parameters
  • topdir (str) – Path of the directory to search files in

  • extn (str) – Extension to look for

Returns

Matching file paths

Return type

Generator[str, None, None]

google_drive_ocr.utils.list_to_range(list_of_int: List[int]) List[Tuple[int, int]][source]

Convert a list of integers into a list of ranges

A range is tuple (start, end)

Parameters

list_of_int (List[int]) – List of integers

Returns

List of ranges

Return type

List[Tuple[int, int]]

google_drive_ocr.utils.extract_pages(pdf_path: str, pages: Optional[Iterator[Tuple[int, int]]] = None) Set[str][source]

Extract pages from a PDF file as image files

Pages are saved in the same directory as the PDF file, with the suffix .page-[number].jpg

Parameters
  • pdf_path (str) – Path to the PDF file

  • pages (Iterator[Tuple[int, int]], optional) – Page ranges to extract. If None, all pages will be extracted. The default is None.

Returns

Set of paths to extracted pages

Return type

Set[str]

Module contents

Google OCR (Drive API v3).