google_drive_ocr package
Submodules
google_drive_ocr.application module
Google OCR Application
Create a project on Google Cloud Platform
Wizard: https://console.developers.google.com/start/api?id=drive
Instructions:
https://cloud.google.com/genomics/downloading-credentials-for-api-access
Select application type as “Installed Application”
Create credentials OAuth consent screen –> OAuth client ID
Save client_secret.json
References
https://developers.google.com/api-client-library/python/start/get_started
https://developers.google.com/drive/v3/web/quickstart/python
- class google_drive_ocr.application.Status(value)[source]
Bases:
enum.Enum
An enumeration.
- SUCCESS = 'Done!'
- ALREADY = 'Already done!'
- ERROR = 'Something went wrong!'
- class google_drive_ocr.application.GoogleOCRApplication(client_secret: str, upload_folder_id: Optional[str] = None, ocr_suffix: str = '.google.txt', temporary_upload: bool = False, credentials_path: Optional[str] = None, scopes: Optional[str] = None)[source]
Bases:
object
Google OCR Application
Perform OCR using Google-Drive API v3
- client_secret: str
- upload_folder_id: str = None
- ocr_suffix: str = '.google.txt'
- temporary_upload: bool = False
- credentials_path: str = None
- scopes: str = None
- get_output_path(img_path: str) str [source]
Get the output path
Output path is constructed by replacing the extension in
img_path
withocr_suffix
- Parameters
img_path (str) – Path to the input image file
- Returns
Output path
- Return type
str
- get_credentials() google.oauth2.credentials.Credentials [source]
Get valid user credentials
If no (valid) credentials are available, * Log the user in * Store the credentials for future use
- Returns
Valid user credentials
- Return type
Credentials or None
- upload_image_as_document(img_path: str) str [source]
Upload an image file as a Google Document
- Parameters
img_path (str) – Path to the image file
- Returns
ID of the uploaded Google document
- Return type
str
- download_document_as_text(file_id: str, output_path: str)[source]
Download a Google Document as text
- Parameters
file_id (str) – ID of the Google document
output_path (str) – Path to where the document should be downloaded
- delete_file(file_id: str)[source]
Delete a file from Google Drive
- Parameters
file_id (str) – ID of the file on Google Drive to be deleted
- perform_ocr(img_path: str, output_path: Optional[str] = None) google_drive_ocr.application.Status [source]
Perform OCR on a single image
Upload the image to Google Drive as google-document
[Google adds OCR layer to the image]
Download the google-document as plain text
- Parameters
img_path (str or Path) – Path to the image file
output_path (str or Path, optional) – Path where the OCR text should be stored If None, a new file will be created beside the image The default is None.
- Returns
status – Status of the OCR operation
- Return type
- _worker_ocr_batch(worker_arguments: dict) float [source]
Worker to perform OCR on multiple files
- Parameters
worker_arguments (dict) – Arguments for the worker
- Returns
Time taken in seconds
- Return type
float
- perform_ocr_batch(image_files: list, workers: int = 1, disable_tqdm: Optional[bool] = None)[source]
Perform OCR on multiple files
- Parameters
image_files (list) – List of paths to image files
workers (int, optional) – Number of workers The default is 1.
disable_tqdm (bool, optional) – If True, the progress bars from
tqdm
will be disabled. The default is None.
google_drive_ocr.cli module
Console script for Google OCR (Drive API v3)
google_drive_ocr.errors module
HTTP Errors
List of HTTP errors that can be fixed in most cases by trying again.
Provides a @retry
decorator, which applies exponential backoff
to a function.
- google_drive_ocr.errors.retry(attempts: int = 4, delay: int = 1, backoff: int = 2, hook: Optional[Callable[[int, Exception, int], Any]] = None) Callable [source]
Decorator to Retry with Exponential Backoff (on Exception)
A function that raises an exception on failure, when decorated with this decorator, will retry till it returns True or number of attempts runs out.
The decorator will call the function up to
attempts
times if it raises an exception.By default it catches instances of the Exception class and subclasses. This will recover after all but the most fatal errors. You may specify a custom tuple of exception classes with the
exceptions
argument; the function will only be retried if it raises one of the specified exceptions.Additionally you may specify a hook function which will be called prior to retrying with the number of remaining tries and the exception instance; This is primarily intended to give the opportunity to log the failure. Hook is not called after failure if no retries remain.
- Parameters
attempts (int, optional) – Number of attempts in case of failure. The default is 4.
delay (int, optional) – Intinitial delay in seconds The default is 1.
backoff (int, optional) – Backoff multiplication factor The default is 2.
hook (Callable[[int, Exception, int], Any], optional) – Function with the parameters (tries_remaining, exception, delay) The default is None.
- Returns
Decorator function
- Return type
Callable
- Raises
ValueError – If the
backoff
multiplication factor is less than 1.ValueError – If the number of
attempts
is less than 0.ValueError – If the initial
delay
is less than or equal to 0.
google_drive_ocr.utils module
Utility Functions
- google_drive_ocr.utils.get_files(topdir: str, extn: str) Generator[str, None, None] [source]
Search
topdir
recursively for all files with extensionextn
extension is checked with
str.endswith()
, instead of the supposedly betteros.path.splitext()
, in order to facilitate the search with multiple dots in theextn
i.e.
>>> get_files(topdir, ".xyz.txt")
wouldn’t have worked as expected ifsplitext()
was used.- Parameters
topdir (str) – Path of the directory to search files in
extn (str) – Extension to look for
- Returns
Matching file paths
- Return type
Generator[str, None, None]
- google_drive_ocr.utils.list_to_range(list_of_int: List[int]) List[Tuple[int, int]] [source]
Convert a list of integers into a list of ranges
A range is tuple (start, end)
- Parameters
list_of_int (List[int]) – List of integers
- Returns
List of ranges
- Return type
List[Tuple[int, int]]
- google_drive_ocr.utils.extract_pages(pdf_path: str, pages: Optional[Iterator[Tuple[int, int]]] = None) Set[str] [source]
Extract pages from a PDF file as image files
Pages are saved in the same directory as the PDF file, with the suffix
.page-[number].jpg
- Parameters
pdf_path (str) – Path to the PDF file
pages (Iterator[Tuple[int, int]], optional) – Page ranges to extract. If None, all pages will be extracted. The default is None.
- Returns
Set of paths to extracted pages
- Return type
Set[str]
Module contents
Google OCR (Drive API v3).