pw.io.gdrive

read(object_id, *, mode='streaming', object_size_limit=None, refresh_interval=30, service_user_credentials_file, with_metadata=False, file_name_pattern=None, name=None, max_backlog_size=None, **kwargs)

sourceReads a table from a Google Drive directory or file.

It will return a table with single column data containing each file in a binary format.

Parameters
- object_id (str) – id of a directory or file. Directories will be scanned recursively.
- mode (Literal['streaming', 'static']) – denotes how the engine polls the new data from the source. Currently “streaming” and “static” are supported. If set to “streaming”, it will check for updates, deletions and new files every refresh_interval seconds. “static” mode will only consider the available data and ingest all of it in one commit. The default value is “streaming”.
- object_size_limit (int | None) – Maximum size (in bytes) of a file that will be processed by this connector or None if no filtering by size should be made;
- refresh_interval (int) – time in seconds between scans. Applicable if mode is set to ‘streaming’.
- service_user_credentials_file (str) – Google API service user json file. Please follow the instructions provided in the developer’s user guide to obtain them.
- with_metadata (bool) – when set to True, the connector will add an additional column named _metadata to the table. This column will contain file metadata, such as: id, name, mimeType, parents, modifiedTime, thumbnailLink, lastModifyingUser.
- file_name_pattern (list | str | None) – glob pattern (or list of patterns) to be used to filter files based on their names. Defaults to None which doesn’t filter anything. Doesn’t apply to folder names. For example, *.pdf will only return files that has .pdf extension.
- name (str | None) – A unique name for the connector. If provided, this name will be used in logs and monitoring dashboards. Additionally, if persistence is enabled, it will be used as the name for the snapshot that stores the connector’s progress.
- max_backlog_size (int | None) – Limit on the number of entries read from the input source and kept in processing at any moment. Reading pauses when the limit is reached and resumes as processing of some entries completes. Useful with large sources that emit an initial burst of data to avoid memory spikes.
Returns
The table read.

Example:

import pathway as pw

table = pw.io.gdrive.read(
    object_id="0BzDTMZY18pgfcGg4ZXFRTDFBX0j",
    service_user_credentials_file="credentials.json"
)