pw.io.gdrive

read(object_id, *, mode='streaming', format='binary', object_size_limit=None, refresh_interval=30, service_user_credentials_file, with_metadata=False, file_name_pattern=None, name=None, max_backlog_size=None, **kwargs)

sourceReads a table from a Google Drive directory or file.

Returns a table containing a binary column data with the binary contents of objects in the specified directory, as well as a dict _metadata that contains metadata corresponding to each object. Metadata is reported only if the with_metadata flag is set, or if the "only_metadata" format is chosen.

Note that if you only need to monitor changes in the given directory, you can use the "only_metadata" format, in which case the table will contain only metadata, and no time or resources will be spent downloading the objects.

  • Parameters
    • object_id (str) – id of a directory or file. Directories will be scanned recursively.
    • mode (Literal['streaming', 'static']) – denotes how the engine polls the new data from the source. Currently "streaming" and "static" are supported. If set to "streaming", it will check for updates, deletions and new files every refresh_interval seconds. "static" mode will only consider the available data and ingest all of it in one commit. The default value is "streaming".
    • format (Literal['binary', 'only_metadata']) – the format of the resulting table. Can be either "binary", which corresponds to a table with a data column containing the object’s contents, or "only_metadata", which corresponds to a table that has only the _metadata column with the objects’ metadata, without downloading the objects themselves.
    • object_size_limit (int | None) – Maximum size (in bytes) of a file that will be processed by this connector or None if no filtering by size should be made;
    • refresh_interval (int) – time in seconds between scans. Applicable if mode is set to "streaming".
    • service_user_credentials_file (str) – Google API service user json file. Please follow the instructions provided in the developer’s user guide to obtain them.
    • with_metadata (bool) – when set to True, the connector will add an additional column named _metadata to the table. This column will contain file metadata, such as: id, name, mimeType, parents, modifiedTime, thumbnailLink, lastModifyingUser.
    • file_name_pattern (list | str | None) – glob pattern (or list of patterns) to be used to filter files based on their names. Defaults to None which doesn’t filter anything. Doesn’t apply to folder names. For example, \*.pdf will only return files that has .pdf extension.
    • name (str | None) – A unique name for the connector. If provided, this name will be used in logs and monitoring dashboards. Additionally, if persistence is enabled, it will be used as the name for the snapshot that stores the connector’s progress.
    • max_backlog_size (int | None) – Limit on the number of entries read from the input source and kept in processing at any moment. Reading pauses when the limit is reached and resumes as processing of some entries completes. Useful with large sources that emit an initial burst of data to avoid memory spikes.
  • Returns
    The table read.

Example:

import pathway as pw

table = pw.io.gdrive.read(
    object_id="0BzDTMZY18pgfcGg4ZXFRTDFBX0j",
    service_user_credentials_file="credentials.json"
)