pw.io.gdrive
read(object_id, *, mode='streaming', format='binary', object_size_limit=None, refresh_interval=30, service_user_credentials_file, with_metadata=False, file_name_pattern=None, name=None, max_backlog_size=None, **kwargs)
sourceReads a table from a Google Drive directory or file.
Returns a table containing a binary column data
with the binary contents
of objects in the specified directory, as well as a dict _metadata
that
contains metadata corresponding to each object. Metadata is reported only if
the with_metadata
flag is set, or if the "only_metadata"
format is chosen.
Note that if you only need to monitor changes in the given directory, you can
use the "only_metadata"
format, in which case the table will contain only
metadata, and no time or resources will be spent downloading the objects.
- Parameters
- object_id (
str
) –id
of a directory or file. Directories will be scanned recursively. - mode (
Literal
['streaming'
,'static'
]) – denotes how the engine polls the new data from the source. Currently"streaming"
and"static"
are supported. If set to"streaming"
, it will check for updates, deletions and new files everyrefresh_interval
seconds."static"
mode will only consider the available data and ingest all of it in one commit. The default value is"streaming"
. - format (
Literal
['binary'
,'only_metadata'
]) – the format of the resulting table. Can be either"binary"
, which corresponds to a table with adata
column containing the object’s contents, or"only_metadata"
, which corresponds to a table that has only the_metadata
column with the objects’ metadata, without downloading the objects themselves. - object_size_limit (
int
|None
) – Maximum size (in bytes) of a file that will be processed by this connector orNone
if no filtering by size should be made; - refresh_interval (
int
) – time in seconds between scans. Applicable if mode is set to"streaming"
. - service_user_credentials_file (
str
) – Google API service user json file. Please follow the instructions provided in the developer’s user guide to obtain them. - with_metadata (
bool
) – when set toTrue
, the connector will add an additional column named_metadata
to the table. This column will contain file metadata, such as:id
,name
,mimeType
,parents
,modifiedTime
,thumbnailLink
,lastModifyingUser
. - file_name_pattern (
list
|str
|None
) – glob pattern (or list of patterns) to be used to filter files based on their names. Defaults toNone
which doesn’t filter anything. Doesn’t apply to folder names. For example,\*.pdf
will only return files that has.pdf
extension. - name (
str
|None
) – A unique name for the connector. If provided, this name will be used in logs and monitoring dashboards. Additionally, if persistence is enabled, it will be used as the name for the snapshot that stores the connector’s progress. - max_backlog_size (
int
|None
) – Limit on the number of entries read from the input source and kept in processing at any moment. Reading pauses when the limit is reached and resumes as processing of some entries completes. Useful with large sources that emit an initial burst of data to avoid memory spikes.
- object_id (
- Returns
The table read.
Example:
import pathway as pw
table = pw.io.gdrive.read(
object_id="0BzDTMZY18pgfcGg4ZXFRTDFBX0j",
service_user_credentials_file="credentials.json"
)