pw.xpacks.connectors

This page provides the documentation of connectors in the Live Data Framework that are available as an xpack. This module is available when using one of the following licenses only: Pathway Scale, Pathway Enterprise.

read(url, *, tenant, client_id, cert_path, thumbprint, root_path, mode='streaming', format='binary', recursive=True, object_size_limit=None, with_metadata=False, refresh_interval=30, max_failed_attempts_in_row=8, max_backlog_size=None)

sourceReads a table from a directory or a file in Microsoft SharePoint site. Requires a valid Pathway Live Data Framework Scale license key.

It will return a table with single column data containing each file in a binary format.

Note that if you only need to monitor changes in the given directory, you can use the "only_metadata" format, in which case the table will contain only the _metadata column, and no time or traffic will be spent on downloading the files’ contents.

Parameters
- url (str) – URL of the SharePoint site including the path to the site. For example: https://company.sharepoint.com/sites/MySite;
- tenant (str) – ID of SharePoint tenant. It is normally a GUID;
- client_id (str) – ClientID of the SharePoint application that has the required grants and will be used to access the data;
- cert_path (str) – Path to the certificate, normally .pem-file, added to the applicationspecified above and used to authenticate;
- thumbprint (str) – Thumbprint for the specified certificate;
- root_path (str) – The path for a directory or a file within the SharePoint space to beread;
- mode (str) – Denotes how the engine polls the new data from the source. Currently "streaming" and "static" are supported. If set to "streaming", it will check for updates, deletions and new files every refresh_interval seconds. "static" mode will only consider the available data and ingest all of it in one commit. The default value is "streaming";
- format (Literal['binary', 'only_metadata']) – The format of the resulting table. Can be either "binary", which corresponds to a table with a data column containing each file’s contents, or "only_metadata", which corresponds to a table that has only the _metadata column with the objects’ metadata, without downloading the objects themselves;
- recursive (bool) – If set to True, the connector will scan the nested directories. Otherwise it will only process files that are placed in the specified directory;
- object_size_limit (int | None) – Maximum size (in bytes) of a file that will be processed by this connector or None if no filtering by size should be made;
- with_metadata (bool) – when set to True, the connector will add an additional column named _metadata to the table. This column will contain file metadata, such as: path, modified_at, created_at. The creation and modification times will be given as UNIX timestamps;
- refresh_interval (int | float | timedelta) – Time between scans, given as a number of seconds or a datetime.timedelta / pw.Duration. Applicable if mode is set to "streaming".
- max_failed_attempts_in_row (int | None) – The maximum number of consecutive read errors beforethe connector terminates with an error. If set to None, the connector tries to readdata indefinitely, regardless of possible errors in the provided credentials.
- max_backlog_size (int | None) – Limit on the number of entries read from the input source and kept in processing at any moment. Reading pauses when the limit is reached and resumes as processing of some entries completes. Useful with large sources that emit an initial burst of data to avoid memory spikes.
Returns
The table read.

Example:

Let’s consider that there is a dataset stored in SharePoint site Datasets. Below we give an example for reading this dataset in the streaming mode. Please note that you canuse this example for the reference of how the parameters should look:

t = pw.xpacks.connectors.sharepoint.read(  
    url="https://company.sharepoint.com/sites/Datasets",
    tenant="c2efaf1f-8add-4334-b1ca-32776acb61ea",
    client_id="f521a53a-0b36-4f47-8ef7-60dc07587eb2",
    cert_path="certificate.pem",
    thumbprint="33C1B9D17115E848B1E956E54EECAF6E77AB1B35",
    root_path="Shared Documents/Data",
)

In the example above we also consider that this dataset is located by the path Shared Documents/Data. This code will also recursively scan the subdirectories of thegiven directory.

We can change it a little. Let’s suppose that we need to take the dataset from the directory Datasets/Animals/2023 and not take the nested subdirectories into consideration. That leads us to the following snippet:

t = pw.xpacks.connectors.sharepoint.read(  
    url="https://company.sharepoint.com/sites/Datasets",
    tenant="c2efaf1f-8add-4334-b1ca-32776acb61ea",
    client_id="f521a53a-0b36-4f47-8ef7-60dc07587eb2",
    cert_path="certificate.pem",
    thumbprint="33C1B9D17115E848B1E956E54EECAF6E77AB1B35",
    root_path="Datasets/Animals/2023",
    recursive=False,
)

SharePoint sites are often used with the subsites. The Pathway Live Data Framework supports the data reads from the subsites as well. To read the data from the subsite, you need to specify its’ URL in the url parameter. For example, if you read the dataset from vendor subspace, you can configure the connector this way:

t = pw.xpacks.connectors.sharepoint.read(  
    url="https://company.sharepoint.com/sites/Datasets/vendor",
    tenant="c2efaf1f-8add-4334-b1ca-32776acb61ea",
    client_id="f521a53a-0b36-4f47-8ef7-60dc07587eb2",
    cert_path="certificate.pem",
    thumbprint="33C1B9D17115E848B1E956E54EECAF6E77AB1B35",
    root_path="Datasets/Animals/2023",
    recursive=False,
)