pw.xpacks.connectors

This page provides the documentation of connectors in Pathway that are available as an xpack. This module is available when using one of the following licenses only: Pathway Scale, Pathway Enterprise.

pw.xpacks.connectors.sharepoint.read(url, *, tenant, client_id, cert_path, thumbprint, root_path, mode='streaming', recursive=True, object_size_limit=None, with_metadata=False, refresh_interval=30)

sourceReads a table from a directory or a file in Microsoft SharePoint site. Requires a valid Pathway Scale license key.

It will return a table with single column data containing each file in a binary format.

  • Parameters
    • url (str) – URL of the SharePoint site including the path to the site. For example: https://company.sharepoint.com/sites/MySite;
    • tenant (str) – ID of SharePoint tenant. It is normally a GUID;
    • client_id (str) – ClientID of the SharePoint application that has the required grants and will be used to access the data;
    • cert_path (str) – Path to the certificate, normally .pem-file, added to the applicationspecified above and used to authenticate;
    • thumbprint (str) – Thumbprint for the specified certificate;
    • root_path (str) – The path for a directory or a file within the SharePoint space to beread;
    • mode (str) – Denotes how the engine polls the new data from the source. Currently “streaming” and “static” are supported. If set to “streaming”, it will check for updates, deletions and new files every refresh_interval seconds. “static” mode will only consider the available data and ingest all of it in one commit. The default value is “streaming”;
    • recursive (bool) – If set to True, the connector will scan the nested directories. Otherwise it will only process files that are placed in the specified directory;
    • object_size_limit (int | None) – Maximum size (in bytes) of a file that will be processed by this connector or None if no filtering by size should be made;
    • with_metadata (bool) – when set to True, the connector will add an additional column named _metadata to the table. This column will contain file metadata, such as: path, modified_at, created_at. The creation and modification times will be given as UNIX timestamps;
    • refresh_interval (int) – Time in seconds between scans. Applicable if mode is set to’streaming’.
  • Returns
    The table read.

Example:

Let’s consider that there is a dataset stored in SharePoint site Datasets. Below we give an example for reading this dataset in the steaming mode. Please note that you canuse this example for the reference of how the parameters should look:

t = pw.xpacks.connectors.sharepoint.read(  
    url="https://company.sharepoint.com/sites/Datasets",
    tenant="c2efaf1f-8add-4334-b1ca-32776acb61ea",
    client_id="f521a53a-0b36-4f47-8ef7-60dc07587eb2",
    cert_path="certificate.pem",
    thumbprint="33C1B9D17115E848B1E956E54EECAF6E77AB1B35",
    root_path="Shared Documents/Data",
)

In the example above we also consider that this dataset is located by the path Shared Documents/Data. This code will also recursively scan the subdirectories of thegiven directory.

We can change it a little. Let’s suppose that we need to take the dataset from the directory Datasets/Animals/2023 and not take the nested subdirectories into consideration. That leads us to the following snippet:

t = pw.xpacks.connectors.sharepoint.read(  
    url="https://company.sharepoint.com/sites/Datasets",
    tenant="c2efaf1f-8add-4334-b1ca-32776acb61ea",
    client_id="f521a53a-0b36-4f47-8ef7-60dc07587eb2",
    cert_path="certificate.pem",
    thumbprint="33C1B9D17115E848B1E956E54EECAF6E77AB1B35",
    root_path="Datasets/Animals/2023",
    recursive=False,
)

SharePoint sites are often used with the subsites. Pathway supports the data reads from the subsites as well. To read the data from the subsite, you need to specify its’ URL in the url parameter. For example, if you read the dataset from vendor subspace, you can configure the connector this way:

t = pw.xpacks.connectors.sharepoint.read(  
    url="https://company.sharepoint.com/sites/Datasets/vendor",
    tenant="c2efaf1f-8add-4334-b1ca-32776acb61ea",
    client_id="f521a53a-0b36-4f47-8ef7-60dc07587eb2",
    cert_path="certificate.pem",
    thumbprint="33C1B9D17115E848B1E956E54EECAF6E77AB1B35",
    root_path="Datasets/Animals/2023",
    recursive=False,
)