pw.io.pyfilesystem

read(source, *, path='', refresh_interval=30, mode='streaming', with_metadata=False)

sourceReads a table from PyFilesystem https://docs.pyfilesystem.org/en/latest/introduction.html_ source.

It returns a table with a single column data containing each file in a binary format. If the with_metadata option is specified, it also attaches a column _metadata containing the metadata of the objects read.

  • Parameters
    • source (FS) – PyFilesystem source.
    • path (str) – Path inside the PyFilesystem source to process. All files within this path will be processed recursively. If unspecified, the root of the source is taken.
    • mode (str) – denotes how the engine polls the new data from the source. Currently “streaming” and “static” are supported. If set to “streaming”, it will check for updates, deletions, and new files every refresh_interval seconds. “static” mode will only consider the available data and ingest all of it in one commit. The default value is “streaming”.
    • refresh_interval (float) – time in seconds between scans. Applicable if the mode is set to “streaming”.
    • with_metadata (bool) – when set to True, the connector will add column named _metadata to the table. This column will contain file metadata, such as: path, name, owner, created_at, modified_at, accessed_at, size.
  • Returns
    The table read.

Example:

Suppose that you want to read a file from a ZIP archive projects.zip with the usage of PyFilesystem. To do that, you first need to import the fs library or just the open_fs method and to create the data source. It can be done as follows:

from fs import open_fs
source = open_fs("zip://projects.zip")

Then you can use the connector as follows:

import pathway as pw

table = pw.io.pyfilesystem.read(source)

This command reads all files in the archive in full. If the data is not supposed to be changed, it makes sense to run this read in the static mode. It can be done by specifying the mode parameter:

table = pw.io.pyfilesystem.read(source, mode="static")

Please note that PyFilesystem offers a great variety of sources that can be read. You can refer to the “Index of Filesystems” https://www.pyfilesystem.org/page/index-of-filesystems/_ web page for the list and the respective documentation.

For instance, you can also read a dataset from the remote FTP source with this connector. It can be done with the usage of FTP file source with the code as follows:

source = fs.open_fs('ftp://login:password@ftp.example.com/datasets')  
table = pw.io.pyfilesystem.read(source)