Sharepoint Data Extraction

Retrieves files and pages from a SharePoint site, extracts their content (e.g., text from PDFs, DOCX), and outputs a DataFrame with metadata and content. Designed as the starting node in a PySpark workflow, requiring no input DataFrame.

Output

A DataFrame with columns: file_name (string), file_type (string), share_point_url (string), content (string).

Type

pyspark

Class

fire.nodes.gai.NodeSharepointLoader

Fields

Name

Title

Description

sharePointConnection

SharePoint Connection

Connection details for the SharePoint API, including tenant_id, client_id, client_secret, and sharepoint_url (e.g., https://example.sharepoint.com/sites/SparkflowsProject).

performance

Advanced

start_time

Start Time

Optional start time for filtering files/pages by lastModifiedDateTime (format: ‘YYYY-MM-DD HH:MM:SS’ or ISO 8601, e.g., ‘2025-07-03 02:05:43’ or ‘2025-07-03T02:05:43Z’). Assumes UTC if no timezone is specified.

end_time

End Time

Optional end time for filtering files/pages by lastModifiedDateTime (format: ‘YYYY-MM-DD HH:MM:SS’ or ISO 8601, e.g., ‘2025-07-04 23:59:59’ or ‘2025-07-04T23:59:59Z’). Assumes UTC if no timezone is specified.

Details

SharePoint Data Extraction Node Details

The SharePoint Data Extraction node retrieves files and pages from a SharePoint site, extracts their content (e.g., text from PDFs, DOCX, or SharePoint pages), and outputs the data as a DataFrame. The node connects to a SharePoint instance using Microsoft Graph API, fetches files and pages modified within a specified time range, and provides metadata and extracted content, making it ideal for integrating SharePoint data into data processing pipelines.

SharePoint Connection:

Specifies the connection details for the SharePoint API, including:

tenant_id: The Azure AD tenant ID for authentication.

client_id: The client ID of the registered application in Azure AD.

client_secret: The client secret for the registered application.

sharepoint_url: The URL of the SharePoint site (e.g., https://example.sharepoint.com/sites/SparkflowsProject).

These parameters are required to authenticate and access the SharePoint instance.

Start Time:

Optional timestamp to filter files and pages by their lastModifiedDateTime. Only items modified after this time are retrieved. Supported formats are:

YYYY-MM-DD HH:MM:SS (e.g., “2025-07-03 02:05:43”, assumed UTC).

ISO 8601 (e.g., “2025-07-03T02:05:43Z” or “2025-07-03T02:05:43+00:00”).

If not specified, the node uses the last run’s timestamp for subsequent runs (Change Data Capture) or no lower bound for the first run.

End Time:

Optional timestamp to filter files and pages by their lastModifiedDateTime. Only items modified before or at this time are retrieved. Supported formats are:

YYYY-MM-DD HH:MM:SS (e.g., “2025-07-04 23:59:59”, assumed UTC).

ISO 8601 (e.g., “2025-07-04T23:59:59Z” or “2025-07-04T23:59:59+00:00”).

If not specified, no upper time bound is applied.

Output:

The node outputs a DataFrame with the following columns:

file_name: The name of the file or page retrieved from SharePoint (string).

file_type: The type of the item (e.g., “pdf”, “docx”, “page” for SharePoint pages) (string).

share_point_url: The URL of the file or page in the SharePoint instance (string).

content: The extracted content from the file or page (e.g., text from PDFs, DOCX, or SharePoint page web parts) (string).