Sharepoint Data Extraction¶
Retrieves files and pages from a SharePoint site, extracts their content (e.g., text from PDFs, DOCX), and outputs a DataFrame with metadata and content. Designed as the starting node in a PySpark workflow, requiring no input DataFrame.
Output¶
A DataFrame with columns: file_name (string), file_type (string), share_point_url (string), content (string).
Type¶
pyspark
Class¶
fire.nodes.gai.NodeSharepointLoader
Fields¶
Name |
Title |
Description |
|---|---|---|
sharePointConnection |
SharePoint Connection |
Connection details for the SharePoint API, including tenant_id, client_id, client_secret, and sharepoint_url (e.g., https://example.sharepoint.com/sites/SparkflowsProject). |
performance |
Advanced |
|
start_time |
Start Time |
Optional start time for filtering files/pages by lastModifiedDateTime (format: ‘YYYY-MM-DD HH:MM:SS’ or ISO 8601, e.g., ‘2025-07-03 02:05:43’ or ‘2025-07-03T02:05:43Z’). Assumes UTC if no timezone is specified. |
end_time |
End Time |
Optional end time for filtering files/pages by lastModifiedDateTime (format: ‘YYYY-MM-DD HH:MM:SS’ or ISO 8601, e.g., ‘2025-07-04 23:59:59’ or ‘2025-07-04T23:59:59Z’). Assumes UTC if no timezone is specified. |
Details¶
SharePoint Data Extraction Node Details¶
The SharePoint Data Extraction node retrieves files and pages from a SharePoint site, extracts their content (e.g., text from PDFs, DOCX, or SharePoint pages), and outputs the data as a DataFrame. The node connects to a SharePoint instance using Microsoft Graph API, fetches files and pages modified within a specified time range, and provides metadata and extracted content, making it ideal for integrating SharePoint data into data processing pipelines.
SharePoint Connection:
Specifies the connection details for the SharePoint API, including:
tenant_id: The Azure AD tenant ID for authentication.
client_id: The client ID of the registered application in Azure AD.
client_secret: The client secret for the registered application.
sharepoint_url: The URL of the SharePoint site (e.g., https://example.sharepoint.com/sites/SparkflowsProject).
These parameters are required to authenticate and access the SharePoint instance.
Start Time:
Optional timestamp to filter files and pages by their lastModifiedDateTime. Only items modified after this time are retrieved. Supported formats are:
YYYY-MM-DD HH:MM:SS (e.g., “2025-07-03 02:05:43”, assumed UTC).
ISO 8601 (e.g., “2025-07-03T02:05:43Z” or “2025-07-03T02:05:43+00:00”).
If not specified, the node uses the last run’s timestamp for subsequent runs (Change Data Capture) or no lower bound for the first run.
End Time:
Optional timestamp to filter files and pages by their lastModifiedDateTime. Only items modified before or at this time are retrieved. Supported formats are:
YYYY-MM-DD HH:MM:SS (e.g., “2025-07-04 23:59:59”, assumed UTC).
ISO 8601 (e.g., “2025-07-04T23:59:59Z” or “2025-07-04T23:59:59+00:00”).
If not specified, no upper time bound is applied.
Output:
The node outputs a DataFrame with the following columns:
file_name: The name of the file or page retrieved from SharePoint (string).
file_type: The type of the item (e.g., “pdf”, “docx”, “page” for SharePoint pages) (string).
share_point_url: The URL of the file or page in the SharePoint instance (string).
content: The extracted content from the file or page (e.g., text from PDFs, DOCX, or SharePoint page web parts) (string).