Sharepoint Data Extraction =========== Retrieves files and pages from a SharePoint site, extracts their content (e.g., text from PDFs, DOCX), and outputs a DataFrame with metadata and content. Designed as the starting node in a PySpark workflow, requiring no input DataFrame. Output -------------- A DataFrame with columns: file_name (string), file_type (string), share_point_url (string), content (string). Type --------- pyspark Class --------- fire.nodes.gai.NodeSharepointLoader Fields --------- .. list-table:: :widths: 10 5 10 :header-rows: 1 * - Name - Title - Description * - sharePointConnection - SharePoint Connection - Connection details for the SharePoint API, including tenant_id, client_id, client_secret, and sharepoint_url (e.g., https://example.sharepoint.com/sites/SparkflowsProject). * - performance - Advanced - * - start_time - Start Time - Optional start time for filtering files/pages by lastModifiedDateTime (format: 'YYYY-MM-DD HH:MM:SS' or ISO 8601, e.g., '2025-07-03 02:05:43' or '2025-07-03T02:05:43Z'). Assumes UTC if no timezone is specified. * - end_time - End Time - Optional end time for filtering files/pages by lastModifiedDateTime (format: 'YYYY-MM-DD HH:MM:SS' or ISO 8601, e.g., '2025-07-04 23:59:59' or '2025-07-04T23:59:59Z'). Assumes UTC if no timezone is specified. Details ------- SharePoint Data Extraction Node Details +++++++++++++++ The SharePoint Data Extraction node retrieves files and pages from a SharePoint site, extracts their content (e.g., text from PDFs, DOCX, or SharePoint pages), and outputs the data as a DataFrame. The node connects to a SharePoint instance using Microsoft Graph API, fetches files and pages modified within a specified time range, and provides metadata and extracted content, making it ideal for integrating SharePoint data into data processing pipelines. SharePoint Connection: Specifies the connection details for the SharePoint API, including: tenant_id: The Azure AD tenant ID for authentication. client_id: The client ID of the registered application in Azure AD. client_secret: The client secret for the registered application. sharepoint_url: The URL of the SharePoint site (e.g., https://example.sharepoint.com/sites/SparkflowsProject). These parameters are required to authenticate and access the SharePoint instance. Start Time: Optional timestamp to filter files and pages by their lastModifiedDateTime. Only items modified after this time are retrieved. Supported formats are: YYYY-MM-DD HH:MM:SS (e.g., "2025-07-03 02:05:43", assumed UTC). ISO 8601 (e.g., "2025-07-03T02:05:43Z" or "2025-07-03T02:05:43+00:00"). If not specified, the node uses the last run’s timestamp for subsequent runs (Change Data Capture) or no lower bound for the first run. End Time: Optional timestamp to filter files and pages by their lastModifiedDateTime. Only items modified before or at this time are retrieved. Supported formats are: YYYY-MM-DD HH:MM:SS (e.g., "2025-07-04 23:59:59", assumed UTC). ISO 8601 (e.g., "2025-07-04T23:59:59Z" or "2025-07-04T23:59:59+00:00"). If not specified, no upper time bound is applied. Output: The node outputs a DataFrame with the following columns: file_name: The name of the file or page retrieved from SharePoint (string). file_type: The type of the item (e.g., "pdf", "docx", "page" for SharePoint pages) (string). share_point_url: The URL of the file or page in the SharePoint instance (string). content: The extracted content from the file or page (e.g., text from PDFs, DOCX, or SharePoint page web parts) (string).