Web Scraper

Scrapes Webpages

Input

It takes in a DataFrame as input

Type

transform

Class

fire.nodes.gai.NodeWebScraper

Fields

Name

Title

Description

site

Webpage to Crawl

Enter website url

siteCol

Webpage to Crawl

Select website url

scrape_recursively

Scrape Recursively

when True, will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.

limit

Depth

Stop looking after finding this many results when scrape recursively is True.

Details

Web Scraper Node Details

The Web Scraper node is designed to extract content from webpages, starting from a specified URL. It collects text from a wide range of HTML elements, cleans it to remove unwanted characters (e.g., newlines, tabs) and boilerplate phrases (e.g., “top of page,” “bottom of page”), and can optionally crawl linked pages within the same domain up to a specified limit. The node outputs a DataFrame containing the cleaned text and corresponding URLs for each scraped page.

Webpage to Crawl:Specifies the starting URL for the web scraping process. This is the initial webpage from which content is extracted and, if recursive scraping is enabled, the starting point for discovering linked pages.

Required: Yes

Depth:Defines the maximum number of pages to scrape when Scrape Recursively is set to true. The node stops once this limit is reached.

Default: 5

Example: 3

Output:The node outputs a DataFrame with the following columns:

text: The cleaned text content extracted from the webpage, free of newlines, tabs, excessive spaces, and boilerplate phrases (e.g., “top of page,” “bottom of page”).

metadata: The URL of the scraped webpage.Each row represents a single scraped page, with up to Depth pages included in the output.