Read CSV¶

It reads in CSV files and creates a DataFrame from it.

Input¶

It reads in CSV text files

Output¶

It creates a DataFrame from the data read and sends it to its output

Type¶

dataset

Class¶

fire.nodes.dataset.NodeDatasetCSV

Fields¶

Name	Title	Description
path	Path	Path of the file/directory
separator	Separator	CSV Separator
header	Header	Whether the file has a header row
dropSpecialCharacterInColumnName	Drop Special Character In ColumnName	Whether to drop the Special Characters and Spaces in Column Name.
mode	Mode	Mode for dealing with corrupt records during parsing.
enforceSchema	Enforce Schema	If it is set to true, the specified or inferred schema will be forcibly applied to datasource files, and headers in CSV files will be ignored. If the option is set to false, the schema will be validated against all headers in CSV files in the case when the header option is set to true.
addInputFileName	Whether to Add Input File Name as Column in the Dataframe	Add the new field:input_file_name
encoding	Encoding	Decodes the CSV files by the given encoding type
quote	Quote	Sets a single character used for escaping quoted values where the separator can be part of the value
escape	Escape	Sets a single character used for escaping quotes inside an already quoted value.
schema	InferSchema
outputColNames	Column Names for the CSV	New Columns from CSV
outputColTypes	Column Types for the CSV	Data Type of the Output Columns
outputColFormats	Column Formats for the CSV	Format of the Output Columns
properties	Properties
extraOptionsKeys	Properties Name	Extra options/properites available while executing in Read CSV.
extraOptionsValues	Properties Value	Config Values for the Corresponding properites name

Details¶

Read CSV Details¶

This node reads CSV files and creates a DataFrame from them. It can read either a single file or a directory containing multiple files. The user can configure the below fields to parse the file.

The user can choose the Output storage level from the drop down. The options in the dropdown can be one of the following:

MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they are needed. This is the default level.
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that do not fit on disk, and read them from there when they are needed.
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that do not fit in memory to disk instead of recomputing them on the fly each time they’re needed.
DISK_ONLY Store the RDD partitions only on disk.
**MEMORY_ONLY_2, MEMORY_AND_DISK_2 others ** . Same as the levels above, but replicate each partition on two cluster nodes.
OFF_HEAP Similar to MEMORY_ONLY_SER, but store the data in off-heap memory. This requires off-heap memory to be enabled.

The user needs to provide a data file Path to read the data from. This is a required field.

The user can choose the Separator used in the data file to parse it. The default separator is ( , ) comma.

In the Header field, one can choose:

true if the data file has a header.
false Otherwise.

In the Drop special character in column name field, one can choose:

true If you want to remove the special characters from column names.
false Otherwise.

In the Mode field, one can choose from the below options in the dropdown:

PERMISSIVE When the parser meets a corrupt field in a record, it sets the value of the field to NULL and continues to the next record.
DROPMALFORMED ignores the whole corrupted record.
FAILFAST throws an exception when it meets corrupted records.

In the Enforce Schema field, one can choose:

true The specified or inferred schema will be forcibly applied to datasource files, and headers in CSV files will be ignored.
false The schema will be validated against all headers in CSV files when the header option is set to false.

In the Whether to add input file as a column in DataFrame field, one can choose:

true There will be a new column added to the DataFrame at the end, which can be seen in the schema columns. One can enter the name of this column.
false This functionality is disabled, and the DataFrame consists of only the columns read from the data file.

In the ENCODING field, one can specify the encoding type to be used for reading the files. By default, it is set as UTF-8.

The QUOTE field sets a single character used for escaping quoted values where the separator can be part of the value. The default value for this is ( ” ), a double quote.

The ESCAPE field sets a single character used for escaping quotes inside an already quoted value. The default value for this is ( ), a backslash.

After the above options are set, one can click on InferSchema to see the final columns.

Users can still add or delete columns using + button next to the InferSchema and - button next to the column names.

Examples¶

Read CSV Node Example¶

Given a CSV file with the following data:

SupplierID,SupplierName,Region,YearsInBusiness,LeadTime,PriceIndex,OrderFulfillmentTime,OverallCost,OrderCancellations,CustomerRating

S1,Supplier A,Region 1,10,5,1.2,3,100,2,4

S2,Supplier B,Region 2,15,7,1.1,4,120,1,5

If you configure the Read CSV node as follows:

Path: /path/to/your/file.csv

Separator: ,

Header: true

The output would be a DataFrame with the following schema:

Column Name Data Type

SupplierID String

SupplierName String

Region String

YearsInBusiness Integer

LeadTime Integer

PriceIndex Double

OrderFulfillmentTime Integer

OverallCost Integer

OrderCancellations Integer

CustomerRating Integer