Read CSV¶
It reads in CSV files and creates a DataFrame from it.
Input¶
It reads in CSV text files
Output¶
It creates a DataFrame from the data read and sends it to its output
Type¶
dataset
Class¶
fire.nodes.dataset.NodeDatasetCSV
Fields¶
Name |
Title |
Description |
|---|---|---|
path |
Path |
Path of the file/directory |
separator |
Separator |
CSV Separator |
header |
Header |
Whether the file has a header row |
dropSpecialCharacterInColumnName |
Drop Special Character In ColumnName |
Whether to drop the Special Characters and Spaces in Column Name. |
mode |
Mode |
Mode for dealing with corrupt records during parsing. |
enforceSchema |
Enforce Schema |
If it is set to true, the specified or inferred schema will be forcibly applied to datasource files, and headers in CSV files will be ignored. If the option is set to false, the schema will be validated against all headers in CSV files in the case when the header option is set to true. |
addInputFileName |
Whether to Add Input File Name as Column in the Dataframe |
Add the new field:input_file_name |
encoding |
Encoding |
Decodes the CSV files by the given encoding type |
quote |
Quote |
Sets a single character used for escaping quoted values where the separator can be part of the value |
escape |
Escape |
Sets a single character used for escaping quotes inside an already quoted value. |
schema |
InferSchema |
|
outputColNames |
Column Names for the CSV |
New Columns from CSV |
outputColTypes |
Column Types for the CSV |
Data Type of the Output Columns |
outputColFormats |
Column Formats for the CSV |
Format of the Output Columns |
properties |
Properties |
|
extraOptionsKeys |
Properties Name |
Extra options/properites available while executing in Read CSV. |
extraOptionsValues |
Properties Value |
Config Values for the Corresponding properites name |
Details¶
Read CSV Details¶
This node reads CSV files and creates a DataFrame from them. It can read either a single file or a directory containing multiple files. The user can configure the below fields to parse the file.
The user can choose the Output storage level from the drop down. The options in the dropdown can be one of the following:
MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they are needed. This is the default level.
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that do not fit on disk, and read them from there when they are needed.
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that do not fit in memory to disk instead of recomputing them on the fly each time they’re needed.
DISK_ONLY Store the RDD partitions only on disk.
**MEMORY_ONLY_2, MEMORY_AND_DISK_2 others ** . Same as the levels above, but replicate each partition on two cluster nodes.
OFF_HEAP Similar to MEMORY_ONLY_SER, but store the data in off-heap memory. This requires off-heap memory to be enabled.
The user needs to provide a data file Path to read the data from. This is a required field.
The user can choose the Separator used in the data file to parse it. The default separator is ( , ) comma.
In the Header field, one can choose:
true if the data file has a header.
false Otherwise.
In the Drop special character in column name field, one can choose:
true If you want to remove the special characters from column names.
false Otherwise.
In the Mode field, one can choose from the below options in the dropdown:
PERMISSIVE When the parser meets a corrupt field in a record, it sets the value of the field to NULL and continues to the next record.
DROPMALFORMED ignores the whole corrupted record.
FAILFAST throws an exception when it meets corrupted records.
In the Enforce Schema field, one can choose:
true The specified or inferred schema will be forcibly applied to datasource files, and headers in CSV files will be ignored.
false The schema will be validated against all headers in CSV files when the header option is set to false.
In the Whether to add input file as a column in DataFrame field, one can choose:
true There will be a new column added to the DataFrame at the end, which can be seen in the schema columns. One can enter the name of this column.
false This functionality is disabled, and the DataFrame consists of only the columns read from the data file.
In the ENCODING field, one can specify the encoding type to be used for reading the files. By default, it is set as UTF-8.
The QUOTE field sets a single character used for escaping quoted values where the separator can be part of the value. The default value for this is ( ” ), a double quote.
The ESCAPE field sets a single character used for escaping quotes inside an already quoted value. The default value for this is ( ), a backslash.
After the above options are set, one can click on InferSchema to see the final columns.
Users can still add or delete columns using + button next to the InferSchema and - button next to the column names.
Examples¶
Read CSV Node Example¶
Given a CSV file with the following data:
SupplierID,SupplierName,Region,YearsInBusiness,LeadTime,PriceIndex,OrderFulfillmentTime,OverallCost,OrderCancellations,CustomerRating
S1,Supplier A,Region 1,10,5,1.2,3,100,2,4
S2,Supplier B,Region 2,15,7,1.1,4,120,1,5
If you configure the Read CSV node as follows:
Path: /path/to/your/file.csv
Separator: ,
Header: true
The output would be a DataFrame with the following schema:
Column Name Data Type
SupplierID String
SupplierName String
Region String
YearsInBusiness Integer
LeadTime Integer
PriceIndex Double
OrderFulfillmentTime Integer
OverallCost Integer
OrderCancellations Integer
CustomerRating Integer