Split With Stratified Sampling

This node splits the incoming DataFrame into 2. It takes in the fraction to use in splitting the data by Stratified Sampling.

Input

It takes in a DataFrame as input

Output

The input DataFrame is split into 2 DataFrames and output

Type

transform

Class

fire.nodes.util.SplitWithStratifiedSampling

Fields

Name

Title

Description

keyInputCol

Column Name

column that defines strata

fraction

Fraction

sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero

seed

Seed

random seed

Details

Split With Stratified Sampling, which is the preferred way to sample from populations with varing subpopulation sizes.

Returns a stratified sample without replacement based on the fraction given on each stratum.

More details are available at : https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.sampleBy