K-Means¶

K-means clustering with support for k-means initialization proposed by Bahmani et al

Input¶

It takes in a DataFrame as input and performs K-Means clustering

Output¶

The input DataFrame is passed along to the next Processors

Type¶

ml-estimator

Class¶

fire.nodes.ml.NodeKMeans

Fields¶

Name	Title	Description
modelIdentifier	Model Identifier	modelIdentifier starts with $loop & columns names separated with underscore. Example: $loop_columnName1_columnName2.
featuresCol	Features Column	Features column of type vectorUDT for model fitting.
k	K	The number of clusters to create.
maxIter	Max Iterations	The maximum number of iterations.
predictionCol	Prediction Column	The prediction column created during model scoring.
seed	Seed	Random Seed.
tol	Tolerence	The convergence tolerance for iterative algorithms.
initMode	initMode	The initialization algorithm mode.
initSteps	initSteps	The number of steps for the k-means initialization mode. It will be ignored when other initialization modes are chosen.
distanceMeasure	distanceMeasure	Trait for shared param distanceMeasure
weightCol	Weight Column	Weight Column

Details¶

k-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The MLlib implementation includes a parallelized variant of the k-means++ method called kmeans||.

KMeans is implemented as an Estimator and generates a KMeansModel as the base model.

More details are available at Apache Spark ML docs page:

https://spark.apache.org/docs/latest/ml-clustering.html#k-means

Examples¶

Below example is available at : https://spark.apache.org/docs/latest/ml-clustering.html#k-means

import org.apache.spark.ml.clustering.KMeans

import org.apache.spark.ml.evaluation.ClusteringEvaluator

// Loads data.

val dataset = spark.read.format(“libsvm”).load(“data/mllib/sample_kmeans_data.txt”)

// Trains a k-means model.

val kmeans = new KMeans().setK(2).setSeed(1L)

val model = kmeans.fit(dataset)

// Make predictions

val predictions = model.transform(dataset)

// Evaluate clustering by computing Silhouette score

val evaluator = new ClusteringEvaluator()

val silhouette = evaluator.evaluate(predictions)

println(s”Silhouette with squared euclidean distance = $silhouette”)

// Shows the result.

println(“Cluster Centers: “)

model.clusterCenters.foreach(println)