LDA¶

LDA is given a collection of documents as input data, via the featuresCol parameter. Each document is specified as a Vector of length vocabSize, where each entry is the count for the corresponding term (word) in the document

Input¶

It takes in a DataFrame as input and performs LDA

Output¶

LDA Model is passed to the next Node for Prediction or Storing

Type¶

ml-estimator

Class¶

fire.nodes.ml.NodeLDA

Fields¶

Name	Title	Description
featuresCol	Features Column	Features column of type vectorUDT for model fitting.
k	K	The number of topics to create.
maxIter	Max Iterations	The maximum number of iterations.
optimizer	Optimizer	Optimizer or inference algorithm used to estimate the LDA model.
topicDistributionCol	TopicDistributionColumn	Output column with estimates of the topic mixture distribution for each document
docConcentration	Doc Concentration	Dirichlet parameter for the prior over document-topic distributions. Smaller values encourage documents to be dominated by a few topics, while larger values lead to more evenly mixed topics.
checkpointInterval	checkpointInterval	The checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations.
subsamplingRate	subsamplingRate	Fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1].
seed	Seed	Random Seed.
maxTermsPerTopic	MaxTermsPerTopic	Number of Terms in Topics
keepLastCheckpoint	Keep Last Checkpoint	indicates whether to keep the last checkpoint
optimizeDocConcentration	Optimize Doc Concentration	Indicates whether the docConcentration will be optimized during training

Details¶

LDA is implemented as an Estimator that supports both EMLDAOptimizer and OnlineLDAOptimizer, and generates a LDAModel as the base model. Expert users may cast a LDAModel generated by EMLDAOptimizer to a DistributedLDAModel if needed.

More details are available at Apache Spark ML docs page:

http://spark.apache.org/docs/latest/ml-clustering.html#latent-dirichlet-allocation-lda

Examples¶

Below example is available at : https://spark.apache.org/docs/latest/ml-clustering.html#latent-dirichlet-allocation-lda

import org.apache.spark.ml.clustering.LDA

// Loads data.

val dataset = spark.read.format(“libsvm”)

.load(“data/mllib/sample_lda_libsvm_data.txt”)

// Trains a LDA model.

val lda = new LDA().setK(10).setMaxIter(10)

val model = lda.fit(dataset)

val ll = model.logLikelihood(dataset)

val lp = model.logPerplexity(dataset)

println(s”The lower bound on the log likelihood of the entire corpus: $ll”)

println(s”The upper bound on perplexity: $lp”)

// Describe topics.

val topics = model.describeTopics(3)

println(“The topics described by their top-weighted terms:”)

topics.show(false)

// Shows the result.

val transformed = model.transform(dataset)

transformed.show(false)