LDA¶
LDA is given a collection of documents as input data, via the featuresCol parameter. Each document is specified as a Vector of length vocabSize, where each entry is the count for the corresponding term (word) in the document
Input¶
It takes in a DataFrame as input and performs LDA
Output¶
LDA Model is passed to the next Node for Prediction or Storing
Type¶
ml-estimator
Class¶
fire.nodes.ml.NodeLDA
Fields¶
Name |
Title |
Description |
|---|---|---|
featuresCol |
Features Column |
Features column of type vectorUDT for model fitting. |
k |
K |
The number of topics to create. |
maxIter |
Max Iterations |
The maximum number of iterations. |
optimizer |
Optimizer |
Optimizer or inference algorithm used to estimate the LDA model. |
topicDistributionCol |
TopicDistributionColumn |
Output column with estimates of the topic mixture distribution for each document |
docConcentration |
Doc Concentration |
Dirichlet parameter for the prior over document-topic distributions. Smaller values encourage documents to be dominated by a few topics, while larger values lead to more evenly mixed topics. |
checkpointInterval |
checkpointInterval |
The checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. |
subsamplingRate |
subsamplingRate |
Fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1]. |
seed |
Seed |
Random Seed. |
maxTermsPerTopic |
MaxTermsPerTopic |
Number of Terms in Topics |
keepLastCheckpoint |
Keep Last Checkpoint |
indicates whether to keep the last checkpoint |
optimizeDocConcentration |
Optimize Doc Concentration |
Indicates whether the docConcentration will be optimized during training |
Details¶
LDA is implemented as an Estimator that supports both EMLDAOptimizer and OnlineLDAOptimizer, and generates a LDAModel as the base model. Expert users may cast a LDAModel generated by EMLDAOptimizer to a DistributedLDAModel if needed.
More details are available at Apache Spark ML docs page:
http://spark.apache.org/docs/latest/ml-clustering.html#latent-dirichlet-allocation-lda
Examples¶
Below example is available at : https://spark.apache.org/docs/latest/ml-clustering.html#latent-dirichlet-allocation-lda
import org.apache.spark.ml.clustering.LDA
// Loads data.
val dataset = spark.read.format(“libsvm”)
.load(“data/mllib/sample_lda_libsvm_data.txt”)
// Trains a LDA model.
val lda = new LDA().setK(10).setMaxIter(10)
val model = lda.fit(dataset)
val ll = model.logLikelihood(dataset)
val lp = model.logPerplexity(dataset)
println(s”The lower bound on the log likelihood of the entire corpus: $ll”)
println(s”The upper bound on perplexity: $lp”)
// Describe topics.
val topics = model.describeTopics(3)
println(“The topics described by their top-weighted terms:”)
topics.show(false)
// Shows the result.
val transformed = model.transform(dataset)
transformed.show(false)