LDA =========== LDA is given a collection of documents as input data, via the featuresCol parameter. Each document is specified as a Vector of length vocabSize, where each entry is the count for the corresponding term (word) in the document Input -------------- It takes in a DataFrame as input and performs LDA Output -------------- LDA Model is passed to the next Node for Prediction or Storing Type --------- ml-estimator Class --------- fire.nodes.ml.NodeLDA Fields --------- .. list-table:: :widths: 10 5 10 :header-rows: 1 * - Name - Title - Description * - featuresCol - Features Column - Features column of type vectorUDT for model fitting. * - k - K - The number of topics to create. * - maxIter - Max Iterations - The maximum number of iterations. * - optimizer - Optimizer - Optimizer or inference algorithm used to estimate the LDA model. * - topicDistributionCol - TopicDistributionColumn - Output column with estimates of the topic mixture distribution for each document * - docConcentration - Doc Concentration - Dirichlet parameter for the prior over document-topic distributions. Smaller values encourage documents to be dominated by a few topics, while larger values lead to more evenly mixed topics. * - checkpointInterval - checkpointInterval - The checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. * - subsamplingRate - subsamplingRate - Fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1]. * - seed - Seed - Random Seed. * - maxTermsPerTopic - MaxTermsPerTopic - Number of Terms in Topics * - keepLastCheckpoint - Keep Last Checkpoint - indicates whether to keep the last checkpoint * - optimizeDocConcentration - Optimize Doc Concentration - Indicates whether the docConcentration will be optimized during training Details ------- LDA is implemented as an Estimator that supports both EMLDAOptimizer and OnlineLDAOptimizer, and generates a LDAModel as the base model. Expert users may cast a LDAModel generated by EMLDAOptimizer to a DistributedLDAModel if needed. More details are available at Apache Spark ML docs page: http://spark.apache.org/docs/latest/ml-clustering.html#latent-dirichlet-allocation-lda Examples ------- Below example is available at : https://spark.apache.org/docs/latest/ml-clustering.html#latent-dirichlet-allocation-lda import org.apache.spark.ml.clustering.LDA // Loads data. val dataset = spark.read.format("libsvm") .load("data/mllib/sample_lda_libsvm_data.txt") // Trains a LDA model. val lda = new LDA().setK(10).setMaxIter(10) val model = lda.fit(dataset) val ll = model.logLikelihood(dataset) val lp = model.logPerplexity(dataset) println(s"The lower bound on the log likelihood of the entire corpus: $ll") println(s"The upper bound on perplexity: $lp") // Describe topics. val topics = model.describeTopics(3) println("The topics described by their top-weighted terms:") topics.show(false) // Shows the result. val transformed = model.transform(dataset) transformed.show(false)