LDA

LDA is given a collection of documents as input data, via the featuresCol parameter. Each document is specified as a Vector of length vocabSize, where each entry is the count for the corresponding term (word) in the document

Input

It takes in a DataFrame as input and performs LDA

Output

LDA Model is passed to the next Node for Prediction or Storing

Type

ml-estimator

Class

fire.nodes.ml.NodeLDA

Fields

Name

Title

Description

featuresCol

Features Column

Features column of type vectorUDT for model fitting.

k

K

The number of topics to create.

maxIter

Max Iterations

The maximum number of iterations.

optimizer

Optimizer

Optimizer or inference algorithm used to estimate the LDA model.

topicDistributionCol

TopicDistributionColumn

Output column with estimates of the topic mixture distribution for each document

docConcentration

Doc Concentration

Dirichlet parameter for the prior over document-topic distributions. Smaller values encourage documents to be dominated by a few topics, while larger values lead to more evenly mixed topics.

checkpointInterval

checkpointInterval

The checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations.

subsamplingRate

subsamplingRate

Fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1].

seed

Seed

Random Seed.

maxTermsPerTopic

MaxTermsPerTopic

Number of Terms in Topics

keepLastCheckpoint

Keep Last Checkpoint

indicates whether to keep the last checkpoint

optimizeDocConcentration

Optimize Doc Concentration

Indicates whether the docConcentration will be optimized during training

Details

LDA is implemented as an Estimator that supports both EMLDAOptimizer and OnlineLDAOptimizer, and generates a LDAModel as the base model. Expert users may cast a LDAModel generated by EMLDAOptimizer to a DistributedLDAModel if needed.

More details are available at Apache Spark ML docs page:

http://spark.apache.org/docs/latest/ml-clustering.html#latent-dirichlet-allocation-lda

Examples

Below example is available at : https://spark.apache.org/docs/latest/ml-clustering.html#latent-dirichlet-allocation-lda

import org.apache.spark.ml.clustering.LDA

// Loads data.

val dataset = spark.read.format(“libsvm”)

.load(“data/mllib/sample_lda_libsvm_data.txt”)

// Trains a LDA model.

val lda = new LDA().setK(10).setMaxIter(10)

val model = lda.fit(dataset)

val ll = model.logLikelihood(dataset)

val lp = model.logPerplexity(dataset)

println(s”The lower bound on the log likelihood of the entire corpus: $ll”)

println(s”The upper bound on perplexity: $lp”)

// Describe topics.

val topics = model.describeTopics(3)

println(“The topics described by their top-weighted terms:”)

topics.show(false)

// Shows the result.

val transformed = model.transform(dataset)

transformed.show(false)