LDA
===========

LDA is given a collection of documents as input data, via the featuresCol parameter. Each document is specified as a Vector of length vocabSize, where each entry is the count for the corresponding term (word) in the document

Input
--------------
It takes in a DataFrame as input and performs LDA

Output
--------------
LDA Model is passed to the next Node for Prediction or Storing

Type
--------- 

ml-estimator

Class
--------- 

fire.nodes.ml.NodeLDA

Fields
--------- 

.. list-table::
      :widths: 10 5 10
      :header-rows: 1

      * - Name
        - Title
        - Description
      * - featuresCol
        - Features Column
        - Features column of type vectorUDT for model fitting.
      * - k
        - K
        - The number of topics to create.
      * - maxIter
        - Max Iterations
        - The maximum number of iterations.
      * - optimizer
        - Optimizer
        - Optimizer or inference algorithm used to estimate the LDA model.
      * - topicDistributionCol
        - TopicDistributionColumn
        - Output column with estimates of the topic mixture distribution for each document
      * - docConcentration
        - Doc Concentration
        - Dirichlet parameter for the prior over document-topic distributions. Smaller values encourage documents to be dominated by a few topics, while larger values lead to more evenly mixed topics.
      * - checkpointInterval
        - checkpointInterval
        - The checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations.
      * - subsamplingRate
        - subsamplingRate
        - Fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1].
      * - seed
        - Seed
        - Random Seed.
      * - maxTermsPerTopic
        - MaxTermsPerTopic
        - Number of Terms in Topics
      * - keepLastCheckpoint
        - Keep Last Checkpoint
        - indicates whether to keep the last checkpoint
      * - optimizeDocConcentration
        - Optimize Doc Concentration
        - Indicates whether the docConcentration will be optimized during training


Details
-------
LDA is implemented as an Estimator that supports both EMLDAOptimizer and OnlineLDAOptimizer, and generates a LDAModel as the base model. Expert users may cast a LDAModel generated by EMLDAOptimizer to a DistributedLDAModel if needed.


More details are available at Apache Spark ML docs page:


http://spark.apache.org/docs/latest/ml-clustering.html#latent-dirichlet-allocation-lda


Examples
-------
Below example is available at : https://spark.apache.org/docs/latest/ml-clustering.html#latent-dirichlet-allocation-lda


import org.apache.spark.ml.clustering.LDA


// Loads data.

val dataset = spark.read.format("libsvm")

  .load("data/mllib/sample_lda_libsvm_data.txt")


// Trains a LDA model.

val lda = new LDA().setK(10).setMaxIter(10)

val model = lda.fit(dataset)


val ll = model.logLikelihood(dataset)

val lp = model.logPerplexity(dataset)

println(s"The lower bound on the log likelihood of the entire corpus: $ll")

println(s"The upper bound on perplexity: $lp")


// Describe topics.

val topics = model.describeTopics(3)

println("The topics described by their top-weighted terms:")

topics.show(false)


// Shows the result.

val transformed = model.transform(dataset)

transformed.show(false)