Save to Pinecone
===========

This process involves storing document embeddings in a Pinecone vector database for efficient similarity search and retrieval.

Input
--------------
It may take in Dataframe as an input

Output
--------------
Save data to Vector DB

Type
--------- 

pyspark

Class
--------- 

fire.nodes.gai.NodeSaveToPineconeDB

Fields
--------- 

.. list-table::
      :widths: 10 5 10
      :header-rows: 1

      * - Name
        - Title
        - Description
      * - contentCol
        - Select Content Column
        - Select Content Column
      * - embeddingsCol
        - Select Embeddings Column
        - Select Embeddings Column
      * - pineconeConnection
        - Select Pinecone Connection
        - Select Pinecone Connection
      * - pineconeIndexName
        - Index Name
        - The name of the Pinecone index to create.
      * - indexNameSpace
        - Index Namespace
        - Pinecone Index Namespace
      * - dimension
        - Dimension
        - For text-embedding-ada-002 and amazon.titan-embed-text-v1, dimension is 1536; for multilingual-e5-large, dimension is 1024.
      * - metric
        - Metric
        - Type of metric used in the vector index when querying.
      * - metadataCols
        - Metadata Columns
        - Metadata Columns
      * - uniqueIdCols
        - Unique ID Columns
        - Unique ID Columns


Details
-------
Save to Pinecone Node Details
+++++++++++++++

The Save to Pinecone node stores document embeddings in a Pinecone vector database, enabling efficient similarity search and retrieval. It takes a DataFrame as input, containing text content and corresponding embeddings, and saves them to a specified Pinecone index. This node is designed for PySpark-based workflows, making it ideal for integrating vector-based storage and search into data pipelines.


General:
+++++++++++++++

Select Content Column: Specifies the DataFrame column containing the text content to be stored in the Pinecone vector database. This is required to associate text with the embeddings.


Select Embeddings Column: Specifies the DataFrame column containing the embeddings (vector representations) of the text content. This is required for storage in the vector database.


Select Pinecone Connection: Specifies the connection details for the Pinecone API (e.g., API key, environment). This is required to authenticate and access the Pinecone service.


Index Name: Specifies the name of the Pinecone index where the embeddings will be stored. Default is 'document-index'. If the index does not exist, it will be created.


Index Namespace: Specifies the namespace within the Pinecone index to organize the stored embeddings. Default is 'document-namespace'. This is optional and used to group related vectors.


Dimension: Specifies the dimensionality of the embeddings. Default is 1024 (suitable for models like multilingual-e5-large). For other models, such as text-embedding-ada-002 or amazon.titan-embed-text-v1, the dimension is typically 1536. This must match the embedding model used.


Metric: Specifies the distance metric used for similarity search in the Pinecone index. Options are:


* cosine: Cosine similarity metric.
* dotproduct: Dot product metric.
* euclidean: Euclidean distance metric (default).


Metadata Columns: Specifies one or more DataFrame columns to include as metadata for the stored embeddings. This is optional and allows additional context to be stored alongside the vectors.


Unique ID Columns: Specifies one or more DataFrame columns to use as unique identifiers for the stored embeddings. This is optional; if provided, these IDs are used to reference the vectors in the Pinecone index.


Output:
+++++++++++++++

The node saves the embeddings, content, and optional metadata to the specified Pinecone index and namespace. It does not produce a DataFrame output but confirms the successful storage of data in the Pinecone vector database, ready for similarity search and retrieval.


Examples
-------
Example: Save to Pinecone Node
+++++++++++++++


Input:
+++++++++++++++

A DataFrame contains the following data:


* doc_id: ["doc1", "doc2"]
* content: ["This is a report about climate change...", "A study on AI advancements..."]
* embeddings: [[0.12, 0.45, ...], [0.23, 0.67, ...]] (1024-dimensional vectors)
* category: ["Environment", "Technology"]


The Save to Pinecone node is configured as follows:


* Select Content Column: content
* Select Embeddings Column: embeddings
* Select Pinecone Connection: Configured with a valid Pinecone API key and environment
* Index Name: document-index
* Index Namespace: document-namespace
* Dimension: 1024
* Metric: cosine
* Metadata Columns: [category]
* Unique ID Columns: [doc_id]


Output:
+++++++++++++++


The node stores the embeddings in the Pinecone vector database under the 'document-index' index and 'document-namespace' namespace. The stored data includes:


* Vectors from the embeddings column (1024-dimensional).
* Corresponding text from the content column.
* Metadata from the category column (e.g., "Environment", "Technology").
* Unique IDs from the doc_id column (e.g., "doc1", "doc2").


The Pinecone index is now ready for similarity search queries using the cosine metric.


Explanation:
+++++++++++++++


* The node processes the DataFrame, extracting the content and embeddings columns to store in the Pinecone vector database.
* The doc_id column is used as unique identifiers for each vector, allowing for easy retrieval.
* The category column is stored as metadata, providing additional context for each vector (e.g., "Environment" for doc1).
* The Index Name ('document-index') and Index Namespace ('document-namespace') organize the stored vectors in the Pinecone database.
* The Dimension is set to 1024 to match the embedding model (e.g., multilingual-e5-large), and the Metric is set to cosine for similarity searches.
* If Metadata Columns or Unique ID Columns were left empty, no metadata or custom IDs would be stored, and Pinecone would generate default IDs for the vectors.