Save to Pinecone

This process involves storing document embeddings in a Pinecone vector database for efficient similarity search and retrieval.

Input

It may take in Dataframe as an input

Output

Save data to Vector DB

Type

pyspark

Class

fire.nodes.gai.NodeSaveToPineconeDB

Fields

Name

Title

Description

contentCol

Select Content Column

Select Content Column

embeddingsCol

Select Embeddings Column

Select Embeddings Column

pineconeConnection

Select Pinecone Connection

Select Pinecone Connection

pineconeIndexName

Index Name

The name of the Pinecone index to create.

indexNameSpace

Index Namespace

Pinecone Index Namespace

dimension

Dimension

For text-embedding-ada-002 and amazon.titan-embed-text-v1, dimension is 1536; for multilingual-e5-large, dimension is 1024.

metric

Metric

Type of metric used in the vector index when querying.

metadataCols

Metadata Columns

Metadata Columns

uniqueIdCols

Unique ID Columns

Unique ID Columns

Details

Save to Pinecone Node Details

The Save to Pinecone node stores document embeddings in a Pinecone vector database, enabling efficient similarity search and retrieval. It takes a DataFrame as input, containing text content and corresponding embeddings, and saves them to a specified Pinecone index. This node is designed for PySpark-based workflows, making it ideal for integrating vector-based storage and search into data pipelines.

General:

Select Content Column: Specifies the DataFrame column containing the text content to be stored in the Pinecone vector database. This is required to associate text with the embeddings.

Select Embeddings Column: Specifies the DataFrame column containing the embeddings (vector representations) of the text content. This is required for storage in the vector database.

Select Pinecone Connection: Specifies the connection details for the Pinecone API (e.g., API key, environment). This is required to authenticate and access the Pinecone service.

Index Name: Specifies the name of the Pinecone index where the embeddings will be stored. Default is ‘document-index’. If the index does not exist, it will be created.

Index Namespace: Specifies the namespace within the Pinecone index to organize the stored embeddings. Default is ‘document-namespace’. This is optional and used to group related vectors.

Dimension: Specifies the dimensionality of the embeddings. Default is 1024 (suitable for models like multilingual-e5-large). For other models, such as text-embedding-ada-002 or amazon.titan-embed-text-v1, the dimension is typically 1536. This must match the embedding model used.

Metric: Specifies the distance metric used for similarity search in the Pinecone index. Options are:

  • cosine: Cosine similarity metric.

  • dotproduct: Dot product metric.

  • euclidean: Euclidean distance metric (default).

Metadata Columns: Specifies one or more DataFrame columns to include as metadata for the stored embeddings. This is optional and allows additional context to be stored alongside the vectors.

Unique ID Columns: Specifies one or more DataFrame columns to use as unique identifiers for the stored embeddings. This is optional; if provided, these IDs are used to reference the vectors in the Pinecone index.

Output:

The node saves the embeddings, content, and optional metadata to the specified Pinecone index and namespace. It does not produce a DataFrame output but confirms the successful storage of data in the Pinecone vector database, ready for similarity search and retrieval.

Examples

Example: Save to Pinecone Node

Input:

A DataFrame contains the following data:

  • doc_id: [“doc1”, “doc2”]

  • content: [“This is a report about climate change…”, “A study on AI advancements…”]

  • embeddings: [[0.12, 0.45, …], [0.23, 0.67, …]] (1024-dimensional vectors)

  • category: [“Environment”, “Technology”]

The Save to Pinecone node is configured as follows:

  • Select Content Column: content

  • Select Embeddings Column: embeddings

  • Select Pinecone Connection: Configured with a valid Pinecone API key and environment

  • Index Name: document-index

  • Index Namespace: document-namespace

  • Dimension: 1024

  • Metric: cosine

  • Metadata Columns: [category]

  • Unique ID Columns: [doc_id]

Output:

The node stores the embeddings in the Pinecone vector database under the ‘document-index’ index and ‘document-namespace’ namespace. The stored data includes:

  • Vectors from the embeddings column (1024-dimensional).

  • Corresponding text from the content column.

  • Metadata from the category column (e.g., “Environment”, “Technology”).

  • Unique IDs from the doc_id column (e.g., “doc1”, “doc2”).

The Pinecone index is now ready for similarity search queries using the cosine metric.

Explanation:

  • The node processes the DataFrame, extracting the content and embeddings columns to store in the Pinecone vector database.

  • The doc_id column is used as unique identifiers for each vector, allowing for easy retrieval.

  • The category column is stored as metadata, providing additional context for each vector (e.g., “Environment” for doc1).

  • The Index Name (‘document-index’) and Index Namespace (‘document-namespace’) organize the stored vectors in the Pinecone database.

  • The Dimension is set to 1024 to match the embedding model (e.g., multilingual-e5-large), and the Metric is set to cosine for similarity searches.

  • If Metadata Columns or Unique ID Columns were left empty, no metadata or custom IDs would be stored, and Pinecone would generate default IDs for the vectors.