Save to Pinecone¶
This process involves storing document embeddings in a Pinecone vector database for efficient similarity search and retrieval.
Input¶
It may take in Dataframe as an input
Output¶
Save data to Vector DB
Type¶
pyspark
Class¶
fire.nodes.gai.NodeSaveToPineconeDB
Fields¶
Name |
Title |
Description |
|---|---|---|
contentCol |
Select Content Column |
Select Content Column |
embeddingsCol |
Select Embeddings Column |
Select Embeddings Column |
pineconeConnection |
Select Pinecone Connection |
Select Pinecone Connection |
pineconeIndexName |
Index Name |
The name of the Pinecone index to create. |
indexNameSpace |
Index Namespace |
Pinecone Index Namespace |
dimension |
Dimension |
For text-embedding-ada-002 and amazon.titan-embed-text-v1, dimension is 1536; for multilingual-e5-large, dimension is 1024. |
metric |
Metric |
Type of metric used in the vector index when querying. |
metadataCols |
Metadata Columns |
Metadata Columns |
uniqueIdCols |
Unique ID Columns |
Unique ID Columns |
Details¶
Save to Pinecone Node Details¶
The Save to Pinecone node stores document embeddings in a Pinecone vector database, enabling efficient similarity search and retrieval. It takes a DataFrame as input, containing text content and corresponding embeddings, and saves them to a specified Pinecone index. This node is designed for PySpark-based workflows, making it ideal for integrating vector-based storage and search into data pipelines.
General:¶
Select Content Column: Specifies the DataFrame column containing the text content to be stored in the Pinecone vector database. This is required to associate text with the embeddings.
Select Embeddings Column: Specifies the DataFrame column containing the embeddings (vector representations) of the text content. This is required for storage in the vector database.
Select Pinecone Connection: Specifies the connection details for the Pinecone API (e.g., API key, environment). This is required to authenticate and access the Pinecone service.
Index Name: Specifies the name of the Pinecone index where the embeddings will be stored. Default is ‘document-index’. If the index does not exist, it will be created.
Index Namespace: Specifies the namespace within the Pinecone index to organize the stored embeddings. Default is ‘document-namespace’. This is optional and used to group related vectors.
Dimension: Specifies the dimensionality of the embeddings. Default is 1024 (suitable for models like multilingual-e5-large). For other models, such as text-embedding-ada-002 or amazon.titan-embed-text-v1, the dimension is typically 1536. This must match the embedding model used.
Metric: Specifies the distance metric used for similarity search in the Pinecone index. Options are:
cosine: Cosine similarity metric.
dotproduct: Dot product metric.
euclidean: Euclidean distance metric (default).
Metadata Columns: Specifies one or more DataFrame columns to include as metadata for the stored embeddings. This is optional and allows additional context to be stored alongside the vectors.
Unique ID Columns: Specifies one or more DataFrame columns to use as unique identifiers for the stored embeddings. This is optional; if provided, these IDs are used to reference the vectors in the Pinecone index.
Output:¶
The node saves the embeddings, content, and optional metadata to the specified Pinecone index and namespace. It does not produce a DataFrame output but confirms the successful storage of data in the Pinecone vector database, ready for similarity search and retrieval.
Examples¶
Example: Save to Pinecone Node¶
Input:¶
A DataFrame contains the following data:
doc_id: [“doc1”, “doc2”]
content: [“This is a report about climate change…”, “A study on AI advancements…”]
embeddings: [[0.12, 0.45, …], [0.23, 0.67, …]] (1024-dimensional vectors)
category: [“Environment”, “Technology”]
The Save to Pinecone node is configured as follows:
Select Content Column: content
Select Embeddings Column: embeddings
Select Pinecone Connection: Configured with a valid Pinecone API key and environment
Index Name: document-index
Index Namespace: document-namespace
Dimension: 1024
Metric: cosine
Metadata Columns: [category]
Unique ID Columns: [doc_id]
Output:¶
The node stores the embeddings in the Pinecone vector database under the ‘document-index’ index and ‘document-namespace’ namespace. The stored data includes:
Vectors from the embeddings column (1024-dimensional).
Corresponding text from the content column.
Metadata from the category column (e.g., “Environment”, “Technology”).
Unique IDs from the doc_id column (e.g., “doc1”, “doc2”).
The Pinecone index is now ready for similarity search queries using the cosine metric.
Explanation:¶
The node processes the DataFrame, extracting the content and embeddings columns to store in the Pinecone vector database.
The doc_id column is used as unique identifiers for each vector, allowing for easy retrieval.
The category column is stored as metadata, providing additional context for each vector (e.g., “Environment” for doc1).
The Index Name (‘document-index’) and Index Namespace (‘document-namespace’) organize the stored vectors in the Pinecone database.
The Dimension is set to 1024 to match the embedding model (e.g., multilingual-e5-large), and the Metric is set to cosine for similarity searches.
If Metadata Columns or Unique ID Columns were left empty, no metadata or custom IDs would be stored, and Pinecone would generate default IDs for the vectors.