Save Faiss DB

Save Vector Embeddings to faiss db

Input

It takes in a DataFrame as input

Type

pyspark

Class

fire.nodes.gai.NodeSaveToFaissDB

Fields

Name

Title

Description

fileNameCol

File Name Column

Column name for file names.

pageNumberCol

Page Number Column

Column name for page numbers.

contentCol

Content Column

Column name for text content.

directoryPathCol

Directory Path Column

Column name for directory paths.

embeddingsCol

Embeddings Column

Column name for embeddings.

base64ImageCol

Base64 Image Column

Column name for Base64 encoded images.

faissIndexName

FAISS Index Name

Name of the FAISS index.

faissIndexDir

FAISS Index Directory Path

Directory path for the FAISS index.

dimension

Dimension

Dimension for embeddings.

Details

Save Faiss DB Node Details

The Save Faiss DB node stores vector embeddings and associated metadata from a DataFrame into a FAISS vector database, enabling efficient similarity search and retrieval. It is designed for PySpark-based workflows, making it suitable for integrating vector-based storage into data pipelines. The node allows specification of columns for text content, embeddings, and metadata, which are saved to a FAISS index for later querying.

General:

File Name Column: Specifies the DataFrame column containing file names associated with the content. Default is ‘fileName’. This is optional and used for metadata tracking.

Page Number Column: Specifies the DataFrame column containing page numbers for multi-page documents. Default is ‘pageNumber’. This is optional and relevant for documents like PDFs.

Content Column: Specifies the DataFrame column containing the text content to be stored. Default is ‘content’. This is required to associate text with the embeddings.

Directory Path Column: Specifies the DataFrame column containing directory paths for the source files. Default is ‘directoryPath’. This is optional and used for metadata tracking.

Embeddings Column: Specifies the DataFrame column containing the embeddings (vector representations) of the text content. Default is ‘embeddings’. This is required for storage in the FAISS database.

Base64 Image Column: Specifies the DataFrame column containing base64-encoded images associated with the content. Default is ‘base64Image’. This is optional and used for storing image-related metadata.

FAISS Index Name: Specifies the name of the FAISS index where the embeddings will be stored. Default is ‘faiss_index’. This is required to identify the index.

FAISS Index Directory Path: Specifies the directory path (local or distributed filesystem) where the FAISS index will be saved. This is required to store the index.

Dimension: Specifies the dimensionality of the embeddings. Default is 0, which must be overridden to match the embedding model (e.g., 1536 for text-embedding-ada-002, 1024 for multilingual-e5-large). This is required for proper index creation.

Output:

The node saves the embeddings, content, and optional metadata (file names, page numbers, directory paths, base64 images) to the specified FAISS index in the designated directory. It does not produce a DataFrame output but confirms the successful storage of data in the FAISS vector database, ready for similarity search and retrieval.

Examples

Example: Save Faiss DB Node

Input:

A DataFrame contains the following data:

  • fileName: [“report.pdf”, “study.docx”]

  • pageNumber: [1, 1]

  • content: [“Climate change impacts ecosystems…”, “AI advancements in 2025…”]

  • directoryPath: [“/data/docs/”, “/data/docs/”]

  • embeddings: [[0.12, 0.45, …], [0.23, 0.67, …]] (1024-dimensional vectors)

  • base64Image: [“iVBORw0KGgoAAAANSUhEUg…”, null]

The Save Faiss DB node is configured as follows:

  • File Name Column: fileName

  • Page Number Column: pageNumber

  • Content Column: content

  • Directory Path Column: directoryPath

  • Embeddings Column: embeddings

  • Base64 Image Column: base64Image

  • FAISS Index Name: faiss_index

  • FAISS Index Directory Path: /data/faiss_indices/

  • Dimension: 1024

Output:

The node stores the embeddings and associated data in the FAISS vector database under the index named ‘faiss_index’ in the directory ‘/data/faiss_indices/’. The stored data includes:

  • Vectors from the embeddings column (1024-dimensional).

  • Text from the content column.

  • Metadata from the fileName, pageNumber, directoryPath, and base64Image columns.

The FAISS index is now ready for similarity search queries.

Explanation:

  • The node processes the DataFrame, extracting the embeddings and content columns to store in the FAISS index named ‘faiss_index’.

  • The fileName, pageNumber, directoryPath, and base64Image columns are stored as metadata, providing additional context for each vector (e.g., fileName: “report.pdf”, base64Image: base64-encoded image for the first row, null for the second).

  • The Dimension is set to 1024 to match the embedding model (e.g., multilingual-e5-large).

  • The FAISS Index Directory Path (‘/data/faiss_indices/’) specifies where the index is saved.

  • If optional columns like Base64 Image Column or Directory Path Column were not provided, they would not be included as metadata in the FAISS index.

  • The node ensures the FAISS index is created or updated with the provided data, making it available for efficient similarity searches.