Save Faiss DB
===========

Save Vector Embeddings to faiss db

Input
--------------
It takes in a DataFrame as input

Type
--------- 

pyspark

Class
--------- 

fire.nodes.gai.NodeSaveToFaissDB

Fields
--------- 

.. list-table::
      :widths: 10 5 10
      :header-rows: 1

      * - Name
        - Title
        - Description
      * - fileNameCol
        - File Name Column
        - Column name for file names.
      * - pageNumberCol
        - Page Number Column
        - Column name for page numbers.
      * - contentCol
        - Content Column
        - Column name for text content.
      * - directoryPathCol
        - Directory Path Column
        - Column name for directory paths.
      * - embeddingsCol
        - Embeddings Column
        - Column name for embeddings.
      * - base64ImageCol
        - Base64 Image Column
        - Column name for Base64 encoded images.
      * - faissIndexName
        - FAISS Index Name
        - Name of the FAISS index.
      * - faissIndexDir
        - FAISS Index Directory Path
        - Directory path for the FAISS index.
      * - dimension
        - Dimension
        - Dimension for embeddings.


Details
-------
Save Faiss DB Node Details
+++++++++++++++

The Save Faiss DB node stores vector embeddings and associated metadata from a DataFrame into a FAISS vector database, enabling efficient similarity search and retrieval. It is designed for PySpark-based workflows, making it suitable for integrating vector-based storage into data pipelines. The node allows specification of columns for text content, embeddings, and metadata, which are saved to a FAISS index for later querying.


General:
+++++++++++++++


File Name Column: Specifies the DataFrame column containing file names associated with the content. Default is 'fileName'. This is optional and used for metadata tracking.


Page Number Column: Specifies the DataFrame column containing page numbers for multi-page documents. Default is 'pageNumber'. This is optional and relevant for documents like PDFs.


Content Column: Specifies the DataFrame column containing the text content to be stored. Default is 'content'. This is required to associate text with the embeddings.


Directory Path Column: Specifies the DataFrame column containing directory paths for the source files. Default is 'directoryPath'. This is optional and used for metadata tracking.


Embeddings Column: Specifies the DataFrame column containing the embeddings (vector representations) of the text content. Default is 'embeddings'. This is required for storage in the FAISS database.


Base64 Image Column: Specifies the DataFrame column containing base64-encoded images associated with the content. Default is 'base64Image'. This is optional and used for storing image-related metadata.


FAISS Index Name: Specifies the name of the FAISS index where the embeddings will be stored. Default is 'faiss_index'. This is required to identify the index.


FAISS Index Directory Path: Specifies the directory path (local or distributed filesystem) where the FAISS index will be saved. This is required to store the index.


Dimension: Specifies the dimensionality of the embeddings. Default is 0, which must be overridden to match the embedding model (e.g., 1536 for text-embedding-ada-002, 1024 for multilingual-e5-large). This is required for proper index creation.


Output:
+++++++++++++++

The node saves the embeddings, content, and optional metadata (file names, page numbers, directory paths, base64 images) to the specified FAISS index in the designated directory. It does not produce a DataFrame output but confirms the successful storage of data in the FAISS vector database, ready for similarity search and retrieval.


Examples
-------
Example: Save Faiss DB Node
+++++++++++++++


Input:
+++++++++++++++

A DataFrame contains the following data:


* fileName: ["report.pdf", "study.docx"]
* pageNumber: [1, 1]
* content: ["Climate change impacts ecosystems...", "AI advancements in 2025..."]
* directoryPath: ["/data/docs/", "/data/docs/"]
* embeddings: [[0.12, 0.45, ...], [0.23, 0.67, ...]] (1024-dimensional vectors)
* base64Image: ["iVBORw0KGgoAAAANSUhEUg...", null]


The Save Faiss DB node is configured as follows:


* File Name Column: fileName
* Page Number Column: pageNumber
* Content Column: content
* Directory Path Column: directoryPath
* Embeddings Column: embeddings
* Base64 Image Column: base64Image
* FAISS Index Name: faiss_index
* FAISS Index Directory Path: /data/faiss_indices/
* Dimension: 1024


Output:
+++++++++++++++


The node stores the embeddings and associated data in the FAISS vector database under the index named 'faiss_index' in the directory '/data/faiss_indices/'. The stored data includes:


* Vectors from the embeddings column (1024-dimensional).
* Text from the content column.
* Metadata from the fileName, pageNumber, directoryPath, and base64Image columns.


The FAISS index is now ready for similarity search queries.


Explanation:
+++++++++++++++


* The node processes the DataFrame, extracting the embeddings and content columns to store in the FAISS index named 'faiss_index'.
* The fileName, pageNumber, directoryPath, and base64Image columns are stored as metadata, providing additional context for each vector (e.g., fileName: "report.pdf", base64Image: base64-encoded image for the first row, null for the second).
* The Dimension is set to 1024 to match the embedding model (e.g., multilingual-e5-large).
* The FAISS Index Directory Path ('/data/faiss_indices/') specifies where the index is saved.
* If optional columns like Base64 Image Column or Directory Path Column were not provided, they would not be included as metadata in the FAISS index.
* The node ensures the FAISS index is created or updated with the provided data, making it available for efficient similarity searches.