Audio Diarization

This node processes audio files to transcribe and diarize speech, identifying different speakers in the audio. It generates a structured output as a DataFrame with two columns: speaker and dialogue.

Input

It takes directory or path as an input

Output

Outputs a Dataframe with 2 columns speaker and dialouge

Type

pyspark

Class

fire.nodes.gai.NodeSpeechToText

Fields

Name

Title

Description

audioFilePath

Directory Or File Path

Select a Pdf/Text/Docx File or Directory

numSpeakers

Number of Speakers

Provide the number of speakers expected in the conversation.

diarization

Diarization

Diarise the transcription.

saveOutputPath

Output Save Path

Specify the file path to save the transcription output as a .txt file.

context

Additional Context

Add any relevant context or details about the conversation to help improve the diarization.

openai

OpenAI

llmConnection

Select Connection

Select Connection

openaiModel

OpenAI Model

OpenAI Model to be Used

Details

Audio Diarization Node Details

The Audio Diarization node processes audio files to transcribe speech and, optionally, diarize it by identifying different speakers. It uses OpenAI’s Whisper model (or other specified models) to generate a structured DataFrame output with two columns: speaker and dialogue. This node is ideal for extracting and organizing spoken content from audio files in PySpark-based data pipelines.

General:

Directory Or File Path: Specifies the path to a single audio file or a directory containing multiple audio files. This field is required and must be accessible to the PySpark engine.

Number of Speakers: Specifies the expected number of speakers in the audio. Default is 1. If set to 1, diarization is not applied, and all dialogue is attributed to a single speaker. Must be an integer.

Diarization: Controls whether speaker diarization is performed. Options are:

  • true: Enables diarization to identify and label different speakers in the audio.

  • false: Disables diarization, treating all dialogue as coming from a single speaker (default).

Output Save Path: Specifies the file path to save the transcription output as a .txt file. This is optional; if provided, the transcribed text is saved to the specified location.

Additional Context: Allows users to provide additional context or details about the conversation (e.g., speaker names, accents, or topics) to improve transcription and diarization accuracy. This is optional.

OpenAI Configuration:

Select Connection: Specifies the connection details for the OpenAI API (e.g., API key). This is required to authenticate and access the OpenAI model.

OpenAI Model: Specifies the OpenAI model to use for transcription. Default is ‘whisper-1’. Other compatible models can be specified if supported by the OpenAI API.

Output:

The node outputs a DataFrame with the following columns:

  • speaker: The identified speaker label (e.g., Speaker_1, Speaker_2, or ‘Default’ if diarization is disabled).

  • dialogue: The transcribed text corresponding to the speaker’s speech.

If the Output Save Path is specified, the transcription is also saved as a .txt file at the provided location.

Examples

Example: Audio Diarization Node

Input:

A directory /data/audio/ contains the following file:

  • meeting_recording.wav (a 5-minute audio file with two speakers discussing a project)

The Audio Diarization node is configured as follows:

  • Directory Or File Path: /data/audio/meeting_recording.wav

  • Number of Speakers: 2

  • Diarization: true

  • Output Save Path: /data/output/transcription.txt

  • Additional Context: “Conversation between a project manager and a developer discussing project milestones.”

  • Select Connection: Configured with a valid OpenAI API key

  • OpenAI Model: whisper-1

Output:

The node processes the audio file and produces a DataFrame with the following structure:

speaker     | dialogue
------------|--------------------------------------
Speaker_1   | Let's discuss the project timeline...
Speaker_2   | Sure, we need to finalize the milestones...
Speaker_1   | I think we should prioritize the testing phase...
Speaker_2   | Agreed, but we need more resources for that...

The transcription is also saved as /data/output/transcription.txt with the content:

  • Speaker_1: Let’s discuss the project timeline…

  • Speaker_2: Sure, we need to finalize the milestones…

  • Speaker_1: I think we should prioritize the testing phase…

  • Speaker_2: Agreed, but we need more resources for that…

Explanation:

  • The meeting_recording.wav file is processed using the OpenAI Whisper-1 model.

  • With Diarization set to true and Number of Speakers set to 2, the node identifies two distinct speakers and labels them as Speaker_1 and Speaker_2.

  • The dialogue column contains the transcribed text for each speaker’s segment.

  • The Additional Context (“Conversation between a project manager and a developer…”) helps improve the accuracy of transcription and diarization by providing relevant information about the conversation.

  • The transcription is saved as a .txt file in /data/output/transcription.txt as specified in the Output Save Path.

  • If Diarization was set to false or Number of Speakers was set to 1, all dialogue would be attributed to a single speaker labeled ‘Default’.