Skip to main content
Version: SyncExpress

Video to Audio Transcription

Author(s)

  • Bishwanath Jana

Submission Date

2024-09-13


Version History

VersionDateChangesAuthor
1.02024-09-13Initial draftBishwanath Jana

Objective

The purpose of this POC is to evaluate and demonstrate the feasibility of converting video files into audio, transcribing the extracted audio into text, and distinguishing between different speakers or narrators. The POC will also focus on generating the output in SRT (SubRip Subtitle) format to support captioning and time-coded transcripts. Additionally, the POC will aim to detect the American accent and identify speakers through the audio. The project also aims to add speaker memorization capabilities for multimodal systems using parallel processing.


Scope

  • Video Processing: Support for multiple video formats (MP4, AVI, MKV, MOV) to extract audio.
  • Audio Enhancement: Noise reduction and audio normalization to improve transcription accuracy.
  • Speech-to-Text: Convert processed audio into text using the Whisper model for accurate transcription.
  • Narrator Separation: Use the pyannote/speaker-diarization-3.1 model from Hugging Face to distinguish between different speakers.
  • Accent Detection: Detect if the speaker has an American accent using a suitable model.
  • Speaker Memorization: Incorporate speaker memorization using multimodal system parallel processing.
  • Output Format: Generate transcripts in SRT (SubRip Subtitle) format, including time-coding and speaker identification for each dialogue.
  • Performance Testing: Evaluate transcription accuracy across different audio qualities and file sizes.

  • Media Processing: Integration with video/audio tools like FFmpeg for extraction.
  • Speech Recognition with Whisper Model: Using the Whisper model for high-accuracy transcription.
  • Speaker Diarization with pyannote: Using the pyannote/speaker-diarization-3.1 model for narrator separation and diarization.
  • Accent Detection: Explore and integrate models for accent detection.
  • Speaker Memorization: Implement multimodal system parallel processing for speaker recognition.
  • Transcription Analysis: Output testing to verify transcription accuracy, format, and narrator separation.

Technical Approach

  • Tools & Libraries:

    • FFmpeg: For video-to-audio extraction.
    • webrtcvad: For audio enhancement and noise reduction.
    • Whisper Model: Use the OpenAI Whisper model for accurate transcription of audio.
    • pyannote/speaker-diarization-3.1: Use this Hugging Face model for speaker diarization to identify and separate narrators.
    • Accent Detection Model: Implement or integrate a model for detecting American accents.
    • Speaker Memorization: Use multimodal system parallel processing to recognize and remember speaker voices.
    • SRT File Generation: Convert transcript data with timestamps and speaker labels into the SRT format.
  • Frameworks: Python for scripting and automation, leveraging Whisper, pyannote, and other models for transcription, speaker separation, accent detection, and speaker memorization.


Development Tasks & Estimates

NoTask NameEstimate (Hours)DependenciesNotes
1Video-to-Audio Extraction (Done)4 hoursFFmpeg setupEnsure multi-format support
2Audio Preprocessing (Noise Reduction)7 hourswebrtcvadApply clean-up for better transcription accuracy
3Speech-to-Text Transcription (Whisper Model) (Done)6 hoursWhisper modelValidate accuracy for different video/audio formats
4Speaker Diarization (pyannote/speaker-diarization-3.1) (Done)6 hourspyannote setupEnsure reliable narrator separation
5Accent Detection Implementation5 hoursAccent detection modelDetect if the speaker has an American accent
6Speaker Memorization for Multimodal Systems6 hoursParallel processing setupImplement speaker memorization capabilities
7SRT File Generation with Time-Codes4 hourspysrtCreate time-coded subtitles with speaker labels
8Output Formatting (TXT, SRT)3 hoursN/AEnsure flexible output options
9Testing and Validation5 hoursSetup completeTest different video/audio formats and speaker separation
Total54 hours

Testing & Validation

  • Basic Tests:

    • Test with small, medium, and large video files to verify extraction, transcription, and narrator separation.
    • Ensure output accuracy with different audio qualities (clear vs noisy environments).
    • Validate the time-coding accuracy and narrator labels in the generated SRT files.
    • Verify accent detection functionality.
    • Confirm speaker memorization through multiple video inputs.
  • Validation Criteria:

    • Audio extracted without errors and enhanced for clarity.
    • Transcription must maintain over 90% accuracy for high-quality audio, including proper narrator identification.
    • Output generated in SRT format with correct timestamps, speaker labels, and overall subtitle formatting.
    • Accurate detection of American accent.
    • Speaker memorization works effectively with multiple video inputs.

Risks & Mitigations

RiskImpactLikelihoodMitigation Strategy
Whisper model performance issuesHighHighEnsure use of the latest version of the Whisper model for better accuracy
Speaker separation inaccuraciesHighHighUse the pyannote/speaker-diarization-3.1 model with tuning to handle complex audio
Accent detection inaccuraciesHighHighImplement and test robust models for accurate accent detection
Speaker memorization issuesHighHighUse effective multimodal system parallel processing to enhance memorization
Inconsistent time-coding in SRT filesHighHighFine-tune the time-coding logic based on testing feedback
Long processing timesHighHighOptimize the process for performance with multithreading if necessary
Audio Preprocessing issuesHighHighImplement robust preprocessing techniques to clean and prepare audio data effectively before further analysis

Conclusion

This POC aims to showcase a robust solution for converting video files into text via audio transcription with narrator separation using Whisper and pyannote models. By incorporating accent detection and speaker memorization capabilities, the solution will provide more detailed and useful transcription results. Generating the output in SRT format will facilitate subtitles and captioning with clear timestamps and speaker identification. If successful, this solution can be integrated into applications requiring automatic media transcription with enhanced features for accent detection and speaker memorization. The next steps would involve integrating the solution into larger systems or exploring additional features like real-time transcription.