Version: SyncExpress

Video to Audio Transcription

Author(s)

Bishwanath Jana

Submission Date

2024-09-13

Version History

Version	Date	Changes	Author
1.0	2024-09-13	Initial draft	Bishwanath Jana

Objective

The purpose of this POC is to evaluate and demonstrate the feasibility of converting video files into audio, transcribing the extracted audio into text, and distinguishing between different speakers or narrators. The POC will also focus on generating the output in SRT (SubRip Subtitle) format to support captioning and time-coded transcripts. Additionally, the POC will aim to detect the American accent and identify speakers through the audio. The project also aims to add speaker memorization capabilities for multimodal systems using parallel processing.

Scope

Video Processing: Support for multiple video formats (MP4, AVI, MKV, MOV) to extract audio.
Audio Enhancement: Noise reduction and audio normalization to improve transcription accuracy.
Speech-to-Text: Convert processed audio into text using the Whisper model for accurate transcription.
Narrator Separation: Use the pyannote/speaker-diarization-3.1 model from Hugging Face to distinguish between different speakers.
Accent Detection: Detect if the speaker has an American accent using a suitable model.
Speaker Memorization: Incorporate speaker memorization using multimodal system parallel processing.
Output Format: Generate transcripts in SRT (SubRip Subtitle) format, including time-coding and speaker identification for each dialogue.
Performance Testing: Evaluate transcription accuracy across different audio qualities and file sizes.

Media Processing: Integration with video/audio tools like FFmpeg for extraction.
Speech Recognition with Whisper Model: Using the Whisper model for high-accuracy transcription.
Speaker Diarization with pyannote: Using the pyannote/speaker-diarization-3.1 model for narrator separation and diarization.
Accent Detection: Explore and integrate models for accent detection.
Speaker Memorization: Implement multimodal system parallel processing for speaker recognition.
Transcription Analysis: Output testing to verify transcription accuracy, format, and narrator separation.

Technical Approach

Tools & Libraries:
- FFmpeg: For video-to-audio extraction.
- webrtcvad: For audio enhancement and noise reduction.
- Whisper Model: Use the OpenAI Whisper model for accurate transcription of audio.
- pyannote/speaker-diarization-3.1: Use this Hugging Face model for speaker diarization to identify and separate narrators.
- Accent Detection Model: Implement or integrate a model for detecting American accents.
- Speaker Memorization: Use multimodal system parallel processing to recognize and remember speaker voices.
- SRT File Generation: Convert transcript data with timestamps and speaker labels into the SRT format.
Frameworks: Python for scripting and automation, leveraging Whisper, pyannote, and other models for transcription, speaker separation, accent detection, and speaker memorization.

Development Tasks & Estimates

No	Task Name	Estimate (Hours)	Dependencies	Notes
1	Video-to-Audio Extraction (Done)	4 hours	FFmpeg setup	Ensure multi-format support
2	Audio Preprocessing (Noise Reduction)	7 hours	webrtcvad	Apply clean-up for better transcription accuracy
3	Speech-to-Text Transcription (Whisper Model) (Done)	6 hours	Whisper model	Validate accuracy for different video/audio formats
4	Speaker Diarization (pyannote/speaker-diarization-3.1) (Done)	6 hours	pyannote setup	Ensure reliable narrator separation
5	Accent Detection Implementation	5 hours	Accent detection model	Detect if the speaker has an American accent
6	Speaker Memorization for Multimodal Systems	6 hours	Parallel processing setup	Implement speaker memorization capabilities
7	SRT File Generation with Time-Codes	4 hours	pysrt	Create time-coded subtitles with speaker labels
8	Output Formatting (TXT, SRT)	3 hours	N/A	Ensure flexible output options
9	Testing and Validation	5 hours	Setup complete	Test different video/audio formats and speaker separation
Total		54 hours

Testing & Validation

Basic Tests:
- Test with small, medium, and large video files to verify extraction, transcription, and narrator separation.
- Ensure output accuracy with different audio qualities (clear vs noisy environments).
- Validate the time-coding accuracy and narrator labels in the generated SRT files.
- Verify accent detection functionality.
- Confirm speaker memorization through multiple video inputs.
Validation Criteria:
- Audio extracted without errors and enhanced for clarity.
- Transcription must maintain over 90% accuracy for high-quality audio, including proper narrator identification.
- Output generated in SRT format with correct timestamps, speaker labels, and overall subtitle formatting.
- Accurate detection of American accent.
- Speaker memorization works effectively with multiple video inputs.

Risks & Mitigations

Risk	Impact	Likelihood	Mitigation Strategy
Whisper model performance issues	High	High	Ensure use of the latest version of the Whisper model for better accuracy
Speaker separation inaccuracies	High	High	Use the pyannote/speaker-diarization-3.1 model with tuning to handle complex audio
Accent detection inaccuracies	High	High	Implement and test robust models for accurate accent detection
Speaker memorization issues	High	High	Use effective multimodal system parallel processing to enhance memorization
Inconsistent time-coding in SRT files	High	High	Fine-tune the time-coding logic based on testing feedback
Long processing times	High	High	Optimize the process for performance with multithreading if necessary
Audio Preprocessing issues	High	High	Implement robust preprocessing techniques to clean and prepare audio data effectively before further analysis

Conclusion

This POC aims to showcase a robust solution for converting video files into text via audio transcription with narrator separation using Whisper and pyannote models. By incorporating accent detection and speaker memorization capabilities, the solution will provide more detailed and useful transcription results. Generating the output in SRT format will facilitate subtitles and captioning with clear timestamps and speaker identification. If successful, this solution can be integrated into applications requiring automatic media transcription with enhanced features for accent detection and speaker memorization. The next steps would involve integrating the solution into larger systems or exploring additional features like real-time transcription.

Video to Audio Transcription

Author(s)​

Submission Date​

Version History​

Objective​

Scope​

Related Features​

Technical Approach​

Development Tasks & Estimates​

Testing & Validation​

Risks & Mitigations​

Conclusion​