AI-Powered Live Video Transcriptions for Real-Time Accuracy
Client:
Feature:
Tech Stack:
Project Overview
Tanna AI required a feature that could deliver real-time video transcriptions, combining both audio and on-screen text extraction. The goal was to ensure seamless transcription of video content, including both the audio track and any additional on-screen text, without overlap, and in real-time. This was particularly beneficial for students using the Tanna AI platform to transcribe lectures and other educational videos.
Workflow Overview
The following diagram illustrates the end-to-end workflow for real-time video transcription.
Workflow Steps
1. Tesseract.js for On-Screen Text Extraction
The workflow begins with Tesseract.js, a JavaScript library for Optical Character Recognition (OCR). It captures and processes screenshots taken from the video content to extract any visible text on the screen. The extracted text is then sent to the Flutter-based media player for further processing.
2. Communication between Tesseract.js and Flutter App
Through JavaScript interop, Tesseract shares the extracted text directly with the Flutter media player. This ensures a smooth transfer of the on-screen text data for display or synchronization within the app.
3. Google Cloud Functions Handling Audio Transcription
Simultaneously, the audio track from the video is processed through a Google Cloud Function written in Python. This function makes use of the OpenAI Whisper API to transcribe the spoken audio in real-time. The Cloud Function returns the audio transcript to ensure that audio and text remain in sync without overlaps.
4. Storage and Real-Time Updates via Firebase Firestore
The transcripts (both from the audio and on-screen text) are saved in Firebase Firestore, providing a real-time database to store and sync the transcription data. This ensures that any updates made during the transcription process are immediately reflected on the user’s device.
5. Synchronized Output Display on Flutter Media Player
Finally, the Flutter-based media player integrates the two data streams—OCR text from Tesseract.js and audio transcripts from Whisper (via Google Cloud Functions). The player ensures both are displayed accurately and in real-time without duplication or overlap.
Challenges
- Audio from the video, processed using OpenAI’s Whisper API.
- Text present on-screen, extracted using Tesseract OCR.
Solution: Technical Architecture
1. Flutter Integration with JavaScript
I utilized Flutter’s JS interop to bridge Flutter and JavaScript, allowing the app to send screenshots for text transcription and receive results back in real time.
class JSInterop {
static void init() {
_shareScreenshotTranscript =
allowInterop(_shareScreenshotTranscriptDart);
}
}
@JS('shareScreenshotTranscript')
external set _shareScreenshotTranscript(void Function(String
transcript) f);
ValueNotifier<String> shareScreenshotTranscript =
ValueNotifier('');
void _shareScreenshotTranscriptDart(String transcript) {
shareScreenshotTranscript.value = transcript;
}
2. JavaScript Text Extraction Using Tesseract
Tesseract was used for OCR-based text extraction, which took cropped screenshots from the video and extracted text in real time.
async function extractTextFromBlob(blob) {
var worker = await Tesseract.createWorker("eng");
const { data: { text } } = await
worker.recognize(blob, { psm: 1 });
await worker.terminate();
shareScreenshotTranscript(text);
}
3. Real-Time Screenshot Processing
The app periodically captures screenshots and processes the cropped regions of the video to improve the efficiency of the transcription process. This minimizes unnecessary OCR operations.
void _cropAndTranscribeScreenshot(ByteBuffer screenshotBuffer)
{
final blob = html.Blob([screenshotBuffer]);
js.context.callMethod('cropAndTransribeImage', [
blob,
croppedRegionCoordinates?.x,
croppedRegionCoordinates?.y,
croppedRegionCoordinates?.width,
croppedRegionCoordinates?.height,
croppedRegionCoordinates?.resizeWidth,
croppedRegionCoordinates?.resizeHeight,
]);
}
4. Audio Transcription Using Google Cloud Functions
A Google Cloud Function processes the audio extracted from video using the Whisper model, ensuring that both streams (audio and text) are synchronized in real time.
def transcribe_video_file(json_request):
file_location = json_request.get('file_location')
user_id = json_request.get('user_id')
blob = bucket.blob(file_location)
temp_video_file = '/tmp/temp_video' +
os.path.splitext(file_location)[1]
blob.download_to_filename(temp_video_file)
audio_data =
extract_audio_from_video(temp_video_file)
transcriptions = transcribe_audio(audio_data)
transcript =
transcriptions["results"]["channels"][0]["alternatives"][0]["transcript"]
return {'message': 'File transcribed successfully.',
'transcript': transcript}
Key Features
- Real-time transcription of both audio and on-screen text.
- Multi-stream processing (audio and video text) without overlap.
- JS-Interop integration for seamless interaction between Flutter and JavaScript.
- Cloud-based audio transcription using OpenAI's Whisper API.
- Screenshot cropping functionality for focused text extraction.
Challenges and Solutions
1. Processing On-Screen Text in Real Time
Challenge: Ensuring that Tesseract could accurately
process screenshots of video frames and extract text in real
time.
Solution: By cropping regions of the image,
Tesseract’s accuracy was improved, and gibberish output was
minimized. Extracted text was immediately sent back to Flutter via
JS interop for further processing.
2. Synchronizing Multiple Input Streams
Challenge: Managing synchronization between the two
streams (audio and on-screen text) without overlap.
Solution: The two streams were processed in
parallel using cloud functions, and Firebase was used to deliver
real-time updates to the app to ensure they remained in sync.
Impact
The real-time transcription feature significantly enhanced the accessibility of video content for students, resulting in positive feedback and additional sales for Tanna AI. The seamless integration of multiple technologies ensured high accuracy and user satisfaction.