AI-Powered Live Video Transcriptions for Real-Time Accuracy

Client: Tanna AI

Feature: Real-Time Video Transcription

Tech Stack: Flutter, Vanilla JavaScript, Google Cloud Functions with Python, OpenAI APIs, Whisper, Tesseract

Project Overview

Tanna AI required a feature that could deliver real-time video transcriptions, combining both audio and on-screen text extraction. The goal was to ensure seamless transcription of video content, including both the audio track and any additional on-screen text, without overlap, and in real-time. This was particularly beneficial for students using the Tanna AI platform to transcribe lectures and other educational videos.

Workflow Overview

The following diagram illustrates the end-to-end workflow for real-time video transcription.

Workflow Steps

1. Tesseract.js for On-Screen Text Extraction

The workflow begins with Tesseract.js, a JavaScript library for Optical Character Recognition (OCR). It captures and processes screenshots taken from the video content to extract any visible text on the screen. The extracted text is then sent to the Flutter-based media player for further processing.

2. Communication between Tesseract.js and Flutter App

Through JavaScript interop, Tesseract shares the extracted text directly with the Flutter media player. This ensures a smooth transfer of the on-screen text data for display or synchronization within the app.

3. Google Cloud Functions Handling Audio Transcription

Simultaneously, the audio track from the video is processed through a Google Cloud Function written in Python. This function makes use of the OpenAI Whisper API to transcribe the spoken audio in real-time. The Cloud Function returns the audio transcript to ensure that audio and text remain in sync without overlaps.

4. Storage and Real-Time Updates via Firebase Firestore

The transcripts (both from the audio and on-screen text) are saved in Firebase Firestore, providing a real-time database to store and sync the transcription data. This ensures that any updates made during the transcription process are immediately reflected on the user’s device.

5. Synchronized Output Display on Flutter Media Player

Finally, the Flutter-based media player integrates the two data streams—OCR text from Tesseract.js and audio transcripts from Whisper (via Google Cloud Functions). The player ensures both are displayed accurately and in real-time without duplication or overlap.

Challenges

The primary challenge was to synchronize two separate data streams:

Audio from the video, processed using OpenAI’s Whisper API.
Text present on-screen, extracted using Tesseract OCR.

Both needed to be processed simultaneously, without overlap, and with high accuracy to filter out gibberish in real-time.

Solution: Technical Architecture

1. Flutter Integration with JavaScript

I utilized Flutter’s JS interop to bridge Flutter and JavaScript, allowing the app to send screenshots for text transcription and receive results back in real time.


              class JSInterop {

                static void init() {

                  _shareScreenshotTranscript =
              allowInterop(_shareScreenshotTranscriptDart);

                }

              }

              

              @JS('shareScreenshotTranscript')

              external set _shareScreenshotTranscript(void Function(String
              transcript) f);

              ValueNotifier<String> shareScreenshotTranscript =
              ValueNotifier('');

              void _shareScreenshotTranscriptDart(String transcript) {

                shareScreenshotTranscript.value = transcript;

              }

2. JavaScript Text Extraction Using Tesseract

Tesseract was used for OCR-based text extraction, which took cropped screenshots from the video and extracted text in real time.


              async function extractTextFromBlob(blob) {

                var worker = await Tesseract.createWorker("eng");

                const { data: { text } } = await
              worker.recognize(blob, { psm: 1 });

                await worker.terminate();

                shareScreenshotTranscript(text);

              }

3. Real-Time Screenshot Processing

The app periodically captures screenshots and processes the cropped regions of the video to improve the efficiency of the transcription process. This minimizes unnecessary OCR operations.


              void _cropAndTranscribeScreenshot(ByteBuffer screenshotBuffer)
              {

                final blob = html.Blob([screenshotBuffer]);

                js.context.callMethod('cropAndTransribeImage', [

                  blob,

                  croppedRegionCoordinates?.x,

                  croppedRegionCoordinates?.y,

                  croppedRegionCoordinates?.width,

                  croppedRegionCoordinates?.height,

                  croppedRegionCoordinates?.resizeWidth,

                  croppedRegionCoordinates?.resizeHeight,

                ]);

              }

4. Audio Transcription Using Google Cloud Functions

A Google Cloud Function processes the audio extracted from video using the Whisper model, ensuring that both streams (audio and text) are synchronized in real time.


              def transcribe_video_file(json_request):

                file_location = json_request.get('file_location')

                user_id = json_request.get('user_id')


                blob = bucket.blob(file_location)

                temp_video_file = '/tmp/temp_video' +
              os.path.splitext(file_location)[1]

                blob.download_to_filename(temp_video_file)

                audio_data =
              extract_audio_from_video(temp_video_file)


                transcriptions = transcribe_audio(audio_data)

                transcript =
              transcriptions["results"]["channels"][0]["alternatives"][0]["transcript"]

                return {'message': 'File transcribed successfully.',
              'transcript': transcript}

Key Features

Real-time transcription of both audio and on-screen text.
Multi-stream processing (audio and video text) without overlap.
JS-Interop integration for seamless interaction between Flutter and JavaScript.
Cloud-based audio transcription using OpenAI's Whisper API.
Screenshot cropping functionality for focused text extraction.

Challenges and Solutions

1. Processing On-Screen Text in Real Time

Challenge: Ensuring that Tesseract could accurately process screenshots of video frames and extract text in real time.
Solution: By cropping regions of the image, Tesseract’s accuracy was improved, and gibberish output was minimized. Extracted text was immediately sent back to Flutter via JS interop for further processing.

2. Synchronizing Multiple Input Streams

Challenge: Managing synchronization between the two streams (audio and on-screen text) without overlap.
Solution: The two streams were processed in parallel using cloud functions, and Firebase was used to deliver real-time updates to the app to ensure they remained in sync.

Impact

The real-time transcription feature significantly enhanced the accessibility of video content for students, resulting in positive feedback and additional sales for Tanna AI. The seamless integration of multiple technologies ensured high accuracy and user satisfaction.