Senior Design Team sddec21-06 • DigiClips Media Search Engine


DigiClips is a media content analysis company that records and extracts data from diverse types of media, such as television and radio, and stores this information in a searchable format. It aims to provide its clients a user interface that would facilitate searching of the database for keywords or phrases of user’s interest uttered in audio or video clips. For example, a client may be interested in finding if their company name has been mentioned (along with its frequency) on television or radio within a given time frame.


Problem Statement

The data currently being extracted from the television recordings is from the television network-provided closed captions only. This closed captions data often misses words or phrases spoken within the broadcast, causing a disconnect between the actual content of the broadcast and the searchable content provided. In addition to missed audio, closed caption data does not provide any means of searching the content of broadcasted frames themselves, where there is often visible text that denotes the current segment of news, breaking stories, etc. This information is, at the moment, lost within hours of recordings and extremely difficult to perform searches on.


Solution Approach

This project will investigate existing solutions and develop efficient speech-to-text and video-to-text modules that will take television and radio recordings as its inputs and record the timestamp-location of keywords and phrases of interest in these recordings. The outputs of these modules will be organized in a database schema. The speech-to-text and video-to-text extraction capability will give the company an edge in the industry by granting them access to data that is currently not being tracked, providing more opportunities for clients to find any and all mentions of their keywords. The focus will be to develop near-real-time solutions that can scale with the number of audio and video recording streams. These modules will be integrated with other components in the system that include signal processing applications, databases, query and retrieval frameworks, and user-interfaces.


Current Implementation

Our current implementation of the proposed solution consists of a microservice structure detailed in the figure below.


As can be seen in the above figure, our application plan consists of a three-microservice structure that includes a driver microservice, speech-to-text (stt) microservice, and a video-to-text (vtt) microservice. The driver microservice accepts a POST request with a file path embedded in the body, which the driver microservice uses to locate the recorded video or audio file to perform analysis on. From here, the driver microservice calls both the stt and vtt microservices, which process the video or audio sequence for speech-to-text and video-to-text, respectively.

In the stt microservice, the audio is stripped from the recorded video file and segmented via an algorithm that determines sections of audio where there is noise occuring. Each of these segments of audio is then run through Mozilla's DeepSpeech, a speech-to-text library with state of the art accuracy and speed. Once the output from DeepSpeech is received, the resulting string is passed through grammar and punctuation checking, and the string is indexed via the timestamp of the recording where the string occurs. This ensures that the data is readable and searchable for future querying possibilities.

In the vtt microservice, the video file is processed frame by frame. Each frame is passed through a series of pre-processing steps in order to crop, grayscale, and make text more identifiable in comparison to the background. After this pre-processing is performed, the frame is passed to Google's TesseractOCR, an optical character recognition library that identifies and processes text within images and returns a plaintext string or strings detailing the contents of the text. Similar to the stt microservice, this resulting text is then processed for grammar and punctuation, then indexed according to the timestamp of that specific frame within the recording.

Once both the stt and vtt microservice have reported output, the driver microservice then proceeds to format the outputted data into the correct database schema and store it appropriately across a number of MySQL databases for future querying.