Live video captioning based on speech



In this pandemic situation, a lot of communication gaps are created in studies. Due to the Covid-19lockdown, the studies were going in online mode. In online mode, there were many students which are not able to catch the speed of the speakers/teachers, due to this they are not able to understand many concepts. At this place, video captioning plays an important role for the students. By this live video captioning students can see the text of what the speaker is telling. So, by this, they get some idea of what the speaker is talking about. On YouTube, we see that they are providing captioning options which are very beneficial. This will also help in the online meetings in the corporate world where it is important to understand the content passed in the meetings. In this, I have used JavaScript, HTML, CSS, Python, and Python Flask. With all this technology I have created a web application in which can be able to do a transcript of the live video.


The flow of the work is shown by the icons given below:

As you can see in the above flow diagram, during live meetings or online lectures, we can collect the live audio. From the live audio, we will generate the text from the speech. In the third step, we would apply the algorithm of artificial intelligence for spelling corrections. After this step, we will have the caption which is recognized from the live video. Then the next step is to integrate the text file with the video again. After all, steps are done, you will be able to see the captions on our web application. I hope that by this flow you have some idea about how our web application is going to work and how the captions are created in any applications.

Video Part

In this part, I will be going to show you how we have implemented the part of taking video in by using HTML, CSS, and JavaScript technologies.

The below image of the code shown that how I can get the live video in any browser. We can change the height and width of the video frame by changing the parameters in the below part of the code. We can also set the framerates if we want to change them.

Code Piece:

Code for live video

Audio Part

In this part, we are going to tell you about how to get the live speech from the video.

Code piece:

Code for taking live audio

In the above code, I am using a JavaScript library to recognize the voice. The name of the JavaScript library is SpeechRecognition() Library.

To start the recognition we have to use recognition.continuous = true. For the acknowledgment, I have made the function that will acknowledge you when the recognition is started.

If we want to stop the recognition, that we have to use recognition.continuous = false. And have also made the function that will acknowledge when the recognition stops.


This is the part where we can get our text from the speech. From the live speech recognized by the library provided by JavaScript, we are able to collect the text directly.

For that, I have used a variable called “content” in the above code snippet. In this code, I will collect the text continuously and so that I have to keep my variable updated. So for that, I am continuously appending my text to the “content” variable.

Now to show the content without applying the AI algorithm that how accurate the text is predicted, we will print the “content” variable in the text area that is provided by the use of HTML.

Code piece:

In the above code snippet, you can able to see how I have created the text area. In that text area, we are printing the text stored in the variable “content”. And the code shown is the code for the button to start and stop the recognition of the speech.

Now we will see how I have applied the algorithm for the spelling correction.

Algorithm for spelling correction

For the spelling correction task, we are using NLP libraries which is a sub-domain of Artificial Intelligence (AI). There are many libraries that are openly available for this kind of task. Some list of the libraries that are able to do the spelling correction from the statements are:

1. TextBlob Library

2. SparkNLP library

3. Spello Library

There are many other libraries present but I have tried these three libraries. From these libraries, I find that Spello is the library that can give the more accurate output, and also this library is very easy to integrate.

About Spello Package:

Spello is a spell correction model formed with the combination of two models:

1. Phoneme Model

2. Symspell Model

Phoneme Model:- It uses the Soundex algorithm(Model which is used for indexing the names by sound, as pronounced in the English Language) in the background and suggests correct spellings using phonetic concepts to identify similar-sounding words.

Symspell Model uses the concept of edit-distance in order to suggest correct spellings. Spello gets you the best of both, taking into consideration the context of the word as well. All the algorithms are using concepts of LSTM, BLSTM, RNN, etc.

This model is able to correct the two languages English and Hindi. Now let’s see how we can use this package for our project.

Step 1: Installation of library

For installation of the library, we need to write the code “pip install spello” in the Jupyter notebook or collab file or if we want to install it in our system then we can run the command in the command prompt.

Code to install the library

So I have used Jupyter Notebook for this process. So this is how we can install the package.

Step 2: Model Initialization

To initialize the installed model we need to import the model. The code to import the model is given in the below code piece.

Code to import library

This is how we can initialize the model and here we are using “en” for the English language.

Step 3: Model Training/Create a new model

After initializing the model, we need to train the model. We can train the model by giving them a list of sentences. The code for training the model is given below.

Code to train model

We can give any number of sentences or words in the above manner to train the models.

Step 4: Save the model

After training the model, we need to save the model. The code to save the model is given below.

Code to save the model

So by using we can save the model.

Step 5: Load Model

There are many pre-trained models are available which are trained by spello. We can download it and use it whenever we required it. Or else we can use our own trained model which we can train by using the above process.

We can load the model with this code.

Code to load the model

Step 6: Test the model

We can do testing of the model by using the given code.

This is how we can test our model

Here in the output, we can see that we are getting three output

1. Original Text

2. Spell corrected Text

3. The dictionary contains the words that are corrected.

So this is how the spello library works.

Now we have two different parts created one is out JavaScript, HTML, and CSS part and the other is the Algorithm part.

Now we need to merge two parts of the project.

For that, we are using a python flask in which our algorithm is present and we can send the text present in the “content” variable by the use of API to the flask in which our algorithm is present.

Code piece:

Code to send URL with content

Here in the above code in the image, I am sending the text from the URL which will be received by the Flask application.

Hereby using the “” function we are calling the URL that is written inside the function.

Now we will see how we have applied the algorithm in a python flask. In the flask application, we first get the text from the URL, and then we will test the trained model with the received text. By this, we can able to get the text corrected.

Then we are printing the corrected text and check the output.

Code piece:

Flask code in which text is processing

In the above code, we will load the model and apply the text that we are getting from the URL.

Video without captioning

Now we will see the importance of this work.

In the above gif we can see that the girl is trying to tell something but we are not able to tell what she trying to tell.

Video with captioning

But now we will able to understand what a girl is trying to say.

The output of our work:

Here are some screenshots of our project.

UI of our project.

The above screenshot is of our UI in which we are taking the video as well as audio also.

Here there are two buttons “Start” and “Stop”. When we have to start recognizing the sound we have to click on the “Start” button and for stopping the recognition we need to click on the “Stop” button.

With recognition started

Here we can see after starting the recognition, the text that is recognized is displayed in the text area.

So the text that is recognized is been sent to the flask for processing. In the flask, the text is collected from the URL, and then it is been processed from the model that is trained.

Then the corrected text is shown below.

corrected text

In the above image, you can see that the text after being processed, is displayed on the screen. Further, we are going to make a much better UI in which we can get the corrected text displayed directly.

So this is how our project works.

You can see the detailed explanation of our project in the following video link.

Video of detailed explanation


Thanks for reading. Hope these contents will help you.

For any query you can contact us on Linkedin:

Harshal Faldu:

Nipun Parekh:

Hope you have enjoyed reading this blog :)

DevOps enthusiast.