TOP  >  Thesis/Dissertation  >  Matching Conversational Speech and Text Considering Connection of Sentences

Matching Conversational Speech and Text Considering Connection of Sentences

In school education, it is emphasized to train the ability to communicate in foreign languages. In our application, a teacher searches a part of educational language video that addresses syntax or a word that the teacher wants to teach, and show the students the video as a practical example. In order to achieve this application, we propose a method to find a part of a video clip corresponding to each sentence in the given text book.

In this paper, we aim to find audio segments corresponding to each sentence in the text. Because an audio data is always synchronized with an image sequence in a given video clip, to find the audio segment means to find the video segment. The target video clips in our method are learning language TV programs about everyday conversational talk.

A previous research proposed a method for matching a drama movie with its scenario text every 0.5 seconds using DP matching based on the features extracted from the image sequence, the audio data and the text data. Another previous research proposed a method for matching captions with speeches of a TV news program by applying the syllable/phoneme HMM constructed through the captions. In these researches, they achieved the matching in fine grained unit, such as phonetic units.

However, in these researches, the matching tends to fail when a speaker makes a pause while speaking one sentence, or when there are speeches not transcribed in the text, such as a falter. Additionally, in the former previous method, it is assumed that there are always pauses between adjacent sentences. However, it is common that more than two sentences are spoken continuously in conversational talk.

In order to achieve our application, it is enough to find the matching between the speeches and the text data in sentence unit. Therefore, we aim to get the matching between the text and the audio in sentence unit and we cope with the problems in previous methods.

In this paper, we match any patterns generated by concatenating more than two neighboring sentences or speech segments. If the speaker speaks more than two sentences continuously without a pause, one speech segment needs to be matched to multiple sentences. Conversely, if the speaker makes a pause during speaking one sentence, one sentence needs to be matched to multiple speech segments. Therefore we need to match multiple sentences and multiple speech segments. Additionally, we decide the matching between the speech and sentence pair in the order of high fitness, so that non-transcribed speeches are not incorrectly matched with sentences. The fitness is weighted sum of similarity of speech duration and keywords and the penalty of number of concatenated sentences. Finally, even when multiple sentences are matched to speech segments, the start and the end times are found for each sentence by dividing the speech segments according to the ratio of the estimated speech duration of each sentence. In order to investigate the concatenation effects on the accuracy of the matching, we compared the accuracy when sentences and speech segments are isolated with the one when these concatenated patterns are generated. We regard that the matching is correct when the difference of the start and the end time between correct and estimated speech segments is within 0.5 or 1 seconds.

In this paper, we applied our method to 12 conversational video clips. Each video clip is from 24 to 71 seconds long and contains from 15 to 28 sentences. Compared with the method of matching single sentence and speech segment, our method improves the total accuracy from 38.5% to 47.3% by 8.8 points in case the acceptable range of error is 0.5 seconds, and from 48.3% to 59.6% by 11.3 points in case that is 1 second.

As a future work, because order of sentences and speech segments is not change, the position of them in the video clip should be used for calculating fitness. Moreover, in this paper, we apply our method to only one learning language TV program in English. It is necessary to evaluate our method with others.