Site Loader

2. SPEECH CORPUS FUNDAMENTALS

A corpus is a collection of pieces of language text in
electronic form, selected according to external criteria to represent, as far
as possible, a language or language variety as a source of data for linguistic
research 4. A speech corpus or spoken corpus is a database of speech audio
files and text transcriptions in a format that can be used to create acoustic
models which can then be used with a speech recognition engine 5. In broad
sense, Speech Corpora may be viewed in two types as below:

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

1. Read Speech – This includes Book excerpts, Broadcast
news, Lists of words and Sequences of numbers.

2. Spontaneous Speech – This includes Dialogs between two or
more people (includes meetings), Narratives such as a person telling a story, Map-tasks
such as one person explains a route on a map to another and Appointment-tasks such
as two people try to find a common meeting time based on individual schedules.

Special kinds of speech corpora are non-native speech
databases that contain speech with foreign accent.

Speech corpus is the basis for both analyzing the
characteristics of speech signal and developing speech synthesis and
recognition systems. The corpus content becomes more and more complicated and
the size larger and larger with the development of computation power and the
speech technology. One of the selection methods of speech content of a corpus
is to derive the speech corpus from text corpus. For example, a speech corpus
of British English WSJCAM0 has been recorded at Cambridge University
from the Wall Street Journal text corpus 6.  Before recording a speech corpus, careful
selection of vocabulary is important since on average each out-of-vocabulary word
causes errors usually between 1.5 and 2 7. The recognizer vocabulary is
usually designed with the goal of maximizing lexical coverage for the expected
input. A straight forward approach is to choose the N most frequent words in
the training data which means that the usefulness of the vocabulary is highly
dependent upon the representativeness of the training data 8.

There are
different parameters to categorize a speech recognition system. Influential
parameters are speech types, speaker dependency, vocabulary size, etc. The importance
of these parameters is based upon the typical design considerations of a
recognition system, which may be closely related to a specific application or
task 9. In terms of speech types, speech recognition devices are usually faces
recognition problems with isolated or discrete, connected, or continuous
speech. Discrete speech requires a significant pause between words, may be 250
milliseconds. A single utterance may consist of a single word or a short string
of a number of isolated words. In continuous speech recognition systems, fluent or continuous speech flows with a rhythm and the
words bump into each other thus making recognition harder. In between
these two, connected speech recognizers do not require the intermediate pause
between inputs, but are able to detect word boundaries within a string of
connected speech. They do, however, require that the user carefully annunciate
each word like a dictation. Though many relevant literatures describe connected
words and continuous words as alternative terms, but because of vast diversity
of application it is required to define connected words separately. In speech recognition task, the difference in classification
between “connected words” and “continuous speech” is somewhat
technical. A connected word recognizer uses words as recognition
units, which can be trained in an isolated word mode. Particularly
in dictation and voice command recognition this type of systems becomes
efficient. Discrete, connected, and continuous speech recognition systems can
be classified further as either speaker-dependent or speaker-independent
systems. Speaker-dependent systems require that each speaker enter several
samples of each word in the vocabulary to form the reference templates 10. Another
important consideration to design a speech corpus is its vocabulary size. The
adjectives “small”, “medium” and “large” are applied to vocabulary sizes of the order of 100, 1000 and (over) 5000 words,
respectively. A typical small vocabulary recognizer can recognize
only ten digits; a typical large vocabulary    recognition system can recognize 20000 words 9. In
dictation and voice command recognition medium size vocabulary may be estimated
enough for satisfactory performance. Because it supports the study of Gould, Conti, and Hovanyecz 10 to determine the
feasibility of a limited capability automatic dictation machine which was simulated
along with isolated and connected speech modes using various vocabulary sizes.  In their experiment users composed and edited
letters with the simulated voice recognizer which had either a 1000 word
vocabulary or an unlimited vocabulary. The 1000 word vocabulary was composed of
the 1000 most frequently used English words. An analysis afterwards indicated
that roughly 75% of the words used in the letter writing task were available in
the 1000 word vocabulary.

 

3. BdNC01 CORPUS AND DATABASE DESIGN

BdNC01 corpus is a text corpus collected from web edition
of several influential Bangla newspapers during 2005-2011. BdNC01 contains a
large amount of Bangla text including more than 11 million word tokens. As a
requirement of this work, a program was developed using C Language to parse and
sort the text in BdNC01 corpus, the result was a list of words with their frequency
of occurrence in the text. The objective of this processing was to select a
list of high frequent 1000 or more words so that it becomes a good
representative of the language in consideration to construct a significant
connected speech database. A part of the list is shown in Table-1 and top
frequent 1000 words were selected to find some practical Bangla sentences. From
three issues of daily newspapers selected randomly, 52 sentences were selected
such that they include high frequent words as above. The list of sentences was
accepted for a small-medium vocabulary speech database and includes 252
different words in 343 places. The special characteristic of this list is that
some words are in multiple places with different context.

Post Author: admin

x

Hi!
I'm Rhonda!

Would you like to get a custom essay? How about receiving a customized one?

Check it out