From: Tony Robinson 
To:  htk-users@eng.cam.ac.uk
Subject: ANN: HTKtimit.sh - TIMIT train and test triphones with a bigram grammar

I've written a simple bash script that will train up standard HTK state 
clustered triphones on TIMIT.  The main aims of this are to serve as a 
reference implementation for standard HTK triphones and also as a basis 
for testing new acoustic parameterisations.

I chose TIMIT for a reference implementation as it is well respected, 
readily available, reasonably priced and very well established.  The HTK 
tutorial goes through all the necessary steps, but has no reference 
data.  The RM recipe requires RM and has very few errors on account of 
the artificial grammar.  In contrast, TIMIT is a phone recognition task 
so produces lots of errors yet is still very quick to train.

Some design decisions were:

1) To use the full set of 61 TIMIT symbols instead of mapping them down 
into something smaller (about 45).  The argument for the full 61 symbols 
is that of standard and simplicity, the argument for a reduced set is to 
make it closer to what is normal in speech recognition based on phonetic 
dictionaries, i.e. no special closure symbols.  However, many of the 
reduced mappings kept the closure symbols, so simplicity ruled.

2) To test on the core test set (by default).  It turns out that 
triphone recognition with a bigram grammar using HVite is quite slow, so 
I made this compromise so get results out faster.  Note that there are 
significant random fluctuations in accuracy as the number of Gaussians 
per mixture is increased which can only be seen by testing the different 
model sets.

3) To provide monophone results as well as trigram.  As TIMIT is small 
and well labeled, monophone accuracy is almost as good as triphone.  All 
but the simplest monophone models may be bypassed by setting NMIXMONO=1 
in the script.

4) Not to provide "no grammar" results in addition to bigram?  I 
personally don't think that "no grammar" results are very meaningful as 
too many details provide implicit priors and affect the result.  I think 
the stronger bigram grammar is more robust to this.

5) Not to tweak HVites -p or other parameters for best performance.  The 
aim of this script isn't the worlds most accurate TIMIT recogniser, but 
to act as a reference for people wanting to build similar systems. 
Simplicity rules again.

6) To exclude h# from the results.  h# always appears at the start and 
end of utterances - so it really is nothing to do with HMM training. 
HVite and friends like utterances to be delimited by sil as a monophone, 
so it seems both natural and right to exclude h#.

7) To make the whole project a single script.  Many configuration files 
are needed but I decided that creating them on the fly as they were 
required both indicated when the configuration files were needed and 
also reduced the mystique of hidden variables buried in config files 
that were never seen.

Expect accuracies of 58% for monophones and 60% for triphones.

The homepage for this project is at 
http://www.cantabResearch.com/HTKtimit.html and the first release of the 
code is at http://cantabResearch.dnsalias.com/HTKtimit.sh



Tony
http://cantabResearch.com

---------------------------------------------------------------------------------

Date: Sun, 14 May 2006 12:08:01 +0100
From: Tony Robinson 
To:  htk-users@eng.cam.ac.uk
Subject: Re: HVite: using bigram probabilities with triphones

Thanks to expert help off-list I've now found out why HVite didn't like 
my triphone set.

I mistakenly thought that my one monophone, h#, was going to occur at 
the start and end of the sentence, hence all triphones would have a 
valid acoustic context (h#, h#-a+b, ..., y-z+h#. h#).   When I fixed 
things up so that what I thought should be happening was really true, 
then HVite works as it should.

Previously I've searched the archives for several years back and found 
many people getting this far, asking the same question, and not getting 
an answer on the list, so let me spell it out here.

HVite can only recognise models at the start of an utterance that have 
no left context (and at the end with no right context).   The easiest 
way around this in the TIMIT task is to train up a silence phone, h#, as 
a monophone and force it to occur at the start and end.   TIMIT was 
designed so that all utterances start and end with the symbol h#, so 
nothing is lost.

HLStats generates a bigram lattice assuming the symbol before the "text" 
started is !ENTER and after it !EXIT.   So we can delete the initial and 
final h# from the transcription and HLStats will put !ENTER and !EXIT 
symbols in their place for us.   So if train.wrd lists all the modified 
phone labels (e.g. with egrep -v 'h#$') then:

 HLStats -T 1 -b bigfn -o -I trainMono.mlf -S train.wrd monophones
 HBuild -T 1 -n bigfn monophones-h#+ENTER+EXIT monophones-h#+ENTER+EXIT 
outLatFile

Produces a valid lattice file.   The file monophones-h#+ENTER+EXIT is 
the monophones file without the h# and with !ENTER and !EXIT.   
Similarly the dictionary doesn't have h# as a "word" but instead has the 
two lines:

!ENTER  [] h#
!EXIT   [] h#

Now when HVite is called as:

HVite -A -D -T 1 -C hvite.config -H tri3/MODEL -S test.scp -i 
recout_bigram.mlf -w outLatFile -p 0.0 -s 5.0 ../dict tiedlist

and hvite.config contains
FORCECXTEXP = T
ALLOWXWRDEXP = T

Then HVite starts with the !ENTER symbol, with pronunciation h#, then 
can visit any of the triphones with bigram constraints, ending up in 
!EXIT with has pronunciation h# again.   The use of [] in the dictionary 
ensures that we don't see h# in the output and then score it as 
correctly recognised.

I'm happy to help people sort this out by email for the next month or so 
until I have a clean script in place that does all this automatically 
(i.e. if you are reading this after June 2006 then please start at 
http://www.cantabResearch.com/HTKtimit.html).


Tony