Voiced and unvoiced regions
Voiced and unvoiced regions are detected in the speech segments only. The pause segments are unvoiced by default. To detect voiced and unvoiced regions, the speech segment is again subdivided into a sequence of frames. However, the frames are longer than that for the pause-finding algorithms, i.e. 20 ms, and they overlap by half of the frame length, namely 10 ms. We define a frame to be voiced if its mean energy exceeds a certain threshold, the absolute height of the frame’s maximum or minimum peak is above a given level and the number of counted zero crossings in the frame is lower than a certain number. A voiced region is a sequence of voiced frames and similarly, an unvoiced region contains only unvoiced frames.
A zero crossing is the location in the speech signal where there is change from a positive sample value to a negative value or vice versa. Voiced regions such as vowels, nasals, etc. exhibit a low number of zero crossings, whereas unvoiced regions, e.g. fricatives, usually have a rather high number of zero crossings.
For computation of the mean energy of a frame, we use a more elaborate method than the standard approach, i.e. computing the sum of the squares of the frame’s samples and dividing it by the frame length. This standard approach is not precise if the F0 of a frame is not an integer multiple of the frame length [GLA 15]. It may falsify the voiced/unvoiced decision and the stable/unstable classification of a frame in a later step (see section 9.2.3) that is based on the mean energy, too. However, as the period of a frame - the inverse of F0 - is not known at this stage of processing, we compute the mean energy on a scale of window lengths, each of which corresponds to a different period length. An optimization step then finds the best window length for each frame. This procedure is similar to pitch-scaled harmonic filtering (PSHF) [ROA 07], where an optimal window length is calculated for finding harmonic and non-harmonic spectra. The window lengths are selected such that periods of F0 between 50 and 500 Hz roughly fit in one of the selected windows a small number of times at least. The selected window lengths correspond to fundamental frequencies of 50, 55, 60,..., 95 Hz. Each window length is centered around a frame’s center position. The optimal window length is the one where the mean energies of a small number of frames around the frame’s middle position show the least variation [GLA 15].