A Parallel, Cognition-oriented Fundamental Frequency Estimation Algorithm
The fundamental frequency F0 plays an important role in human speech perception and is used in all fields of speech research. For instance, humans identify emotional states based on a few features, one of which is F0 [ROD 11]. For speech synthesis, accurate estimates of F0 are a prerequisite for prosody control in concatenative speech synthesis [EWE 10].
Fundamental frequency detection has been an active field of research for more than 40 years. Early methods used the autocorrelation function and inverse filtering techniques [MAR 72, RAB 76]. In most of these approaches, threshold values are used to decide whether a frame is assumed to be voiced or unvoiced. More advanced methods incorporate a dynamic programming stage to calculate the F0 contour based on frame-level F0 estimates gained from either a conditioned linear prediction residual or a normalized cross correlation function [SEC 83, TAL 95]. The normalized cross correlation-based RAPT algorithm is also known as getf0. Praat’s well-known pitch detection algorithm calculates cross-correlation or autocorrelation functions and considers local maxima as F0 hypotheses
Chapter written by Ulrike Glavitsch.
[BOE 01]. The fundamental frequency estimator for speech and music YIN with no upper limit on the frequency search range uses the autocorrelation function and a number of modifications to prevent errors [DEC 02]. In the last decade, techniques like pitch-scaled harmonic filtering (PSHF), nonnegative matrix factorization (NMF) as well as time-domain probabilistic approaches have been proposed for F0 estimation [ACH 05, ROA 07, SHA 05, PEH 11]. In SAFE, F0 estimates are inferred from prominent signal-to- noise ratio (SNR) peaks in the speech spectra [CHU 12]. Pitch and probability-of-voicing estimates gained from a highly modified version of the getf0 (RAPT) algorithm are used in an automatic speech recognition system for tonal languages [GHA 14]. These recent methods achieve low error rates and high accuracies but at a high computational cost - either at run-time or during model training. These calculative approaches generally disregard the principles of human cognition and the question is whether F0 estimation can be performed equally well or better by considering them.
In this chapter, we propose an F0 estimation algorithm based on the elementary appearance and inherent structure of the human speech signal. A period, i.e. the inverse of F0, is primarily defined as the time distance between two maximum and two minimum peaks, and we use the same term to refer to the speech section between two such peaks. Human speech is a sequence of alternating speech and pause segments. Speech segments are word flows uttered in one breath of air. The speech segments are usually much longer than the pause segments. In speech segments, we distinguish voiced and unvoiced parts. The speech signal is periodic in voiced regions, whereas it is aperiodic in unvoiced regions. The voiced regions can be further divided into stable and unstable intervals [GLA 15]. Stable intervals show a quasi-constant energy or a quasi-flat envelope, whereas unstable intervals exhibit significant energy rises or decays. On stable intervals, the F0 periods are mostly regular, i.e. the sequence of maximum or minimum peaks is more or less equidistant, whereas the F0 periods in unstable regions are often shortened, elongated, doubled, or may show little similarity with their neighboring periods. Speech signals are highly variable and such special cases occur relatively often. Thus, it makes sense to compute F0 estimates in stable intervals first and use this knowledge to find F0 of unstable intervals in a second step. The F0 estimation method for stable intervals is straightforward as regular F0 periods are expected. The F0
estimation approach for unstable intervals computes variants of possible F0 continuation sequences and evaluates them for highest plausibility. The variants reflect the regular and all the irregular period cases and are calculated using a peak look-ahead strategy. We denote this F0 estimation method for unstable intervals as F0 propagation, since it computes and verifies F0 estimates by considering previously computed ones.
It turns out that the whole F0 estimation can be performed in parallel on the different speech segments of a recording. The speech segments can be considered as separable units of speech that can be treated as computationally independent entities.
We consider the proposed algorithm as cognition oriented inasmuch as it incorporates several principles of human cognition. First, human hearing is also a two-stage process. The inner ear performs a spectral analysis of a speech section, i.e. different frequencies excite different locations along the basilar membrane and as a result different neurons with characteristic frequencies [MOO 08]. This spectral analysis delivers the fundamental frequency and the harmonics. The brain, however, then checks the information delivered by the neurons, interpolating and correcting it where necessary. Our proposed F0 estimation algorithm performs in a similar way, in that the F0 propagation step proceeds from regions with reliable F0 estimates to those where F0 is not clearly known yet. We observed that F0 is very reliably estimated on high-energy stable intervals, which typically represent vowels. Thus, we always compute F0 for unstable intervals by propagation from high-energy stable intervals to lower energy regions. Second, we have adopted the hypothesis-testing principle of human thinking for generating variants of possible F0 sequences and testing them for the detection of F0 in unstable intervals [KAH 11]. Next, human cognition uses context to decide a situation. For instance, in speech perception humans bear the left and right context of a word in mind if its meaning is ambiguous. In an analogous way, our algorithm looks two or three peaks ahead to find the next valid maximum or minimum peak for a given F0 hypothesis. Special cases in unstable intervals can very rarely be disambiguated by just looking a single peak ahead. Finally, performing the tasks of the F0 estimation algorithm in parallel on different speech segments is also adopted from human cognition. The human brain is able to process a huge number of tasks in parallel.
The resulting algorithm is very efficient, thoroughly extensible, easy to understand and has been evaluated on a clean speech database. Recognition rates are better than those of a reference method that uses cross-correlation functions and dynamic programming. In addition, our algorithm structures the speech signal in spoken and pause segments, voiced and unvoiced regions, and stable and unstable intervals. This structure may be useful for further speech processing, such as automatic text-to-speech alignment, automatic speech or speaker recognition.