Vol. 32 Issue 4 Reviews

Anssi Klapuri, Manuel Davy, Eds: Signal Processing Methods
for Music Transcription

Hardcover, 2006, ISBN-13 978-0-387-30667-4, US$ 139, 440 pages, illustrated, references, index; Springer, 233 Spring Street, New York, New York 10013, USA; telephone (+1) 212-460-1500 or (+1) 800-SPRINGER; fax (+1) 212-460-1575; electronic mail service-ny@springer.com; Web www.springer.com/engineering/signals/book/978-0-387-30667-4.

Reviewed by George Tsanetakis
Victoria, British Columbia, Canada

The automatic music transcription (AMT) of music signals in audio format remains one of the biggest challenges of computer music analysis and information retrieval. During the last ten years, assisted by the enormous advances in computer processing speed, the interest in AMT has increased rapidly. This book is a timely addition to the literature on the topic and contains descriptions of state-of-art algorithms and systems in that area. AMT is a challenging, multi-faceted interdisciplinary problem with many subtasks that are covered in the book. Publications related to AMT appear in a variety of different conferences and journals making it difficult to track the progress in the field. This challenge has to a large degree been addressed by the publication of this book with its comprehensive bibliography of almost 700 entries and extensive index. Hopefully such a great resource will stimulate more research in this exciting area.

The editors have done a good job of assembling chapters from leading experts in each subtask and organizing the book into a coherent whole. As is frequently the case with edited collections of chapters the book is not as well integrated as single author textbooks usually are. Therefore, the text may be more appropriate for researchers or graduate students familiar with the field than for newcomers.

There are four parts, each consisting of three chapters. Chapters 1 through 3 (Part I) define terminology and lay the foundations for understanding AMT algorithms and systems. Chapters 4 through 6 (Part II) deal with rhythm and timbre analysis, and Chapters 7 through 9 (Part III) with multiple fundamental frequency analysis. Parts II and III cover the majority of approaches, algorithms, and concepts needed to build AMT systems. The last part (IV) of the book describes three existing examples of such systems.

The first chapter provides a comprehensive well-written introduction to the problem of music transcription and different approaches to solving it. The chapter also provides a compact summary of all the topics covered in the book and could serve as a quick but thorough introduction to the field for someone who doesn't have the time to read the entire book.

The goal of chapter 2 is to provide an introduction to the signal processing, statistics and machine learning techniques that have been applied to music transcription. This is followed by chapter 3, which describes sparse adaptive representations of audio signals. Both of these chapters manage to provide a comprehensive overview of the majority of the techniques used in automatic music transcription systems. There is some unbalance, with certain topics described in more or less detail than necessary, but this is not a serious problem. Another criticism is that the descriptions are relatively dry and technical. This makes them more suitable for researchers familiar with the topics who need a quick overview rather than readers who are encountering them for the first time. In addition, it would be nice to have more explicit connections made by the authors about how these techniques are used in the subsequent chapters. More generally, I would have liked to see more links and connections established both ways between part I and the other three parts of the book. I would also have liked a different structure with one chapter devoted to audio representations (merging section 2.1 of Chapter 2 with Chapter 3) followed by a chapter on statistics, estimation, and machine learning. Finally I feel that a chapter providing basic information about perceptually-informed approaches covering topics such as auditory filter banks, masking, gestalt grouping cues, and computational auditory scene analysis would make a valuable addition to Part I. Although these topics are covered in subsequent chapters I feel that distilling their common fundamental elements as was done for the signal processing methods would make the introduction more balanced.

Chapter 4 is in my opinion one of the best chapters of the book and provides a compact introduction to beat tracking and musical meter analysis. I especially like the taxonomy of the different methods that have been proposed in the literature and the contrasting of different approaches. Chapter 5 deals with unpitched percussion transcription (also frequently referred to as audio drum transcription). Similarly to the previous chapter the authors categorize different percussion transcription systems and describe the common processing elements. The authors should probably have mentioned in this chapter early work in this area such as the Ph.D. thesis "On the Automatic Transcription of Percussive Music: From Acoustic Signal to High Level Analysis" by Andrew Schloss (Stanford University, 1985—already listed in the references as entry [567]). The last chapter (6) of Part II describes the automatic classification of pitched musical instrument sounds and nicely summarizes existing work in this area in Table 6.3. The description of classification techniques in 6.4 should be moved to the introduction and there is some redundancy with Part I (for example, Mel_Frequency Cepstral Coefficients and the Linear Discriminant Analysis—LDA).

Part III describes the core of what typically comes to mind when discussing a music transcription system: the detection of multiple fundamental frequencies. Chapter 7 describes estimation based on generative models. As the authors mention, this is a topic still in its infancy; however, these methods have conceptual appeal due to their elegant mathematical formulation. Another drawback is their computational complexity, although this can to some extent be addressed by domain specific heuristics. Unfortunately, there are no comparisons or evaluations of the various systems described. The following chapter (8) is one of my favorite ones, as it is well-written and covers the majority of systems that are based on auditory models. The concepts are introduced clearly and the chapter could easily serve as a stand-alone introduction to auditory model processing especially for pitch estimation. All the stages are clearly described and it should be straightforward to understand and implement almost all the techniques described. In addition there is more experimental evaluation than the previous chapter. Chapter 9 deals with the slightly different problem of source separation in monaural music signals. This area of research is currently very active and there is little large-scale evaluation of the described algorithms on real-world polyphonic music recordings. Therefore, the chapter is mostly descriptive in nature and has very little comparative information or experimental evaluation.

The last part of the book (IV) deals with more complete systems that are partly based on the techniques described previously. Chapter 10 is titled “Auditory Scene Analysis in Music Signals.” Ideally, I would have preferred that the first sections of the chapter (10.1 and 10.2) would be in the introduction to concepts of Part I. The rest of the chapter describes a particular system, OPTIMA, based on Baysian networks, and is an example of an actual system. Chapter 11 is another of my favorite chapters, mainly because it deals with the interesting idea of extracting "music scene descriptions" that can be useful in various contexts without necessarily performing full music transcription or sound source separation. This is a very general idea and helps expand the way we think about music transcription systems. More specifically, the authors show how the following local and global descriptions can be extracted for Western music: melody and bass lines, hierarchical beat structure, drums and chorus sections, and repeated sections, none of which require full music transcription or source separation. The authors show how such descriptions can be in music information retrieval (MIR) applications, for synchronizing computer graphics to music and for "intelligent" music listening stations. The final chapter of the book describes singing voice transcription assuming a single voice. The most interesting part of the chapter deals with the problems of note segmentation and labeling as well as the utilization of musical context. These are important concepts that take into account more about what we know about music and symbolic representations and are necessary ingredients to a full music transcription system.

To summarize, this is a great resource for anyone interested in automatic music transcription and provides a comprehensive snapshot of the current state of the art. The few shortcomings of the book such as repetition of concepts, inconsistent writing style, and limited connections between the chapters are to be expected given the emerging nature of the field and the fact that it is a multi-author volume.