Automatic Perceptual and Cognitive Classification of Everyday Sounds

The aim of this project is to construct a system that automatically categorizes everyday sounds within a perceptually and cognitively plausible taxonomy of everyday sounds.

Our auditory system serves different purposes. Besides speech and music, sound also helps the objective of object identification. Everyday objects can be identified based on the sound they create. Gaver [1993] presents a taxonomy of sounds based on physical properties of the objects [Susini et al. 2006]. However sounds are also cognitively grouped together according to the context in which they appear. For example, the sound of peeling carrots and of grinding coffee are quite different with respect to the physical sound properties. But both can be perceived as kitchen sounds. Categorization is based on an underlying similarity measure. In the case of everyday sounds, this similarity measure is composed of two factors: in a bottom-up manner some physical characteristics (e.g. frequency, energy, irregularity) determine similarity between sounds. But induced by enculturation, the context and co-occurrence of sounds implies similarity. When sounds always occur one after the other we infer a close relation between them and perceive them as similar. For the construction of an automatic classification tool based on perceptual and cognitive grounds we need to model both the bottom-up and the top-down process. For the bottom-up process, we need to select appropriate sound descriptors that describe the sound in a perceptually plausible way. Various descriptors are available, developped from the signal processing, physics or perception point of view. The sound taxonomy presented by Gaver and various psychoacoustic experiments provide psychological reference. For the top-down process we need to model the enculturation effect manifested in how we have been exposed to the sounds that is if sounds always occur in a certain order or context. Currently, Hidden Markov Models, landmark techniques, and piecewise linear segmentation are tools of the trade to be employed here. As a cognitive reference serve categorization experiments recently performed by Lemaitre et al. [2007].

Detailed description:

The work agenda will consist of several steps:

  1. Study of background knowledge about categorization schemes of everyday sounds and perceptual descriptors.
  2. Selection of everyday sounds from data bases.
  3. Exploration of sound descriptors to model perceptual similarities and taxonomies of sounds.
  4. Simulations with methods for time sequence analysis to incorporate enculturation and context effects in the cognition of sounds.

Gaver's ecological hierarchical categorization scheme of everyday sounds first divides between liquids, aerodynamic processes, and vibrating objects. Vibrating objects can be decomposed into basic level events (deformation, impact, scraping, rolling) of particular properties (material, force, shape). The particular events can be arranged according to temporal patterns. Compound sounds can be composed of several basic events and even combined with liquid or aerodynamic sounds. Experimental literature give insight into the audibility of shape (the length of wooden rods, shape of plates and balls on plates, Kunkler-Peck 2000), material (e.g. for struck plates and impacted bars, Giordano and McAdams, 2006), and other properties of the object [Freed 1990].

First, sounds have to be selected from two kinds of databases: "freesound" and "Sound Ideas". Then the sounds will have to be manually segmented and labeled into basic events, using the software "Wavesurfer". For the labeling the systematic description of the sound in "Sound Ideas" is of great use. Based on around 700 available sound files with 10-40 steps in each we will start to categorize step sounds according to shoes (high heels, boots, barefoot, sneakers, leather), the material of the ground on which we step (concrete or marble, wood, gravel, dirt, snow, rug, sand), the movement pattern (walking, running, jogging, jumping, going upstairs, going downstairs) and the gender (male, female). Next we will perform classification experiments for the material of an object such as metal, wood, glass, rubber. The next step would be to consider more complex sounds such as rolling, opening/closing doors, key in a lock, switch, bounce, peeling carrots, grinding coffee).

Features are extracted from the sound by a physiology-based auditory model, calculation of sound (transients, noise, tonality, tempo) and psychoacoustic (e.g. sharpness, spectral fluctuation) descriptors [Cano et al. 2004]. Of particular interest are also physiology-based feature detectors such as Gamma-tone filters (that simulate the motion of the basilar membrane, Smith and Lewicki 2006) in conjunction with the Hilbert transform or a hair cell model [Meddis 1988, Martinez et al. 2007]. Available tools to extract audio features include "IRCAMdescriptor" [Peeters], "aubio", "jAudio", and "MA tools", "MIRToolbox".

To analyze sounds composed of various basic events and a sequence of sounds occurring in a particular context (e.g. kitchen, outdoor) we employ time series techniques. Feature vectors calculated previously by the sound descriptors can be quantized by clustering methods or their probability density can be estimated. Then graphical models such as hidden Markov models (HMMs, Rabiner 1998, Cano et al. 2005) and other latent variable models can be used to classify and segment sequences of basic events. The HTK or Bayes Net Matlab Toolbox provide practical implementations of graphical models. These can be compared to dynamic time warping, landmark similarity techniques [Perng et al. 2000] and a piecewise linear model.

References:

  • Patrick Susini, Nicolas Misdariis, Guillaume Lemaitre, Olivier Houix, Davide Rocchesso, Pietro Polotti, Karmen Franinovic, Yon Visell, Klaus Obermayer, Hendrik Purwins, Kamil Adiloglu Closing the Loop of Sound Evaluation and Design
  • Houix et al. Everyday sound classification: Sound perception, interaction and synthesis. p.34-37,
  • Xiaofeng Li, Robert J. Logan, and Richard E. Pastore: Perception of acoustic source characteristics: Walking sounds, JASA 90(6) 1991
  • Alexander Ekimov and James M. Sabatier.Vibration and sound signatures of human footsteps in buildings, JASA 120 (2), 2006
  • Robert Annies, Elena Martinez, Kamil Adiloglu, Hendrik Purwins, Klaus Obermayer: Comparison of Biologically Inspired Representation and Classification Schemes for Everyday Sounds. NIPS workshop "Music, Brain, and Cognition" 2007
  • Davide Rocchesso, Federico Fontana: The Sounding Object: http://www.speech.kth.se/prod/publications/files/883.pdf
  • http://obiwannabe.co.uk/tutorials/html/tutorial_footsteps.html