Automatic Perceptual and Cognitive Classification of Everyday Sounds
The aim of this project is to construct a system that automatically
categorizes everyday sounds within a perceptually and cognitively
plausible taxonomy of everyday sounds.
Our auditory system serves different purposes.
Besides speech and music, sound also helps
the objective of object identification.
Everyday objects can be identified based on the sound they create.
Gaver [1993] presents a taxonomy of sounds based on physical
properties of the objects [Susini et al. 2006].
However sounds are also cognitively grouped together
according to the context in which they appear.
For example, the sound of peeling carrots and
of grinding coffee are quite different with respect
to the physical sound properties. But both can
be perceived as kitchen sounds.
Categorization is based on an underlying similarity
measure. In the case of everyday sounds, this similarity
measure is composed of two factors: in a bottom-up
manner some physical characteristics (e.g. frequency, energy, irregularity)
determine similarity between sounds. But induced by enculturation,
the context and co-occurrence of sounds implies similarity.
When sounds always occur one after the other we infer a close
relation between them and perceive them as similar.
For the construction of an automatic classification tool
based on perceptual and cognitive grounds we need to model
both the bottom-up and the top-down process.
For the bottom-up process, we need to select appropriate
sound descriptors that describe the sound in a perceptually
plausible way. Various descriptors are available, developped
from the signal processing, physics or perception point of view.
The sound taxonomy presented by Gaver and various psychoacoustic
experiments provide psychological reference.
For the top-down process we need to model the enculturation effect
manifested in how we have been exposed to the sounds that is if
sounds always occur in a certain order or context.
Currently, Hidden Markov Models, landmark techniques, and
piecewise linear segmentation are tools of the trade to be
employed here. As a cognitive reference serve categorization
experiments recently performed by Lemaitre et al. [2007].
Detailed description:
The work agenda will consist of several steps:
- Study of background knowledge about categorization schemes of everyday sounds and perceptual descriptors.
- Selection of everyday sounds from data bases.
- Exploration of sound descriptors to model perceptual similarities and taxonomies of sounds.
- Simulations with methods for time sequence analysis to incorporate enculturation and context effects in the cognition of sounds.
Gaver's ecological hierarchical categorization scheme of everyday
sounds first divides between liquids, aerodynamic processes, and
vibrating objects. Vibrating objects can be decomposed into basic
level events (deformation, impact, scraping, rolling) of particular
properties (material, force, shape). The particular events can be
arranged according to temporal patterns. Compound sounds can be
composed of several basic events and even combined with liquid or
aerodynamic sounds. Experimental literature give insight into the
audibility of shape (the length of wooden rods, shape of plates and
balls on plates, Kunkler-Peck 2000), material (e.g. for struck plates
and impacted bars, Giordano and McAdams, 2006), and other properties
of the object [Freed 1990].
First, sounds have to be selected from two kinds of databases:
"freesound" and "Sound Ideas". Then the sounds will have to be
manually segmented and labeled into basic events, using the software
"Wavesurfer". For the labeling the systematic description of the
sound in "Sound Ideas" is of great use. Based on around 700 available
sound files with 10-40 steps in each we will start to categorize step
sounds according to shoes (high heels, boots, barefoot, sneakers,
leather), the material of the ground on which we step (concrete or
marble, wood, gravel, dirt, snow, rug, sand), the movement pattern
(walking, running, jogging, jumping, going upstairs, going downstairs)
and the gender (male, female). Next we will perform classification
experiments for the material of an object such as metal, wood, glass,
rubber. The next step would be to consider more complex sounds such
as rolling, opening/closing doors, key in a lock, switch, bounce,
peeling carrots, grinding coffee).
Features are extracted from the sound by a physiology-based auditory
model, calculation of sound (transients, noise, tonality, tempo) and
psychoacoustic (e.g. sharpness, spectral fluctuation) descriptors
[Cano et al. 2004]. Of particular interest are also physiology-based
feature detectors such as Gamma-tone filters (that simulate the motion
of the basilar membrane, Smith and Lewicki 2006) in conjunction with
the Hilbert transform or a hair cell model [Meddis 1988, Martinez et
al. 2007]. Available tools to extract audio features include
"IRCAMdescriptor" [Peeters], "aubio", "jAudio", and "MA tools",
"MIRToolbox".
To analyze sounds composed of various basic events and a sequence of
sounds occurring in a particular context (e.g. kitchen, outdoor) we
employ time series techniques. Feature vectors calculated previously
by the sound descriptors can be quantized by clustering methods or
their probability density can be estimated. Then graphical models such
as hidden Markov models (HMMs, Rabiner 1998, Cano et al. 2005) and
other latent variable models can be used to classify and segment
sequences of basic events. The HTK or Bayes Net Matlab Toolbox provide
practical implementations of graphical models. These can be compared
to dynamic time warping, landmark similarity techniques [Perng et
al. 2000] and a piecewise linear model.
References:
- Patrick Susini, Nicolas Misdariis, Guillaume Lemaitre, Olivier Houix,
Davide Rocchesso, Pietro Polotti, Karmen Franinovic, Yon Visell, Klaus
Obermayer, Hendrik Purwins, Kamil Adiloglu Closing the Loop of Sound
Evaluation and Design
- Houix et al. Everyday sound classification: Sound perception,
interaction and
synthesis. p.34-37,
- Xiaofeng Li, Robert J. Logan, and Richard E. Pastore: Perception of
acoustic source characteristics: Walking sounds, JASA 90(6) 1991
- Alexander Ekimov and James M. Sabatier.Vibration and sound signatures of human footsteps
in buildings, JASA 120 (2), 2006
- Robert Annies, Elena Martinez, Kamil Adiloglu, Hendrik Purwins, Klaus Obermayer:
Comparison of Biologically Inspired Representation and Classification Schemes for Everyday Sounds.
NIPS workshop "Music, Brain, and Cognition" 2007
- Davide Rocchesso, Federico Fontana: The Sounding Object: http://www.speech.kth.se/prod/publications/files/883.pdf
- http://obiwannabe.co.uk/tutorials/html/tutorial_footsteps.html
|