Marys Medicine

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 1, JANUARY 2008 Extraction of Audio Features Specific to Speech Production for Multimodal Speaker Detection Patricia Besson, Vlad Popovici, Jean-Marc Vesin, Member, IEEE, Jean-Philippe Thiran, Senior Member, IEEE, and Murat Kunt, Fellow, IEEE Abstract—A method that exploits an information theoretic
be done with only a camera and a single microphone. Moreover, framework to extract optimized audio features using video infor-
it has been pointed out, in [1] and [2] for example, that such a mation is presented. A simple measure of mutual information (MI)
fusion can greatly help the classification task: the richer and the between the resulting audio and video features allows the detection
more representative the features, the more efficient the classi- of the active speaker among different candidates. This method
involves the optimization of an MI-based objective function.

No approximation is needed to solve this optimization problem,
Some audio-video feature fusion approaches try to directly neither for the estimation of the probability density functions
evaluate the synchronism of the two signals [3], [4], [5]. As sug- (pdfs) of the features, nor for the cost function itself. The pdfs
gested in [4], the synchronism is here the perceptive effect of are estimated from the samples using a nonparametric approach.
the causal relationship between the two signals. Other methods The challenging optimization problem is solved using a global
method: the differential evolution algorithm. Two information

map first the features onto a subspace where this relationship theoretic optimization criteria are compared and their ability to
is enhanced and can therefore be estimated [2], [6], [7]. All extract audio features specific to speech production is discussed.
the approaches rely on explicit or implicit use of mutual infor- Using these specific audio features, candidate video features are
mation (MI). An estimation of the features' probability density then classified as member of the "speaker" or "non-speaker"
functions (pdfs) is therefore required. There are two main ap- class, resulting in a speaker detection scheme. As a result, our
method achieves a speaker detection rate of 100% on in-house test

proaches that may be taken: either a parametric or a nonpara- sequences, and of 85% on most commonly used sequences.
metric one. In the first case, the pdfs are assumed to follow a Index Terms—Audio features, differential evolution, multi-
parametric law. Most of the time, a Gaussian distribution is con- modal, mutual information, speaker detection, speech.
sidered, which is not necessarily valid. Fisher in [2], as well asButz in [1] and [8], estimate the probability density functions di-rectly from the available samples during the feature extraction process through Parzen windowing [9].
The problem addressed in this paper is the detection of the WITH the increasing capabilities of modern computers, currentspeakerinagivenvideosequencewithtwoormorecan-
both auditive and visual modalities of the speech signal didates. To this end, audio features specific to speech production may be used to improve speaker detection, leading to major are extracted using the information content of both the audio and improvements in the user-friendliness of man-machine interac- video signals. Taking a similar approach to Butz and Thiran in tions. Just consider, for example, a video-conference system.
[1] and [8], as well as Fisher et al. in [10] and [2], the problem The most interactive current solution requires an audio engineer is cast in an information theoretic framework to optimize the and a cameraman so that the speaking person can be emphasized audio features with respect to the video features. The cost func- both on audio and video. An intelligent system able to detect tion to be optimized is therefore based on MI, which leads to a the speaker on the basis of sound and image information could highly nonlinear optimization problem. Moreover, an analytical focus a moving camera on him/her. Another possible applica- formulation of the gradient of the cost function is difficult to ob- tion would be multimedia indexing.
tain without any parametric approximation of the pdf. For this Among the different methods that exploit the information reason, it is preferable to have a method which does not require contained in each modality, a few are performing the fusion di- such an analytical form of the gradient (gradient-free method).
rectly at the feature level, though using such an approach, the In [2], Fisher and Darell use a second order Taylor approxima- detection of the current speaker on an audiovisual sequence can tion of the MI and the Parzen estimator to cast the optimizationproblem into a convex one and to derive a closed form of thegradient. However, our purpose here is to avoid such an approx- Manuscript received February 24, 2006; revised June 12, 2007. This work was supported by the Swiss National Science Foundation through Grant imation and to directly solve our optimization problem using 2000-06-78-59. The associate editor coordinating the review of this manuscript a suitable optimization method. It turns out that the best solu- and approving it for publication was Dr. John R. Smith.
tion for solving the optimization problem is differential evolu- P. Besson, J.-M. Vesin, J.-P. Thiran, and M. Kunt are with the Swiss Federal Institute of Technology (EPFL), CH-1015 Lausanne, Switzerland (e-mail: tion [11], an evolutionary algorithm.
Once these specific audio features are obtained, the MI be- tween them and the video features of candidate mouth regions V. Popovici is with the Bioinformatics Core Facility, Swiss Institute of Bioin- formatics, CH-1015 Lausanne, Switzerland (e-mail: [email protected]).
is used as a classifier to determine the class of the mouth region Digital Object Identifier 10.1109/TMM.2007.911302 ("speaker" or "non-speaker').
1520-9210/$25.00 2008 IEEE IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 1, JANUARY 2008 The rest of this paper is organized as follows: first, the use of are the entropy of information theory to extract optimized features for classifica- , respectively, and tion problems is presented. Then, the chosen representation for . A similar relationship can be established for the the video and audio signals is described. In the third section, the video Markov chain, resulting in a lower bound on information theoretic optimization approach is applied to ob- The inequality (1) does not help us directly to minimize the tain audio features optimized for the specific classification task, error probability . It does indicate however that an efficient regardless the classifier. Different optimization criteria based is conditioned by the minimization of on MI are defined. The fourth section exposes the optimization the right hand side of the inequality. The minimization of this problem as well as optimization methods used to solve it. The term leads to an optimal source estimate . Implicitly, the last part of the paper deals with the experiments and discusses processing steps leading to this optimal are taken into account, the different optimization criteria used in the feature extraction, as stated before. A similar reasoning can be followed for the the ability of the method to produce audio features specific to video Markov chain.
speech production, and finally, the performance of the methodas a speaker detector.
B. Extension of the Information Theoretic Feature ExtractionTo the Multimodal Case II. THEORETICAL FRAMEWORK A possibility is to firstly obtain optimal estimates A. Information Theoretic Feature Extraction Then a fusion at the decision or at the classification level can beperformed in order to get a unique estimate of the class from In the present work, the detection of the current speaker in an both unimodal processes. However, such an approach would not audio-visual sequence is viewed as a classification problem. In take advantage of the discriminant information offered by the this classification problem, is a binary random variable (r.v.) bimodal nature of the source . A better approach would be to defined on a sample space which models the membership to use an extension of the previously described unimodal feature the "speaker" or "non-speaker" class with respect to an audio- extraction framework to the multimodal case. Such an exten- visual source, modelled by a r.v.
defined on another sample sion has been proposed by Butz et al. in [1] and applied more particularly to image registration. A similar multimodal feature The goal in a classification process is obviously to minimize extraction framework is used here. It is extended to speaker de- the probability of assigning some measurements performed on tection and the various processing steps are fully justified.
the original signal to the wrong class. That is, to minimize the As already stated, the original source is accessible only classification error probability through the measurements . But, as mentioned in [2], probability depends of course on the classifier and on its ability these two measurements are affected by independent inter- to deal with the particular problem, but it also depends on all the . A good estimate of the source processing steps leading from should include a feature extraction step which discards this noisy information present in each modality and recovers the In the problem at hand, the bimodal source information coming from the common source , thus shared by accessible but yields two observed signals of different physical both modalities. Obviously, such a goal can only be reached by nature: audio and video signals . For each of these sig- considering the two modalities jointly. Let nals, the unimodal classification process leading from the mea- modelling such audio and video features that contain only the information coming from the source . Since they specifically - of the class, can be described through a first order Markov describe this common source, they are related by their joint chain [Fig. 1(a)]. Two associated classification error probabil- . Thus an estimate of the feature related to one modality can be inferred from the other modality with defined. They correspond to the probability of committing an transition probability . These transition error when estimating (audio Markov chain), or probabilities can be obtained by joint probability estimation Markov chain), from . However, they are also con- ditioned by the probability of committing an error when esti- are correctly estimated . These errors are referred to as the the approximation estimation error probabilities can be assumed. These considerations lead to the definition of . The estimation of one r.v. from another can be the classification problem with the two Markov chains shown understood as a feature extraction step where some specific in- in Fig. 1(b). Notice that the source estimates associated to formation must be recovered from the initial r.v. Therefore, the each chain are indexed by AV or VA, to stress that they have information theoretic framework for extracting features devel- been obtained using information present in both modalities, in oped in [10] can be applied.
contrast with the previous case [Fig. 1(a)].
Using Fano's inequality, it is possible to relate the probability Of course, the estimation error probabilities of each error to the conditional entropies their associated lower bounds are still defined according to in- [12]. For the audio Markov chain, this inequality is defined as equality (1). However, since each estimation can be viewed asa feature-extraction step where the estimate r.v. is a function of the previous r.v., the data processing inequality for Markov BESSON et al.: EXTRACTION OF AUDIO FEATURES SPECIFIC TO SPEECH PRODUCTION C. Classifier Definition Applying this framework to extract features, the estimation error probability comes closer to its minimum. However, theclassification error probability must still be minimized: this depends on the choice of a suitable classifier. Previous works inthe domain have shown that measuring the synchrony betweenthe audio and video measurements is a good way of classifyingthem as originating from an audio-visual source or not [4]–[6].
In [4] in particular, the authors interpret synchrony as the degree Fig. 1. (a) Graphical representation of the audio and video Markov chains(leading to O , respectively) modelling the two unimodal classifica- of between audio and video signals. MI also shows good per- tion processes associated to each modality. (b) Graphical representation of the formance in detecting synchronized audio-video sources such related Markov chains modelling the multimodal classification process.
as speakers [2]–[4]. Moreover, the feature optimization pre-pro-cessing also indicates the MI-based classifier as a good choice.
For these reasons, the chosen classifier consists in the evaluation chains can be used to weaken the bounds on the error proba- of the MI between candidate audio and video features. The fea- bilities. Inequality (1) for the audio and video Markov chains tures that exhibit the largest MI are classified as "speaker", while the other ones are labelled as "non-speaker", only one "speaker"class label being authorized per estimation.
Notice that such a classifier has also the advantage of fusing the information at the classification level in a straightforwardway, resulting in a unique class estimate III. SIGNAL REPRESENTATION As previously stated, A. Video Representation this approximation in (2) and (3), a joint lower bound can finally When applying this feature extraction framework in the con- be defined as follows: text of speaker detection, the first decision to be made is in thechoice of signal representation.
Physiologic evidences point out the motion in the mouth re- gion as a visual evidence for speech. Therefore, the chosen videofeatures are the estimates of the optical flow in the mouth region.
Minimizing the lower bound on then amounts to maxi- In order to have a local pixel-based representation of these video mizing the MI between the extracted features features, the Horn and Schunck's gradient-based algorithm [13] sponding to each modality. The feature sets resulting from the has been chosen. The method is implemented in a two-frame maximization of the MI involved in these equations are expected simple forward difference scheme so that the temporal resolu- to compactly describe the relationship between the two modali- tion is large enough to capture complex and quickly varying ties. The extraction stage therefore produces optimized features.
mouth motions. First, a median pre-filtering is used on the raw However, to get a source estimate with a probability of es- intensity images to reduce the noise level. The optical flow is timation error close to this bound, a suitable estimator must be computed between each two consecutive frames over a region are correctly estimated, they compactly de- pixels including the lips and the chin of each can- scribe the source . For this last statement to be true, not only the didate. These regions are referred to as mouth regions and are between features extracted from each modality estimated over the sequence either from a manual extraction on must be increased, but also the conditional entropies the first frame, or using a face detector such as the one described must be minimized. Indeed, if the entropies in- in [14]. In order to get reliable pdf estimates without using a very crease, they reduce the inter-feature dependencies. Dividing (4) large sample, only the magnitude of the optical flow and the sign by the joint entropy , a feature efficiency coefficient of the vertical component are kept, so that the video features are [1] can be defined as Speakers are observed over a window of frames. Therefore, a sample of the 1-D r.v.
which are the optical flow This coefficient defines our estimator. Since norm values at each spatio-temporal point. These values are nor- malized for the subsequent optimization (see [15] for details).
minimizes the lower bound on the error probability defined in This approach implicitly considers the observations to be inde- (4) while constraining inter-feature independencies. In other pendent, which is obviously a simplification of the real world.
words, the extracted features will tend to capture Indeed, the neighboring pixels are correlated and cannot be truly just the information related to the common origin of independent. This simplification is somewhat mitigated by es- while discarding the unrelated interferences coming from timating the pdf with the Parzen window approach [9], as we : they estimate the source shall see later.
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 1, JANUARY 2008 B. Audio Representation remains constant for a given set of video fea- The audio signal also needs to be represented in a tractable adapts itself to the audio features during the op- way. This representation should describe salient aspects of the timization process.
speech signal, while being robust to variations in speaker or Using the Parzen window to estimate the densities in a non- acquisition conditions. Mel-cepstrum analysis is one of the parametric way yields a better estimate than histogram-based methods that fits best these requirements and as such, is widely approaches, given the small size of the available samples ( used in speech-processing research [16], [17]. Accordingly, the for the random variable associated with the audio features).
speech signal is represented as a set of B. Optimization Criteria (the first coefficient has been discarded as As discussed in Section II, minimizing the estimation error it pertains to the energy). Notice that the mel-cepstrogram is probability is equivalent to maximizing the efficiency coeffi- downsampled to the video feature frame rate to get synchro- cient considering the audio and video features over a mouth nized audio and video representations.
region. The set of weights to be optimized with respect to theefficiency coefficient criterion (ECC) are defined as IV. EXTRACTION OF OPTIMIZED SPEECH AUDIO FEATURES A. Audio Feature Optimization In principle, the information theoretic feature extraction discussed in Section II can now be used for audio and video Note that in our case, the normalization term for the MI involves . However, the audio representation is only the audio feature entropy since the video features remain -dimensional thus large samples are required for a correct pdf constant during the optimization process.
estimation. The feature extraction framework is then applied To verify the necessity of normalizing the MI by the entropy so as to decrease the dimensionality of the audio representa- during the optimization, ECC will be compared with a "simple" tion while keeping and emphasizing the information related mutual information criterion (MIC). The set of weights to be to the video content. Consequently, the 1-D audio features optimized is then defined as composing the sample of the r.v.
built as the following linear combination of the quency cepstral coefficients (MFCCs): Finally, a more constraining criterion is introduced, which takes into account a pair of mouth regions. This criterion, referred toas , is the squared difference between the efficiency co- efficient computed in each mouth region (referred to as where the weights are chosen such that ). This way, the differences between the marginal den- sities of the video features in each region are taken into account.
audio observations reduces to Moreover, only one optimization is performed for two mouths.
The minimization of the estimation error given by (4) will lead denote the random variables associated to re- to the optimized vector . This optimization therefore requires respectively, then the optimization problem the availability of the joint probability density function as well as of the marginal densities of the r.v.
audio features are unknown before the optimization, their dis- tribution is obviously unknown too. To avoid any restrictive as-sumption, the pdfs are estimated using the nonparametric Parzen V. OPTIMIZATION METHOD windowing approach A. Definition of the Optimization Problem The extraction of optimized audio features with respect to our classification task requires finding the real-valued vector is a kernel function whose variance is controlled by , that minimizes the chosen cost function is the sample size, and function is defined as the negative value of one of the opti- . A 2-D Gaussian kernel mization criteria defined in (9), (10), or (11). Moreover, to re- and diagonal covariance matrix strain the set of possible solutions, the weighting coefficients is chosen in our case for its widespread validity. The variances must fulfill the following conditions: are estimated from the audio and video data respec- tively, in a robust way, as described in [18] denotes the median of the data points. Since the video This optimization problem is highly nonlinear and gradient-free.
data remain the same during the optimization of the audio data, Indeed, an analytical formulation of the gradient of the cost BESSON et al.: EXTRACTION OF AUDIO FEATURES SPECIFIC TO SPEECH PRODUCTION function is difficult to obtain due to the unknown form of the pdf the distributions are obtained with a small number of observa- of the extracted audio features. In [2], Fisher and Darell use a tions, but also the cost functions are smoother than what could second order Taylor approximation of the MI and the Parzen es- be expected with histograms. The smoothness of the density timator to cast the optimization problem into a convex one and to estimates and thus the smoothness of the cost functions is derive the gradient in an analytical way. However, our purpose controlled by the parameter (see Section IV). This parameter here is to avoid such an approximation and to directly solve our must therefore be carefully chosen: if it is too small, the cost optimization problem using a proper optimization method.
functions are likely to be highly irregular, with a negative Optimization methods can be classified as either local or impact on the optimization algorithm. On the other hand, if it is global. The first category includes steepest gradient descent too large, the loss of information and in particular, the loss of and gradient descent-based methods such as the Powell's discrimination between the densities can be dramatic and may direction set method. They mainly rely on the use of an exact lead to a wrong solution. The smoothing parameter defined in or estimated formulation of the gradient of the cost function (8) is a function of the data points . Therefore it varies during to find an optimum. They present the advantage of being fast the optimization process as the audio feature data points vary.
and easy to use but are very likely to fail to reach the global These audio feature data points tend to evolve so that their optimum of the cost function if the latter is not convex.
distribution tends away from a uniform distribution. Indeed, the The second category refers to algorithms which aim at finding optimization process looks for features which maximize the MI, the globally best solution, in the possible presence of multiple while possibly minimizing the joint entropy between the audio local minima. We find in this category stochastic and heuristic and video features, and the entropy is maximal for r.v. with methods such as simulated annealing (SA) [19], Tabu Search uniform density. Roughly speaking, the smoothing parameter (TS) [20], or evolutionary algorithms (EAs). These have proven evolves as follows: at the beginning of the optimization, the their ability to approach the global optimum of highly nonlinear audio features are scattered in the space and the smoothing problems, possibly at a high computational cost. Both SA and parameter is thus large: the pdf, thus implicitly the objective TS are more dedicated to solve combinatorial problems. EAs, function, is largely smoothed. Then, as the optimization pro- which include genetic algorithms (GAs), look more suitable for ceeds, the distribution of the data points tend to concentrate our problem. Such optimization procedures, first introduced by in the sample space and the smoothing parameter decreases: Holland in 1962 [21], are based on natural evolution principles: fine structures of the pdf, thus of the objective function, ap- starting from an initial candidate population of chromosomes pear. The use of an adaptive smoothing parameter as defined (or sets of parameters to be optimized), operators mimicking by (8) induces then a multi-resolution approach for solving the biological ones of crossover and mutation are used to select the optimization problem. Multi-resolution schemes have been and reproduce fittest solutions, the fitness of a solution being shown to perform better in the context of optimization problems given by a scoring function. Basically, mutation enables the al- involving MI, notably, in image registration problems (see, for gorithm to explore new regions of the search space by randomly example, [23]).
altering some or all genes (components) of some chromosomesin the population. On the other hand, crossover reinforce prior C. Local Optimization: Powell's Direction Set Method successes by recombining parent-chromosomes so as to producethe fittest offspring.
In a first set of experiments, the deterministic Powell's di- Although the underlying principles are relatively simple, EAs rection set method [24] has been used. This local optimization algorithms have proven to be robust and powerful search tools, method is well-suited for problems where no analytical formu- owing to their remarkable flexibility and adaptability to a given lation of the gradient is available. It presents the advantage of task [22]. As a matter of fact, their tuning relies on a proper se- being fast and easy to use, but, as a local optimization method, lection values for only a few parameters which make them very it is very likely to fail to reach the global optimum of a non- attractive and easy-to-use. Furthermore, EAs do not try to pro- convex cost function.
vide an exact match but an approximation of the optimal solu- Combining both smoothing and different initial guesses of tion within an acceptable tolerance, which improve their effec- the solution, good results have been obtained, showing that the proposed approach was able to extract audio features specific tospeech production and to detect the current speaking mouth in B. Multi-Resolution Approach simple audio-video sequences [25].
However, the solutions found by this method were strongly Whatever the optimization method, a pre-processing of the dependent on the initial conditions, a sign that the cost func- cost function can be introduced to improve the efficiency of tion still exhibited too many local optima. To ensure that the the optimization. Indeed, the MI-based cost functions are a global optimum is reached, an exhaustive trial of all initial points priori nonconvex and are very likely to present rugged surfaces.
should be performed; an approach which is, obviously, unfea- To limit the risk of getting trapped in a local minimum, it is sible. Consequently, a global optimization strategy turned out common to smooth the cost function. A trade-off has to be to be preferable. Moreover, to be efficient, this global optimiza- found however between smoothness and loss of information tion method should fulfill the following requirements.
so there is still no guarantee of finding the global optimum.
1) Efficiency for highly nonlinear problems without requiring The cost functions require the estimation of the pdfs: using the the cost function to be differentiable or even continuous nonparametric Parzen windowing approach, fine estimates of over the search space.
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 1, JANUARY 2008 2) Efficiency with cost functions that present a shallow, rough vectors is first generated to lie within the convex acceptance error surface.
domain, as in the case of GACS optimization, by dividing 3) Ability to deal with real-valued parameters.
the search space in predefined quantization levels [32]. A 4) Ability to handle the two constraints defined by (12) and is then generated as a (13) in the most efficient way.
counterpart for each vector of the current population refers to the current generation. This perturbed vector, D. Differential Evolution (DE) or child vector, results form the linear combination of three As previously discussed, evolutionary approaches such as randomly picked up from GAs present flexibility and simplicity of use in a challenging (these conditions context and look therefore suitable to solve the problem at hand.
ensure the DE mutation to be effective and not to simplify In order to deal with the four requirements expressed above, towards a classical crossover scheme [31]) the adaptation of the genetic algorithm in the continuous space(GACS) developed in [26] and [27], was firstly used as a global optimization approach. The GACS, first described in [28], is an is a scaling factor taking value on extension of the original GA scheme which uses real valued pa- crossover probability controls the number of child vector rameter vectors instead of bit strings for chromosomes. Thus, subject to perturbation: the proximity between two points in both the representation and are generated (i.e., one for each element of the problem spaces is retained and the third requirement is ful- the vector under consideration); each time one of these random filled. The adaptation of GACS used speeds up the convergence number is smaller than the corresponding vector element of the algorithm by requiring the solution domain, or acceptance index is subject to a perturbation. Thereafter, the child vector in our case, as indicated by (12)], to differs from its parent by at least one element be convex. It also relates the genetic operators to the constraints most, by all of its elements on the solution parameters.
Both the perturbed and the original populations are evaluated Though better than with Powell, the results can still be by the cost function and pair competitions are performed be- improved. A loss of population diversity was observed which tween child and parent vectors (so the population size remains caused a notable difference between optima reached from one constant). At the end of one iteration, a new population eventu- run to another (premature convergence of the algorithm) due to ally emerges, composed by the winners of each local competi- the difficulty of fixing the parameters as well as the difficulty of reaching a solution when this one was located close to the The constraints defined in (12) and (13) still hold. Therefore, boundaries of the search space.
the validity of each vector of the perturbed, or child, population The choice of an evolutionary strategy (ES) seemed suitable should be verified before starting the decision process. If the el- to alleviate the limits of GACS. As EAs, this kind of approaches ement of a child vector does not belong to the acceptance do- first developed by Rechenberg [29] and Schwefel [30], presents main, it is replaced by the mean between its pre-mutation value the same advantages as GACS and operates according to and the bound that is violated [31]. This scheme is more effi- the same general scenario. But a strategy to exploit the local cient than the simple rejection adopted with GACS. Indeed, it topology of the search space is considered. Usually, this consists allows the bounds to be asymptotically approached, thus cov- of generating the variance of the perturbation distribution using ering efficiently the whole search space. To handle the second another a priori defined distribution. However, the problem of constraint ((13)), a simple normalization is performed on each choosing the right distribution as well as the right variance for child vector, as it was done with GACS.
this new distribution still remains to be solved.
A good introduction to DE as well as some rules to tune the The differential evolution (DE) approach introduced in 1997 parameters in an adequate way can be found in [33] and [31].
by Storn and Price [11] belongs to this category of EAs. How- Both the generation of the perturbation increment using the ever, rather than applying a perturbation generated by an a priori population itself instead of a predefined probability density defined distribution, the perturbation in DE corresponds to the function and the handling of the out-of-range values allow difference of chromosomes (termed vectors in this context) ran- the DE algorithm to achieve outstanding performance in the domly selected from the population. In this way, the distribution context of our problem.
of the perturbation is determined by the distribution of the vec-tors themselves and no a priori defined distribution is required.
Since this distribution depends primarily on the response of the VI. AUDIOVISUAL SPEAKER DETECTION RESULTS population vectors to the topography of the cost function, the A. Experimental Protocol biases introduced by DE in the random walk towards the solu-tion match those implicit in the function it is optimizing [31]. In Two different sets of test sequences have been used. The first other words, the requirement for an efficient mutation scheme is set of sequences is part of an in-house data set containing five more closely met: the generated increments move the existing audio-video sequences of duration 4 s (labelled vectors with both suitable displacement magnitude and direc- each shot in PAL format (25 frames/second (fps), 44.1-kHz tion for the given generation.
stereo sound). In each sequence, two individuals are present, The exact algorithm used is based on the so-called only one of them speaks during the entire sequence. Notice DE/rand/1/bin algorithm [31]. An initial population of however that both are referred to as "speakers", since either BESSON et al.: EXTRACTION OF AUDIO FEATURES SPECIFIC TO SPEECH PRODUCTION They give the optimal linear combination of mel-cepstrumcoefficients with respect to the optimization criterion (eitherECC or MIC). Let us denote them indicate whether these weights result from the optimization performed on the first or second mouth region.
Two corresponding audio feature sets derive from these weightsets: Two pairs of MI values can then be evaluated between the audio features and the video features in each mouth region. If Fig. 2. Typical frame extracted from the in-house test sequences. White rect- denotes the video features of the first mouth region and angles delimits the extracted mouth regions. (a) Frame extracted from sequence those of the second, the two pairs of MI are given by 5. (b) Frame extracted from the third sequence.
one of them may have uttered the recorded audio. These se- quences are of increasing complexity, the fifth being the mostchallenging with the nonspeaking individual moving randomly First, a comparison of both MIC and ECC criteria is per- his head and lips. Frames extracted from two sequences are formed on the in-house sequences. As a result, ECC turned out shown as an example in Fig. 2. These sequences, shot under to be indeed more discriminative than MIC. Therefore, ECC controlled conditions, are used to test the theoretical points alone is then used on the same sequences to analyze the ability developed in this paper. For that purpose, the mouth regions are of the method to extract audio features specific to speech pro- extracted based on manual localization initialized on the first duction and to perform speaker detection. Finally, the discussion frame. The duration of the sequences allows us to obtain a data of the results leads to the definition of a more efficient criterion sample set large enough to correctly estimate the pdfs. It also given by (11) whose performance on both sequence sets allows the speakers to remain still enough for the initial mouth are presented and discussed in Section VI-D.
localization to be valid throughout the sequence.
The second set of sequences is part of the CUAVE database B. Comparison of Optimization Criteria MIC and ECC [34]. The 11 two-speaker sequences considered, g11 to g22,1 The initial hypothesis is that ECC is more effective than the are shot in the NTSC standard (29.97 fps, 44.1 kHz stereo simpler MIC. The first set of experiments, carried out on the sound). On these sequences each speaker utters in turn two in-house sequence set, aims at testing this hypothesis. There- series of digits. The final seconds of the video clips, where fore, the knowledge of the active mouth region is introduced both speakers read simultaneously different digit strings, have a priori so that the optimization is only performed on this re- not been used since signal separation is not in our goal. The gion, with each of the optimization criteria successively. Using first seconds present challenging properties, making the detec- the resulting audio feature sets, the difference of MI between tion task difficult: in some sequences the nonspeaking person the speaking mouth region and the nonspeaking one, normal- moves his lips and chin, sometimes even formulating the words ized by the speaking mouth region MI (i.e., the normalized dif- without sounding them. A tradeoff has to be found between ference of MI), is measured for each of the five test sequences.
the sample size required for correctly estimating the pdf, and a Table I presents the results ( refer to the nor- detection window offering enough flexibility to correctly deal malized difference of MI measured between the speaking and with the speaker changes. Thus, the optimization is done over the nonspeaking mouth regions when using optimization cri- temporal window, shifted in one second steps over the terion MIC and ECC, respectively). Two observations can be whole sequence to make decisions once per second. The mouth made from these results. Firstly, the MI is always greater in the regions are tracked along the sequence using the face detector active mouth region, regardless the optimization criterion used, described in [14].
confirming that our scheme permits the detection of the current For both sequence sets, the mouth regions are ex- speaker. Secondly, we see that in four cases out of five, the ECC varying between 22 and 57 pixels, de- criterion leads to larger difference between MI in the two re- pending on speakers' characteristics and acquisition conditions.
gions. This indicates that the use of the ECC criterion gives rise Thus the video feature set (video sample) is composed of the to more discriminative features. Consequently, normalizing the values of the optical flow norm at each pixel MI by the entropy during the optimization allows to extract more location (T being the number of video frames within the anal- specific information than using simply the MI alone, as stated ysis window, i.e., in Section IV.
From the audio signal, mel-cepstrum coefficients are computed using 30 ms Hamming windows [16], [17].
C. Performance Using ECC Considering each mouth region and its associated video From the first set of experiments, we may conclude that ECC features, the MFCCs are projected on a new 1-D subspace is more suitable as an optimization criterion for active speaker as defined in Section IV. As a result of the optimization, two detection. This is why in the following we will focus only on its sets of weights are obtained (one for each mouth region).
use and analyze its properties in detail. The purpose of the exper- 1g18 has been discarded as it exhibits strong noise due to the compression.
iments described here is to assess the ability of our algorithm to IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 1, JANUARY 2008 D. Results Obtained With on the In-House Data Set NORMALIZED DIFFERENCE OF MI MEASURED IN EACH MOUTH REGION Two optimizations were performed previously to decide who FOR EACH OF THE FIVE IN-HOUSE TEST SEQUENCES, CONSIDERING THE AUDIO FEATURES EXTRACTED WITH OPTIMIZATION CRITERION MIC OR ECC, is the current speaker. They are now combined in a single op- ON THE SPEAKING MOUTH REGION timization problem, which aims at maximizing the discrepancybetween the two mouth regions. For this, the (11), is used. The result of the optimization is a vectorwhich generates a single audio feature vector. The latter is ex-pected to maximize thereafter the MI with the video features ofthe active mouth region. This new detection approach has firstly extract audio features specific to speech production and to per- been tested on the five in-house test sequences. Results are sum- form speaker detection. The tests are carried out on the in-house marized in Table III. The normalized difference of MI is always in favor of the active speaker, i.e., the correct speaking mouth The capacity of the proposed method to act as a speaker de- region is always indicated. It is also interesting to note that the tector is shown first. In contrast to the experiments described in difference of MI here is greater than what was obtained with the Section VI-B, no a priori knowledge of the active speaker is as- previous ECC optimization scheme (Table I). This stresses the sumed. Then the technique described in Section VI-A is applied.
benefit of using the video content related to each mouth region Recall that the optimization is performed on each of the mouth during the optimization.
) and the MI between two pairs of audio and video features is measured as stated by (15) and (16). If the E. Results Obtained with on the CUAVE Database approach is correct, the highest MI value should be measured To validate the results obtained with this simple detection between the video features of the speaking mouth and the audio , experiments on the CUAVE database features resulting from the optimization on the active speaker.
have been performed. Recall that a two second analysis window The values of MI are plotted in Fig. 3. We note that for all se- is shifted in one second steps over a given sequence. Due to quences (including the challenging sequence 5), the MI mea- the resulting overlap between the windows, the evaluation is re- optimized on this same region is stricted to the second half of each detection window, except for always strikingly greater than all the other 3. Indeed, in all these the very first window one. The results are then evaluated based is the speaking mouth, which gives 100% cor- on the experimental framework described in [35]: the ground rect detections, a rather encouraging result.
truth for the evaluation window takes the label that mainly oc- Another issue that it is necessary to investigate is whether the curs over these 30 frames (speaker 1 or speaker 2). Since our features extracted from audio are specific to speech production.
detector is not tuned to detect a silent state, the silent frames are For this, the difference between the normalized MI computed not considered. The results are listed in Table IV. As a compar- on mouth regions and the corresponding audio is measured as ison, the average rate of correct detections over the 11 sequences when using a simple motion-based detector (the highest powervalue of the video features indicating the speaking mouth) is60%. These results indicate that the use of both audio and videoinformation significantly improves the detection.
It is interesting to compare these results to those presented by Nock et al. in [3]. They compute the MI at each pixel location, considering the difference of pixel intensity as video features,and MFCCs as audio features. In a first stage, the highest totalMI value in the left and right of the image is assumed to in-dicate the current speaker (76% of correct detections in such ascheme). In a second experiment, the highest concentration of region indicates the speaking mouth. It is classified as correct if this region falls within a square centered around the true speaking mouth ( The results are listed in Table II. It can be seen that to 200). They obtain 70% of correct detection with this detec- for all the sequences. But tion scheme, which is the most directly comparable to ours since times negative. In other words, when the audio features are ob- we also limit the MI measure to mouth regions. Our results are tained on the nonspeaking mouth region, the difference of MI is significantly better, thus the optimization of the audio features sometimes favoring the nonspeaking mouth (sequences 2, 4, and as presented in this work leads to better classification results.
5). So when optimizing on the nonspeaking region, the features A comparative study of the classification results obtained with extracted cannot (and are not expected to) reflect any underlying and without introducing the feature extraction step prior to the relationship between audio and video. This result also appears classification has been performed in [36]. As a result, the per- in Fig. 3, since the MI measured between formance of the classification process is increased when audio ways smaller than the one measured between features at the input of the classifier are an optimized linear Therefore, the audio features can be said to be specific to speech combination of MFCCs instead of a simple average of the same BESSON et al.: EXTRACTION OF AUDIO FEATURES SPECIFIC TO SPEECH PRODUCTION Fig. 3. MI measured between the M or M mouth region features and the audio featuresF obtained with optimization on mouth region M or M [(15) and (16)] for the five in-house sequences. The normalized difference of MI between the best value found and the corresponding value found in theopposite mouth is indicated.
NORMALIZED DIFFERENCE OF MI MEASURED BETWEEN THE M AND M We have presented a method that exploits the common OUTH REGIONS WITH THE AUDIO FEATURES OBTAINED WITH OPTIMIZATION ON MOUTH REGIONS M (I ). THE OPTIMIZATION content of speech audio and video signals to detect the active CRITERION USED IN BOTH CASE IS ECC speaker among different candidates. This method uses theinformation theoretic framework in a similar way to that in [1]and [2] to derive optimized audio features with respect to thevideo ones. No assumption is made about the distributions ofthe features. They are rather estimated from the samples. More-over, no approximation of the MI-based cost functions is used but the optimization is performed in a straightforward manner NORMALIZED DIFFERENCE OF MI MEASURED BETWEEN THE SPEAKING using a global method, the Differential Evolution algorithm.
AND THE NONSPEAKING MOUTH REGIONS WITH THE AUDIO FEATURES OBTAINED USING 1ECC AS COST FUNCTION (TESTS PERFORMED A study of two optimization criteria that can be used in this ON THE IN-HOUSE DATABASE) information theoretic framework has been carried out. Resultsshown that the best performing criterion (namely, ECC) isable to extract audio features that are specifically related to thespeaker video features. Using only these extracted features, thealgorithm performs detection of the current speaker with 100% correct detection on five in-house test sequences.
RESULTS ON THE CUAVE SEQUENCES, USING THE EVALUATION FRAMEWORK In order to optimize the detection in the case of two-people GIVEN IN [35] WITH EVALUATION ON THE LAST SECOND OF EACH DETECTION sequences, a third optimization criterion WINDOW (SILENT WINDOWS ARE NOT CONSIDERED) troduced and tested on the same sequence set as before as well ason the more widely used CUAVE database [34]. This criterionaims at simplifying the detection scheme, as well as improvingthe audio feature specificity by taking advantage of the videoinformation related to both mouth regions. Indeed, the resultingaudio features have been shown to be even more specific thanwith the previous optimization criterion. A number of exper-iments have therefore been carried out on 11 sequences of theCUAVE database to assess and compare the performance of this -based detection method to the results presented in [3].
In the latter, MFCCs are used as audio features, without any op-timization. The better results achieved by our method show thatoptimizing the features improves the classifier performance.
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 1, JANUARY 2008 Only two potential speakers are present in these test se- [19] S. Kirkpatrick, C. D. Gelatt, and J. M. P. Vecchi, "Optimization by quences but the method can easily be extended to sequences Simulated Annealing," Science, vol. 220, no. 4598, pp. 671–680, May1983.
containing more speaker candidates using ECC as the opti- [20] F. Glover, "Future paths for integer programming and links to artificial mization criterion. These speakers should remain face on to the intelligence," Comput. and Oper. Res., vol. 13, no. 5, pp. 533–549, camera. However, it is not a problem if they move, provided the [21] J. H. Holland, Adaptation in Natural and Artificial Systems.
mouth detector is able to deal with moving faces. In its actual Arbor: Univ. of Michigan Press, 1975.
form, the computation time does not allow the algorithm to be [22] T. Spalek, P. Pietrzyk, and Z. Sojka, "Application of the genetic algo- used in real-time applications. An example application would rithm joint with the Powell method to nonlinear least-squares fitting ofpowder EPR spectra," J. Chem. Inf. Model., vol. 45, pp. 18–29, 2005.
be multimedia indexing.
[23] A. A. Cole-Rhodes, K. L. Johnson, J. LeMoigne, and I. Zavorin, "Mul- Future work aims at extending this optimization process to tiresolution registration of remote sensing imagery by optimization of the video features, and introducing a silent state detection.
mutual information using a stochastic gradient," IEEE Trans. ImageProcess., vol. 12, no. 12, pp. 1495–1511, Dec. 2003.
[24] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical Recipes in C, 2nd ed.
Cambridge, U.K.: Cambridge Uni- versity Press, 1992.
[25] P. Besson, M. Kunt, T. Butz, and J.-P. Thiran, "A multimodal approach The authors would like to thank Prof. P. Vandergheynst, Prof.
to extract optimized audio features for speaker detection," in Proc. Eu- P. Frossard, Prof. A.C. Davison, Dr. T. Butz, Dr. X. Bresson, G.
ropean Signal Processing Conf. (EUSIPCO), Antalya, Turkey, Sep.
Monaci, and P. Berlin for fruitful discussions.
[26] P. Schroeter, J.-M. Vesin, T. Langenberger, and R. Meuli, "Robust pa- rameter estimation of intensity distributions for brain magnetic reso-nance images," IEEE Trans. Med. Imag., vol. 17, no. 2, pp. 172–186, [27] V. Vaerman, "Multi-dimensional object modeling with application to [1] T. Butz and J.-P. Thiran, "From error probability to information the- medical image coding," Ph.D. dissertation, Ecole Polytechnique Fed- oretic (multi-modal) signal processing," Signal Process., vol. 85, pp.
erale de Lausanne (EPFL), Lausanne, Switzerland, 1999.
875–902, 2005.
[28] X. Qi and F. Palmieri, "Theoretical analysis of evolutionary algorithms [2] J. W. Fisher and T. Darrell, "Speaker association with signal-level au- with an infinite population size in continuous space, Part I: Basic prop- diovisual fusion," IEEE Trans. Multimedia, vol. 6, no. 3, pp. 406–413, erties of selection and mutation. Part II: Analysis of the diversification role of the crossover," IEEE Trans. Neural Netw., vol. 5, no. 1, pp.
[3] H. J. Nock, G. lyengar, and C. Neti, "Speaker localisation using audio- 102–129, Jan. 1994.
visual synchrony: An empirical study," in Proc. Int. Conf. Image and [29] I. Rechenberg, Evolutionsstrategie - Optimierung technischer Systeme Video Retrieval (CIVR), Urbana, IL, Jul. 2003, pp. 488–499.
nach Prinzipien der biologischen Evolution 1973, Stuttgart: From- [4] J. Hershey and J. Movellan, "Audio-vision: Using audio-visual syn- chrony to locate sounds," in Proc. NIPS, Denver, CO, 1999, vol. 12, [30] H.-P. Schwefel, Numerical Optimization of Computer Models.
pp. 813–819.
Chichester, U.K.: Wiley, 1981.
[5] G. Monaci, O. D. Escoda, and P. Vandergheynst, "Analysis of mul- [31] K. V. Price, New Ideas in Optimization.
New York: McGraw-Hill, timodal signals using redundant representations," in Proc. IEEE Int. 1999, ch. 6, pp. 79–108, An Introduction to Differential Evolution.
Conf. Image Processing (ICIP'05), Geneva, Italy, Sep. 2005, pp.
[32] Y.-W. Leung and Y. Wang, "An orthogonal genetic algorithm with quantization for global numerical optimization," IEEE Trans. Evol. [6] M. Slaney and M. Covell, "FaceSync: A linear operator for measuring Comput., vol. 5, no. 1, pp. 41–53, Feb. 2001.
synchronisation of video facial images and audio tracks," in Proc. [33] R. Joshi and A. C. Sanderson, "Minimal representation multisensor NIPS, 2001, vol. 13.
fusion using differential evolution," IEEE Trans. Syst., Man, Cybern. [7] P. Smaragdis and M. Casey, "Audio/visual independent components," A, Syst. Humans, vol. 29, no. 1, pp. 63–76, 1999.
in Proc. ICA, Nara, Japan, Apr. 2003, pp. 709–714.
[34] E. Patterson, S. Gurbuz, Z. Tufekci, and J. Gowdy, "CUAVE: A new [8] T. Butz and J.-P. Thiran, "Feature space mutual information in speech- audio-visual database for multimodal human-computer interface re- video sequences," in Proc. ICME, Lausanne, Switzerland, 2002, vol. 2, search," in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Pro- pp. 361–364.
cessing (ICASSP), Orlando, FL, May 2002, vol. 2, pp. 2017–2020.
[9] E. Parzen, "On estimation of a probability density function and mode," [35] P. Besson, G. Monaci, P. Vandergheynst, and M. Kunt, Experimental Ann. Mathemat. Statist., vol. 33, pp. 1065–1076, 1962.
evalutation framework for speaker detection on the CUAVE data- [10] J. W. Fisher and J. C. Principe, "A methodology for information the- base, Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne, oretic feature extraction," in Proc. Int. Joint Conf. Neural Networks, Switzerland, Tech. Rep. TR-ITS- 2006.003, Jan. 2006.
Anchorage, AK, May 1998, vol. 3, pp. 1712–1716, ser. IEEE World [36] P. Besson and M. Kunt, "Hypothesis testing as a performance evalua- Congress on Computational Intelligence.
tion method for multimodal speaker detection," in Proc. 2nd Int. Work- [11] R. Storn and K. Price, "Differential evolution - a simple and efficient shop on Biosignal Processing and Classification (BPC2006), ICINCO, adaptive scheme for global optimization over continuous spaces," J. Setubal, Portugal, 2006, pp. 106–115.
Global Optimiz., vol. 11, pp. 341–359, 1997.
[12] T. M. Cover and J. A. Thomas, D. L. Schilling, Ed., Elements of In- Patricia Besson was born in Charenton-le-Pont,
New York: Wiley, 1991.
France, in July 1977. She received the M.Sc. degree [13] B. K. P. Horn and B. G. Schunck, "Determining optical flow," Artif. in biomedical engineering from the University of Intell., vol. 17, pp. 185–203, 1981.
Lyon (UCBL), Lyon, France, in June 2001 and the [14] J. Meynet, V. Popovici, and J.-P. Thiran, Face Detection with Mixtures Ph.D. degree from the Signal Processing Institute of Boosted Discriminant Features, EPFL, 1015 Ecublens, Tech. Rep.
(ITS), Swiss Federal Institute of Technology (EPFL), 2005-35, Nov. 2005.
Lausanne, Switzerland, in June 2007.
[15] P. Besson and M. Kunt, Information theoretic optimization of audio She spent the 2000–2001 academic year as an ex- features for multimodal speaker detection, Ecole Polytechnique Fed- change student at the University of Montreal (UdM), erale de Lausanne (EPFL), Lausanne, Switzerland, EPFL-ITS Tech.
Montreal, QC, Canada. In June 2002, she joined the Rep. 08/2005, Feb. 2005.
ITS, EPFL, where she started a thesis on the multi- [16] B. Gold and N. Morgan, Speech and Audio Signal Processing.
modal detection of the speaker in audiovisual sequences under the supervision York: Wiley, 2000.
of Prof. Murat Kunt. In October 2007, she will integrate the Laboratory "Mo- [17] J. W. Picone, "Signal modeling techniques in speech recognition," in tion and Perception" at the National Center for Scientific Research (CNRS), Proceedings of the IEEE, Sept. 1993, vol. 81, no. 9.
Marseille, France as a Postdoctoral Researcher. Her research will focus on the [18] A. W. Bowman and A. Azzalini, Applied Smoothing Techniques for modelization of the multisensorial perception and control of self-orientation in New York: Oxford University Press, 1997.
space by humans.
Murat Kunt (SM'70–M'74–SM'80–F'86) was born
in computer science from the Technical University in Ankara, Turkey, in 1945. He received the M.S. de- of Cluj-Napoca, Romania, in 1998 and 1999, respec- gree in physics and the Ph.D. degree in electrical en- tively, and the Ph.D. degree from the Swiss Federal gineering, both from the Swiss Federal Institute of Institute of Technology (EPFL), Lausanne, Switzer- Technology (EPFL), Lausanne, Switzerland, in 1969 land, in 2004.
and 1974, respectively.
He is Researcher with the Swiss Institute of Bioin- From 1974 to 1976, he was a Visiting Scientist at formatics (SIB), working on different, theoretical, the Research Laboratory of Electronics, Massachu- and applied aspects of statistical pattern recognition.
setts Institute of Technology, Cambridge, MA, where Before joining SIB, he was with the Digital Media he developed compression techniques for X-ray im- Institute, Tampere University of Technology, as Re- ages and electronic image files. In 1976, he returned search Scientist, and with the Signal Processing Institute, EPFL as an Assistant to the EPFL where he is presently a Professor of Electrical Engineering and Researcher and later as Postdoctoral Researcher. His current research focusses Director of the Signal Processing Institute (ITS), one of the largest at EPFL.
on sparse classifiers andmultiple clsissifier systems, and their applications to He conducts teaching and research in digital signal and image processing with life sciences.
applications to modeling, coding, pattern recognition, scene analysis, industrialdevelopments and biomedical engineering. His laboratory participates in a largenumber of European projects under various programmes such as Esprit, Eureka,Race, HCM, Commett and Cost. He is the author or the co-author of more than Jean-Marc Vesin (M'98) graduated from the Ecole
200 research papers and 15 books, and holds seven patents. He supervised more Nationale Superieure d'Ingenieurs Electricians de than 60 Ph.D. students, some of them being today university professors. He con- Grenoble (ENSIEG), Grenoble, France, in 1980. He sults for governmental offices, including the French General Assembly.
received the M.Sc. degree from Laval University, Dr. Kunt has been the Editor-in-Chief of Signal Processing for 28 years Quebec City, QC, Canada, in 1984, where he spent and is a founding member of EURASIP, the European Association for Signal four years on research projects. He received the Processing. He is now the Editor-in-Chief of Signal, Images and Video Pro- Ph.D. degree from the Signal Processing Institute cessing (Springer) and serves as a Chairman and/or a member of the Scientific (ITS), Swiss Federal Institute of Technology (EPFL), Committees of several international conferences and on the editorial boards of the PROCEEDINGS OF THE IEEE, Pattern Recognition Letters, and Traitement He was previously involved in research projects at du Signal. He was the Co-Chairman of the first European Signal Processing Laval University and later spent two years working in Conference held in Lausanne in 1980 and the General Chairman of the Inter- industry. He is currently in charge of the activities in 1-D signal processing at national Image Processing Conference (ICIP'96) held in Lausanne in 1996. He ITS, EPFL. His main interests are biomedical signal processing, adaptive signal was the President of the Swiss Association for Pattern Recognition from its modelling and analysis, and the applications of genetic algorithms in signal pro- creation until 1997. He received the Gold Medal of EURASIP for meritorious cessing. He has authored or co-authored more than 30 publications in renowned services, the IEEE Acoustics, Speech, and Signal Processing's Technical peer-reviewed journals, as well as several book chapters.
Achievement Award, the IEEE Third Millennium Medal, an honorary doctoratefrom the Catholic University of Louvain, the Technical Achievement Award ofEURASIP, and the Imaging Scientist of the Year Award of the IS&T and SPIEin 1983, 1997, 2000, 2001, and 2003, respectively.
Jean-Philippe Thiran (M'93–SM'05) was born in
Namur, Belgium, in 1970. He received the Elect.Eng.
and Ph.D. degrees from the Universite catholique
de Louvain (UCL), Louvain-la-Neuve, Belgium, in
1993 and 1997, respectively.
He joined the Signal Processing Institute (ITS), Swiss Federal Institute of Technology (EPFL),Lausanne, Switzerland, in February 1998 as a SeniorLecturer. Since January 2004, he has been an Assis-tant Professor, responsible for the Image AnalysisGroup. His current scientific interests include image segmentation, prior knowledge integration in image analysis, partial differen-tial equations and variational methods in image analysis, multimodal signalprocessing, medical image analysis, including multimodal image registration,segmentation, computer-assisted surgery, and diffusion MRI. He is author orco-author of two book chapters, 51 journal papers, and some 90 peer-reviewedpapers published in proceedings of international conferences. He holds fourinternational patents.
Dr. Thiran was Co-Editor-in-Chief of Signal Processing (published by El- sevier Science) from 2001 to 2005. He is currently an Associate Editor of theInternational Journal of Image and Video Processing (published by Hindawi),and member of the Editorial Board of Signal, Image and Video Processing (pub-lished by Springer). He will be the General Chairman of the 2008 EuropeanSignal Processing Conference (EUSIPCO 2008).


Ch02_p 45.78

Modern Organocopper Chemistry. Edited by Norbert Krause Copyright > 2002 Wiley-VCH Verlag GmbH ISBNs: 3-527-29773-1 (Hardcover); 3-527-60008-6 (Electronic) 2Transmetalation Reactions Producing Organocopper Reagents Paul Knochel and Bodo Betzemeier Organocopper reagents constitute a key class of organometallic reagents, with nu-merous applications in organic synthesis [1]. Their high reactivities and chemo-selectivities have made them unique intermediates. Most reports use organocopperreagents of type 1 or 2, which are prepared from organolithiums. This trans-metalation procedure confers optimal reactivity, but in many cases it permitsonly the preparation of relatively unfunctionalized organocopper reagents. Morerecently, substantial developments have been taking place in transmetalations toorganocopper reagents starting from organometallic species that tolerate the pres-ence of functional groups [2], while synthetic methods permitting the preparationof functionalized organolithiums and organomagnesium compounds have alsobeen developed. All organometallics in which the metal M is less electronegativethan copper, and all organometallic species of similar electronegativity but withweaker carbon-metal bonds, are potential candidates for transmetalation reac-tions [3]. Thus, reaction conditions allowing the transmetalation of organo-boron,-aluminium, -zinc, -tin, -lead, -tellurium, -titanium, -manganese, -zirconium and-samarium compounds have all been found, resulting in a variety of new organo-copper reagents of type 3. Their reactivity is dependent on the nature of the origi-nal metal M, which in many cases is still intimately associated with the resultingorganocopper reagent (Scheme 2.1) [3–5].


268_273_pscherer 7.8.15 8:56 Stránka 268 268/ ACTA CHIR. ORTHOP. TRAUM. ČECH., 82, 2015, Delayed Fracture Healing in Diabetics with Distal Radius Fractures Opožděné hojení zlomenin u diabetiků se zlomeninou distálního radia S. PSCHERER1, G. H. SANDMANN2, S. EHNERT3, A. K. NUSSLER3, U. STÖCKLE3, T. FREUDE3